Solved parsing packagesite.yaml

dvl@

Aspiring Daemon
Developer

Reaction score: 69
Messages: 572

I'm trying to parse packagesite.yaml for reasons and I'm looking for coding help please.

I keep running into encoding issues. I've tried:
  • latin-1
  • ascii
  • utf-8
  • ISO-8859-1
So far, I am unable to parse all of the file.

My proof-of-concept script is:

Python:
#!/usr/local/bin/python

import yaml
import io
import sys

line = sys.stdin.readline()
while line:
    docs = yaml.load_all(line, Loader=yaml.FullLoader)
    for doc in docs:
        print(doc['name'], doc['version'])
        line = sys.stdin.readline()
Using https://pkg.freebsd.org/FreeBSD:12:amd64/latest/packagesite.txz as the source file (see how I got it) as sample input:

Code:
$ head -1 packagesite.yaml | ~/bin/yaml-test-packages.stdin.all.line.by.line
py37-pyasn1-modules 0.2.7
To encounter one of these encoding issues:

Code:
$ head -14074 packagesite.yaml | tail -1 | ~/bin/yaml-test-packages.stdin.all.line.by.line
Traceback (most recent call last):
  File "/usr/home/dan/bin/yaml-test-packages.stdin.all.line.by.line", line 10, in <module>
    for doc in docs:
  File "/usr/local/lib/python3.7/site-packages/yaml/__init__.py", line 127, in load_all
    loader = Loader(stream)
  File "/usr/local/lib/python3.7/site-packages/yaml/loader.py", line 24, in __init__
    Reader.__init__(self, stream)
  File "/usr/local/lib/python3.7/site-packages/yaml/reader.py", line 74, in __init__
    self.check_printable(stream)
  File "/usr/local/lib/python3.7/site-packages/yaml/reader.py", line 144, in check_printable
    'unicode', "special characters are not allowed")
yaml.reader.ReaderError: unacceptable character #xdcbc: special characters are not allowed
  in "<unicode string>", position 1421
That's the line for zh-auto-cn-l10n:

Code:
$ grep -hn zh-auto-cn-l10n packagesite.yaml
14074:{"name":"zh-auto-cn-l10n","origin":"chinese/auto-cn-l10n","version":"1.1_3","comment":"The automatic localization for Simplified Chinese zh_CN.eucCN locale","maintainer":"ports@FreeBSD.org","www":"UNKNOWN","abi":"FreeBSD:12:amd64","arch":"freebsd:12:x86:64","prefix":"/usr/local","sum":"7d87b8636a0a77528b79cad0172eab1a10da472320b9873e0f3ba8942dc1b155","flatsize":19656,"path":"All/zh-auto-cn-l10n-1.1_3.txz","repopath":"All/zh-auto-cn-l10n-1.1_3.txz","licenselogic":"single","pkgsize":7496,"desc":"Simplified Chinese (GB2312 encoding) zh_CN.eucCN automatic localization\nInstall this port and you will have a Simplified Chinese FreeBSD system","deps":{"relaxconf":{"origin":"sysutils/relaxconf","version":"1.1.1_3"},"wqy-fonts":{"origin":"x11-fonts/wqy","version":"20100803_10,1"},"zh-scim-pinyin":{"origin":"chinese/scim-pinyin","version":"0.5.92_4"},"zh-scim-tables":{"origin":"chinese/scim-tables","version":"0.5.10_1"}},"categories":["chinese"],"options":{"FCITX":"off","FIREFLYTTF":"off","MINICHINPUT":"off","RELAXCONF":"on","SCIM":"on","WQY":"on"},"annotations":{"FreeBSD_version":"1201000"},"messages":[{"message":"English Instructions:\n Please tell your users to merge their old dotfiles with the new ones, in\n    /usr/local/share/skel/zh_CN.eucCN/dot.*\n\n For future adduser\n    # adduser -k /usr/local/share/skel/zh_CN.eucCN\n\n**************************************************************************\n\n????????˵??:\n ??????????û??Ƚ????ǵ??¾?????,????\n /usr/local/share/skel/zh_CN.eucCN/dot.*\n\n ????Ժ???Ҫ?????û?,???????????µķ?ʽ:\n    # adduser -k /usr/local/share/skel/zh_CN.eucCN","type":"install"},{"message":"===>   NOTICE:\n\nThe zh-auto-cn-l10n port currently does not have a maintainer. As a result, it is\nmore likely to have unresolved issues, not be up-to-date, or even be removed in\nthe future. To volunteer to maintain this port, please create an issue at:\n\nhttps://bugs.freebsd.org/bugzilla\n\nMore information about port maintainership is available at:\n\nhttps://www.freebsd.org/doc/en/articles/contributing/ports-contributing.html#maintain-port"}]}
 

guidok

Member

Reaction score: 5
Messages: 22

Assuming the encoding errors are in fields you are not interested in, eg desc, comment, messages, you should be safe to replace the characters that trigger an encoding error with replacement characters (U+FFFD, see also https://docs.python.org/3.7/library/codecs.html#codec-base-classes). In Python and for sys.stdin this is easily done using:

Python:
sys.stdin = io.TextIOWrapper(sys.stdin.buffer, errors='replace')
It seems packagesite.yaml is not a proper YAML file. Although it contains multiple YAML documents, these documents are not separated by --- as required by the specification. So before we can start parsing this file in one go, we need to turn it into a proper YAML file first:

Bash:
sed 's/.*/---\
&/' packagesite.yaml > proper_multidoc_packagesite.yaml
With a proper YAML file and guarded against encoding errors we can now parse the file:

Python:
import io
import sys

from ruamel.yaml import YAML

sys.stdin = io.TextIOWrapper(sys.stdin.buffer, errors='replace')

yaml = YAML(typ='safe')
docs = yaml.load_all(sys.stdin)

for doc in docs:
    print(f"{doc['origin']}\t{doc['name']}\t{doc['version']}")
This version uses ruamel.yaml instead of PyYAML. The former is an enhancement of the latter.

Edit: spelling
 
Last edited by a moderator:
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 69
Messages: 572

I think we have a simple solution, with combined suggestions from Fosstodon and bsd.network:

use textprod/jq from https://stedolan.github.io/jq/

Code:
$ time jq -rc '[1, .origin, .name, .version] |
@tsv
' < ~/tmp/FreeBSD\:12\:amd64/latest/packagesite.yaml > packagesite.csv

real0m1.351s
user0m1.295s
sys0m0.055s

$ time ./import-via-copy-packagesite.py

real0m1.731s
user0m0.131s
sys0m0.008s

The data get in there fast enough.
Next step, go from that raw data into normalized form. That should be easier & faster now that it's in a [#PostgreSQL] database [on #FreeBSD].
Thank you.
 
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 69
Messages: 572

guidok

Member

Reaction score: 5
Messages: 22

Ah, that's even better. The lines in packagesite.yaml are valid JSON documents. Didn't think of that.
 
Last edited by a moderator:
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 69
Messages: 572

Ah, that's even better. The lines in packagesite.yaml are valid JSON documents. Didn't think of that.
I thought it was not valid JSON... so I never even tried.
 
Top