Solved parsing packagesite.yaml

dvl@ · Apr 9, 2020

I'm trying to parse packagesite.yaml for reasons and I'm looking for coding help please.

I keep running into encoding issues. I've tried:

latin-1
ascii
utf-8
ISO-8859-1

So far, I am unable to parse all of the file.

My proof-of-concept script is:

Python:

#!/usr/local/bin/python

import yaml
import io
import sys

line = sys.stdin.readline()
while line:
    docs = yaml.load_all(line, Loader=yaml.FullLoader)
    for doc in docs:
        print(doc['name'], doc['version'])
        line = sys.stdin.readline()

Using https://pkg.freebsd.org/FreeBSD:12:amd64/latest/packagesite.txz as the source file (see how I got it) as sample input:

Code:

$ head -1 packagesite.yaml | ~/bin/yaml-test-packages.stdin.all.line.by.line
py37-pyasn1-modules 0.2.7

To encounter one of these encoding issues:

Code:

$ head -14074 packagesite.yaml | tail -1 | ~/bin/yaml-test-packages.stdin.all.line.by.line
Traceback (most recent call last):
  File "/usr/home/dan/bin/yaml-test-packages.stdin.all.line.by.line", line 10, in <module>
    for doc in docs:
  File "/usr/local/lib/python3.7/site-packages/yaml/__init__.py", line 127, in load_all
    loader = Loader(stream)
  File "/usr/local/lib/python3.7/site-packages/yaml/loader.py", line 24, in __init__
    Reader.__init__(self, stream)
  File "/usr/local/lib/python3.7/site-packages/yaml/reader.py", line 74, in __init__
    self.check_printable(stream)
  File "/usr/local/lib/python3.7/site-packages/yaml/reader.py", line 144, in check_printable
    'unicode', "special characters are not allowed")
yaml.reader.ReaderError: unacceptable character #xdcbc: special characters are not allowed
  in "<unicode string>", position 1421

That's the line for zh-auto-cn-l10n:

Code:

$ grep -hn zh-auto-cn-l10n packagesite.yaml
14074:{"name":"zh-auto-cn-l10n","origin":"chinese/auto-cn-l10n","version":"1.1_3","comment":"The automatic localization for Simplified Chinese zh_CN.eucCN locale","maintainer":"ports@FreeBSD.org","www":"UNKNOWN","abi":"FreeBSD:12:amd64","arch":"freebsd:12:x86:64","prefix":"/usr/local","sum":"7d87b8636a0a77528b79cad0172eab1a10da472320b9873e0f3ba8942dc1b155","flatsize":19656,"path":"All/zh-auto-cn-l10n-1.1_3.txz","repopath":"All/zh-auto-cn-l10n-1.1_3.txz","licenselogic":"single","pkgsize":7496,"desc":"Simplified Chinese (GB2312 encoding) zh_CN.eucCN automatic localization\nInstall this port and you will have a Simplified Chinese FreeBSD system","deps":{"relaxconf":{"origin":"sysutils/relaxconf","version":"1.1.1_3"},"wqy-fonts":{"origin":"x11-fonts/wqy","version":"20100803_10,1"},"zh-scim-pinyin":{"origin":"chinese/scim-pinyin","version":"0.5.92_4"},"zh-scim-tables":{"origin":"chinese/scim-tables","version":"0.5.10_1"}},"categories":["chinese"],"options":{"FCITX":"off","FIREFLYTTF":"off","MINICHINPUT":"off","RELAXCONF":"on","SCIM":"on","WQY":"on"},"annotations":{"FreeBSD_version":"1201000"},"messages":[{"message":"English Instructions:\n Please tell your users to merge their old dotfiles with the new ones, in\n    /usr/local/share/skel/zh_CN.eucCN/dot.*\n\n For future adduser\n    # adduser -k /usr/local/share/skel/zh_CN.eucCN\n\n**************************************************************************\n\n????????˵??:\n ??????????û??Ƚ????ǵ??¾?????,????\n /usr/local/share/skel/zh_CN.eucCN/dot.*\n\n ????Ժ???Ҫ?????û?,???????????µķ?ʽ:\n    # adduser -k /usr/local/share/skel/zh_CN.eucCN","type":"install"},{"message":"===>   NOTICE:\n\nThe zh-auto-cn-l10n port currently does not have a maintainer. As a result, it is\nmore likely to have unresolved issues, not be up-to-date, or even be removed in\nthe future. To volunteer to maintain this port, please create an issue at:\n\nhttps://bugs.freebsd.org/bugzilla\n\nMore information about port maintainership is available at:\n\nhttps://www.freebsd.org/doc/en/articles/contributing/ports-contributing.html#maintain-port"}]}

guidok · Apr 10, 2020

Assuming the encoding errors are in fields you are not interested in, eg desc, comment, messages, you should be safe to replace the characters that trigger an encoding error with replacement characters (U+FFFD, see also https://docs.python.org/3.7/library/codecs.html#codec-base-classes). In Python and for sys.stdin this is easily done using:

Python:

sys.stdin = io.TextIOWrapper(sys.stdin.buffer, errors='replace')

It seems packagesite.yaml is not a proper YAML file. Although it contains multiple YAML documents, these documents are not separated by --- as required by the specification. So before we can start parsing this file in one go, we need to turn it into a proper YAML file first:

Bash:

sed 's/.*/---\
&/' packagesite.yaml > proper_multidoc_packagesite.yaml

With a proper YAML file and guarded against encoding errors we can now parse the file:

Python:

import io
import sys

from ruamel.yaml import YAML

sys.stdin = io.TextIOWrapper(sys.stdin.buffer, errors='replace')

yaml = YAML(typ='safe')
docs = yaml.load_all(sys.stdin)

for doc in docs:
    print(f"{doc['origin']}\t{doc['name']}\t{doc['version']}")

This version uses ruamel.yaml instead of PyYAML. The former is an enhancement of the latter.

Edit: spelling

dvl@ · Apr 10, 2020

I think we have a simple solution, with combined suggestions from Fosstodon and bsd.network:

use textprod/jq from https://stedolan.github.io/jq/

Code:

$ time jq -rc '[1, .origin, .name, .version] |
@tsv
' < ~/tmp/FreeBSD\:12\:amd64/latest/packagesite.yaml > packagesite.csv

real0m1.351s
user0m1.295s
sys0m0.055s

$ time ./import-via-copy-packagesite.py

real0m1.731s
user0m0.131s
sys0m0.008s

The data get in there fast enough.
Next step, go from that raw data into normalized form. That should be easier & faster now that it's in a [#PostgreSQL] database [on #FreeBSD].
Thank you.

dvl@ · Apr 10, 2020

guidok said:
Assuming the encoding errors are in fields you are not interested in, eg desc, comment, messages, you should be safe to replace the characters that trigger an encoding error with replacement characters ( U+FFFD, see also https://docs.python.org/3.7/library/codecs.html#codec-base-classes). In Python and for sys.stdin this is easily done using:
...

Thank you. Your help is much appreciated.

guidok · Apr 10, 2020

Ah, that's even better. The lines in packagesite.yaml are valid JSON documents. Didn't think of that.

dvl@ · Apr 10, 2020

guidok said:
Ah, that's even better. The lines in packagesite.yaml are valid JSON documents. Didn't think of that.

I thought it was not valid JSON... so I never even tried.

Solved parsing packagesite.yaml

dvl@

guidok

dvl@

dvl@

guidok

dvl@