32-bit PCRE missing metacharacters (pcre32)

escape · Jun 25, 2014

A problem using 32-bit Perl Regular Expression library, the regular expression metacharacters seem not to work. The library has to be built with 32-bit support and library libpcre32 is built and can be used with -lpcre32. The version is 8.35. I'm not sure if big-endian support was on, the machine is i586 PC (little endian), FreeBSD 10 is in use and it's updated more than once. The PCRE used now is built directly from source without any possible OS specific modifications.

I have made a C program calling functions: pcre32_compile2, pcre32_study, pcre32_exec (and pcre32_pattern_to_host_byte_order). The program and function calls seem to work properly, matching for example line "Wheel is round" with regular expression "Wheel". It does not match for example "Whe.l" or "Wh[a-z]el". It works and metacharacters do not.

Additional information: If using option flag PCRE_UTF32, an error appears telling the string is not valid UTF-32. UTF-32 should be allmost like four byte UTC, it's different by it's range. It may be possible that the 4-byte characters are somehow not in order?

I put the string to a four byte character (I think this is 4-byte UTC) array like this: by reading from a stream (a file or input) 8-bit bytes and putting first byte to first item in array, the second byte to second byte in array and so on. One character is always four bytes. And this seems to work, it matches "Wheel". I do not remember what happened when trying to write BOM as first character in regular expression pattern. Is it possible, or is it always true that PCRE uses big-endian pattern text in 32-bit mode? My machine is little endian.

How should I put the pattern text to fix the missing metacharacters -problem? Is this a bug (maybe a known bug in implementation) I do not know, or a new one? Should the regular expression pattern be for example in 4-byte big -endian format or is it fully functional only in 8-bit mode (ASCII, 8-bit and possibly in UTF-8)? How to switch on a little-endian support?

Has anyone used pcre32 and could give a link to a working example for example?

escape · Jun 27, 2014

This might have been uninformative without examples. Maby if someone knew and read the post, might have known. The problem is now almost solved. Here are the middle results:

1) Both pattern and subject text are in original byte order:

 cat ~/Wheel.txt | ./test2_regexp_search "W.+" # Miss

cat ~/Wheel.txt | ./test2_regexp_search ".+" # Miss

cat ~/Wheel.txt | ./test2_regexp_search "W" # Hit

2) Reversing patterns four bytes in 32-bit (4-byte) characters:

 cat ~/Wheel.txt | ./test2_regexp_search "W.+" # Miss

cat ~/Wheel.txt | ./test2_regexp_search ".+" # Hits several times

cat ~/Wheel.txt | ./test2_regexp_search "W" # Miss

3) Reversing bytes of characters in pattern and subject block:

 cat ~/Wheel.txt | ./test2_regexp_search "Whe.l" # Works properly, and all the above (even many times in one row):

cat ~/Wheel.txt | ./test2_regexp_search "W.+" # Hit

cat ~/Wheel.txt | ./test2_regexp_search ".+" # Hits several times

cat ~/Wheel.txt | ./test2_regexp_search "W" # Hit

In http://www.pcre.org/pcre.txt at "CHARACTER CODES" is a description of UTF-16, BOM is ignored: "The PCRE functions do not handle this, expecting strings to be in host byte order." (same text: man pcre32). Still the bytes have to be converted:

[1][2][3][4] -> [4][3][2][1]

In both pattern and subject text (assuming big endian instead of host byte order). The subject has to be programmatically allways in big-endian also? pcre32_utf32_to_host_byte_order does not seem to do anything in my case (little endian).

Using pcre32_pattern_to_host_byte_order or not did not have any effect as it should not, it reverses allready compiled re to hostbyte order (if needed).

"Assuming host byte order" does not work at the moment? Am I right ? (FreeBSD 10.0-RELEASE-p2 i386)

Or do I have it all wrong, the bytes should be read in 32-bit words, not in bytes? But both subject and pattern was formed in the same machine and that is why they are in host byte order.

escape · Jun 27, 2014

Code rechecked: [1][2][3][4] In the program used in testing, number one byte has most significant bits in data block.

pcre_utf32_to_host_byte_order(3) advices to use this function to convert subject block to correct order.

This was a bit difficult to find in the documentation. It seems that the subject text has to be without bom:s (if needed) and in big endian format as here, only bytes switched by endianness (LE host).

The problem seems to be solved.

32-bit PCRE missing metacharacters (pcre32)

escape

escape

escape