lex scanner on a socket

dcole · Dec 4, 2012

Hello,

I have a program with multiple threads generating data. I would like for them to each write to a socket that I can attach a lex scanner to, so that single scanner can parse data coming from multiple sources.

I assume I would use a AF_UNIX socket for this, but is it best to use a SOCK_STREAM?

Is this possible? How do I set up the socket, and how do I set up the scanner to use it as input?

This is briefly mentioned in the lex man page, but I got the impression I would have to read a byte at a time for this to work - is that accurate?

nslay · Dec 5, 2012

You could use fdopen(3) to make a socket into a FILE * and then assign that to yyin. Lex seems limited to one input stream at a time, but maybe you could swap yyin out based on the socket you're processing.

Something tells me that making a socket into a FILE * could have all sorts of bad side effects. You'll have to experiment ...

EDIT:
And yes, you'd probably want the socket to be SOCK_STREAM for reliable and ordered transmission. Although maybe this isn't such a big issue with AF_UNIX.

I haven't tried the C++ interface for lex. Maybe it doesn't have this single stream limitation. In all honesty, I haven't touched lex in many years.

SirDice · Dec 5, 2012

On the receiving end you could do something simple as:

Code:

nc -l 1234 | lex

See nc(1) for examples of this simple, yet brilliant, little tool.

nslay · Dec 5, 2012

I just noticed a section titled "Multiple Input Buffers" in the man page of lex(3).

Yeah, use fdopen(3) with yy_create_buffer for every new socket, and then yy_switch_to_buffer on recv(3).

By the way, libevent (or even libev) makes this pretty trivial. You could pass the YY_BUFFER_STATE as the extra argument to your callback.

EDIT:
Yeah, well, don't actually call recv(2) (call yylex() or whatever). But that raises another problem ... blocking I/O. You can get away with this by using one thread per socket (if the number of connections you expect is small) in which case you wouldn't need libevent. Then you have to investigate the thread safety of the yy_switch_to_buffer function (it's probably not thread safe). Or, if you wanted to use an event-driven design, you'd need to figure how lex might cope with non-blocking I/O (it probably can't). In fact, I'm not even sure how stdio would deal with a non-blocking socket.

Man, this went from possibly elegant to horribly ugly.

dcole · Dec 5, 2012

Well to give you some context on what I have going on:

Multiple TCP sockets are opened to read/write HTTP requests/responses. I need a way to parse those HTTP headers and such across all of the TCP connections (Like when a browser hits a page it opens several TCP connections, but all the HTTP requests are related)

In my case, I thought I could use something like lex to remember something like a cookie state to see which cookies were set across all of the TCP connections.

I wasn't even planning on using multiple input buffers..I just wanted to have one buffer that I always kept open. Basically intercept the socket traffic and "T" it off to the lex parser so that it can do the parsing out of the stream of traffic.

I will play around with trying to fdopen on the socket and see what happens.

nslay · Dec 5, 2012

dcole said:
Well to give you some context on what I have going on:

Multiple TCP sockets are opened to read/write HTTP requests/responses. I need a way to parse those HTTP headers and such across all of the TCP connections (Like when a browser hits a page it opens several TCP connections, but all the HTTP requests are related)

In my case, I thought I could use something like lex to remember something like a cookie state to see which cookies were set across all of the TCP connections.

I wasn't even planning on using multiple input buffers..I just wanted to have one buffer that I always kept open. Basically intercept the socket traffic and "T" it off to the lex parser so that it can do the parsing out of the stream of traffic.

I will play around with trying to fdopen on the socket and see what happens.

Do you really need a powerful lex parser then? Are there libraries that can do this parsing in a more canonical way?

I think lex can be made to solve this problem ... although it's not straightforward to me how. At this stage, I would

Invesigtate the C++ interface - it may cope better with multiple streams (but then you have to look into stdiobuf or similar GNU extensions to interface with istream).
Experiment with stdio operating on FILE * backed by non-blocking files - an easy way is to set O_NONBLOCK with fcntl(2) on STDIN_FILENO and see the effects on various functions like getc(3).
Look into Boost Spirit (which is a C++ parser generator) - I hate boost with a passion, but if it can be made to solve your problem quickly, go for it.

dcole · Dec 5, 2012

nslay said:
Do you really need a powerful lex parser then? Are there libraries that can do this parsing in a more canonical way?

I think lex can be made to solve this problem ... although it's not straightforward to me how. At this stage, I would

Invesigtate the C++ interface - it may cope better with multiple streams (but then you have to look into stdiobuf or similar GNU extensions to interface with istream).

Experiment with stdio operating on FILE * backed by non-blocking files - an easy way is to set O_NONBLOCK with fcntl(2) on STDIN_FILENO and see the effects on various functions like getc(3).

Look into Boost Spirit (which is a C++ parser generator) - I hate boost with a passion, but if it can be made to solve your problem quickly, go for it.

I had actually thought about looking into converting this to C++. It is a possibility if I use a AF_UNIX socket and just need to pick it up on the receiving end of that socket.

I had also looked into Boost Spirit, however, I actually already have a set of Lex rules that successfully parse HTTP. The problem is it used to be used to statically parse a single stream of HTTP. Now I need something that can parse multiple streams and remember states across them. This was not possible with the current thing because it took multiple calls to yylex().

My latest idea is maybe using the multiple input buffers idea. Basically I could send EOF characters across at the end of each "send" on the socket. on <EOF>, do a select(2) to await the next available socket read, accept(2) that, fdopen on the file desciptor, and read again. According to that manpage, yy_switch_to_buffer() does not change the start condition.

Just some more thoughts.

nslay · Dec 6, 2012

dcole said:
I had actually thought about looking into converting this to C++. It is a possibility if I use a AF_UNIX socket and just need to pick it up on the receiving end of that socket.

I had also looked into Boost Spirit, however, I actually already have a set of Lex rules that succesfully parse HTTP. The problem is it used to be used to statically parse a single stream of HTTP. Now I need something that can parse multiple streams and remember states across them. This was not possible with the current thing because it took multiple calls to yylex().

My latest idea is maybe using the multiple input buffers idea. Basically I could send EOF characters across at the end of each "send" on the socket. on <EOF>, do a select(2) to await the next available socket read, accept(2) that, fdopen on the file desciptor, and read again. According to that manpage, yy_switch_to_buffer() does not change the start condition.

Just some more thoughts.

Digging further into the lex(3) man page, you can do something like this (and use non-blocking I/O!):

Code:

%{

/* Current socket to process */
extern int currentSocket;

/* Stop scanning on YY_NULL */
%option  noyywrap

#define YY_INPUT(buf,result,max_size) \
  { \
    ssize_t recvSize = recv(currentSocket, buff, max_size, 0); \
    result = recvSize > 0 ? recvSize : YY_NULL; \
    if (recvSize == 0 || (recvSize == -1 && errno != EAGAIN)) { \
    /* Close socket on error */
    } \
  }
%}

Use this with yy_create_buffer (you may pass the FILE * argument as NULL according to the pages since you don't use yyin) and yy_switch_to_buffer.

I'm not sure if the above is syntactically correct, but you get the idea. It would be better to have a global pointer to some context structure that holds the socket than just the socket itself since you'll need to clean up the context, close the socket, and unregister it from your event loop (I recommend using libevent).

If you were using libevent, it would be something like:

Code:

struct ClientContext {
  int fd; /* Your socket */
  struct event *readEvent; /* event pointer used to register/unregister yourself from the event loop */
  YY_BUFFER_STATE bufferState;
  /* ... */
  /* Other context information here */
};

struct ClientContext *currentContext;

void OnRecv(evutil_socket_t fd, short what, void *arg) {
  currentContext = arg;
  yy_switch_to_buffer(currentContext->bufferState);
  yylex(); /* Your YY_INPUT should call event_del(), event_free(), yy_delete_buffer() and close() the socket on error or closed connection */
}

Unfortunately, you'll gain nothing by using threads.

lex scanner on a socket

Administrator