C++ regex: is there a way to use (.*?)

C, C++, Python, Perl, Shell, etc.

C++ regex: is there a way to use (.*?)

Postby ChrisCphDK » 10 Aug 2009, 08:23

Hi.

I'm working on a method that should search a string of text and extract text in-between tags, and store the resulting strings in a vector<string> to be returned.

Example:

string to search: "dummy<t>test</t><t>test2</t>dummy"
tags: "<t>", "</t>"
desired result strings in vector: "test", "test2"

I'm using the regular expression library by including <regex.h>.
If i build a pattern like: "<t>" + "(.*)?" + "</t>"
Then the result is the classic greedy-problem where it stops at the very last tag instead of the ones in between.
But according to the man page and a regcomp() error message, I cannot use "(.*?)" to create a lazy / non-greedy expression.

Is there another way to do this?

Best reg.
Chris
ChrisCphDK
Junior Member
 
Posts: 6
Joined: 10 Aug 2009, 08:01

Postby ChrisCphDK » 10 Aug 2009, 08:42

Forgot to mention system specs:
MacBook Pro.
Mac OS X (10.5.8)
ChrisCphDK
Junior Member
 
Posts: 6
Joined: 10 Aug 2009, 08:01

Postby Alt » 10 Aug 2009, 09:11

I may wrong but maybe you need perl-compatible regexp library (PCRE) ?
User avatar
Alt
Member
 
Posts: 726
Joined: 18 Nov 2008, 12:22
Location: Mother Russia

Postby Levenson » 10 Aug 2009, 10:30

To select all text in the html tags use this pattern [^<>]+. So if your text between tags <t> </t> you may used this pattern
(?:<t>)([^<>]+)(?:</t>)

your desired text will be in $1 group. But there are many subtleties.
To Think...
Levenson
Junior Member
 
Posts: 40
Joined: 21 Nov 2008, 17:50
Location: Moscow, Russia

Postby ChrisCphDK » 10 Aug 2009, 11:21

Thanks for the quick replies Alt and Levenson.

@Levenson: I get the same error ("repetition-operator operand invalid") when using the complete regex you provided. If I substitute the (.*?) with your ([^<>]+) I get the correct match, but only one match.

I think maybe I'm misunderstanding how matches work, so here's the code I made for gathering all strings and storing them in a vector:

Code: Select all
std::vector<std::string> regexGetMultiTextBetweenTags(const std::string xml, const std::string &startTag, const std::string &endTag)
{
   regex_t reg; //will store the compiled regex pattern
   regmatch_t matches[MAXREGEXMATCHES]; //found matches, maxregmatches = 500
   //std::string pattern = "(?:<test>)([^<>]+)(?:</test>)"; //
   std::string pattern = startTag + "([^<>]+)" + endTag; //
   std::string result; //the result to return
   std::vector<std::string> vecResults;
   char error[128];
   int errnum = 0;
   if((errnum = regcomp(®, pattern.c_str(), REG_EXTENDED)) != 0) //compiles the regex
   {
      regerror(errnum, ®, error, sizeof(error));
      std::cout << "regex error: " << error << std::endl;
      return vecResults; //empty vector
   }
   int res = regexec(®, xml.c_str(), MAXREGEXMATCHES, matches, 0); //execute the regex
   if(res == REG_NOMATCH) //no matches found
   {
      return vecResults;
   }
   else if(res != 0) //An error occured
   {
      return vecResults;
   }
   
   //the matches[0] contain the full string which is not relevant, so we start from 1
   for(int i = 1; i < MAXREGEXMATCHES && matches[i].rm_so != -1; i++)
   {
      std::cout << "i: " << i << std::endl;
      result = xml.substr(matches[i].rm_so, matches[i].rm_eo-matches[i].rm_so);
      vecResults.push_back(result);
   }
   regfree(®);
   
   return vecResults;
}


Am I getting it wrong if I expect the regex to continue through the entire string looking for a match, and then each time it finds one it stores the position in the matches-array?

Best regards
Chris
ChrisCphDK
Junior Member
 
Posts: 6
Joined: 10 Aug 2009, 08:01

Postby ChrisCphDK » 10 Aug 2009, 22:51

Decided to abandon the regex and using string methods instead to locate and extract data.
Still, I'm curious as to how matches in regex are to be understood.
ChrisCphDK
Junior Member
 
Posts: 6
Joined: 10 Aug 2009, 08:01

Postby trev » 14 Aug 2009, 11:09

Code: Select all

/******************************************************************************
 *  $Id: regex_filter.c,v 1.3 2009/08/14 11:04:37 trev Exp $
 *
 *  Name   : regex_filter
 *  Date   : 14 August 2009
 *  Author : Trev
 *
 *  Syntax : regex_filter < input_file > output_file
 *           cat input_file | regex_filter > output_file
 *
 *  Purpose: Extract content from <t></t> tags
 *
 *  Notes  : Based on comlaw_regbinaryformats_filter
 *
 *  ToDo   :
 *****************************************************************************/

/* Includes */

#include <regex.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* Prototypes */

int main(int, char *[]);
void extract_content(char *);   /* extract content from tags */

/* Defines */

#define MAX_STR         1024    /* size file input line buffer */

/* Globals */


/************
 *  MAIN()  *
 ************/

int main(int argc, char *argv[])
{
char buffer[MAX_STR] = "";      /* Input file line buffer */

while (fgets(buffer, sizeof(buffer), stdin) != NULL)
        {
        extract_content(buffer);
        }

exit(0);
}

/***********************
 *  extract_content()  *
 ***********************/

void extract_content(char string[])
{
regex_t    preg;
regmatch_t mtch[1];
size_t     rm, nmatch;
int start;      /* Offset from the beginning of the line */
char tempstr[MAX_STR] = "";

/* Pattern */
rm=regcomp(&preg, "<t>[^<]+</t>", REG_EXTENDED);

/* How many matches do we want in a line? */
nmatch = 1;

/* Execute regex */
while(regexec(&preg, string+start, nmatch, mtch, 0)==0) /* Found a match */
        {
          strncpy(tempstr, string+start+mtch[0].rm_so+3, mtch[0].rm_eo-mtch[0].rm_so-7);
          printf("%s\n", tempstr);

          /* Update the offset */
          start +=mtch[0].rm_eo;

          /* Have to zero string or problem with shorter  */
          /* string after a longer string                 */
          memset(tempstr, '\0', strlen(tempstr));
        }

/* Clean up */
regfree(&preg);
}


Code: Select all
$ echo 'dummy<t>test</t><t>test2</t>dummy' | ./regex_filter
test
test2


Presumably you could adapt my C solution for C++ :)
trev
Member
 
Posts: 354
Joined: 31 Dec 2008, 06:41

Postby ChrisCphDK » 15 Aug 2009, 09:36

trev, thanks for your code example.
Much appreciated.
ChrisCphDK
Junior Member
 
Posts: 6
Joined: 10 Aug 2009, 08:01


Return to Userland Programming & Scripting

Who is online

Users browsing this forum: No registered users and 0 guests