C++ regex: is there a way to use (.*?)

Hi.

I'm working on a method that should search a string of text and extract text in-between tags, and store the resulting strings in a vector<string> to be returned.

Example:

string to search: "dummy<t>test</t><t>test2</t>dummy"
tags: "<t>", "</t>"
desired result strings in vector: "test", "test2"

I'm using the regular expression library by including <regex.h>.
If i build a pattern like: "<t>" + "(.*)?" + "</t>"
Then the result is the classic greedy-problem where it stops at the very last tag instead of the ones in between.
But according to the man page and a regcomp() error message, I cannot use "(.*?)" to create a lazy / non-greedy expression.

Is there another way to do this?

Best reg.
Chris
 
To select all text in the html tags use this pattern [^<>]+. So if your text between tags <t> </t> you may used this pattern
(?:<t>)([^<>]+)(?:</t>)
your desired text will be in $1 group. But there are many subtleties.
 
Thanks for the quick replies Alt and Levenson.

@Levenson: I get the same error ("repetition-operator operand invalid") when using the complete regex you provided. If I substitute the (.*?) with your ([^<>]+) I get the correct match, but only one match.

I think maybe I'm misunderstanding how matches work, so here's the code I made for gathering all strings and storing them in a vector:

Code:
std::vector<std::string> regexGetMultiTextBetweenTags(const std::string xml, const std::string &startTag, const std::string &endTag)
{
	regex_t reg; //will store the compiled regex pattern
	regmatch_t matches[MAXREGEXMATCHES]; //found matches, maxregmatches = 500
	//std::string pattern = "(?:<test>)([^<>]+)(?:</test>)"; //
	std::string pattern = startTag + "([^<>]+)" + endTag; //
	std::string result; //the result to return
	std::vector<std::string> vecResults;
	char error[128];
	int errnum = 0;
	if((errnum = regcomp(&reg, pattern.c_str(), REG_EXTENDED)) != 0) //compiles the regex
	{
		regerror(errnum, &reg, error, sizeof(error));
		std::cout << "regex error: " << error << std::endl;
		return vecResults; //empty vector
	}
	int res = regexec(&reg, xml.c_str(), MAXREGEXMATCHES, matches, 0); //execute the regex
	if(res == REG_NOMATCH) //no matches found
	{
		return vecResults;
	}
	else if(res != 0) //An error occured
	{
		return vecResults;
	}
	
	//the matches[0] contain the full string which is not relevant, so we start from 1
	for(int i = 1; i < MAXREGEXMATCHES && matches[i].rm_so != -1; i++) 
	{
		std::cout << "i: " << i << std::endl;
		result = xml.substr(matches[i].rm_so, matches[i].rm_eo-matches[i].rm_so);
		vecResults.push_back(result);
	}
	regfree(&reg);
	
	return vecResults;
}

Am I getting it wrong if I expect the regex to continue through the entire string looking for a match, and then each time it finds one it stores the position in the matches-array?

Best regards
Chris
 
Decided to abandon the regex and using string methods instead to locate and extract data.
Still, I'm curious as to how matches in regex are to be understood.
 
Code:
/******************************************************************************
 *  $Id: regex_filter.c,v 1.3 2009/08/14 11:04:37 trev Exp $
 *
 *  Name   : regex_filter
 *  Date   : 14 August 2009
 *  Author : Trev
 *
 *  Syntax : regex_filter < input_file > output_file
 *           cat input_file | regex_filter > output_file
 *
 *  Purpose: Extract content from <t></t> tags
 *
 *  Notes  : Based on comlaw_regbinaryformats_filter
 *
 *  ToDo   :
 *****************************************************************************/

/* Includes */

#include <regex.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* Prototypes */

int main(int, char *[]);
void extract_content(char *);   /* extract content from tags */

/* Defines */

#define MAX_STR         1024    /* size file input line buffer */

/* Globals */


/************
 *  MAIN()  *
 ************/

int main(int argc, char *argv[])
{
char buffer[MAX_STR] = "";      /* Input file line buffer */

while (fgets(buffer, sizeof(buffer), stdin) != NULL)
        {
        extract_content(buffer);
        }

exit(0);
}

/***********************
 *  extract_content()  *
 ***********************/

void extract_content(char string[])
{
regex_t    preg;
regmatch_t mtch[1];
size_t     rm, nmatch;
int start;      /* Offset from the beginning of the line */
char tempstr[MAX_STR] = "";

/* Pattern */
rm=regcomp(&preg, "<t>[^<]+</t>", REG_EXTENDED);

/* How many matches do we want in a line? */
nmatch = 1;

/* Execute regex */
while(regexec(&preg, string+start, nmatch, mtch, 0)==0) /* Found a match */
        {
          strncpy(tempstr, string+start+mtch[0].rm_so+3, mtch[0].rm_eo-mtch[0].rm_so-7);
          printf("%s\n", tempstr);

          /* Update the offset */
          start +=mtch[0].rm_eo;

          /* Have to zero string or problem with shorter  */
          /* string after a longer string                 */
          memset(tempstr, '\0', strlen(tempstr));
        }

/* Clean up */
regfree(&preg);
}

Code:
$ echo 'dummy<t>test</t><t>test2</t>dummy' | ./regex_filter
test
test2

Presumably you could adapt my C solution for C++ :)
 
Back
Top