1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

C++ regex: is there a way to use (.*?)

Discussion in 'Userland Programming and Scripting' started by ChrisCphDK, Aug 10, 2009.

  1. ChrisCphDK

    ChrisCphDK New Member

    Messages:
    6
    Thanks Received:
    0
    Hi.

    I'm working on a method that should search a string of text and extract text in-between tags, and store the resulting strings in a vector<string> to be returned.

    Example:

    string to search: "dummy<t>test</t><t>test2</t>dummy"
    tags: "<t>", "</t>"
    desired result strings in vector: "test", "test2"

    I'm using the regular expression library by including <regex.h>.
    If i build a pattern like: "<t>" + "(.*)?" + "</t>"
    Then the result is the classic greedy-problem where it stops at the very last tag instead of the ones in between.
    But according to the man page and a regcomp() error message, I cannot use "(.*?)" to create a lazy / non-greedy expression.

    Is there another way to do this?

    Best reg.
    Chris
     
  2. ChrisCphDK

    ChrisCphDK New Member

    Messages:
    6
    Thanks Received:
    0
    Forgot to mention system specs:
    MacBook Pro.
    Mac OS X (10.5.8)
     
  3. Alt

    Alt New Member

    Messages:
    726
    Thanks Received:
    79
    I may wrong but maybe you need perl-compatible regexp library (PCRE) ?
     
  4. Levenson

    Levenson New Member

    Messages:
    40
    Thanks Received:
    8
    To select all text in the html tags use this pattern [^<>]+. So if your text between tags <t> </t> you may used this pattern
    your desired text will be in $1 group. But there are many subtleties.
     
    ChrisCphDK thanks for this.
  5. ChrisCphDK

    ChrisCphDK New Member

    Messages:
    6
    Thanks Received:
    0
    Thanks for the quick replies Alt and Levenson.

    @Levenson: I get the same error ("repetition-operator operand invalid") when using the complete regex you provided. If I substitute the (.*?) with your ([^<>]+) I get the correct match, but only one match.

    I think maybe I'm misunderstanding how matches work, so here's the code I made for gathering all strings and storing them in a vector:

    Code:
    std::vector<std::string> regexGetMultiTextBetweenTags(const std::string xml, const std::string &startTag, const std::string &endTag)
    {
    	regex_t reg; //will store the compiled regex pattern
    	regmatch_t matches[MAXREGEXMATCHES]; //found matches, maxregmatches = 500
    	//std::string pattern = "(?:<test>)([^<>]+)(?:</test>)"; //
    	std::string pattern = startTag + "([^<>]+)" + endTag; //
    	std::string result; //the result to return
    	std::vector<std::string> vecResults;
    	char error[128];
    	int errnum = 0;
    	if((errnum = regcomp(&reg, pattern.c_str(), REG_EXTENDED)) != 0) //compiles the regex
    	{
    		regerror(errnum, &reg, error, sizeof(error));
    		std::cout << "regex error: " << error << std::endl;
    		return vecResults; //empty vector
    	}
    	int res = regexec(&reg, xml.c_str(), MAXREGEXMATCHES, matches, 0); //execute the regex
    	if(res == REG_NOMATCH) //no matches found
    	{
    		return vecResults;
    	}
    	else if(res != 0) //An error occured
    	{
    		return vecResults;
    	}
    	
    	//the matches[0] contain the full string which is not relevant, so we start from 1
    	for(int i = 1; i < MAXREGEXMATCHES && matches.rm_so != -1; i++) 
    	{
    		std::cout << "i: " << i << std::endl;
    		result = xml.substr(matches.rm_so, matches.rm_eo-matches.rm_so);
    		vecResults.push_back(result);
    	}
    	regfree(&reg);
    	
    	return vecResults;
    }
    


    Am I getting it wrong if I expect the regex to continue through the entire string looking for a match, and then each time it finds one it stores the position in the matches-array?

    Best regards
    Chris
     
  6. ChrisCphDK

    ChrisCphDK New Member

    Messages:
    6
    Thanks Received:
    0
    Decided to abandon the regex and using string methods instead to locate and extract data.
    Still, I'm curious as to how matches in regex are to be understood.
     
  7. trev

    trev Member

    Messages:
    366
    Thanks Received:
    45
    Code:
    
    /******************************************************************************
     *  $Id: regex_filter.c,v 1.3 2009/08/14 11:04:37 trev Exp $
     *
     *  Name   : regex_filter
     *  Date   : 14 August 2009
     *  Author : Trev
     *
     *  Syntax : regex_filter < input_file > output_file
     *           cat input_file | regex_filter > output_file
     *
     *  Purpose: Extract content from <t></t> tags
     *
     *  Notes  : Based on comlaw_regbinaryformats_filter
     *
     *  ToDo   :
     *****************************************************************************/
    
    /* Includes */
    
    #include <regex.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    /* Prototypes */
    
    int main(int, char *[]);
    void extract_content(char *);   /* extract content from tags */
    
    /* Defines */
    
    #define MAX_STR         1024    /* size file input line buffer */
    
    /* Globals */
    
    
    /************
     *  MAIN()  *
     ************/
    
    int main(int argc, char *argv[])
    {
    char buffer[MAX_STR] = "";      /* Input file line buffer */
    
    while (fgets(buffer, sizeof(buffer), stdin) != NULL)
            {
            extract_content(buffer);
            }
    
    exit(0);
    }
    
    /***********************
     *  extract_content()  *
     ***********************/
    
    void extract_content(char string[])
    {
    regex_t    preg;
    regmatch_t mtch[1];
    size_t     rm, nmatch;
    int start;      /* Offset from the beginning of the line */
    char tempstr[MAX_STR] = "";
    
    /* Pattern */
    rm=regcomp(&preg, "<t>[^<]+</t>", REG_EXTENDED);
    
    /* How many matches do we want in a line? */
    nmatch = 1;
    
    /* Execute regex */
    while(regexec(&preg, string+start, nmatch, mtch, 0)==0) /* Found a match */
            {
              strncpy(tempstr, string+start+mtch[0].rm_so+3, mtch[0].rm_eo-mtch[0].rm_so-7);
              printf("%s\n", tempstr);
    
              /* Update the offset */
              start +=mtch[0].rm_eo;
    
              /* Have to zero string or problem with shorter  */
              /* string after a longer string                 */
              memset(tempstr, '\0', strlen(tempstr));
            }
    
    /* Clean up */
    regfree(&preg);
    }
    


    Code:
    $ echo 'dummy<t>test</t><t>test2</t>dummy' | ./regex_filter
    test
    test2
    


    Presumably you could adapt my C solution for C++ :)
     
    ChrisCphDK thanks for this.
  8. ChrisCphDK

    ChrisCphDK New Member

    Messages:
    6
    Thanks Received:
    0
    trev, thanks for your code example.
    Much appreciated.