i was trying to parse a string with pyparsing so all the words were separated from the punctuation signs, i was using this expression to do it:

OneOrMore(Word(alphanums)) + OneOrMore(Char(printables))

But when i parse the following string with this expression:

return abc(1, ULLONG_MAX)

All the words inside the parentheses get split:

['return', 'abc', '(', '1', ',', 'U', 'L', 'L', 'O', 'N', '_', 'M', 'A', 'X', ')', ';']

But if i use this expression:

OneOrMore(Word(alphanums)) + OneOrMore(Char(string.punctuation))

Only a part of the string gets parsed:

['return', 'abc', '(']

What is wrong with those expressions?

  • e0qdk
    link
    fedilink
    31 year ago

    Haven’t used that particular library, but have written libraries that do similar sorts of things and have played with a few other similar libraries in C++ and Haskell. I’ve taken a quick glance at the documentation here, but since I don’t know this library specifically apologizes in advance if I make a mistake.

    For OneOrMore(Word(alphanums)) + OneOrMore(Char(printables)) it looks it matches as many alphanum Words as it can (whitespace sequences being an acceptable separator between tokens by default) and when it hits ( it cannot continue with that so tries to match the next expression in the sequence. (i.e. OneOrMore(Char(printables)))

    The documentation says:

    Char - a convenience form of Word that will match just a single character from a string of matching characters

    Presumably, that means it will not group the characters together, which is why you get individual character matches after that point for all the remaining non-whitespace characters. (Your result also seems to imply there was a semicolon at the end of your input?)

    For OneOrMore(Word(alphanums)) + OneOrMore(Char(string.punctuation)) it looks like it cannot match further than ( since 1 is not a punctuation character; so, you got the tokens for the parts of the string that matched. (If you chained the parser expression with something like + Word(alphanum) I’d expect you’d get another token [i.e. "1"] added onto the end of your result.) You may eventually want StringEnd/LineEnd or something like that – I’d expect they’d fail the parser expression if there’s unconsumed input (for error detection), but again, haven’t used this specific library, so it may work different than I expect.

    There appears to be a Combine class you can use to join string results together; that might be useful for future reference.

    i was trying to parse a string with pyparsing so all the words were separated from the punctuation signs

    Have not tested it (since I don’t have a copy of the library installed anywhere and can’t set up an environment for it easily right now) but perhaps something like OneOrMore(Word(alphanums)|Char(string.punctuation)) would be more like what you are looking for?