grep/pdfgrep’s inability to match across lines

@[email protected] · edit-2 8 months ago

grep/pdfgrep’s inability to match across lines

@TootSweet · edit-2 8 months ago

TIL there are people who (try to) use grep for natural language.

Whatever the case, I’d rather not modify (GNU) grep for this purpose. If grep can be hacked to do this (tr '\n' ' ' | grep 'the orange menace' or something maybe only a little more sophisticated?) then more power to you. Otherwise, making a separate tool is probably better.

If all you’re advocating for is allowing grep to use some other character as a delimeter, I might be able to get behind something like bash’s $IFS or awk’s $FS variable (maybe). But I couldn’t get behind anything backwards-incompatible.

Meanwhile, PDFGREP isn’t associated with any maintainers of grep, is it? I’ve never used it and I don’t know if you’re saying “because PDFGREP is good at handling natural language, grep should be too” or “grep should be good at handling natural language and PDFGREP even more so, but neither is.” Either way, I don’t follow how PDFGREP is relevant to discussions about grep (unless they are related, but I’d be surprised if that was the case.)

Oh, and this is 100% feature/enhancement request territory. Not a bug report in any sense.

@[email protected] · edit-2 8 months ago

If all you’re advocating for is allowing grep to use some other character as a delimeter, I might be able to get behind something like bash’s $IFS or awk’s $FS variable (maybe). But I couldn’t get behind anything backwards-incompatible.

Of course. GREP has an immeasurable number of scripts dependant on it worldwide going back 50 years and it’s among Debian’s 23 essential packages:

dpkg-query -Wf '${Package;-40}${Essential}\n' | grep yes

Changing grep’s default behavior now would bring the world down. Dams would shatter. Nuclear power plants would melt down. Traffic lights would go berzerk. It would be like a Die Hard 3 “firesale”. Planes would fall out of the sky. Skynet would come online and wipe us all out. It would have to be a separate option.

TIL there are people who (try to) use grep for natural language.

The very first task grep was created for is specifically natural language input. Search “Federalist Papers grep”. There’S also a short documentary about this out in the wild somewhere but I don’t have any link handy.

Oh, and this is 100% feature/enhancement request territory. Not a bug report in any sense.

This is conventional wisdom coming from a viewpoint that simultaneously misses grep’s intended purpose.

But now that the defect has been rooted in for ~50 years, perhaps fair enough to leave grep alone. For me it depends on how lean the improvement could be. Boating grep out too much would not be favorable, but substantial replication of code between two different tools is also unfavorable. Small is good, but swiss army knives of tools also bring great value if they can be lean and internally simple.

I don’t know if you’re saying “because PDFGREP is good at handling natural language, grep should be too”

Not at all. They both have the same problem. But this same limitation in pdfgrep is a nuissance in more situations because PDFs are proportionally more likely to process natural language input.

Either way, I don’t follow how PDFGREP is relevant to discussions about grep

They have the same expression language and roughly same options. PDFGREP is most likely not much more than a grep wrapper that extracts the text from the PDF first.

@TootSweet · 8 months ago

But now that the defect has been rooted in…

Not a defect. What is it with people equating “doesn’t do this one hairbrained thing I want it to” with “broken?”

It’s not a bug if it works as designed. Unless somewhere some official documentation says (some specific version of) grep supports what you’re advocating for but the actual grep command doesn’t, it’s not a defect. It’s a feature request.

To qualify as a “bug”, I’d also accept “it used to do this and it doesn’t any more and not on purpose”.

Even if (say, GNU) grep maintainers decided they’d make grep support what you’re going for, there’d still be design to do. Should it be a flag? Should the regex syntax be extended to support this? Should we add an environment variable? Some combination of the three? Something else? If we go with the flag, what should it be called and what should be its semantic meaning? Should it take an argument? Etc, etc, etc.

Even assuming this feature is necessary to fulfill “grep’s intended purpose” (and I’m far from convinced it is), that doesn’t make it a bug if it was never designed in to the program.

@[email protected] · edit-2 8 months ago

It’s not a bug if it works as designed.

What you claim here is that software cannot have a defective design. Of course you have design defects. These are the hardest to correct.

I’d also accept “it used to do this and it doesn’t any more and not on purpose”.

This is conventional wisdom. Past behavior is no more an indication of correctness than defectiveness. GREP’s purpose was to process natural language. A line feed is not a sensible terminator in that application. For 50 years people just live with the limitation or they worked around it. Or they adapt to single token searches. It does not cease to be defect because workarounds were available.

that doesn’t make it a bug if it was never designed in to the program.

The original design was implemented on an extremely resource-poor system by today’s standards, where 64k was HUGE amount of space. It was built to function under limitations that no longer exist. I would say the design is not defective so long as your target platform is a PDP-11 from the 1970s. Otherwise the design should evolve along with the tasks and machines.

mozz · 8 months ago

grep isn’t really designed as a natural language search tool but perl -pe can do a pretty similar thing to what you’re looking for.

perl -0777 -pe 's/\n/ /g' file.txt | perl -ne 'print "$1\n" while /(.{0,20}(the.orange.menace).{0,20})/g'

@[email protected] · 8 months ago

grep isn’t really designed as a natural language search tool

My understanding of GREP history is that Ken Thompson created grep to do some textual analysis on The Federalist Papers, which to me sounds like it was designed for processing natural language. But it was on a PDP-11 which had resource constraints. Lines of text would be more uniform to manage than sentences given limited resources of the 1970s.

Thanks for the PERL code. Though I might favor sed or awk for that job. Of course that also means complicating emacs’ grep mode facility. And for PDFs I guess I’d opt for pdfgrep’s limitations over doing a text extraction on every PDF.

mozz · 8 months ago

Hm… yeah, I didn’t know that; I just sort of assumed that it was for searching code etc initially, but you are correct.

BTW I just learned about pcregrep -M which can do a little more directly what you’re asking for – you can do pcregrep -M 'the(.|\n)orange(.|\n)menace' which seems to work, although you may want -A or -B to give a little more useful output also.