Some will regard this as an enhancement request. To each his own, but IMO *grep has always had a huge deficiency when processing natural languages due to line breaks. PDFGREP especially because most PDF docs carry a payload of natural language.
If I need to search for “the.orange.menace“ (dots are 1-char wildcards), of course I want to be told of cases like this:
A court whereby no one is above the law found the orange
menace guilty on 34 counts of fraud..
When processing a natural language a sentence terminator is almost always a more sensible boundary. There’s probably no command older than grep that’s still in use today. So it’s bizarre that it has not evolved much. In the 90s there was a Lexis Nexus search tool which was far superior for natural language queries. E.g. (IIRC):
foo w/s bar
:: matches if “foo” appears within the same sentence as “bar”foo w/4 bar
:: matches if “foo” appears within four words of “bar”foo pre/5 bar
:: matches if “foo” appears before “bar”, within five wordsfoo w/p bar
:: matches if “foo” appears within the same paragraph as “bar”
Newlines as record separators are probably sensible for all things other than natural language. But for natural language grep is a hack.
TIL there are people who (try to) use grep for natural language.
Whatever the case, I’d rather not modify (GNU) grep for this purpose. If grep can be hacked to do this (
tr '\n' ' ' | grep 'the orange menace'
or something maybe only a little more sophisticated?) then more power to you. Otherwise, making a separate tool is probably better.If all you’re advocating for is allowing grep to use some other character as a delimeter, I might be able to get behind something like bash’s $IFS or awk’s $FS variable (maybe). But I couldn’t get behind anything backwards-incompatible.
Meanwhile, PDFGREP isn’t associated with any maintainers of grep, is it? I’ve never used it and I don’t know if you’re saying “because PDFGREP is good at handling natural language, grep should be too” or “grep should be good at handling natural language and PDFGREP even more so, but neither is.” Either way, I don’t follow how PDFGREP is relevant to discussions about grep (unless they are related, but I’d be surprised if that was the case.)
Oh, and this is 100% feature/enhancement request territory. Not a bug report in any sense.
If all you’re advocating for is allowing grep to use some other character as a delimeter, I might be able to get behind something like bash’s $IFS or awk’s $FS variable (maybe). But I couldn’t get behind anything backwards-incompatible.
Of course. GREP has an immeasurable number of scripts dependant on it worldwide going back 50 years and it’s among Debian’s 23 essential packages:
dpkg-query -Wf '${Package;-40}${Essential}\n' | grep yes
Changing grep’s default behavior now would bring the world down. Dams would shatter. Nuclear power plants would melt down. Traffic lights would go berzerk. It would be like a Die Hard 3 “firesale”. Planes would fall out of the sky. Skynet would come online and wipe us all out. It would have to be a separate option.
TIL there are people who (try to) use grep for natural language.
The very first task grep was created for is specifically natural language input. Search “Federalist Papers grep”. There’S also a short documentary about this out in the wild somewhere but I don’t have any link handy.
Oh, and this is 100% feature/enhancement request territory. Not a bug report in any sense.
This is conventional wisdom coming from a viewpoint that simultaneously misses grep’s intended purpose.
But now that the defect has been rooted in for ~50 years, perhaps fair enough to leave grep alone. For me it depends on how lean the improvement could be. Boating grep out too much would not be favorable, but substantial replication of code between two different tools is also unfavorable. Small is good, but swiss army knives of tools also bring great value if they can be lean and internally simple.
I don’t know if you’re saying “because PDFGREP is good at handling natural language, grep should be too”
Not at all. They both have the same problem. But this same limitation in pdfgrep is a nuissance in more situations because PDFs are proportionally more likely to process natural language input.
Either way, I don’t follow how PDFGREP is relevant to discussions about grep
They have the same expression language and roughly same options. PDFGREP is most likely not much more than a grep wrapper that extracts the text from the PDF first.
But now that the defect has been rooted in…
Not a defect. What is it with people equating “doesn’t do this one hairbrained thing I want it to” with “broken?”
It’s not a bug if it works as designed. Unless somewhere some official documentation says (some specific version of) grep supports what you’re advocating for but the actual grep command doesn’t, it’s not a defect. It’s a feature request.
To qualify as a “bug”, I’d also accept “it used to do this and it doesn’t any more and not on purpose”.
Even if (say, GNU) grep maintainers decided they’d make grep support what you’re going for, there’d still be design to do. Should it be a flag? Should the regex syntax be extended to support this? Should we add an environment variable? Some combination of the three? Something else? If we go with the flag, what should it be called and what should be its semantic meaning? Should it take an argument? Etc, etc, etc.
Even assuming this feature is necessary to fulfill “grep’s intended purpose” (and I’m far from convinced it is), that doesn’t make it a bug if it was never designed in to the program.
It’s not a bug if it works as designed.
What you claim here is that software cannot have a defective design. Of course you have design defects. These are the hardest to correct.
I’d also accept “it used to do this and it doesn’t any more and not on purpose”.
This is conventional wisdom. Past behavior is no more an indication of correctness than defectiveness. GREP’s purpose was to process natural language. A line feed is not a sensible terminator in that application. For 50 years people just live with the limitation or they worked around it. Or they adapt to single token searches. It does not cease to be defect because workarounds were available.
that doesn’t make it a bug if it was never designed in to the program.
The original design was implemented on an extremely resource-poor system by today’s standards, where 64k was HUGE amount of space. It was built to function under limitations that no longer exist. I would say the design is not defective so long as your target platform is a PDP-11 from the 1970s. Otherwise the design should evolve along with the tasks and machines.
grep isn’t really designed as a natural language search tool but perl -pe can do a pretty similar thing to what you’re looking for.
perl -0777 -pe 's/\n/ /g' file.txt | perl -ne 'print "$1\n" while /(.{0,20}(the.orange.menace).{0,20})/g'
grep isn’t really designed as a natural language search tool
My understanding of GREP history is that Ken Thompson created grep to do some textual analysis on The Federalist Papers, which to me sounds like it was designed for processing natural language. But it was on a PDP-11 which had resource constraints. Lines of text would be more uniform to manage than sentences given limited resources of the 1970s.
Thanks for the PERL code. Though I might favor sed or awk for that job. Of course that also means complicating emacs’ grep mode facility. And for PDFs I guess I’d opt for pdfgrep’s limitations over doing a text extraction on every PDF.
Hm… yeah, I didn’t know that; I just sort of assumed that it was for searching code etc initially, but you are correct.
BTW I just learned about
pcregrep -M
which can do a little more directly what you’re asking for – you can dopcregrep -M 'the(.|\n)orange(.|\n)menace'
which seems to work, although you may want -A or -B to give a little more useful output also.