grep has gained a few competitors over the years. ack and grin both aim to fill in the gaps in grep's functionality, and provide a style of interaction that is focused on searching large code repositories. By default, they search recursively, colorize their output, and ignore certain "obvious" files that no one generally wants to search through. The latter both cuts noise out of search results, and allows them to run faster.
I've used all three over the years; mostly grep and grin. I've encountered surprising issues with ack and grin, both in terms of performance and behavior:
-a
option, but it turns
out: not really.
The discovery and resolution of these issues has caused me to switch back and forth a few times, and finally I decided to do a performance test of all three, just to make sure I wasn't missing out on something.
grep --color=auto -sPIr \ --exclude-from=excludes \ --exclude-dir='.*' \ --exclude-dir='~.dep' \ --exclude-dir='~.dot' \ --exclude-dir='~.nib' \ --exclude-dir='~.plst' \ --exclude-dir=blib \ --exclude-dir=CVS \ --exclude-dir=RCS \ --exclude-dir=SCCS \ --exclude-dir=_darcs \ --exclude-dir=_sgbak \ --exclude-dir=autom4te.cache \ --exclude-dir=cover_db \ --exclude-dir=_build \ --exclude-dir=build \ 'aflksdjfk?falk\d+sdfj' `pwd` ack -a \ --ignore-dir=build \ --ignore-dir=dist \ --type-set='junk=.pyc,.pyo,.so,.o,.a,.tgz,.tar.gz,.rar,.zip,~,#,.bak,.png,.jpg,.gif,.bmp,.tif,.tiff,.pyd,.dll,.exe,.obj,.lib' \ --type=nojunk \ 'aflksdjfk?falk\d+sdfj' grin \ -d 'CVS,RCS,.svn,.hg,.bzr,build,dist,.git,~.dep,~.dot,~.nib,~.plst,blib,SCCS,_darcs,_sgbak,autom4te.cache,cover_db,_build' \ 'aflksdjfk?falk\d+sdfj'
For the full picture, here is the contents of the excludes
file
referenced by the grep command:
.* *.pyc *.pyo *.so *.o *.a *.tgz *.tar.gz *.rar *.zip *~ *# *.bak *.png *.jpg *.gif *.bmp *.tif *.tiff *.pyd *.dll *.exe *.obj *.lib #?*# .*.swp _*.swp core.[0-9]*
I ran this on my personal code projects directory, which has all kinds of random stuff that's built up over the years. Code, images, executables, archive files, SQLite databases, and so on. Here are the results:
grep | 12.524s |
ack | 25.873s |
grin | 14.703 |
There were a few exceptions with grin; it can't ignore file patterns other
than literal suffixes, so ack's default ignoring of #.+#\$
,
[._].*\\.swp\$
, and core\\.\\d+\$
could not be
applied to grin. Thankfully, all of those filenames are pretty rare, so it
shouldn't matter much.
ack's file suffix ignoring mechanism, which is a bit circuitous, turns out
to be completely disabled when using -a
. It is impossible to
ignore any files when using -a
; only literal directory names (no
globs). In other words, you can modify ack's whitelist of file suffixes, but
if you want to forego the whitelist and use a blacklist approach instead,
tough luck. You either use the whitelist or you search everything.
If you are okay with the whitelist approach, ack is pretty close in performance to the other two. It performed the same search as above in in 13.988 seconds. This number can't be compared strictly to the others, but it's as close as I can get.
So in short, performance is fairly uniform, with grep being the fastest by a fairly small margin (around 10-20%).
grep did not have the --exclude-dir
option until
version 2.5.3. That was released in 2007 or 2008 (it's surprisingly
hard to track down the date), but Ubuntu 10.04 is still using grep 2.5.1. In
light of this, and to be fair to grep with regard to any recent performance
enhancements, I installed the newest (2.6.3) package from Launchpad.
Now that I had the --exclude-dir
option available to
me, I had a lot of trouble with it. If you tell it to ignore .*
(any "dot directories"), and you then pass .
as the directory to
search, it will immediately exit without having done anything. It might seem
obvious why when I state it that way, but I was truly baffled for a little
while. One solution is to pass `pwd`
instead of .
;
But now, all of the filenames in your search results will have their full,
absolute path shown, and that's usually quite long and ridiculous to sift
through. Another solution is to never ignore .*
, but rather
ignore specific names like .git
and .svn
. You can
even ignore almost every dot-dir you'll encounter in the real world by using
--exclude-dir='.[a-zA-Z0-9]*'
. This will fail if a dot-dir starts
with anything other than an ascii alphanumeric character, but it should be good
for the vast majority of cases. By the way, .?*
and
.??*
mysteriously do not work. For me, they
prevent grep from recursing. I don't understand that at all. It may be some
weird artifact of the options grep is passing to fnmatch()
.
grep also fails to exclude a directory glob that
looks/like/this*
. I'm not sure why this happens either.
Beyond those issues, grep, unlike ack and grin, has a pretty complete set of options for excluding files and directories.
There are some other issues with grep. You probably know about these. They all mostly have solutions now.
-E
or -P
.
--color=auto
.
-s
.
I've had a couple issues with grin over time. The first is a lack of a
-w
(word) option. You can simulate it by doing
\bpattern\b
, but that's pretty tedious. The author did not seem
very interested in implementing this feature when I asked. ack and grep have
it.
My other issue with grin is well known by the author:
[...] setuptools installs scripts indirectly; the scripts installed to $prefix/bin or Python2xScripts use setuptools' pkg_resources module to load the exact version of grin egg that installed the script, then runs the script's main() function. This [...] can add substantial startup overhead [...]. If you want the response of grin to be snappier, I recommend installing custom scripts that just import the grin module and run the appropriate main() function. -- From the grin PyPI page
Not only does the default script start up a bit slower than it could, but if you hit control-C soon after grin starts up, you might get an ugly Python traceback, because grin hasn't gotten to its KeyboardInterrupt try/catch statement yet.
This is more a Python packaging limitation than a problem with grin per se. Nonetheless, it's another annoyance to deal with as a user, and fair or not, it makes it less appealing.
grin only supports excluding literal directory names, and filename suffixes.
Its default whitelisting behavior is a really poor choice in my opinion. If it isn't familiar with a given file extension, it will simply ignore it. Since you have no idea it's ignoring it, you won't know that you missed something until there is some unfortunate side effect. That can be downright dangerous when refactoring big, old, ugly code that has stuff in all kinds of unpredictable filenames. A coworker and I have both sadly run into this problem while working on a messy legacy PHP project where some PHP files had names ending with ".inc". I can picture this whitelist behavior biting a lot of people in the ass when they don't realize that's how ack works.
The default whitelisting approach might be forgivable if it were possible to turn it off and go with a blacklisting approach, but that, according to the author, is simply not supported.
These are all of the variations of file/directory exclusions I could dream up, and their support across these three tools:
Excluding Files | |||
---|---|---|---|
grep | grin | ack | |
fixed name (foo ) | OK | suffixes only | - |
glob (fo* ) | OK | - | - |
fixed name w/path (path/to/foo ) | OK | - | - |
glob w/path (path/to/fo* ) | - | - | - |
Excluding Directories | |||
grep | grin | ack | |
fixed name (foo ) | OK | OK | OK |
glob (fo* ) | OK | - | - |
fixed name w/path (path/to/foo ) | OK | - | - |
glob w/path (path/to/fo* ) | OK | - | - |
I'm going with this for now:
export GREP_OPTIONS='-rIPs --exclude-dir=.[a-zA-Z0-9]* --exclude=.* --exclude=*~ --color=auto' alias cgrep='grep --color=always'
This brings grep 95% of the way towards doing what I appreciate about grin and ack. You do still have to pass the directory name to search, whereas ack and grin will default to the current directory if you don't tell them otherwise. However, I can live with typing another space and period.
The cgrep alias will force colors on, which you can pipe through
less -R
if you want to page the output.
What I think would work amazingly would be a hierarchical set of exclusion
rule files. Let's give them the filename .grepignore
. You could
have a .grepignore
in your home directory which would list
files/dirs you always want to ignore. Then in each project's
directory (which are children of your home directory) you could have another
.grepignore
file that would ignore the specific files that you
want to ignore in that project. $GREP would then ignore the superset of all
the .grepignore
files from /
down to the directory
you're in. This seems like it would be elegant, simple, and effective.
I looked into implementing something like this in grin, but it turns out
there may be a good reason for grin not being able to ignore
multi/level/paths/with/glob*s -- Python's fnmatch function is (re-)implemented
in pure Python and does not use the system fnmatch. Thus, it's impossible to
use the FNM_PATHNAME
flag, which enables sane multi-level
globbing. Python's fnmatch thinks that the glob foo*bar
matches
foo/x/y/z/bar
, which is strange and contrary to most other
tools.
Implementing hierarchical exclusion rule files in grep would certainly be more laborious, since it's written in C instead of Python. I may try doing it with some kind of wrapper script instead. Anyone wanna beat me to it?