Grep Tutorial: Searching File Contents
Share on facebook
Share on twitter
Share on linkedin

Grep Tutorial: Searching File Contents

Anthony James
Anthony James

The grep command allows searching the contents of a file from the command line. It’s a very useful tool to find a particular line in, say, a log file or a conf file. And because it’s a command line program, you can combine it with other commands in various ways to produce powerful results. In this tutorial, you will learn both the basics and some more advanced applications of grep.

Looking for a needle in a haystack

Suppose you have a large configuration file in which you want to find a particular setting. For example, you might want to know the current maximum upload file size of your PHP installation. The following grep command will quickly give you your answer:grep upload_max_filesize php.iniIn my case, it outputs “upload_max_filesize = 2M”. Pretty neat, isn’t it? Let’s have a closer look at this command and what it does. Obviously, “grep” invokes the grep command. The second part, “upload_max_filesize” is the needle we’re looking for, and the third part, “php.ini” is the file that is our haystack. When invoked, grep reads the haystack file line by line, looking for the needle, and prints every line that contains the needle. The general way of invoking grep is:grep <needle> <haystack>As you see, there are two places of interest in this format. The first is the needle, and the second is the haystack. Read on to learn more about both of these, and how you can use them to your advantage.

On the use of needles

The needle, or search pattern as it is more commonly called, specifies what a line should contain to be printed by grep. This can be a word, or a certain character, but also a so-called regular expression. Now, if you’re not familiar with regular expressions, there are plenty of wonderful tutorials on the web, and it could take you a while before you master them. For this tutorial, I will just show you the most commonly used example, and warn you about some pitfalls.The simplest search pattern is just a literal word or character. You can find all recent pages requested by Googlebot in your web server’s access log by executing this:grep Googlebot access.logNote that this search is case sensitive. A line containing “googlebot” or “GoogleBot” is not printed. If you’d like to ignore case, add the -i option:grep -i Googlebot access.logNote that options are placed before the search pattern. This command will also print lines containing “googlebot”, “GoogleBot” or even “gOoGLeBoT”.Now that we’re looking at a web server access log anyway, you might also want to view all requests by actual people. A logical way to do this would be to filter out all lines containing requests by known bots. An interesting option of grep is the -v option, which inverts the search results by printing all lines that do not contain the pattern:grep -iv Googlebot access.logRemember that -i made the search case insensitive, and that “-iv” is equivalent to “-i -v”, so that this command prints all lines not containing “Googlebot”, “gOoGLeBoT” and so on. Everything that’s printed now was not requested by Googlebot. But there are more known bots, like bingbot for example. How to exclude both from the output of grep?This is where the most commonly used regular expression comes in:grep -Eiv 'Googlebot|bingbot' access.logNotice three important things:

  1. The -E option allows the use of extended regular expressions as a pattern. This is necessary to make more advanced patterns work on any computer on which you might execute this command.
  2. The “|” character in the pattern means or, and makes the pattern match both “Googlebot” and “bingbot”. You can add more matching words as long as you place “|” characters between them, like ‘Googlebot|bingbot|Baiduspider|facebookexternalhit’.
  3. Finally, the single quotes around the pattern prevent your shell from interpreting the special characters in the regular expression. Forgetting the single quotes may cause disaster, but usually results in some kind of “command not found” error.

Notice that the “|” character has a special meaning in this context. A common pitfall when using grep is to put special characters in the search pattern without realizing they have a special meaning. For example, in most versions of grep, the dot character matches anything, even if you don’t use the -E option. Have a look at this command:grep i.e. textfile.txtYou might want to find all lines that contain “i.e.”, the abbreviation for “that is”. However, this command will also print lines containing “item” or “cabinet” or even “businessman”. All of these words contain the character sequence “i<some character>e<some character>”, and the period matches any single character. If you want grep to look for the search pattern exactly as it is, you should use the -F option to look for fixed strings and put the search pattern between single quotes to prevent your shell from interpreting the special characters. The correct command would be:grep -F 'i.e.' textfile.txtHowever, when you want to search for multiple patterns containing special characters, you can’t use the -F option because you need the special meaning of the “|” character. In this case you’ll need to escape special characters that grep should search for as they are. For example, if you want to look for “i.e.” or “e.g.”, escape the dots using backslashes, like this:grep -E 'i.e.|e.g.' textfile.txtA backslash before a special character tells grep to search for the special character literally, without minding its special meaning. For a full list of characters that need escaping, refer to the man page of grep (run “man grep” in your command line).

On the selection of haystacks

So far, we have only searched single text-based files. However, grep can search multiple files. You can just put multiple files after the search pattern and grep will search them all. Also, you can use the -r option to recursively search all files under the current directory. For example:# This command prints all lines where a comment is started# in# index.php and product.php:grep -E '//|/*' index.php product.php# This command prints all lines containing TODO in all files# under the# current directory:grep -r TODO .Note that “.” signifies the current (working) directory in the command line, and that the -r option also searches files in subdirectories and their subdirectories and so on. When you search multiple files, grep will automatically add the filename in which the matching lines was found before the matching line itself. The output will look like:./index.php: // TODO: add simpler user authentication./header.php: /* TODO: fix the following JavaScript:Always useful to know which TODO is in which file, isn’t it?Finally, when you use the -r option and there are some binary files (non-text files) under the directory being searched, you may get results like this: “Binary file somefile matches”. Grep can search binary files, but as they are not line-based nor human-readable,all that’s printed if there is a match is this line. You can stop grep from wasting time on these files by using the -I (capital i) option to Ignore binary files.Now, there is one more kind of haystack that I want to tell you about: the standard input. If you do not specify a haystack, grep will search its stdin. This means that you can pipe the output of another command to grep, and then search that output for a search pattern. This allows you to quickly look for lines containing something you want to know. For example, you can find out if a certain package is installed on your system:# For Ubuntu/Debian usersdpkg -l | grep libpng# For RHEL/CentOS/Fedora usersrpm -qa | grep libpngThe commands before the “|” character generate a list of all installed packages, and grep only prints the lines containing “libpng”. You can also search your kernel ring buffer for errors or warnings:dmesg | grep -Ei 'error|warning'Additionally, you can use grep to check if a certain process is running:ps -ef | grep sshdThis is actually so commonly used that many distros now include a command dedicated to this way of grepping: pgrep. If it’s installed, the equivalent of the above command is “pgrep -fl sshd”. It prints less information, but doesn’t print an obsolete line for the grep command itself.

Bonus: a cool trick

Now, I don’t want you to finish reading this article without knowing a not-so-obvious trick: recently, somebody wanted me to believe that a certain volume of songs we both knew was very egocentric and contained the word “I” more often than the word “you”. I happened to have a text-based digital copy of the volume. Using the following grep commands, I found out that the bundle wasn’t so egocentric after all:grep -iorw I songvolume/ | wc -lgrep -iorw you songvolume/ | wc -lIt turned out that “I” occurred 1257 times and “you” occurred 1803 times. Let me explain how this trick works:

  • “-iorw“: these options mean, in the order in which they are listed:
    • be case insensitive
    • for each match, print only the matching characters (this means that if a word occurs multiple times on a line, grep prints it multiple times on separate lines, which is needed for an accurate count)
    • recursively search all files under the directory given as a haystack
    • match only whole words, not every “i” occurring anywhere
  • “| wc -l”: pipes the output of grep, containing one line for every time “I” or “you” occurs, to “wc -l”, which counts the number of lines it gets and prints that number.

The combination of grep and wc -l is used in many more tricks, whenever the number of lines containing something is of interest. Think of web access logs (how many times has my site been visited by Googlebot?), programming code (how many TODO comments are left?) or process tables (how many instances of apache are running at this moment?). If you know another cool way to use grep, feel free to share it in the comments!


Get more insights, news, and assorted awesomeness around all things cloud learning.

Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!

Get Started
Who’s going to be learning?