Episode 3 Wildcards and grep

In this episode, we use basic wildcards to select files, and then explore how the ‘grep’ command can search for words or phrases across multiple files. As always, you can follow along using the same directory structure by downloading it from https://github.com/commandlinetv/sample-files.

10 August 2015

[Rhythmic, dark electronic intro music]

League

Welcome back to Command Line TV. This is Episode 3.

Today we’re going to talk about wildcards and text processing using pipelines.

First of all, do we have any follow-up from last time?

Lopes

I did have a question about accessing files, especially when it comes to their extensions.

We did access a .gz file and I was curious – one of the other files we had was a .tar.gz

and I was curious as to which one was the actual extension type?

League

Yeah, so first of all, extensions in UNIX don’t mean quite as much as they do on other systems.

They are primarily there for humans, and the system can work –

most commands at least can work perfectly well with whatever extension you want to give it.

So when you have a file like this .tar.gz – what’s happening there is that it’s one type of file,

which is a .tartar is an archive file, kind of like a zip.

But what’s interesting about tar is that it’s not by itself compressed.

All it does is it packages up a bunch of files, so that it creates one file,

and then you can compress that separately.

So that’s why it gets 2 extensions. The .tar means that it’s archived files,

and the .gz means that it’s compressed.

And so they go in that order.

But extensions are really not as meaningful as on other systems.

So for example, if I wanted to rename that as something else, I can still use it as a compressed file.

Or another pretty shocking example – last time I think we looked at a .png file –

we did an external viewer to open up the PNG. That was in thinkjava/figs.

cd thinkjava/figs/
ls *.png

And I have this PNG, so we did xdg-open gridworld.png to look at this file.

xdg-open gridworld.png

And it pops up in a separate viewer here.

But that viewer doesn’t – and even xdg-open itself – doesn’t care that it’s a .png extension.

I can rename that file. So, to rename a file is mv

we’re going to learn a lot more about mv, probably in the next episode.

And then I rename it so it has a .txt extension.

So it looks like that would be a text file. But when I do xdg-open on that,

it still opens up the image viewer. It still knows that it’s an image.

So that’s a little bit odd.

The way that it knows that is that it actually looks at the content of the file, rather than the extension.

So there’s a command that does that too, called the file command.

file gridworld.txt

When you run file on a text, or on any type of file.

It looks into the file content and identifies what’s there to tell you what it is.

And so file knows, even though I called this .txt, it knows it’s a PNG image, and it has its dimensions

and everything else about the image. So that’s pretty useful.

For that .tar.gz, that you mentioned, if I run the file command, it just says that it’s compressed data.

It doesn’t actually decompress to say what is there behind the compressed data.

Lopes

So you just did ls *.png. I know that ‘star’ is one of the wildcards.

What are other things that this * wildcard is capable of?

League

Right, so * – the idea is that it matches filenames.

And * can match any number of characters, including zero characters.

And including characters that seem like they would be special, like dots.

So a very common way to use it is with an extension, so *.png means anything followed by a .png.

And we know that works.

But you can also use it in some other interesting ways.

So if I want to see every filename here that has an a in it. Okay, I could say ls *a*, right.

ls *a*

And this means that any characters come before an a, and any characters come after an a.

And both of those stars could match emptiness, so it could start with an a or end with an a.

And we get a subset of the files that were listed – just the ones that have an a in them.

Or if it’s just a*, then that’s any filename that starts with an a.

ls a*

So you don’t have to do it just along the lines of extensions.

You might think it only works with *.png or something like card.*.

ls card.*

Those work fine, but you can also use it in more flexible ways.

Lopes

Besides star, are there any other wildcards that exist?

League

Sure, some of it depends on what shell you’re using and how it’s configured.

But I’ll give you 2 more of the basic ones that are always available.

One is the question mark. So, like a star, a question mark matches characters, but it only matches

exactly one character – not zero or more. So you can use it to substitute a missing character.

And a great example of that is – I have files here that have numbers in them, so list1 list2 list3.

If I wanted to match all the list files with any number after them,

I could say list?.fig to get anything where there’s one character following list.

ls list?.fig

Or it could be list?.* to mix and match both kinds of wildcards.

ls list?.*

So that question mark matching one character is useful in lots of situations.

You can also pile them together a little bit.

So if I wanted to match multiple characters, but a specific number of them, like 3 characters.

Then I could put 3 question marks. So let’s say for example list*.???.

ls list*.???

So the star will match anything – that’s going to match my numbers –

and then the question mark matches one character, but it does that 3 times.

So that will get any file that starts with list and has a 3 character extension.

But it would not work if there was a one or 2 character extension.

Lopes

So the star could be used to search for things that you have unknown lengths for,

and the question mark is used for more precise queries?

League

Yeah, I think so. If you know exactly that it’s one character, or how many chars it needs to be, then that’s useful.

One great example is when you’re doing C – so I’m going to go over here to a little C program.

So C programs usually use the extensions .c and .h especially, and those are single character extensions.

So you can do something like ls *.? to just pull out those single-character extensions.

ls *.?

It turns out every file in this folder is a single- character extension, so it matched all of them.

But if these were interspersed with other files, it would allow me to select just those.

That leads me to the third kind of wildcard we can do today, which is the square brackets.

So when you put square brackets into a file expression like this, then you can put

individual characters that would match.

So if I only want to match files that end with a c or an h, I put the ch in brackets.

ls *.[ch]

And that matches only one character, like the question mark,

but the character has to be one of the specified characters.

So this would match the .c and the .h but not the .o.

You can switch the order of these, but that won’t make a difference – it’s just any character from the set.

ls *.[hc]

So the order within that set doesn’t matter.

Lopes

I know that besides the square brackets, we also used the squiggly braces as a wildcard.

What’s the purpose of that?

League

It’s a little bit of overlap with what we’ve already seen, but it works a little differently.

So it’s important to understand the difference.

If I use squiggly braces here, I can specify different possible extensions.

They can be more than one character, so we’re going to separate them with commas.

So if I did something like *.{c,h}, that would match anything that ends in a c or ends in an h.

ls *.{c,h}

So that is the same as with square brackets, no more power.

But let’s look at some other files up here.

So I’ve got a bunch of files that start with config, right – config.* – there are 4 of them.

And if I only wanted the .h and .log, then I could use the curly braces to specify {h,log}

ls *.{h,log}

and it would only match those 2.

Lopes

So far we’ve only used wildcards with the ls command.

Can wildcards be used with the other commands we’ve used so far, such as more or less?

League

Yeah, definitely. So wildcards can be used with any command.

In fact, wildcards are expanded by your shell program – the program that is interpreting all your commands.

That means they can work with commands that aren’t even necessarily programmed to use them.

So let’s try it with a couple other commands. So cat we know can show the contents of a file.

So if I do cat config.h it will dump out those contents on my screen.

cat config.h

If I give it multiple files, using for example curly braces, it will just dump the contents of both files.

cat config.{h,log}

And of course that scrolled off the screen, but then I can pipe it into more or less.

So I’m first getting the config.h and later on – somewhere – I’ll get the config.log.

So it can cat both of those files at the same time.

A command we learned last week that does something especially useful with multiple files is head.

So if I do head config.h – you remember what head does?

head config.h
Lopes

It just shows the intro to that file?

League

Yeah, the first 8 or 10 lines, whatever that is.

And we can specify an option here to make it shorter or longer.

head -3 config.h

But you can also give that multiple files.

So if I said config.* here – this is pretty cool – it actually puts a little header with the filename,

head -3 config.*

and gives me 3 lines from that file, a blank line, and then the next file.

3 lines from that file, blank line, and so on.

So it’s showing me the top couple of lines from each of multiple files.

And the multiple files are just based on the wildcard.

Lopes

Using head, we were able to find out what the top 3 or 4 or 5 lines of each file had.

What if instead we wanted to search throughout those files for particular words or phrases?

League

Great, there’s a perfect command for that, that you’re going to love. [Laughs]

This is one of the most powerful Unix commands that is accessible to a beginner.

And it’s called grep.

What you do with grep is you specify a word or pattern of some sort that it will search for in files.

So let’s say I want to search for a word like Copyright.

And the files I want to search in are what I put next.

So you could list multiple files here – like that – or you could use your wildcards to specify which files.

What if I just put star, all by itself?

grep Copyright *

That will match any file at all in the current directory.

So this command says I want to see occurrences of the word Copyright in any file in the current directory.

And what this output does – it shows us a filename, so for example, there’s a file called Makefile.

And then in that file, it’s showing me only the lines that match the word Copyright.

So the Makefile seems to have 4 lines that match. And then there’s another file called Makefile.am

which has one line which matches, and so on. The README has one line.

So that’s the basic structure of grep.

Lopes

Now the grep command you just typed dumped a lot of content onto the terminal.

Is there a better way to organize or view what we’re trying to see?

League

Yeah, one thing is that it paged off the screen, so we have to scroll up to see some of it.

And of course we know how to do a pager, so we could pipe it to less

grep Copyright * | less

and see only one screen at a time. That’s part of it.

But something else really cool you can do is grep

– at least the version from the GNU project (which we mentioned last time as well).

grep supports coloring its output.

So if you say double-dash color --color then it will give you this colorful output where

grep --color Copyright *

the filenames are in purple, the text you’re looking for is in red, and then the rest of that line is black.

And that just makes it a whole lot easier to see the different matches.

Lopes

So like most of the commands we’ve learned, it seems that grep has a case-sensitivity issue.

Is there a way to work around that?

League

Yeah, you notice here that I typed Copyright with a capital C.

And all of the matches it’s giving are a capital C.

If I searched for copyright with a lowercase C, it would also look for matches

– oh let’s also keep the color – there are ways to specify we always want color output,

but I’m not going to get into them right now.

So I’ll just remember to put that --color.

So when I use lowercase copyright, it’s finding instances where copyright appears in lowercase,

and those are different than the upper-case ones. So yeah it is case-sensitive.

But I can put a -i, which means that it will do an insensitive search.

grep --color -i copyright *

So now it will give me every match of copyright – some of them are lowercase, as in here,

and some of them are uppercase. And I think there are even ones,

if I search up a little bit, that are all caps.

Yeah here it appears in all caps, which we didn’t get by doing c or C, but it matches that too.

Lopes

I notice that the last 2 lines that your terminal put out

didn’t seem to put out anything in regards to copyright, that we searched for.

League

Yeah these are errors, or warning messages.

The last one here – grep can search through binary files as well as text,

but it’s not useful to show you the lines of a binary file, because they won’t be understandable.

So it just says that it matches, without showing me the line where it matches.

So that explains that one.

These other ones, which also appeared sprinkled throughout up here.

When I specified *, star matches everything in the current directory,

but that includes other directories.

So grep by default doesn’t descend into directories, and it doesn’t search in a directory on its own.

So it just gives me a warning that one of the filenames that I included here, by typing *,

it’s not going to look at.

There are 2 things I can do about that.

One is to just silence those types of messages.

So there’s an s option. It can either go separately, or – like we did with ls -ltr

you can merge that in with another switch.

So i and s are different options that I’m specifying here, but they’re all

part of the one dash: -is.

So s means to silence any error messages.

And if I do it that way, it doesn’t say anything about those directories, just silently ignores them.

grep --color -is copyright *

So that is a little bit of a cleaner output.

The other option is you can actually ask grep to look inside directories, and grep through

all of the files within them. So when you do that,

you specify -r for recursive. And now we’re going to see lots of files

grep --color -ir copyright *

we didn’t see before. And some of them with slash in them, which means it’s in a sub-directory.

So previously it just ignored the src folder.

But now it’s going and looking at all the files in there, also searching for copyright.

So that allows you to search many more files, very quickly.

Lopes

Well now that we used grep to search for any instance of copyright – it puts out a lot of data

is there any way to tell exactly where within the file that line is?

League

Yeah, that’s a great question, and very useful.

There’s a very simple option we can add to grep, which is -n.

So again, I can keep it as part of this same block or make a separate -n.

grep --color -ir -n copyright *

And what this will do is it adds number after each filename here.

That number tells me what line it appeared on in that file.

So you can see that in this one file, hello.c, copyright appeared on line 3,

but it also appeared down on 167, 180, and 183.

So it gives you a sense of whether all of the instances appear in the same place,

or are they spread out more – stuff like that.

There are a couple more options related to changing the output style.

One of them is – let’s say that I only want to see what files matched.

I don’t really care to see the text of the line that matched, just which files.

So that is an option called -l. This is a lowercase ‘L’.

And I’m not going to do recursive anymore, but I’ll turn the -s back on,

grep --color -ils copyright *

which means suppress the error messages.

So -l will change this to be much simpler.

It’s just going to give me a list of filenames that contain that word copyright.

It doesn’t show me where it matches, and the file only appears once in this list

even if it has multiple occurrences of that text.

So let’s try to search for a different word, and we’ll see different files that match.

Let’s try printf – and we still see printf in many of those files.

grep --color -ils printf *

If you want to go back to the style we had before, just delete the -l,

grep --color -is printf *

and now we will see where it matches.

grep --color -is printf *

A lot of these are matching capitalized versions of PRINTF.

So if I wanted to see if lowercase printf appears in any files, and which files it appears in,

I’ll get rid of the -i and keep the -l. and now it appears in all of those.

grep --color -ls printf *

And here are some of those.

grep --color -s printf *

So grep is a very powerful tool, just for searching for text within a file.

So in addition to that -l, which just prints the filenames that match,

there’s one other option that’s really cool, which is -c – lowercase ‘C’.

And that means to print a count of how many matches within the file.

grep --color -cs printf *

But again, it doesn’t show us the lines that match – it just shows us the counts.

So that looks like this.

What’s happening here is it has a filename, and following that filename it puts a number,

which is the number of times that the match – I believe it’s actually the number of lines that match.

So if the word printf appears on the same line twice, it only counts once.

Lopes

I guess this is a good place to mention that, like all the other commands,

you can use the --help option with grep.

League

Definitely. So grep has lots more options to explore.

And if we do --help and less, like that, then we get to see some more of the options that it supports.

grep --help | less

Another small tip is that, if you have a phrase you want to search for rather than just a single word.

Remember that spaces are significant in command lines.

So if I put spaces – let’s say I want to search for free software in all of the files.

The problem with that is it interprets the first parameter as what you’re searching for,

and the rest as filenames that you’re searching in.

So there is no file called software which means this is going to be a problem.

So to do that, I can use quotes.

The same way that I quoted spaces in filenames.

So I can use quotes there to group together "free software" as one thing that I search for.

And then, wherever that appears will show up.

But I need the quotes to group it together.

Lopes

So today we went over the wildcards as well as a lot of features that grep has to offer our Linux users.

League

Yeah, and next time I think we will look at a few more of the text processing commands.

There’s a command called cut, and sort, and uniq.

A lot of data in Unix systems is kept in plain text files.

And these commands will allow us to process them and search them in particular ways.

And they all interact with each other very nicely.

We may also look a little bit at renaming files using the move (mv) command, which came up today.

So we’ll go into some of the features of that as well.

So join us next time!

[Dark electronic beat]

[Captions by Christopher League]

[End]