Episode 3 Wildcards and grep

In this episode, we use basic wildcards to select files, and then explore how the ‘grep’ command can search for words or phrases across multiple files. As always, you can follow along using the same directory structure by downloading it from https://github.com/commandlinetv/sample-files.

10 August 2015

	• [Rhythmic, dark electronic intro music]
League	• Welcome back to Command Line TV. This is Episode 3. • Today we’re going to talk about wildcards and text processing using pipelines. • First of all, do we have any follow-up from last time?
Lopes	• I did have a question about accessing files, especially when it comes to their extensions. • We did access a `.gz` file and I was curious – one of the other files we had was a `.tar.gz` – • and I was curious as to which one was the actual extension type?
League	• Yeah, so first of all, extensions in UNIX don’t mean quite as much as they do on other systems. • They are primarily there for humans, and the system can work – • most commands at least can work perfectly well with whatever extension you want to give it. • So when you have a file like this `.tar.gz` – what’s happening there is that it’s one type of file, • which is a `.tar` – `tar` is an archive file, kind of like a zip. • But what’s interesting about tar is that it’s not by itself compressed. • All it does is it packages up a bunch of files, so that it creates one file, • and then you can compress that separately. • So that’s why it gets 2 extensions. The `.tar` means that it’s archived files, • and the `.gz` means that it’s compressed. • And so they go in that order. • But extensions are really not as meaningful as on other systems. • So for example, if I wanted to rename that as something else, I can still use it as a compressed file. • Or another pretty shocking example – last time I think we looked at a `.png` file – • we did an external viewer to open up the PNG. That was in `thinkjava/figs`. cd thinkjava/figs/ ls *.png • And I have this PNG, so we did `xdg-open gridworld.png` to look at this file. xdg-open gridworld.png • And it pops up in a separate viewer here. • But that viewer doesn’t – and even `xdg-open` itself – doesn’t care that it’s a `.png` extension. • I can rename that file. So, to rename a file is `mv` – • we’re going to learn a lot more about `mv`, probably in the next episode. • And then I rename it so it has a `.txt` extension. • So it looks like that would be a text file. But when I do `xdg-open` on that, • it still opens up the image viewer. It still knows that it’s an image. • So that’s a little bit odd. • The way that it knows that is that it actually looks at the content of the file, rather than the extension. • So there’s a command that does that too, called the `file` command. file gridworld.txt • When you run `file` on a text, or on any type of file. • It looks into the file content and identifies what’s there to tell you what it is. • And so `file` knows, even though I called this `.txt`, it knows it’s a PNG image, and it has its dimensions • and everything else about the image. So that’s pretty useful. • For that `.tar.gz`, that you mentioned, if I run the `file` command, it just says that it’s compressed data. • It doesn’t actually decompress to say what is there behind the compressed data.
Lopes	• So you just did `ls .png`. I know that ‘star’ is one of the wildcards. • What are other things that this `` wildcard is capable of?
League	• Right, so `` – the idea is that it matches filenames. • And `` can match any number of characters, including zero characters. • And including characters that seem like they would be special, like dots. • So a very common way to use it is with an extension, so `.png` means anything followed by a `.png`. • And we know that works. • But you can also use it in some other interesting ways. • So if I want to see every filename here that has an `a` in it. Okay, I could say `ls a`, right. ls a* • And this means that any characters come before an `a`, and any characters come after an `a`. • And both of those stars could match emptiness, so it could start with an `a` or end with an `a`. • And we get a subset of the files that were listed – just the ones that have an `a` in them. • Or if it’s just `a`, then that’s any filename that starts with an `a`. ls a • So you don’t have to do it just along the lines of extensions. • You might think it only works with `.png` or something like `card.`. ls card.* • Those work fine, but you can also use it in more flexible ways.
Lopes	• Besides star, are there any other wildcards that exist?
League	• Sure, some of it depends on what shell you’re using and how it’s configured. • But I’ll give you 2 more of the basic ones that are always available. • One is the question mark. So, like a star, a question mark matches characters, but it only matches • exactly one character – not zero or more. So you can use it to substitute a missing character. • And a great example of that is – I have files here that have numbers in them, so `list1 list2 list3`. • If I wanted to match all the list files with any number after them, • I could say `list?.fig` to get anything where there’s one character following `list`. ls list?.fig • Or it could be `list?.` to mix and match both kinds of wildcards. ls list?. • So that question mark matching one character is useful in lots of situations. • You can also pile them together a little bit. • So if I wanted to match multiple characters, but a specific number of them, like 3 characters. • Then I could put 3 question marks. So let’s say for example `list.???`. ls list.??? • So the star will match anything – that’s going to match my numbers – • and then the question mark matches one character, but it does that 3 times. • So that will get any file that starts with list and has a 3 character extension. • But it would not work if there was a one or 2 character extension.
Lopes	• So the star could be used to search for things that you have unknown lengths for, • and the question mark is used for more precise queries?
League	• Yeah, I think so. If you know exactly that it’s one character, or how many chars it needs to be, then that’s useful. • One great example is when you’re doing C – so I’m going to go over here to a little C program. • So C programs usually use the extensions `.c` and `.h` especially, and those are single character extensions. • So you can do something like `ls .?` to just pull out those single-character extensions. ls .? • It turns out every file in this folder is a single- character extension, so it matched all of them. • But if these were interspersed with other files, it would allow me to select just those. • That leads me to the third kind of wildcard we can do today, which is the square brackets. • So when you put square brackets into a file expression like this, then you can put • individual characters that would match. • So if I only want to match files that end with a `c` or an `h`, I put the `ch` in brackets. ls .[ch] • And that matches only one character, like the question mark, • but the character has to be one of the specified characters. • So this would match the `.c` and the `.h` but not the `.o`. • You can switch the order of these, but that won’t make a difference – it’s just any character from the set. ls .[hc] • So the order within that set doesn’t matter.
Lopes	• I know that besides the square brackets, we also used the squiggly braces as a wildcard. • What’s the purpose of that?
League	• It’s a little bit of overlap with what we’ve already seen, but it works a little differently. • So it’s important to understand the difference. • If I use squiggly braces here, I can specify different possible extensions. • They can be more than one character, so we’re going to separate them with commas. • So if I did something like `.{c,h}`, that would match anything that ends in a `c` or ends in an `h`. ls .{c,h} • So that is the same as with square brackets, no more power. • But let’s look at some other files up here. • So I’ve got a bunch of files that start with config, right – `config.` – there are 4 of them. • And if I only wanted the `.h` and `.log`, then I could use the curly braces to specify `{h,log}` ls .{h,log} • and it would only match those 2.
Lopes	• So far we’ve only used wildcards with the `ls` command. • Can wildcards be used with the other commands we’ve used so far, such as `more` or `less`?
League	• Yeah, definitely. So wildcards can be used with any command. • In fact, wildcards are expanded by your shell program – the program that is interpreting all your commands. • That means they can work with commands that aren’t even necessarily programmed to use them. • So let’s try it with a couple other commands. So `cat` we know can show the contents of a file. • So if I do `cat config.h` it will dump out those contents on my screen. cat config.h • If I give it multiple files, using for example curly braces, it will just dump the contents of both files. cat config.{h,log} • And of course that scrolled off the screen, but then I can pipe it into `more` or `less`. • So I’m first getting the `config.h` and later on – somewhere – I’ll get the `config.log`. • So it can `cat` both of those files at the same time. • A command we learned last week that does something especially useful with multiple files is `head`. • So if I do `head config.h` – you remember what `head` does? head config.h
Lopes	• It just shows the intro to that file?
League	• Yeah, the first 8 or 10 lines, whatever that is. • And we can specify an option here to make it shorter or longer. head -3 config.h • But you can also give that multiple files. • So if I said `config.` here – this is pretty cool – it actually puts a little header with the filename, head -3 config. • and gives me 3 lines from that file, a blank line, and then the next file. • 3 lines from that file, blank line, and so on. • So it’s showing me the top couple of lines from each of multiple files. • And the multiple files are just based on the wildcard.
Lopes	• Using `head`, we were able to find out what the top 3 or 4 or 5 lines of each file had. • What if instead we wanted to search throughout those files for particular words or phrases?
League	• Great, there’s a perfect command for that, that you’re going to love. [Laughs] • This is one of the most powerful Unix commands that is accessible to a beginner. • And it’s called `grep`. • What you do with `grep` is you specify a word or pattern of some sort that it will search for in files. • So let’s say I want to search for a word like `Copyright`. • And the files I want to search in are what I put next. • So you could list multiple files here – like that – or you could use your wildcards to specify which files. • What if I just put star, all by itself? grep Copyright * • That will match any file at all in the current directory. • So this command says I want to see occurrences of the word `Copyright` in any file in the current directory. • And what this output does – it shows us a filename, so for example, there’s a file called `Makefile`. • And then in that file, it’s showing me only the lines that match the word `Copyright`. • So the `Makefile` seems to have 4 lines that match. And then there’s another file called `Makefile.am` • which has one line which matches, and so on. The `README` has one line. • So that’s the basic structure of grep.
Lopes	• Now the `grep` command you just typed dumped a lot of content onto the terminal. • Is there a better way to organize or view what we’re trying to see?
League	• Yeah, one thing is that it paged off the screen, so we have to scroll up to see some of it. • And of course we know how to do a pager, so we could pipe it to `less` grep Copyright * \| less • and see only one screen at a time. That’s part of it. • But something else really cool you can do is `grep` • – at least the version from the GNU project (which we mentioned last time as well). • `grep` supports coloring its output. • So if you say double-dash color `--color` then it will give you this colorful output where grep --color Copyright * • the filenames are in purple, the text you’re looking for is in red, and then the rest of that line is black. • And that just makes it a whole lot easier to see the different matches.
Lopes	• So like most of the commands we’ve learned, it seems that `grep` has a case-sensitivity issue. • Is there a way to work around that?
League	• Yeah, you notice here that I typed `Copyright` with a capital C. • And all of the matches it’s giving are a capital C. • If I searched for `copyright` with a lowercase C, it would also look for matches • – oh let’s also keep the color – there are ways to specify we always want color output, • but I’m not going to get into them right now. • So I’ll just remember to put that `--color`. • So when I use lowercase `copyright`, it’s finding instances where copyright appears in lowercase, • and those are different than the upper-case ones. So yeah it is case-sensitive. • But I can put a `-i`, which means that it will do an insensitive search. grep --color -i copyright * • So now it will give me every match of `copyright` – some of them are lowercase, as in here, • and some of them are uppercase. And I think there are even ones, • if I search up a little bit, that are all caps. • Yeah here it appears in all caps, which we didn’t get by doing `c` or `C`, but it matches that too.
Lopes	• I notice that the last 2 lines that your terminal put out • didn’t seem to put out anything in regards to `copyright`, that we searched for.
League	• Yeah these are errors, or warning messages. • The last one here – `grep` can search through binary files as well as text, • but it’s not useful to show you the lines of a binary file, because they won’t be understandable. • So it just says that it matches, without showing me the line where it matches. • So that explains that one. • These other ones, which also appeared sprinkled throughout up here. • When I specified ``, star matches everything in the current directory, • but that includes other directories. • So `grep` by default doesn’t descend into directories, and it doesn’t search in a directory on its own. • So it just gives me a warning that one of the filenames that I included here, by typing ``, • it’s not going to look at. • There are 2 things I can do about that. • One is to just silence those types of messages. • So there’s an `s` option. It can either go separately, or – like we did with `ls -ltr` – • you can merge that in with another switch. • So `i` and `s` are different options that I’m specifying here, but they’re all • part of the one dash: `-is`. • So `s` means to silence any error messages. • And if I do it that way, it doesn’t say anything about those directories, just silently ignores them. grep --color -is copyright * • So that is a little bit of a cleaner output. • The other option is you can actually ask `grep` to look inside directories, and grep through • all of the files within them. So when you do that, • you specify `-r` for recursive. And now we’re going to see lots of files grep --color -ir copyright * • we didn’t see before. And some of them with slash in them, which means it’s in a sub-directory. • So previously it just ignored the `src` folder. • But now it’s going and looking at all the files in there, also searching for `copyright`. • So that allows you to search many more files, very quickly.
Lopes	• Well now that we used `grep` to search for any instance of `copyright` – it puts out a lot of data • is there any way to tell exactly where within the file that line is?
League	• Yeah, that’s a great question, and very useful. • There’s a very simple option we can add to grep, which is `-n`. • So again, I can keep it as part of this same block or make a separate `-n`. grep --color -ir -n copyright * • And what this will do is it adds number after each filename here. • That number tells me what line it appeared on in that file. • So you can see that in this one file, `hello.c`, copyright appeared on line 3, • but it also appeared down on 167, 180, and 183. • So it gives you a sense of whether all of the instances appear in the same place, • or are they spread out more – stuff like that. • There are a couple more options related to changing the output style. • One of them is – let’s say that I only want to see what files matched. • I don’t really care to see the text of the line that matched, just which files. • So that is an option called `-l`. This is a lowercase ‘L’. • And I’m not going to do recursive anymore, but I’ll turn the `-s` back on, grep --color -ils copyright * • which means suppress the error messages. • So `-l` will change this to be much simpler. • It’s just going to give me a list of filenames that contain that word copyright. • It doesn’t show me where it matches, and the file only appears once in this list • even if it has multiple occurrences of that text. • So let’s try to search for a different word, and we’ll see different files that match. • Let’s try `printf` – and we still see `printf` in many of those files. grep --color -ils printf * • If you want to go back to the style we had before, just delete the `-l`, grep --color -is printf * • and now we will see where it matches. grep --color -is printf * • A lot of these are matching capitalized versions of `PRINTF`. • So if I wanted to see if lowercase `printf` appears in any files, and which files it appears in, • I’ll get rid of the `-i` and keep the `-l`. and now it appears in all of those. grep --color -ls printf * • And here are some of those. grep --color -s printf * • So `grep` is a very powerful tool, just for searching for text within a file. • So in addition to that `-l`, which just prints the filenames that match, • there’s one other option that’s really cool, which is `-c` – lowercase ‘C’. • And that means to print a count of how many matches within the file. grep --color -cs printf * • But again, it doesn’t show us the lines that match – it just shows us the counts. • So that looks like this. • What’s happening here is it has a filename, and following that filename it puts a number, • which is the number of times that the match – I believe it’s actually the number of lines that match. • So if the word `printf` appears on the same line twice, it only counts once.
Lopes	• I guess this is a good place to mention that, like all the other commands, • you can use the `--help` option with grep.
League	• Definitely. So `grep` has lots more options to explore. • And if we do `--help` and `less`, like that, then we get to see some more of the options that it supports. grep --help \| less • Another small tip is that, if you have a phrase you want to search for rather than just a single word. • Remember that spaces are significant in command lines. • So if I put spaces – let’s say I want to search for `free software` in all of the files. • The problem with that is it interprets the first parameter as what you’re searching for, • and the rest as filenames that you’re searching in. • So there is no file called `software` which means this is going to be a problem. • So to do that, I can use quotes. • The same way that I quoted spaces in filenames. • So I can use quotes there to group together `"free software"` as one thing that I search for. • And then, wherever that appears will show up. • But I need the quotes to group it together.
Lopes	• So today we went over the wildcards as well as a lot of features that `grep` has to offer our Linux users.
League	• Yeah, and next time I think we will look at a few more of the text processing commands. • There’s a command called `cut`, and `sort`, and `uniq`. • A lot of data in Unix systems is kept in plain text files. • And these commands will allow us to process them and search them in particular ways. • And they all interact with each other very nicely. • We may also look a little bit at renaming files using the move (`mv`) command, which came up today. • So we’ll go into some of the features of that as well. • So join us next time!
	• [Dark electronic beat] • [Captions by Christopher League] • [End]