Episode 4 Text manipulation

We learn about text manipulation commands like cut, sort, and uniq. We build sophisticated pipelines to analyze data, including surveys and web logs. We also look briefly at invoking simple text editors from the command line, like nano, gedit, and TextEdit.

18 August 2015

	• [Rhythmic, dark electronic intro music]
League	• Welcome to Episode 4 of Command Line TV. • Today we’re going to talk about `cut`, `uniq` and `sort`, some commands for text processing, • and with me is my co-host, as always, Christian Lopes. • Do you have any questions from last time?
Lopes	• I did have one question in terms of – we did a lot of text manipulations, • but what if I just wanted to start with a text file? How would I open one of those files?
League	• If you have, like I’ve got in here a text file `languages.txt`, for example, • and we know we can dump its content out using `cat`. cat languages.txt • We can also use things like more or less to see it. less languages.txt • But if you want to modify the content or create a new file, • what you want to do is open up a text editor. • There are a variety of different text editors available on the command line, • or from the command line, and some of it depends on • what platform you’re on and what’s installed. • But one simple thing that works on almost every platform is a really simple editor called `nano`. • So if I do `nano` on `languages.txt`, it will open up a little editor nano languages.txt • where it shows me the content of this, and I can move around through it, • but I can also go in here and make changes to it. • Now it’s a little bit unusual the way its commands work. There isn’t a menu. • You can’t really use the mouse like you’re used to because this is a command line-only editor. • But down across the bottom are some hints as to keystrokes that work. • So if you want to save the file – save is what they call “write out” – • you hit ‘control-O’ and it just confirms for you the file name. • You confirm that and press ‘Enter’, and it saved that file, • and then when you want to exit, there’s ‘control-X’. • Now we’re back and that file has been modified. So `nano` is very simple to use. • But you can also use graphical editors. • Most Ubuntu systems will have `gedit` installed, • and you can just type “gedit” with the file name. gedit languages.txt • It will open up a separate window, and you can use your mouse here. • It’s got a menu – this is more like a standard editor. • So we can go in there and make changes and save it and so on. • And then when you exit that, you just go back to the command line. • There might be some messages here from `gedit` as it ran, but most of them are ignorable. • On a Mac the way to hook into the text edit application is to type `open -e`. • This means open with ‘TextEdit’, and then you put the file name. open -e languages.txt • That won’t work here because we’re on Linux but that should do okay on the Mac.
Lopes	• Now that we’ve been working on the terminal so much, it’s obvious how important • these text files are. What else can we do with the text files in terms of editing and whatnot?
League	• Text files are very important on UNIX systems because a lot of the configuration • and just data is kept in text files. • So we have a lot of commands that are around editing and manipulating text in different ways. • We’re going to look at some of those commands, but first I want to show you one of these • configuration files that’s part of pretty much every UNIX system, and that’s the password file. • So if I use `cat` to dump out the contents of `/etc/passwd` – cat /etc/passwd • `etc` is a directory at the top level of a UNIX system that keeps a lot of configuration files • within it, so ‘password’ is one of those. And we spell `passwd` abbreviated like that. • I dumped that out and I’ve got all kinds of information here • about different users in the system. • So this is my personal account and then I’ve got an account down here for `cltv`, • and we’ve got information about each of these users. Most of these are system users, • so that means that they represent different software on the system or different servers. • For example, ‘Postgres’ and ‘CouchDB’ are two different database systems • and they each have users for them. But then the different information that we have here • is separated by colons, so there are six or seven fields. • The first field is the user name, • the second field traditionally was where we would store the password, • which is why this is called a password file, but eventually UNIX systems started • moving the passwords into a separate file so that it could be protected a lot better. • So in most systems this will just have an `x` in it or it will be blank. • Then I’ve got a ‘user ID’ number and a ‘group ID’ number. • Following that, for me it’s empty, but it’s supposed to be the full name of that user. • You can put a user’s complete name there. • So I’ve got things like that for my database users. • And then finally these last two are the user’s home directory, • so when I’m logged in as this person and I type `~`, that’s the directory it refers to. • And then the shell for that user, so there are different command shells that are available. • The one we’ve been using is `bash`; it’s the most common, • it’s the default on Ubuntu and on the Mac, but there are others, • and later on we’ll look at some of the differences between them.
Lopes	• What if I don’t want to see all this information on the screen? • What if I’m only interested in the user and that user’s home directory?
League	• We can cut out different fields of this so that you don’t have to see everything, • and that’s a command called `cut`. So what we will do is type `cut`, • and then you have to specify with `cut` what the delimiter is, • that means what is separating each field, so for the password file that’s the colon. • And then you have to specify which fields you want to see, • so if I just wanted to see the first field – they’re numbered starting from 1 – • then I’d put the file name I’m looking at, so that command will just dump out the first field, cut -d: -f1 /etc/passwd • which is the names. If I also want to include the home directory – • I think that was the sixth field – I could put `1,6` here, and that will give me cut -d: -f1,6 /etc/passwd • the user name and that user’s home directory, separated again by the delimiter, • but it omits all of the other information. You can also do a range, • so if the fields you want are in order – let’s say I want 1 to 4 – • then that command with `-f 1-4` gives me the first four fields cut -d: -f1-4 /etc/passwd • of each entry in the password file.
Lopes	• So again that would be the user name, what originally used to be the password, the ID?
League	• Yeah, user ID.
Lopes	• The user ID and then the group ID.
League	• Exactly. • There’s another way that cut can work which is useful if you’ve got some text where it’s • already aligned into columns instead of being delimited by a particular character. • An example of that is if we take the output of `ls -l`, you see how things line up here, ls -l • so they intentionally insert some spaces around these numbers and in dates and so on • so that everything winds up into these columns, and we can process that • by specifying which character positions to cut. • So if I do `ls -l` and pipe that output into `cut`, then I can specify `-c` • to say which characters you want. For example, I could do characters 1-10 ls -l \| cut -c1-10 • and that would be just the first part here, these first 10 characters. • So let’s say I want to cut out these file sizes, the numbers here. • What I would normally do is just eyeball that to try and get it right, like I might guess • that it’s about 23 characters in, and if I leave off the ending of that range • it will just take character 23 to the end, so that seemed to work out exactly right. ls -l \| cut -c23- • I start with just these numbers and there’s even a space before the 358, • so that’s cutting right at that point. • And then I want to go out to, let’s say, 33, but that brought in some of the months, ls -l \| cut -c23-33 • so I’m going to cut it back to 29, and that looks about right, ls -l \| cut -c23-29 • so I’ve been able to figure out this range of character positions that will exactly • cut out that information from the original `ls -l` output.
Lopes	• On that password file that we were looking at, • was there any particular order to the file structure?
League	• Usually not. • If I dump out the password file here, these just appear in arbitrary order. • It could be that the latest additions were at the end so it just puts them in the • order that they were created, but I’m not even sure that that’s reliable; • basically the order of the password file doesn’t matter. • But if you wanted to see it in order, there’s a command to do that. It’s called `sort`. • So instead of `cat`, if I just sort password, this doesn’t change that file, sort /etc/passwd • it only reads that file as input and gives it to us as output. • Later on we’ll see how to take that output and store it back into a file. • But if I sort it, what it does is it looks at each line and just • puts them in alphabetical order, so now I’ve got the users that start with a `c` up here • and `s` down here and so on, so that’s alphabetical order. • `sort` is really great even in a pipeline if you want to sort the output of another command. • So a command that I think is really useful to demonstrate that is called `du`, • which stands for “disk usage.” It’s basically showing us how much space files and folders • are taking up on our system. I’m going to use the `-s` command to summarize the results, • and `` means for every file and folder in the current directory. So I get this. du -s • These numbers out front are in units of kilobytes although some versions • of `du` use different units. • There are some ways to specify what units you want as options. • We’re going to get into all of that later; • we’re going to do a section on file system tools and we’ll talk about `du` a lot more. • But what I want to show here is just that I might want to put these in order. • `du` is showing them in the order that they appear, • like alphabetical order by the file name, just like `ls` does. • But with the number beside it of how much space that’s taking up, I might want to sort that, • so I can take that result and pipe it into `sort`, and then it puts those in order, du -s * \| sort • except it’s kind of weird because I’ve got these 4’s right in the middle here. • So if you look at just the first character, these are in order, right: 2 3 4 4 5 7 9. • But it’s not really numerical order based on the entire number. • So sorting text is different than sorting numbers, and when you sort numbers, • you tell `sort -n` and that will compensate for that. du -s * \| sort -n • So now I get them in numerical order from smallest to biggest. • If you want it reversed, of course sort can take `-r` to reverse it du -s * \| sort -n -r • and now the smallest come out at the bottom. • So that’s a useful usage of `du` and `sort` together.
Lopes	• Back in the password directory, we have a lot of users • and they have a lot of separate home directories. • However, some of them are the same; a lot just have the `/` or root as their directory, • but the `cltv` and yours, which is `league`, have `/home`. • Is there a way to sort these out and see how many there are?
League	• Yeah – first of all you might to just want to cut out the home directory field • so that we can see the variety of things that are there more easily. • Recall that our delimiter was colon, and I think that’s Field 6 in the password file. • So here are all of the home directories, and one thing you could do – • we learned `grep` previously – you could `grep` for `/home` to see just those users cut -d: -f6 /etc/passwd\|grep /home • that have directories in `/home`. • That’s one indication that those are regular human users as opposed to system users, • which usually have their directories somewhere else. • But another thing that we can do that’s pretty cool is remove the duplicates • or count the duplicates. Here I’ve got a bunch of different users that just use • the root as their directory, so if I do unique, – unique is just spelled `uniq` – • what it will do is remove duplicates, so what happens is now we only see the slash once, cut -d: -f6 /etc/passwd\|uniq • except we actually see it twice, and that’s because those were not consecutive. • So basically if two lines are side by side or adjacent, then `uniq` can detect that • they’re the same and eliminate them, but if they’re not adjacent, it doesn’t notice. • So it’s very common, unless you know that the data is already in order, • to put a `sort` before the `uniq`. • Now I take those directories, sort them – let’s just look at the result of that first – cut -d: -f6 /etc/passwd\|sort • so now all of the slashes are together and these appear in alphabetical order, • and then if I do `uniq`, it will have the slash only appear once because all of them cut -d: -f6 /etc/passwd\|sort \|uniq • were together and now `uniq` can filter them properly. Another thing that `uniq` supports • is I can do the `-c` option, which means count the number of duplicates cut -d: -f6 /etc/passwd\|sort \|uniq -c • instead of just eliminate them. So now I see that that slash appeared 14 times • and the rest of these appeared once. Another cool thing that I thought we could do is • try to count out based on just the top level of the directory, • so how many are in `var` vs. how many are in `srv`. • To do that, what I’m going to try to do is `cut` again. • So the first cut from the password file just gives me these directory names, cut -d: -f6 /etc/passwd • and then if I cut with a different delimiter – so let’s pretend slash is a delimiter – • I can basically cut apart these directory names as if they were separate fields. • So field 1 would be before the first slash, and this first bit here is Field 2, cut -d: -f6 /etc/passwd \| cut -d/ -f2 • so `-f2`, and now I’m getting blank where it was just the root directory • or I’m getting `var`, `home`, `srv`, whatever. • Then I can sort those and then `uniq -c`, and I see that there are two people cut -d: -f6 /etc/passwd \| cut -d/ -f2 \|sort\|uniq -c • with home directories in home, three in `srv`, seven in `var` and so on.
Lopes	• We have here a `survey.tsv` file. Is there a way we can use the `uniq` command • to do a more practical analysis of data?
League	• Yeah, definitely. This survey file came from a spreadsheet which originally was a Google form. • We can see here I made a Google form just to survey a class about how well • things are going so far, so these 1-5 represent Strongly Agree to Strongly Disagree. • I don’t remember which order is which, but it doesn’t matter. • I basically downloaded this spreadsheet into tab-separated values, tsv, • and that’s what we’ve got over here. • If we take a look at that file, it just has all of the data in it, cat survey.tsv • and in between each is a tab character, right there. So we can do cut on that, and the • delimiter is tab, but we don’t have to worry about how to specify a tab on the command line. • If I just tried to type tab here it wouldn’t really work. But tab is the default delimiter • for `cut`, so it’s going to assume that if I don’t specify `-d`, which means I can just • leave it out and specify which fields I want. So if I just want Field 1, that’s the cat survey.tsv \| cut -f1 • time stamp and Field 2 is the first question, Field 3 is the second question, and so on. cat survey.tsv \| cut -f2 • Now I can try to summarize this data by applying `sort` to it. cat survey.tsv \| cut -f2 \| sort • It’s putting our responses in order from one to five. • Then how do I summarize those `uniq -c` to count them? So now I can tell that there were cat survey.tsv \| cut -f2 \| sort \| uniq -c • 12 responses of 1 and three people responded 2 and so on. • That corresponds to the same type of summary you can get out of a Google spreadsheet itself, • so here I’ve got a summary of responses, and for this question we saw that 12 people said 1, • three people said 2 and so on, so it’s exactly the same data.
Lopes	• So far we’ve used the delimiter with colon as well as its native tab value. • What if we needed to use a special character as a delimiter?
League	• Yeah, if you’ve got a special character like a quote or a space or something, • then you have to do something special to pass that into `cut`.
League cat weblog.txt	• One example we could look at is this weblog file. I grabbed about a day’s worth
League	• of accesses to my web server and it’s a lot to look at, but this is basically one line of text • here that goes on past the width of my screen. But the first part of that is an IP address • and there’s a time stamp and some other information. Let’s say I just want to know • what IP addresses are accessing my web site. I could split that out based on this • space character, so I’m going to do `cut` with the delimiter being a space, • but you can’t just type space because that would mean I’m just all done with my option, • so we put the space in quotes. You could actually use single quotes or double quotes • for this as long as it’s consistent. But there are other cases where • you might to cut on a quote character, so if I wanted to cut on that as a delimiter, • then I would put the double quote within the single quotes, and that should work. • So let’s cut based on space and I’ll take Field 1 of the weblog. cut -d' ' -f1 weblog.txt • That’s going to give me these IP addresses. They don’t necessarily appear in order; • they’re in order of access, so I could have the same IP address appearing at different times. • So I’m going to do sort to put them in order by the IP address and then I can do `uniq -c` cut -d' ' -f1 weblog.txt \| sort • to count how many times each IP address accessed the site, right? cut -d' ' -f1 weblog.txt \| sort \| uniq -c • Now I’ve got a count of how many times each IP accessed the site, but maybe • I want to put those in order, so I just get back that previous command and tack onto it • another sort, but this one is going to be a numeric sort because those are numbers cut -d' ' -f1 weblog.txt \|sort\|uniq -c\|sort -n • so I do `-n`. Now I see in order of how many times each IP address accessed my web site, • and I can do all of that just by incrementally building up this pipeline.
Lopes	• You could also append a `-r` to put it in reverse, • so you could see the least-accessed ones as well.
League	• Yup – `-r` would give me a bunch of ones at the end. This is an interesting finding here. cut -d' ' -f1 weblog.txt \|sort\|uniq -c\|sort -n -r • These are IP addresses but they started with this ‘quad f’ thing. • That’s because this notation is using IP Version 6 notation. • But most accesses are from IP Version 4. These at the bottom are native accesses • from IP Version 6, so those addresses look quite different, • but it’s kind of neat that they popped out when we did that in reverse.
Lopes	• So in today’s episode, we did a lot on text manipulation in terms of • counting files and organizing them. What do we have in store for our viewers next time?
League	• I think next time we are – I think we promised it last time too – but we’re going to get to • manipulating files themselves, like renaming them with mv or moving them • from one directory to another, copying files, deleting files and things like that. • We’ll use `mv`, `cp` and `rm`. Those will be the commands for next time.
Lopes	• All right.
League	• See you then.
	• [Dark electronic beat] • [Captions by GetTranscribed.com and Christopher League] • [End]