Episode 4 Text manipulation

We learn about text manipulation commands like cut, sort, and uniq. We build sophisticated pipelines to analyze data, including surveys and web logs. We also look briefly at invoking simple text editors from the command line, like nano, gedit, and TextEdit.

18 August 2015

[Rhythmic, dark electronic intro music]

League

Welcome to Episode 4 of Command Line TV.

Today we’re going to talk about cut, uniq and sort, some commands for text processing,

and with me is my co-host, as always, Christian Lopes.

Do you have any questions from last time?

Lopes

I did have one question in terms of – we did a lot of text manipulations,

but what if I just wanted to start with a text file? How would I open one of those files?

League

If you have, like I’ve got in here a text file languages.txt, for example,

and we know we can dump its content out using cat.

cat languages.txt

We can also use things like more or less to see it.

less languages.txt

But if you want to modify the content or create a new file,

what you want to do is open up a text editor.

There are a variety of different text editors available on the command line,

or from the command line, and some of it depends on

what platform you’re on and what’s installed.

But one simple thing that works on almost every platform is a really simple editor called nano.

So if I do nano on languages.txt, it will open up a little editor

nano languages.txt

where it shows me the content of this, and I can move around through it,

but I can also go in here and make changes to it.

Now it’s a little bit unusual the way its commands work. There isn’t a menu.

You can’t really use the mouse like you’re used to because this is a command line-only editor.

But down across the bottom are some hints as to keystrokes that work.

So if you want to save the file – save is what they call “write out” –

you hit ‘control-O’ and it just confirms for you the file name.

You confirm that and press ‘Enter’, and it saved that file,

and then when you want to exit, there’s ‘control-X’.

Now we’re back and that file has been modified. So nano is very simple to use.

But you can also use graphical editors.

Most Ubuntu systems will have gedit installed,

and you can just type “gedit” with the file name.

gedit languages.txt

It will open up a separate window, and you can use your mouse here.

It’s got a menu – this is more like a standard editor.

So we can go in there and make changes and save it and so on.

And then when you exit that, you just go back to the command line.

There might be some messages here from gedit as it ran, but most of them are ignorable.

On a Mac the way to hook into the text edit application is to type open -e.

This means open with ‘TextEdit’, and then you put the file name.

open -e languages.txt

That won’t work here because we’re on Linux but that should do okay on the Mac.

Lopes

Now that we’ve been working on the terminal so much, it’s obvious how important

these text files are. What else can we do with the text files in terms of editing and whatnot?

League

Text files are very important on UNIX systems because a lot of the configuration

and just data is kept in text files.

So we have a lot of commands that are around editing and manipulating text in different ways.

We’re going to look at some of those commands, but first I want to show you one of these

configuration files that’s part of pretty much every UNIX system, and that’s the password file.

So if I use cat to dump out the contents of /etc/passwd

cat /etc/passwd

etc is a directory at the top level of a UNIX system that keeps a lot of configuration files

within it, so ‘password’ is one of those. And we spell passwd abbreviated like that.

I dumped that out and I’ve got all kinds of information here

about different users in the system.

So this is my personal account and then I’ve got an account down here for cltv,

and we’ve got information about each of these users. Most of these are system users,

so that means that they represent different software on the system or different servers.

For example, ‘Postgres’ and ‘CouchDB’ are two different database systems

and they each have users for them. But then the different information that we have here

is separated by colons, so there are six or seven fields.

The first field is the user name,

the second field traditionally was where we would store the password,

which is why this is called a password file, but eventually UNIX systems started

moving the passwords into a separate file so that it could be protected a lot better.

So in most systems this will just have an x in it or it will be blank.

Then I’ve got a ‘user ID’ number and a ‘group ID’ number.

Following that, for me it’s empty, but it’s supposed to be the full name of that user.

You can put a user’s complete name there.

So I’ve got things like that for my database users.

And then finally these last two are the user’s home directory,

so when I’m logged in as this person and I type ~, that’s the directory it refers to.

And then the shell for that user, so there are different command shells that are available.

The one we’ve been using is bash; it’s the most common,

it’s the default on Ubuntu and on the Mac, but there are others,

and later on we’ll look at some of the differences between them.

Lopes

What if I don’t want to see all this information on the screen?

What if I’m only interested in the user and that user’s home directory?

League

We can cut out different fields of this so that you don’t have to see everything,

and that’s a command called cut. So what we will do is type cut,

and then you have to specify with cut what the delimiter is,

that means what is separating each field, so for the password file that’s the colon.

And then you have to specify which fields you want to see,

so if I just wanted to see the first field – they’re numbered starting from 1 –

then I’d put the file name I’m looking at, so that command will just dump out the first field,

cut -d: -f1 /etc/passwd

which is the names. If I also want to include the home directory –

I think that was the sixth field – I could put 1,6 here, and that will give me

cut -d: -f1,6 /etc/passwd

the user name and that user’s home directory, separated again by the delimiter,

but it omits all of the other information. You can also do a range,

so if the fields you want are in order – let’s say I want 1 to 4 –

then that command with -f 1-4 gives me the first four fields

cut -d: -f1-4 /etc/passwd

of each entry in the password file.

Lopes

So again that would be the user name, what originally used to be the password, the ID?

League

Yeah, user ID.

Lopes

The user ID and then the group ID.

League

Exactly.

There’s another way that cut can work which is useful if you’ve got some text where it’s

already aligned into columns instead of being delimited by a particular character.

An example of that is if we take the output of ls -l, you see how things line up here,

ls -l

so they intentionally insert some spaces around these numbers and in dates and so on

so that everything winds up into these columns, and we can process that

by specifying which character positions to cut.

So if I do ls -l and pipe that output into cut, then I can specify -c

to say which characters you want. For example, I could do characters 1-10

ls -l | cut -c1-10

and that would be just the first part here, these first 10 characters.

So let’s say I want to cut out these file sizes, the numbers here.

What I would normally do is just eyeball that to try and get it right, like I might guess

that it’s about 23 characters in, and if I leave off the ending of that range

it will just take character 23 to the end, so that seemed to work out exactly right.

ls -l | cut -c23-

I start with just these numbers and there’s even a space before the 358,

so that’s cutting right at that point.

And then I want to go out to, let’s say, 33, but that brought in some of the months,

ls -l | cut -c23-33

so I’m going to cut it back to 29, and that looks about right,

ls -l | cut -c23-29

so I’ve been able to figure out this range of character positions that will exactly

cut out that information from the original ls -l output.

Lopes

On that password file that we were looking at,

was there any particular order to the file structure?

League

Usually not.

If I dump out the password file here, these just appear in arbitrary order.

It could be that the latest additions were at the end so it just puts them in the

order that they were created, but I’m not even sure that that’s reliable;

basically the order of the password file doesn’t matter.

But if you wanted to see it in order, there’s a command to do that. It’s called sort.

So instead of cat, if I just sort password, this doesn’t change that file,

sort /etc/passwd

it only reads that file as input and gives it to us as output.

Later on we’ll see how to take that output and store it back into a file.

But if I sort it, what it does is it looks at each line and just

puts them in alphabetical order, so now I’ve got the users that start with a c up here

and s down here and so on, so that’s alphabetical order.

sort is really great even in a pipeline if you want to sort the output of another command.

So a command that I think is really useful to demonstrate that is called du,

which stands for “disk usage.” It’s basically showing us how much space files and folders

are taking up on our system. I’m going to use the -s command to summarize the results,

and * means for every file and folder in the current directory. So I get this.

du -s *

These numbers out front are in units of kilobytes although some versions

of du use different units.

There are some ways to specify what units you want as options.

We’re going to get into all of that later;

we’re going to do a section on file system tools and we’ll talk about du a lot more.

But what I want to show here is just that I might want to put these in order.

du is showing them in the order that they appear,

like alphabetical order by the file name, just like ls does.

But with the number beside it of how much space that’s taking up, I might want to sort that,

so I can take that result and pipe it into sort, and then it puts those in order,

du -s * | sort

except it’s kind of weird because I’ve got these 4’s right in the middle here.

So if you look at just the first character, these are in order, right: 2 3 4 4 5 7 9.

But it’s not really numerical order based on the entire number.

So sorting text is different than sorting numbers, and when you sort numbers,

you tell sort -n and that will compensate for that.

du -s * | sort -n

So now I get them in numerical order from smallest to biggest.

If you want it reversed, of course sort can take -r to reverse it

du -s * | sort -n -r

and now the smallest come out at the bottom.

So that’s a useful usage of du and sort together.

Lopes

Back in the password directory, we have a lot of users

and they have a lot of separate home directories.

However, some of them are the same; a lot just have the / or root as their directory,

but the cltv and yours, which is league, have /home.

Is there a way to sort these out and see how many there are?

League

Yeah – first of all you might to just want to cut out the home directory field

so that we can see the variety of things that are there more easily.

Recall that our delimiter was colon, and I think that’s Field 6 in the password file.

So here are all of the home directories, and one thing you could do –

we learned grep previously – you could grep for /home to see just those users

cut -d: -f6 /etc/passwd|grep /home

that have directories in /home.

That’s one indication that those are regular human users as opposed to system users,

which usually have their directories somewhere else.

But another thing that we can do that’s pretty cool is remove the duplicates

or count the duplicates. Here I’ve got a bunch of different users that just use

the root as their directory, so if I do unique, – unique is just spelled uniq

what it will do is remove duplicates, so what happens is now we only see the slash once,

cut -d: -f6 /etc/passwd|uniq

except we actually see it twice, and that’s because those were not consecutive.

So basically if two lines are side by side or adjacent, then uniq can detect that

they’re the same and eliminate them, but if they’re not adjacent, it doesn’t notice.

So it’s very common, unless you know that the data is already in order,

to put a sort before the uniq.

Now I take those directories, sort them – let’s just look at the result of that first –

cut -d: -f6 /etc/passwd|sort

so now all of the slashes are together and these appear in alphabetical order,

and then if I do uniq, it will have the slash only appear once because all of them

cut -d: -f6 /etc/passwd|sort |uniq

were together and now uniq can filter them properly. Another thing that uniq supports

is I can do the -c option, which means count the number of duplicates

cut -d: -f6 /etc/passwd|sort |uniq -c

instead of just eliminate them. So now I see that that slash appeared 14 times

and the rest of these appeared once. Another cool thing that I thought we could do is

try to count out based on just the top level of the directory,

so how many are in var vs. how many are in srv.

To do that, what I’m going to try to do is cut again.

So the first cut from the password file just gives me these directory names,

cut -d: -f6 /etc/passwd

and then if I cut with a different delimiter – so let’s pretend slash is a delimiter –

I can basically cut apart these directory names as if they were separate fields.

So field 1 would be before the first slash, and this first bit here is Field 2,

cut -d: -f6 /etc/passwd | cut -d/ -f2

so -f2, and now I’m getting blank where it was just the root directory

or I’m getting var, home, srv, whatever.

Then I can sort those and then uniq -c, and I see that there are two people

cut -d: -f6 /etc/passwd | cut -d/ -f2 |sort|uniq -c

with home directories in home, three in srv, seven in var and so on.

Lopes

We have here a survey.tsv file. Is there a way we can use the uniq command

to do a more practical analysis of data?

League

Yeah, definitely. This survey file came from a spreadsheet which originally was a Google form.

We can see here I made a Google form just to survey a class about how well

things are going so far, so these 1-5 represent Strongly Agree to Strongly Disagree.

I don’t remember which order is which, but it doesn’t matter.

I basically downloaded this spreadsheet into tab-separated values, tsv,

and that’s what we’ve got over here.

If we take a look at that file, it just has all of the data in it,

cat survey.tsv

and in between each is a tab character, right there. So we can do cut on that, and the

delimiter is tab, but we don’t have to worry about how to specify a tab on the command line.

If I just tried to type tab here it wouldn’t really work. But tab is the default delimiter

for cut, so it’s going to assume that if I don’t specify -d, which means I can just

leave it out and specify which fields I want. So if I just want Field 1, that’s the

cat survey.tsv | cut -f1

time stamp and Field 2 is the first question, Field 3 is the second question, and so on.

cat survey.tsv | cut -f2

Now I can try to summarize this data by applying sort to it.

cat survey.tsv | cut -f2 | sort

It’s putting our responses in order from one to five.

Then how do I summarize those uniq -c to count them? So now I can tell that there were

cat survey.tsv | cut -f2 | sort | uniq -c

12 responses of 1 and three people responded 2 and so on.

That corresponds to the same type of summary you can get out of a Google spreadsheet itself,

so here I’ve got a summary of responses, and for this question we saw that 12 people said 1,

three people said 2 and so on, so it’s exactly the same data.

Lopes

So far we’ve used the delimiter with colon as well as its native tab value.

What if we needed to use a special character as a delimiter?

League

Yeah, if you’ve got a special character like a quote or a space or something,

then you have to do something special to pass that into cut.

League cat weblog.txt

One example we could look at is this weblog file. I grabbed about a day’s worth

League

of accesses to my web server and it’s a lot to look at, but this is basically one line of text

here that goes on past the width of my screen. But the first part of that is an IP address

and there’s a time stamp and some other information. Let’s say I just want to know

what IP addresses are accessing my web site. I could split that out based on this

space character, so I’m going to do cut with the delimiter being a space,

but you can’t just type space because that would mean I’m just all done with my option,

so we put the space in quotes. You could actually use single quotes or double quotes

for this as long as it’s consistent. But there are other cases where

you might to cut on a quote character, so if I wanted to cut on that as a delimiter,

then I would put the double quote within the single quotes, and that should work.

So let’s cut based on space and I’ll take Field 1 of the weblog.

cut -d' ' -f1 weblog.txt

That’s going to give me these IP addresses. They don’t necessarily appear in order;

they’re in order of access, so I could have the same IP address appearing at different times.

So I’m going to do sort to put them in order by the IP address and then I can do uniq -c

cut -d' ' -f1 weblog.txt | sort

to count how many times each IP address accessed the site, right?

cut -d' ' -f1 weblog.txt | sort | uniq -c

Now I’ve got a count of how many times each IP accessed the site, but maybe

I want to put those in order, so I just get back that previous command and tack onto it

another sort, but this one is going to be a numeric sort because those are numbers

cut -d' ' -f1 weblog.txt |sort|uniq -c|sort -n

so I do -n. Now I see in order of how many times each IP address accessed my web site,

and I can do all of that just by incrementally building up this pipeline.

Lopes

You could also append a -r to put it in reverse,

so you could see the least-accessed ones as well.

League

Yup – -r would give me a bunch of ones at the end. This is an interesting finding here.

cut -d' ' -f1 weblog.txt |sort|uniq -c|sort -n -r

These are IP addresses but they started with this ‘quad f’ thing.

That’s because this notation is using IP Version 6 notation.

But most accesses are from IP Version 4. These at the bottom are native accesses

from IP Version 6, so those addresses look quite different,

but it’s kind of neat that they popped out when we did that in reverse.

Lopes

So in today’s episode, we did a lot on text manipulation in terms of

counting files and organizing them. What do we have in store for our viewers next time?

League

I think next time we are – I think we promised it last time too – but we’re going to get to

manipulating files themselves, like renaming them with mv or moving them

from one directory to another, copying files, deleting files and things like that.

We’ll use mv, cp and rm. Those will be the commands for next time.

Lopes

All right.

League

See you then.

[Dark electronic beat]

[Captions by GetTranscribed.com and Christopher League]

[End]