Friday, March 15, 2013

Introduction to Regular Expressions within a Shell

A utility that every Linux user will use is ls (sometimes referred to as list), which lists the contents of a directory. By default, ls lists everything in a directory, which is fine in the case of a few files. But what if that directory contains dozens or hundreds of files? Using wildcards combined with regular expressions (commonly referred to as regex) will enable a user to effectively and elegantly search for partial filenames. The usefulness of these techniques will become more evident when writing shell scripts.

Most users coming from other operating systems are familiar with the asterisk (*) wildcard, and some may know about the question mark (?) wildcard when searching for files. An asterisk, used in a search pattern, basically says, "any number of characters (including zero) that can be anything". A question mark means, "a single character that can be anything".

Take note of the difference between the two in practice:

$ ls
cap        document2  document3a  file1  file3  savedemail  test1  zap
document1  document3  document3b  file2  rap    something   testb
$ ls document*
document1  document2  document3  document3a  document3b
$ ls document?
document1  document2  document3

I won't got into too much depth of using regular expressions - there are many tutorials and even courses that cover them in depth, as they can get very complicated. The goal here is to demonstrate some basic usage so you are aware of their existence and hopefully whet your appetite enough to want to learn to use them even more effectively.

Let's use the following directory for the next examples:

$ ls
cap        document12  document14  document3   document3b  file2  rap
document1  document13  document2   document3a  file1       file3  zap

Anything within brackets [ ] substitutes a single character in a string. This is how the majority of your expressions will represented. You can use a list of characters that you want to match or a range. i.e., [abcd] is the same as [a-d]. The efficiency gain is more evident when including more characters, such as [A-Za-z]. On that note, yes, regular expressions are case sensitive. If you want to include the hyphen/minus symbol in your search, it needs to be at the beginning or end of the pattern, like [-a-z]. If you wish to include the bracket symbol, you need to place it at the beginning or end, next to its opposing bracket, like []a-z]. Here are some examples to show you what I mean.

$ ls *ap
cap  rap  zap
$ ls [a-y]ap
cap  rap
$ ls [rz]ap
rap  zap
$ ls [a-ms-z]ap
cap  zap
$ ls document*
document1   document13  document2  document3a
document12  document14  document3  document3b
$ ls document[1-3][a-z]
document3a  document3b
$ ls document[1-3]
document1 document2 document3
$ ls document[1-3]?
document12  document13  document14  document3a  document3b
$ ls document[1-3]*
document1   document13  document2  document3a
document12  document14  document3  document3b

The carat symbol (^) is used to basically say 'everything except'. You need to use a sub-expression to match other characters, using the curly brackets { }.

$ ls *ap
cap  rap  zap
$ ls [^c]ap
rap  zap
$ ls [^cr]ap
zap
$ ls [^c{r}]ap
zap

This concludes my introduction to using regular expressions within shell. Keep in mind that this is not all-inclusive, and there is a ton more information regarding regular expressions and their uses. Here are some websites that cover regex in more depth...

Yuvalinux - Place to Learn Linux

Friday, March 15, 2013

Introduction to Regular Expressions within a Shell

Introduction to Regular Expressions within a Shell

No comments:

Post a Comment