Friday, March 15, 2013

Introduction to Regular Expressions within a Shell



Introduction to Regular Expressions within a Shell


A utility that every Linux user will use is ls (sometimes referred to as list), which lists the contents of a directory. By default, ls lists everything in a directory, which is fine in the case of a few files. But what if that directory contains dozens or hundreds of files? Using wildcards combined with regular expressions (commonly referred to as regex) will enable a user to effectively and elegantly search for partial filenames. The usefulness of these techniques will become more evident when writing shell scripts.

Most users coming from other operating systems are familiar with the asterisk (*) wildcard, and some may know about the question mark (?) wildcard when searching for files. An asterisk, used in a search pattern, basically says, "any number of characters (including zero) that can be anything". A question mark means, "a single character that can be anything".

Take note of the difference between the two in practice:
$ ls
cap document2 document3a file1 file3 savedemail test1 zap
document1 document3 document3b file2 rap something testb
$ ls document*
document1 document2 document3 document3a document3b
$ ls document?
document1 document2 document3

 


I won't got into too much depth of using regular expressions - there are many tutorials and even courses that cover them in depth, as they can get very complicated. The goal here is to demonstrate some basic usage so you are aware of their existence and hopefully whet your appetite enough to want to learn to use them even more effectively.


Let's use the following directory for the next examples:
$ ls
cap document12 document14 document3 document3b file2 rap
document1 document13 document2 document3a file1 file3 zap

 


Anything within brackets [ ] substitutes a single character in a string. This is how the majority of your expressions will represented. You can use a list of characters that you want to match or a range. i.e., [abcd] is the same as [a-d]. The efficiency gain is more evident when including more characters, such as [A-Za-z]. On that note, yes, regular expressions are case sensitive. If you want to include the hyphen/minus symbol in your search, it needs to be at the beginning or end of the pattern, like [-a-z]. If you wish to include the bracket symbol, you need to place it at the beginning or end, next to its opposing bracket, like []a-z]. Here are some examples to show you what I mean.

$ ls *ap
cap rap zap
$ ls [a-y]ap
cap rap
$ ls [rz]ap
rap zap
$ ls [a-ms-z]ap
cap zap
$ ls document*
document1 document13 document2 document3a
document12 document14 document3 document3b
$ ls document[1-3][a-z]
document3a document3b
$ ls document[1-3]
document1 document2 document3
$ ls document[1-3]?
document12 document13 document14 document3a document3b
$ ls document[1-3]*
document1 document13 document2 document3a
document12 document14 document3 document3b

 


The carat symbol (^) is used to basically say 'everything except'. You need to use a sub-expression to match other characters, using the curly brackets { }.

$ ls *ap
cap rap zap
$ ls [^c]ap
rap zap
$ ls [^cr]ap
zap
$ ls [^c{r}]ap
zap

 


This concludes my introduction to using regular expressions within shell. Keep in mind that this is not all-inclusive, and there is a ton more information regarding regular expressions and their uses. Here are some websites that cover regex in more depth...


No comments:

Post a Comment