正規表達式
A regular expression is a pattern template you define that a Linux utility(sed,gawk等) uses to filter text.
正規表達式由正規表達式引擎來實作(regular expression engine)。
In the Linux world, there are two popular regular expression engines:
■ The POSIX Basic Regular Expression (BRE) engine
■ The POSIX Extended Regular Expression (ERE) engine
定義BRE模式
1、純文字
$ echo "This is a test" | sed -n ’/test/p’
This is a test
正規表達式并不關心模式出現在資料流中的位置,關鍵是比對正規表達式模式與資料流文本。正規表達式模式區分大小寫。空格像其它字元一樣處理。
2、特殊字元
The special characters recognized by regular expressions are:
.*[]^${}\+?|()
不要在文本模式中單獨使用這些字元。可以用轉義字元(\)把這些字元當作普通字元。
3、定位符
1)The caret character (^) defines a pattern that starts at the beginning of a line of text in the data stream. If the pattern is located any place other than the start of the line of text, the regular expression pattern fails.
$ echo "Books are great" | sed -n ’/^Book/p’
Books are great
2)The opposite of looking for a pattern at the start of a line is looking for it at the end of a line. The dollar sign ($) special character defines the end anchor. Add this special character after a text pattern to indicate that the line of data must end with the text pattern:
$ echo "This is a good book" | sed -n ’/book$/p’
This is a good book
3)The dot special character is used to match any single character except a newline character. The dot character must match a character though; if there’s no character in the place of the dot, then the pattern will fail.
4)字元類
用方括号來定義字元類。
$ sed -n ’/[ch]at/p’ data6
The cat is sleeping.
That is a very nice hat.
5)否定字元類
$ sed -n ’/[^ch]at/p’ data6
This test is at line two.
6)使用範圍
You can use a range of characters within a character class by using the dash symbol。
Just specify the first character in the range, a dash, then the last character in the range. The regular expression includes any character that’s within the specified character range。
$ sed -n ’/^[0-9][0-9][0-9][0-9][0-9]$/p’ data8
60633
46201
45902
7)特殊字元類
BRE Special Character Classes
Class
Description
[[:alpha:]]
Match any alphabetical character, either upper or lower case.
[[:alnum:]]
Match any alphanumeric character 0–9, A–Z, or a–z.
[[:blank:]]
Match a space or Tab character.
[[:digit:]]
Match a numerical digit from 0 through 9.
[[:lower:]]
Match any lower-case alphabetical character a–z.
[[:print:]]
Match any printable character.
[[:punct:]]
Match a punctuation character.
[[:space:]]
Match any whitespace character: space, Tab, NL, FF, VT, CR.
[[:upper:]]
Match any upper-case alphabetical character A–Z.
8)星号
Placing an asterisk after a character signifies that the character must appear zero or more times in the text to match the pattern:
$ echo "ik" | sed -n ’/ie*k/p’
ik
擴充正規表達式
gawk支援,而sed不支援。
1)問号
The question mark is similar to the asterisk, but with a slight twist. The question mark indicates that the preceding character can appear zero or one time, but that’s all. It doesn’t match repeating occurrences of the character:
$ echo "bt" | gawk ’/be?t/{print $0}’
Bt
2)加号
The plus sign is another pattern symbol that’s similar to the asterisk, but with a different twist than the question mark. The plus sign indicates that the preceding character can appear one or more times, but must be present at least once. The pattern doesn’t match if the character is not present:
$ echo "beeet" | gawk ’/be+t/{print $0}’
beeet
3)大括号
Curly braces are available in ERE to allow you to specify a limit on a repeatable regular expression. This is often referred to as an interval. You can express the interval in two formats:
■ m: The regular expression appears exactly m times.
■ m,n: The regular expression appears at least m times, but no more than n times.
This feature allows you to fine-tune exactly how many times you allow a character (or character class) to appear in a pattern.
4)管道符号
The pipe symbol allows to you to specify two or more patterns that the regular expression engine uses in a logical OR formula when examining the data stream. If any of the patterns match the data stream text, the text passes. If none of the patterns match, the data stream text fails.
The format for using the pipe symbol is:
expr1|expr2|...
$ echo "The cat is asleep" | gawk ’/cat|dog/{print $0}’
The cat is asleep
5)将表達式分組
Regular expression patterns can also be grouped by using parentheses. When you group a regular expression pattern, the group is treated like a standard character. You can apply a special character to the group just as you would to a regular character.
$ echo "Sat" | gawk ’/Sat(urday)?/{print $0}’
Sat
$ echo "Saturday" | gawk ’/Sat(urday)?/{print $0}’
Saturday
$
幾個例子
1)計算檔案目錄
$ cat countfiles
#!/bin/bash
# count number of files in your PATH
mypath=`echo $PATH | sed ’s/:/ /g’`
count=0
for directory in $mypath
do
check=`ls $directory`
for item in $check
count=$[ $count + 1 ]
done
echo "$directory - $count"
$ ./countfiles
/usr/local/bin - 79
/bin - 86
/usr/bin - 1502
/usr/X11R6/bin - 175
/usr/games - 2
/usr/java/j2sdk1.4.1 01/bin - 27
2)驗證電話号碼
$ cat isphone
# script to filter out bad phone numbers
gawk --re-interval ’/^\(?[2-9][0-9]{2}\)?(| |-|\.)
[0-9]{3}( |-|\.)[0-9]{4}/{print $0}’
By default, the gawk program doesn’t recognize regular expression intervals. You must specify the --re-interval command line option for the gawk program to recognize
regular expression intervals.
(123)456-7890
(123) 456-7890
123-456-7890
123.456.7890
3)解析電子郵件位址
^([a-zA-Z0-9 \-\.\+]+)@([a-zA-Z0-9 \-\.]+)\.([a-zA-Z]{2,5})$
參考: