The use of basic regular expressions and extended regular expressions and possible application scenarios

Regular expression

1. What is a regular expression

Simply put, a regular expression is a symbolic notation used to identify patterns in text. In a way, they are similar to shell wildcards used when matching files and pathnames, but are more versatile. Many command line tools and most programming languages support regular expressions as a way to solve text manipulation problems. However, regular expressions vary slightly from tool to tool, and from programming language to programming language, which further complicates matters. For convenience, we limit our discussion of regular expressions to the POSIX standard (which covers most command-line tools), which unlike many programming languages (most notably Perl) uses a larger set of symbols.

2. Regular expression classification

Regular expressions: REGEXP, REGular EXPression.
Regular expressions are divided into two categories:

  • Basic REGEXP (basic regular expression)
  • Extended REGEXP (extended regular expression)

3. Basic regular expressions

[root@redhat ~]# ls
1 2 4 6 8 a b c d e f g h i j k l m n o p q r s t u v w x y z
10 3 5 7 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
//metacharacters
    . //Any single character
    [] //Match any single character within the specified range
    [^] //Match any single character outside the specified range (reverse)
    
Effect test:
[root@redhat ~]# ls | grep '[adf]' //Cannot use, or spaces to separate characters, otherwise, and spaces will be matched as a single character
a
d
f
[root@redhat ~]# ls | grep '[^adf]' //Add ^ to remove any single character except the three single characters adf
[root@redhat ~]# ls | grep '[a-z]' // Match all a to z. If you change it to [a-Z], it will match from lowercase a to uppercase Z.
a
b
c
d
e
f
g
Omit . . .
t
u
v
w
x
y
z
[root@redhat ~]# ls | grep '[^a-Z]' //Get any single character except lowercase a to uppercase Z
1
10
2
3
4
5
6
7
8
9

Supplement: Escape “”

There are some characters in regular expressions that we do not want to show their original meaning. For example, “|” is a pipe character, and the result of the execution of the previous command is used as the parameter of the subsequent command. “^” specifies the beginning. character, “$” is the character that specifies the end, and when we use these characters in some cases, the result he presents is not what we want to use. In this case, we need to change its meaning, and It is escaping (changing the meaning of its expression), and we need to use “”.
Effectiveness test:
[root@redhat ~]# ls
'^' 2 5 8 A B d E g H j K m N p Q s T v W y Z
 1 3 6 9 abc c D f G i J l M o P r S u V x Y
 10 4 7 a b C e F h I k L n O q R t U w X z
 
 At this time we want to match the character "^". Normally we will do the following:
 [root@redhat ~]# ls | grep '[^]'
 grep: Unmatched [, [^, [:, [., or [=] //Shows there is a problem with our grammar. Here he treats "^" as the specified starting character.
 
 Our correct approach is to change its original meaning, as follows:
 [root@redhat ~]# ls | grep '[\^]' //Escape "^" so that it represents the character "^"
 ^
 
 At the same time, we can also put "^" at the end of multiple matches, as follows:
 [root@redhat ~]# ls | grep '[235^]'
^
2
3
5
[root@redhat ~]# ls
'^' 11 3 5 8 A B d E g H j K m N p Q s T v W y Z
 1 2 33 6 9 abc c D f G i J l M o P r S u V x Y
 10 22 4 7 a b C e F h I k L n O q R t U w X z
//Number of matches (greedy mode)
    * //Match any single character preceding it any number of times
    .* //Any character of any length, .* in the regular expression is the * in the wildcard character
    \? //Match any single character before it 1 or 0 times
    \ + //Match any single character preceding it at least once
    \{<!-- -->m,n\} //Match any single character before it at least m times and at most n times

Effect test:
[root@redhat ~]# ls | grep '^[123]*$'
1
11
2
twenty two
3
33

[root@redhat ~]# ls | grep '^1\?$' // Match the previous character 0 or 1 times. 0 times means none, so there is only 1 time
1
[root@redhat ~]# ls | grep '^11\?$'
1
11

[root@redhat ~]# touch ba baa baaa
[root@redhat ~]# ls | grep '^ba\ + $' // Match the previous character, at least once
ba
baa
baaa

[root@redhat ~]# ls | grep '^ba\{2,3\}$' // Match the previous single character at least 2 times and at most 3 times
baa
baaa

//Position anchoring
    ^ //Anchor the beginning of the line, any single character after this character must appear at the beginning of the line
    $ //Anchor the end of the line, any single character before this character must appear at the end of the line
    ^$ //blank line
    \< or \b //Anchor the beginning of the word, any single character following it must appear as the beginning of the word
    \> or \b //Anchor word ending, any single character before it must appear as the end of the word
/group
    \(\)
    Example: \(ab\)*
    //back reference
        \1 //Reference all content included in the first left bracket and the corresponding right bracket
        \2 //Reference everything included in the second left bracket and the corresponding right bracket
        
The effect is as follows:
[root@redhat ~]# vim abc
[root@redhat ~]# cat abc
hello 0714-2564851 hehe
(+86)13872364194
1514864158
                             //There is a blank line here
[root@redhat ~]# cat abc | grep '^$' //Match blank lines

[root@redhat ~]# cat abc | grep -v '^$' //Use the -v option to negate and remove lines that are not blank lines
hello 0714-2564851 hehe
(+86)13872364194
1514864158

[root@redhat ~]# echo "hello world hello1 tom hello zhangsan" | grep ' \<hello\>'
hello world hello1 tom hello zhangsan //hello1 does start with hello, but does not end with hello, so only the third hello will be matched.

[root@redhat ~]# touch ab abab ababab abababab
[root@redhat ~]# ls |grep '^\(ab\)\?$'
ab
[root@redhat ~]# ls |grep '^\(ab\)\1$' //Quote everything included in the first left bracket and the corresponding right bracket
abab
[root@redhat ~]# touch abdcdc
[root@redhat ~]# ls |grep '^\(ab\)\(dc\)\2$' //Quote everything included in the second left bracket and the corresponding right bracket
abdcdc

Extended content:
Swap the positions of world and ftx in hello world ftx
[root@redhat ~]# echo 'hello world ftx' | sed 's/hello \(.*\) \(.*\)/hello \2 \1/g'
hello ftx world //.* matches characters of any length as mentioned before. \2 refers to the content of the second bracket before, and \1 refers to the content of the first bracket.

4. Extended regular expressions

//Character matching
    . //Match any single character
    [] //Match any single character within the specified range
    [^] //Match any single character outside the specified range
//Number of matches
    * //Match any single character preceding it any number of times
    ? //Match any single character before it 1 or 0 times
     + //Match any single character preceding it at least once
    {<!-- -->m,n} //Match any single character preceding it at least m times and at most n times

//Position anchoring
    ^ //Anchor the beginning of the line, any single character after this character must appear at the beginning of the line
    $ //Anchor the end of the line, any single character before this character must appear at the end of the line
    ^$ //blank line
    \< or \b //Anchor the beginning of the word, any single character following it must appear as the beginning of the word
    \> or \b //Anchor word ending, any single character before it must appear as the end of the word
//Group
    () //Group
    \1,\2,\3,....
   Example: (ab)*
    //back reference
        \1 //Reference all content included in the first left bracket and the corresponding right bracket
        \2 //Reference everything included in the second left bracket and the corresponding right bracket
//or
    | //or defaults to matching the entire left or right content of |
    //Example: C|cat represents C or cat. To represent Cat or cat, you need to use grouping, such as (C|c)at

5. Advantages of extended regular expressions

Extended regular expressions optimize some command syntax issues compared to basic regular expressions, making regular expressions more concise and easier to understand.
//Add the -E option to use extended regular expressions
The effect is as follows:
[root@redhat ~]# echo 'hello world ftx' | sed -E 's/hello (.*) (.*)/hello \2 \1/g'
hello ftx world
[root@redhat ~]# ls |grep -E '^(ab)(dc)\2$'
abdcdc

It can be seen from the above two execution effects that the extended regular expression is basically the same as the basic regular expression. This is because the extended regular expression is based on the optimization of the basic regular expression. The easiest thing to see is that we No more escaping "^" to change its meaning

[root@redhat ~]# touch c cat
[root@redhat ~]# ls | grep '^c\|cat$'
c
cat
[root@redhat ~]# ls | grep -E '^c|cat$'
c
cat
[root@redhat ~]# touch Cat cat
[root@redhat ~]# ls | grep '^Cat\|cat'
cat
Cat
[root@redhat ~]# ls | grep -E '^Cat|cat'
cat
Cat
[root@redhat ~]# ls | grep -E '^(C|c)at'
cat
Cat

Usage scenarios

In our work, there may be points that allow you to extract some useful information from some text content. In this case, we need to use regular expressions to match and extract it.

Extract landline numbers from abc file
[root@redhat ~]# cat abc
hello 0714-2564851 hehe
(+86)13872364194
15148641584

(0235)5766813
12345678910
[root@redhat ~]# cat abc | grep -E '\(?0[0-9]{3}\)?-?[0-9]{7}'
hello 0714-2564851 hehe
(0235)5766813

Idea analysis:
'\(?0[0-9]{3}\)?-?[0-9]{7}'

\(? is to escape ( into a simple symbol, and add ? because it may not have ( or it may have (, so we will match it 0 or 1 times
The first 0 in 0[0-9]{<!-- -->3} is because landline numbers generally start with 0, so we can directly confirm that [0-9] matches a single character from 0-9 , {<!-- -->3} is the number of previous matches, here it is 3 times, because - may appear in the landline, and - is usually preceded by 4 numbers. After we determine the first After the number is 0, we only need to match the next three numbers
-? means matching "-" 0 or 1 times, because - may or may not exist
[0-9]{<!-- -->7} matches the following 7 digits
//This way we can match the results we want



//After understanding the usage of each regular expression, we can easily complete the query, and the same is true for extracting mobile phone numbers:
The same is the content in the file abc
[root@redhat ~]# cat abc
hello 0714-2564851 hehe
(+86)13872364194
15148641584

(0235)5766813
12345678910

//Same idea, but in daily life we need to have certain common sense. We all know that the mobile phone numbers of major operators have regulations at the beginning. For example, 138,155, 177,183,191, the first number of domestic mobile phone numbers must be 1, and in There are requirements for the second digit. Now the second digit of the mobile phone number is only 3,5,7,8,9, so we need to take this into consideration when matching. In some cases, we also need to replace the international area code before the mobile phone number. (+ 86) After extracting this kind of information and sorting out my thoughts, the results are as follows:
[root@redhat ~]# cat abc | grep -E '(\(\ + 86\))?1[35789][0-9]{9}'
(+86)13872364194
15148641584

Without the above basic common sense, the following situation will occur:
[root@redhat ~]# cat abc | grep -E '(\(\ + 86\))?1[0-9]{10}'
(+86)13872364194
15148641584
12345678910 //Obviously, this line is also matched if it is not a mobile phone number.