3 minute read

综述

* 原字符 
* 实例
* 练习题

备注 每日一更 +代表更新 -代表还在占坑中(没有内容暂待更新) 更新内容:

  • 新的元字符更新

编程依赖

使用C#作为正则表达式的学习工具,正则表达式在各种语言中都支持(自带或者相关的库) 在这里使用C#作为学习的工具。 C#中使用正则表达式可以有几种方式:

// 本次主要采用这样的实现方式(便于书写)
string[] s = { "catTheOtherText", "TheOtherTextcat",
               "TheOtherText", "TheOthercatText", "ThecOtheraTextt" };
string strRegular = "^cat";
foreach( var str in s)
{
    Match match = Regex.Match(str, @strRegular);
    if (match.Success)
    {
        Console.WriteLine(str + "    The Regular is  " + strRegular);
    }

}

这种方式比较便于理解,但是其实缺少了很多相关的设置条件,相关的设置条件,每个语言有所不同,如果有需要,需要查阅相关的手册,进行确认。如果有需要(比如设置忽略大小写,在shell里是在正则表达式之前添加参数 -i),则用如下的这种方式进行书写:

string[] s = { "catTheOtherText", "TheOtherTextcat",
               "TheOtherText", "TheOthercatText", "ThecOtheraTextt" };
string strRegular = "^cat";
// 这里的RegexOptions 提供了更多正则表达式的相关选项。
// 这里的IgnoreCase 是忽略大小写的含义
Regex reg = new Regex(@strRegular, RegexOptions.IgnoreCase);
foreach( var str in s)
{
    Match match = reg.Match(str);
    
    if (match.Success)
    {
        Console.WriteLine(str + "    The Regular is  " + strRegular);
    }

}

元字符

^

行的起始 如:^cat 取得行首是cat的行

string[] s = { "catTheOtherText", "TheOtherTextcat",
               "TheOtherText", "TheOthercatText", "ThecOtheraTextt" };
string strRegular = "^cat";
// "catTheOtherText"

$

行的结束 如: cat$ 取得行尾是cat的行

string[] s = { "catTheOtherText", "TheOtherTextcat",
                           "TheOtherText", "TheOthercatText", "ThecOtheraTextt" };
string strRegular = "cat$";
// output
// "TheOtherTextcat"

[]

匹配多个字符 如[ab] 匹配a或者b

string[] s = { "aTheOtherText", "bTheOtherText",
               "abTheOtherText", "TheOtheraText", "b" };
string strRegular = "[ab]";
// output
// aTheOtherText    The Regular is  [ab]
// bTheOtherText    The Regular is  [ab]
// abTheOtherText    The Regular is  [ab]
// TheOtheraText    The Regular is  [ab]
// b    The Regular is  [ab]

[-]

[]中的连字符,可以连接有顺序的序列 如:[1-3] 取1到3中任意__一个__数 特别的[0123456789abcdefgABCDEFG]可以写作:[0-9a-gA-G][A-G0-9a-g] (顺序无关)

展开

(以后讲到) 只有-[]中是作为有意义的字符,其他字符保留原来含义。

string[] s = { "2TheOtherText", "1TheOtherText",
               "90TheOtherText", "TheOtheraText", "9", "TheOther?Text" };
string strRegular = "[0-9?]";
// 2TheOtherText    The Regular is  [0-9?]
// 1TheOtherText    The Regular is  [0-9?]
// 90TheOtherText    The Regular is  [0-9?]
// 9    The Regular is  [0-9?]
// TheOther?Text    The Regular is  [0-9?]

[^]

表示除此以外 如:[^a] 取得除了有a的行 也可以使用[^0-9] 取得除了只包含0-9的其他行

string[] s = { "123456", "Text",
               "?Text", "123456Text", "9" , "?", 
               "Text123456", "T123e45x6t" };
string strRegular = "[^0-9?]";
// Text    The Regular is  [^0-9?]
// ?Text    The Regular is  [^0-9?]
// 123456Text    The Regular is  [^0-9?]
// Text123456    The Regular is  [^0-9?]
// T123e45x6t    The Regular is  [^0-9?]

|

或 如 ab|ba 匹配ab和ba 如同编程中的|(或)

string[] s = { "abc", "cba",
               "bca", "ab", "a"};
string strRegular = "ab|ba";
// abc    The Regular is  ab|ba
// cba    The Regular is  ab|ba
// ab    The Regular is  ab|ba

()

使用()包含|,限制|的作用范围 则只针对括号中的内容进行或匹配 如a(ab|ba) 匹配aab 或者aba

string[] s = { "abc", "cba",
                "bca", "ab", "a", "aab"};
string strRegular = "c(ab|ba)";
// cba    The Regular is  c(ab|ba)

值得注意的是,以下3者等价:

string strRegular1 = "(Geoff|Jeff)(rey|ery)";
string strRegular2 = "(Geo|Je)ff(rey|ery)";
string strRegular3 = "(Geo|Je)ff(re|er)y"

下面这种方式与上面3者也等价(较为蠢)

string strOtherWayRegular = "Jeffrey|Geoffery|Jeffery|Geoffrey"; 

单词分界符

C#的实现不同,本例参照所选用语言的正则表达式手册。 标准中提供\<\> 来匹配单词的首位,和末尾。 C#中则只提供\b 匹配单词分割,不特别区分首尾(看放置于哪个部分)。

模糊匹配

?

一定会被匹配到。 必须前面有一个字符, 前面字符出现__0次或1次__ 如 abca? 匹配abc或者abca 一个有歧义,其实可以将多个用括号包裹起来,这样的话也算一个 如 ab(abc)? 匹配ababc或ab

string[] s = { "abca", "abcaaa",
                "abc", "abcd", "babcd"};
string strRegular = "abca?";
// abca    The Regular is  abca?
// abcaaa    The Regular is  abca?
// abc    The Regular is  abca?
// abcd    The Regular is  abca?
// babcd    The Regular is  abca?
string[] s = { "ababc", "abca",
                "babcdeg", "cbagcd" };
string strRegular = "ab(abc)?";
// ababc    The Regular is  ab(abc)?
// abca    The Regular is  ab(abc)?
// babcdeg    The Regular is  ab(abc)?

+

必须前面有一个字符, 前面字符出现__1次或多次__ 如 abca? 匹配abca或者abcaa

string[] s = { "abca", "abcaaa",
                "abc", "abcd", "babcd"};
string strRegular = "abca+";
// abca    The Regular is  abca+
// abcaaa    The Regular is  abca+

*

一定会被匹配到。 必须前面有一个字符, 前面字符出现__0次或多次__ 如 ba? 匹配b或者ba或者baa

string[] s = { "abca", "abcaaa",
                "abc", "abcd", "babcd"};
string strRegular = "abca*";
// abca    The Regular is  abca*
// abcaaa    The Regular is  abca*
// abc    The Regular is  abca*
// abcd    The Regular is  abca*
// babcd    The Regular is  abca*

展开 ? *

看了上面的例子是不是对这2个有点困惑。看起来结果是一样的。其实主要在于,只有存在于明确首尾的情况时,才能使之不同。比如: ^aa?b^aa*b

string[] s = { "ab", "aaaab",
                "aab"};
string strRegular = "^aa?b";
// ab    The Regular is  ^aa?b
// aab    The Regular is  ^aa?b
string[] s = { "ab", "aaaab",
                "aab"};
string strRegular = "^aa*b";
// ab    The Regular is  ^aa*b
// aaaab    The Regular is  ^aa*b
// aab    The Regular is  ^aa*b

练习

[][^]

'q1xText', 'Textq1',
'q2cText', 'Textq2c' 1. `q[0-9][^x]` 会匹配哪一个呢?
  1. q[0-9][^x]$会匹配哪一个呢?

  2. ^q[0-9][^x]会匹配哪一个呢?

答案

  1. q2cText The Regular is q[0-9][^x] Textq2c The Regular is q[0-9][^x]

  2. Textq2c The Regular is q[0-9][^x]$

  3. q2cText The Regular is ^q[0-9][^x]