Python Regular expression

Python Regular expression

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.

To avoid any confusion while dealing with regular expressions, we would use Raw Strings (原始字符串) as r’expression’

如果没有涉及到的,请看 官方手册

1. Basic patterns that match single chars

Expression Matches
a, X, 9, < ordinary characters just match themselves exactly
. matches any single character except newline ‘’ = [^\n]
\w matches a “word” character: a letter or digit or underbar = [a-zA-Z0-9]
\W matches any non-word character
\b boundary between word and non-word
\s matches a single whitespace character – space, newline, return, tab = [\f\n\r\t]
\S matches any non-whitespace character
\t, \n, \r tab, newline, return
\d decimal digit = [0-9]
^ matches the start of the string
$ matches the end of the string
\ inhibit the “specialness” of the character

更特殊的元素

模式 描述
[…] 用于表示一组字符
[^…] 不在 [] 中的字符
* 匹配 0 个或者多个表达式
+ 匹配 1 个或者多个表达式
? 匹配0个或1个表达式, 非贪婪方式
{n} 精确匹配 n 个前面表达式
{n, m} 匹配
a | b 匹配 a 或者 b
() 匹配括号内的表达式,也表示一个组

2. Compilation flags

Compilation flags let you modify some aspects of how regular expressions work. Flags are available in the re module under two names, a long name such as IGNORECASE and a short, one-letter form such as I.

Flag Meaning
ASCII, A = re.U Makes several escapes like , , and match only on ASCII characters with the respective property
DOTALL, S = re.S Make . match any character, including newlines
IGNORECASE, I = re.I Do case-insensitive matches
LOCALE, L = re.L Do a locale-aware match
MULTILINE, M = re.M Multi-line matching, affecting ^ and $
VERBOSE, X(for ‘extended’) = re.X Enable verbose REs, which can be organized more cleanly and understandably

注: 这里使用最多的是 re.S ,目的是使得 . 匹配包括换行符在内的所有字符,经常用于网页匹配,因为 HTML 节点中经常会有换行。

re.I 使得匹配对大小写不敏感

3. The regex Function

re.match attempts to match RE pattern to string with optional flags

match checks for a match only at the beginning of the string.

re.match 尝试从字符串的起始位置匹配一个模式,如果不是起始位置匹配成功的话,match() 就返回 none;匹配成功则返回一个匹配的对象。因此 该方法一般用于检测某个字符串是否满足某个正则表达式的规则

search checks for a match anywhere in the string

re.search 扫描整个字符串并返回第一个成功的匹配

findall() 返回所有匹配结果,类型为 list

The description of the parameters

parameters Description
pattern the regular expression to be matched
string the string searched to match the pattern anywhere in the string
flags specify different flags using bitwise OR(|)

正如前面所说,可以使用 () 括号将想提取的子字符串括起来,() 实际上标记了一个子表达式的开始和结束位置,每个子表达式依次对应一个分组,调用 group() 方法返回分子的索引即可获取提取的结果

运行结果如下:

可以看到, group(1) 会输出第一个被 () 包围的匹配结果, group()group(0) 会输出完整的匹配结果

sub 用于替换字符串中的匹配项

注:repl 表示替换的字符串,也可以为一个函数

4. 贪婪与非贪婪

.* 会匹配尽可能多的字符,而 .*? 会匹配尽可能少的字符

但这里需要注意,如果匹配的结果在字符串结尾,.*?就有可能匹配不到任何内容了,因为它会匹配尽可能少的字符。

5. (?P<group_name> ...) 语法

Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is also a numbered group, just as if the group were not named.

(?P=name): A backreference to a named group; it matches whatever text was matched by the earlier group named name.

e.g.: (?P<quote>['"]).*?(?P=quote) matching a string quoted with either single or double quotes

发表评论

电子邮件地址不会被公开。 必填项已用*标注

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d 博主赞过: