正则表达式

常用字符

字符	功能
\d	数字
\w	数字或字母
.	任意单个字符(除了换行符)
*	零个或多个字符
+	至少一个字符
？	零个或一个字符
{n}	前面的表达式匹配n次
{n,m}
\s	匹配一个空格
\	转义字符
\|	或
^	（匹配一行字符串的开头）^\d以数字开头
$	（匹配一行字符串的结尾）\d$以数字结尾
[0-9a-zA-z\_]	一个数字，字符或下划线
[0-9a-zA-z\_]+	如`'a100'`，`'0_Z'`，`'Py3000'`等等
\A	匹配字符串的开头
\Z	匹配字符串的结尾
.*	贪婪匹配
.*？	非贪婪匹配(在结尾时可能不会匹配)

下划线是特殊字符需要转义

字母大写代表与小写意思相反

使用r’…’可以规避特殊字符

re库

match()

会从字符串的开头进行匹配，成功返回Match对象；失败返回None.

1	re.match(正则表达式，字符串，[修饰符])

content = 'hello, 122323 world _ this is a regex demo'
result = re.match(r'^he.*(/d+).*demo$', content)
print(result)
print(result.group(1))
>>> match object
>>> 3 # 因为.*为贪婪匹配，会尽可能多的匹配

content = 'http://weibo.com/comment/KERGCN'
result1 = re.match(r'http.*?comment(.*?)', content) # 在结尾尽可能少的匹配
result2 = re.match(r'http.*?/(.*?)/KERGCN', content)
print(result1.group(1))
print(result2.group(1))
>>> None
>>> comment

1
2
3

with open('./no_info.txt', 'r') as f:
    for i in f.readlines():
        db_clawer.saveToNoInfo(int(re.match('\d+', i).group()))

修饰符

修饰符名	功能
re.I	匹配对大小写不敏感
re.L
re.M
re.S	使.匹配任意字符
re.U
re.S

search()

从字符串中寻找是否有匹配的片段；然后返回第一个匹配的结果

要注意绝大部分的html文本都会包含大量的换行符，所以尽量加上re.S修饰符，避免出现匹配不到的问题

compile()

用于把一个反复用的正则表达式生成对象

1 2	pattern = re.compile(regex) patter.search('....')

其它方法

re.findall()

1	re.findall(patter, string, flag=0) # 返回列表

re.split()

1	re.split(pattern, string, maxsplit=0, flag=0)

re.finditer()

1
2
3

results = re.finditer(pattern, string, flag=0) # 生成正则表达式匹配结果的生成器
for result in results:
    print(result)

re.sub()

1	re.sub(pattern, repl, string, count=0, flags=0)

分组

使用（）

1
2
3

result.group(0) # 获取的是原字符串
result.group(NO.) # 获取对应第几个子串
result.groups() # 返回生成的子串元组

贪婪匹配

正则表达式默认都是尽可能多的匹配

非贪婪匹配

在后面加一个?

/d+?

.*?

实例

1 2	>>> 'a b c'.split(' ') ['a', 'b', '', '', 'c']

1 2	>>> re.split(r'[\s\,]+', 'a,b, c d') ['a', 'b', 'c', 'd']

1 2	>>> re.split(r'[\s\,\;]+', 'a,b;; c d') ['a', 'b', 'c', 'd']

1 2	>>> re.match(r'^(\d+)(0*)$', '102300').groups() ('102300', '')

1 2	>>> re.match(r'^(\d+?)(0*)$', '102300').groups() ('1023', '00')