Monday, November 8, 2010

Python Regular Expressions

E.g. we'd like to parse such html-code using regexp:
 <tr><td><font color="#bbbbbb">5587  </font></td><td>изумление</td><td>S</td><td>13.98</td><td>20.65</td>
## <td>#N/A</td><td>#N/A</td><td>#N/A</td><td>#N/A</td><td>#N/A</td><td>#N/A</td><td>#N/A</td><td></td></tr>


The code will be:

rows=re.finditer('(\<tr.+?tr\>)',html) ##nejadnyi (v otlichie ot .+ ischet stroki ne maxim dliny)
for row in rows:
cells=re.finditer('(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)',row.groups()[0])


Round brackets '(', ')' means group, you may iterate or name them.
\ - read as it is
.+ - find any string (any symbols), finds string with maximum length and takes a lot of sources
.+? - find any string (any symbols), not maximum length , better one to parse constructions like
<tr>...</tr>..<tr>...</tr>







'(\<tr.+tr\>)', finds  ({<tr>...</tr>..<tr>...</tr>}), only one
'(\<tr.+?tr\>)', finds  ({<tr>...</tr>},{<tr>...</tr>})



Primitive function to remove html tags:
def remove_tags(html): pattern=re.compile('<.*?>')  
result=pattern.sub("",html)  
return result

Find string that doesn't contain symbol (e.g. '{'):
re.finditer('({[^}]+})', str)

No comments:

Post a Comment