<tr><td><font color="#bbbbbb">5587 </font></td><td>изумление</td><td>S</td><td>13.98</td><td>20.65</td>
## <td>#N/A</td><td>#N/A</td><td>#N/A</td><td>#N/A</td><td>#N/A</td><td>#N/A</td><td>#N/A</td><td></td></tr>
The code will be:
rows=re.finditer('(\<tr.+?tr\>)
for row in rows:
cells=re.finditer('(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)
Round brackets '(', ')' means group, you may iterate or name them.
\ - read as it is
.+ - find any string (any symbols), finds string with maximum length and takes a lot of sources
.+? - find any string (any symbols), not maximum length , better one to parse constructions like
<tr>...</tr>..<tr>...</tr>
'(\<tr.+tr\>)
'(\<tr.+?tr\>)
Primitive function to remove html tags:
def remove_tags(html): pattern=re.compile('<.*?>')
result=pattern.sub("",html)
return result
Find string that doesn't contain symbol (e.g. '{'):
re.finditer('({[^}]+})', str)

No comments:
Post a Comment