python - Scraping poorly formatted HTML with BeautifulSoup -
update: found this post while throwing spaghetti @ walls , came this, totally works in loop. csv isn't beautiful, can adapted.
data = [] table = soup.find('table', border=6) rows = table.findall('tr') row in rows: cols = row.findall('td') cells = [ele.text.strip() ele in cols] data = ([ele ele in cells if ele]) # rid of empty values #print data record = (data) writer = csv.writer(open('cpms10.csv', 'ab')) writer.writerow(record)
i'm trying scrape data series of pages this one beautifulsoup. want data right side of each page in proper order column headings starting year.
i've been using this, doesn't actual year because there's padding in first row , stops after first section; when want four:
table = soup.find('table', border=6) data = {} row in table.findall('tr')[2:]: cells = row.findall('td') key = cells[0].text.strip() value = cells[1].text.strip() data[key] = value record = (key, value) writer = csv.writer(open('cpms.csv', 'ab')) writer.writerow(record)
i've tried adding , {'height' : '19} , , 'font' after findall('td') narrow down selection doesn't work.
this html first section of table, tho if @ whole page, there's earlier table , td never close out until end of document.
any ideas/assistance greatly, appreciated!
<table width=845 border=6 cellpadding=0 cellspacing=0 bgcolor=#c0c0c0> <tr><td height=28 width=19 valign=top bgcolor=#336699> </td> <td valign=middle colspan=3 bgcolor=#336699><font color=white size=3><b>10-1101 - department of transportation - dept code:a101101 - class code:01101</b></font></td></tr> <tr><td colspan=4 height=18 valign=top> </td></tr> <tr><td valign=top rowspan=54> </td> <td valign=top height=18 width=255 align=left bgcolor=#ffffb4><font size=2><b>year</b></td> <td rowspan=54 width=5 valign=top><img src='images/spacer.gif' width=10></td> <td valign=top align='right' style='{padding-right:150px}' bgcolor=#ffffb4><font size=2>2010</td></tr> <tr><td height=18 valign=top align=left bgcolor=#ffffb4><font size=2><b>appropriation title</b></td> <td valign=top bgcolor=#ffffb4><font size=2>dona ana co east mesa area roads & drainage</td></tr> <tr><td height=18 valign=top align=left bgcolor=#ffffb4><font size=2><b>fund code</b></td> <td valign=top align='right' style='{padding-right:150px}' bgcolor=#ffffb4><font size=2>severance tax bonds</td></tr> <tr><td height=18 valign=top align=left bgcolor=#ffffb4><font size=2><b>eo 2013-006 eligibility</b></td> <td valign=top align='right' style='{padding-right:150px}' bgcolor=#ffffb4><font size=2></td></tr> <tr><td height=18 valign=top align=left bgcolor=#ffffb4><font size=2><b>bond sale date</b></td> <td valign=top align='right' style='{padding-right:150px}' bgcolor=#ffffb4><font size=2>***</td></tr> <tr><td height=18 valign=top align=left bgcolor=#ffffb4><font size=2><b>bond series number</b></td> <td valign=top align='right' style='{padding-right:150px}' bgcolor=#ffffb4><font size=2></td></tr> <tr><td height=18 valign=top align=left bgcolor=#ffffb4><font size=2><b>amount of bond sale</b></td> <td valign=top align='right' style='{padding-right:150px}' bgcolor=#ffffb4><font size=2>$0 </td></tr> <tr><td height=18 valign=top align=left bgcolor=#ffffb4><font size=2><b>category</b></td> <td valign=top align='right' style='{padding-right:150px}' bgcolor=#ffffb4><font size=2></td></tr> <tr><td height=18 valign=top align=left bgcolor=#ffffb4><font size=2><b>subcategory</b></td> <td valign=top align='right' style='{padding-right:150px}' bgcolor=#ffffb4><font size=2></td></tr> <tr><td height=18 valign=top align=left bgcolor=#ffffb4><font size=2><b>county</b></td> <td valign=top align='right' style='{padding-right:150px}' bgcolor=#ffffb4><font size=2>dona ana</td></tr> <tr><td height=18 valign=top align=left bgcolor=#ffffb4><font size=2><b>state amount</b></td> <td valign=top align='right' style='{padding-right:150px}' bgcolor=#ffffb4><font size=2>$135,000</td></tr> <tr><td height=18 valign=top align=left bgcolor=#ffffb4><font size=2><b>chapter/section</b></td> <td valign=top align='right' style='{padding-right:150px}' bgcolor=#ffffb4><font size=2>105 / 18</td></tr> <tr><td height=18 valign=top align=left bgcolor=#ffffb4><font size=2><b>reversion date</b></td> <td valign=top align='right' style='{padding-right:150px}' bgcolor=#ffffb4><font size=2>6/30/2014</td></tr> <tr><th colspan=2>share/bof data</th> <td height=12</td></tr>
Comments
Post a Comment