Parse HTML table data to JSON and save to text file in Python 2.7 -
i'm trying extract data on crime rate across states webpage, link web page http://www.disastercenter.com/crime/uscrime.htm
i able text file. response in json format. how can in python.
here code:
import urllib import re bs4 import beautifulsoup link = "http://www.disastercenter.com/crime/uscrime.htm" f = urllib.urlopen(link) myfile = f.read() soup = beautifulsoup(myfile) soup1=soup.find('table', width="100%") soup3=str(soup1) result = re.sub("<.*?>", "", soup3) print(result) output=open("output.txt","w") output.write(result) output.close()
the following code data 2 tables , output of json formatted string.
working example (python 2.7.9):
from lxml import html import requests import re regular_expression import json page = requests.get("http://www.disastercenter.com/crime/uscrime.htm") tree = html.fromstring(page.text) tables = [tree.xpath('//table/tbody/tr[2]/td/center/center/font/table/tbody'), tree.xpath('//table/tbody/tr[5]/td/center/center/font/table/tbody')] tabs = [] table in tables: tab = [] row in table: col in row: var = col.text_content() var = var.strip().replace(" ", "") var = var.split('\n') if regular_expression.match('^\d{4}$', var[0].strip()): tab_row = {} tab_row["year"] = var[0].strip() tab_row["population"] = var[1].strip() tab_row["total"] = var[2].strip() tab_row["violent"] = var[3].strip() tab_row["property"] = var[4].strip() tab_row["murder"] = var[5].strip() tab_row["forcible_rape"] = var[6].strip() tab_row["robbery"] = var[7].strip() tab_row["aggravated_assault"] = var[8].strip() tab_row["burglary"] = var[9].strip() tab_row["larceny_theft"] = var[10].strip() tab_row["vehicle_theft"] = var[11].strip() tab.append(tab_row) tabs.append(tab) json_data = json.dumps(tabs) output = open("output.txt", "w") output.write(json_data) output.close()
Comments
Post a Comment