html - Trying to use Beautiful Soup (Python) to find 2 partial matches in an attribute's value -
(this follow-up question previous post, user https://stackoverflow.com/users/771848/alecxe helped me with. makes more sense post follow-up independent question though, more searchable others.)
i have python script using beautiful soup locate web reports on hosting service.
right script pretty exacting. make bit more flexible. feel reg-ex need, maybe nested searches work too. i'm open suggestion.
my current code works like:
def search_table_for_report(table, report_name, report_type): #search rows of table find given report name, grab download url given type row in table.findall('tr')[1:]: #the [1:]: modifier instructs loop skip first item, aka headers. col = row.findall('td') if report_name in col[0].string: print "----- parse out file type request url" report_type = report_type.upper() #this works, using exact match label = row.find("input", {"aria-label": "select " + report_name + " format " + report_type}) #this doesn't work, using reg-ex #label = row.find("input", {"aria-label": re.compile("\b" + report_name + ".*\b" + report_type + ".*")}) print "----- okay found right checkbox, grab href link ----" link_url = label.find_next_sibling("a", href=true)["href"] return link_url which search through table this:
<tr class="odd"> <td header="c1"> report download </td> <td header="c2"> <input aria-label="select report format pdf" id="documentchkbx0" name="documentchkbx" type="checkbox" value="5446"/> <a href="/a/document.html?key=5446"> <img alt="portable document format" src="/img/icons/icon_pdf.gif"> </img> </a> <input aria-label="select report format xls" id="documentchkbx1" name="documentchkbx" type="checkbox" value="5447"/> <a href="/a/document.html?key=5447"> <img alt="excel spreadsheet format" src="/img/icons/icon_xls.gif"> </img> </a> </td> <td header="c4"> 04/27/2015 </td> <td header="c5"> 05/26/2015 </td> <td header="c6"> 05/26/2015 10:00am edt </td> </tr> i'd search "aria-label" value 2 values, or 2 partial matches within it. essentially, instead of finding "select report format xls", may need find "select matrix format pdf". pretty sure "select" , "format" bit there can't sure, need make 2nd word , final extension type partial match searches. partial bit (instead of exact) important because "report" word may have trailing words don't expect, "select report ii format xls", etc, fail if exact search "select report format xls".
so need code (regex presuambly) search given name (in place of report) , given type (in place of xls) i've tried it's not working. think reg-ex syntax good, think i'm jamming re.compile in wrong spot, using in way beautiful soup not expect.
label = row.find("input", {"aria-label": re.compile("\b" + report_name + ".*\b" + report_type + ".*")}) hope explained well. happy clarify confusion.
i figured out issue. bs4 search technique fine, regex pattern needed bit smarter. works great using below! i'm not sure how make search case insensitive it's okay.
#build pattern search on #where report_name , report_type strings passed function regex_criteria = r'.*' + report_name + r'.*' + report_type #search value of "aria-label" attribute #across inputs on page target_input = row.find("input", {"aria-label": re.compile(regex_criteria)})
Comments
Post a Comment