google bigquery - Trying to find exact word match within separate table field, accounting for negative words -
i have tried many different queries 1 right, become complete mess. long story short, trying find exact word match (isolated word separated spaces) based on 3 separate keywords , excluding matches contain negative keywords.
field_name_1, field_name_2 , field_name_3 positive words. negative_keywords comma separated group of words first split , used negate results ut.title contains negative keyword.
in essence query asking: "find ut.title has either field_name_1, field_name_2, or field_name_3 @ same time not have word split negative_keywords field."
any great appreciated. unfortunately regex seemingly not possible because field_name_x constants. in advance!
my current overbloated mess of query below:
select ut.i_id i_id, up.id p_id, up.option_id option_id ds_test.table_1 ut cross join ( select field_name_1, field_name_2, field_name_3, split(negative_keywords ,",") negative_keywords, option_id, id ds_test.table_2 ) ( (ut.title contains " "+up.field_name_1+" ") or (left(ut.title, length(up.field_name_1+" ")) contains up.field_name_1+" ") or (right(ut.title, length(" "+up.field_name_1)) contains " "+up.field_name_1) or (ut.title contains " "+up.field_name_2+" ") or (left(ut.title, length(up.field_name_2+" ")) contains up.field_name_2+" ") or (right(ut.title, length(" "+up.field_name_2)) contains " "+up.field_name_2) or (ut.title contains " "+up.field_name_3+" ") or (left(ut.title, length(up.field_name_3+" ")) contains up.field_name_3+" ") or (right(ut.title, length(" "+up.field_name_3)) contains " "+up.field_name_3) or (ut.title contains concat(substr(up.field_name_1, 1 , length(up.field_name_1))," ")) or (ut.title contains concat(substr(up.field_name_2, 1 , length(up.field_name_2))," ")) or (ut.title contains concat(substr(up.field_name_3, 1 , length(up.field_name_3))," ")) , (not ut.title contains concat(substr(up.negative_keywords, 1 , length(up.negative_keywords))," ")) ) group each i_id, p_id, option_id ignore case for example:
in table ds_test.table_1: field title contains "the x301-p , x301-b top of charts"
in table ds_test.table_2: field_name_1, field_name_2, field_name_3, negative_keywords respectively:
row 1 = |x301-f|x301p|x301-p|x301-a,x301-c| row 2 = |x301-b|x301b|x301-d|x301-h,x301-p| row 3 = |x301 |x30 | | | row 1 true. there x301-p , none of negative keywords in title.
row 2 false. though there x301-b in title, there x301-p negative keyword.
row 3 false. though there x301 and/or x30 in title, match because substrings of x301-p or x301-b, therefore x301 or x30 not full individual words within title.
the idea is:
- split negative keywords repeated field
- remove negative words using
omit record if some(title contains negative)construct - match full words using contains surrounding spaces, or catch beginning/end of string use custom pattern like
putting altogether using data example:
select title, field_1, field_2, field_3 ( select title, field_1, field_2, field_3, split(table2.negative) negative (select * (select 'the x301-b tops x301-p' title), (select 'the x301-p , x301-b top of charts' title)) table1 cross join (select * (select 'x301-f' field_1, 'x301p' field_2, 'x301-p' field_3, 'x301-a,x301-c' negative), (select 'x301-b' field_1, 'x301b' field_2, 'x301-d' field_3, 'x301-h,x301-p' negative), (select 'x301' field_1, 'x30' field_2, '' field_3, '' negative)) table2 ) title contains ' ' + field_1 + ' ' or title '% ' + field_1 or title field_1 + ' %' or title contains ' ' + field_2 + ' ' or title '% ' + field_2 or title field_2 + ' %' or title contains ' ' + field_3 + ' ' or title '% ' + field_3 or title field_3 + ' %' omit record if some(title contains negative) update: since looks evaluation of expensive on real datasets, alternative pad title spaces on both sides before doing contains check. modified query below:
select title, field_1, field_2, field_3 ( select title, field_1, field_2, field_3, split(table2.negative) negative (select ' ' + title + ' ' title (select 'the x301-b tops x301-p' title), (select 'the x301-p , x301-b top of charts' title)) table1 cross join (select * (select 'x301-f' field_1, 'x301p' field_2, 'x301-p' field_3, 'x301-a,x301-c' negative), (select 'x301-b' field_1, 'x301b' field_2, 'x301-d' field_3, 'x301-h,x301-p' negative), (select 'x301' field_1, 'x30' field_2, '' field_3, '' negative)) table2 ) title contains ' ' + field_1 + ' ' or title contains ' ' + field_2 + ' ' or title contains ' ' + field_3 + ' ' omit record if some(title contains negative)
Comments
Post a Comment