Javascript regex with varying input -
i want filter out following information out of long piece of text. copy , paste in textfield , want process table result. with
- name
- address
- status
example snippet:(kind of randomized names , addresses etc)
thuisprikindeling voor: vrijdag 15 mei 2015 de smart bon 22 afspraken pagina 1/4 persoonlijke mededeling: algemene mededeling: prikpostgegevens: reek-eeklo extern, (-) telefoonnummer fax mobiel 0499/9999999 email dummy.dummy@gmail.com dummy foo v stationstreet 2 8000 new york f n - sober bsn: 1655 thuis analyses: werknr: pin: 000000002038905 opdrachtgever: laboratorium arts: mededeling: comments // difficult fo dummy foo v butterstreet 6 8740 melbourne f n - sober bsn: 15898 thuis analyses: werknr: afd 3 pin: 000000002035900 opdrachtgever: laboratorium arts: mededeling: zh bla / bla bla - afd 3 - social beer john fooo v waterstreet 1 9990 rome f n - sober bsn: 17878 thuis / analyses: werknr: k111 pin: 000000002037888 opdrachtgever: laboratorium arts: mededeling: tryout/foo fo smooth m.foo m queen elisabethstreet 19 9990 paris f nn - not sober bsn: 14877
what want out of this:
dummy foo stationstreet 2 8000 new york sober fo dummy foo butterstreet 6 8740 melbourne sober john fooo waterstreet 1 9990 rome sober fo smooth m.foo queen elisabethstreet 19 9990 paris not sober
my strategy moment using following:
- filter lines @ least 2 words in capitals @ beginning of line. , 4 digit postal code.
- then discard other lines need lines names , adresses
- then strip out information needed line
- strip name / address / status
i use following code:
//regular expressions //filter lines start @ least 2 uppercase words following space pattern = /^(([a-z'.* ]{2,} ){2,}[a-z]{1,})(?=.*bsn)/; postcode = /\d{4}/; searchsober= /(n - sober)+/; searchnotsober= /(nn - not sober)+/; adres = inputtext.split('\n'); (var = 0; < adres.length; i++) { // if in 1 line , postcode , starts @ least // 2 uppercase words following space temp = adres[i] if ( pattern.test(temp) && postcode.test(temp)) { //remove bsn in order able use digits sort out postal code temp = temp.replace( /bsn.*/g, ""); // example: dummy foo v stationstreet 2 8000 new york f n - sober //selection of name, take first part of array // dummy foo var name = temp.match(/^([-a-z'*.]{2,} ){1,}[-a-z.]{2,}/)[0]; //remove name string temp = temp.replace(/^([-a-z'*.]{2,} ){1,}[-a-z.]{2,}/, ""); // v stationstreet 2 8000 new york f n - sober //filter out gender //using jquery trim whitespace trimming // v var gender = $.trim(temp.match(/^( [a-z'*.]{1} )/)[0]); //remove gender temp = temp.replace(/^( [a-z'*.]{1} )/, ""); // stationstreet 2 8000 new york f n - sober //looking status var status = "unknown"; if ( searchnotsober.test(temp) ) { status = "not soberr"; } else if ( searchsober.test(temp) ) { status = "sober"; } else { status = "unknown"; } //selection of address /^.*[0-9]{4}.[\w-]{2,40}/ //stationstreet 2 8000 new york var address = $.trim(temp.match(/^.*[0-9]{4}.[\w-]{2,40}/gm)); //assemble person object. var person={name: name + "", address: address + "", gender: gender +"", status:status + "", location:[] , marker:[]}; result.push(person); } }
the problem have that:
- sometimes names not written in capitals
- sometimes postal code not added code stops working.
- sometimes put * in front of name
a broader question strategy can take tackle these type of messy input problems? should make cases every mistake see in these snippets get? feel don't know out of piece of code every time run different input.
here general way of handling it:
find lines matches. match on "sober" or whatever makes unlikely miss match, if gives false positives.
filter out false positives, have update , tweak go. make sure filter out isn't relevant @ all.
strict filtering of input, doesn't match gets logged/reported manual handling, match conforms known strict pattern
normalize , extract data should easier since have limited possible input @ stage
Comments
Post a Comment