python - Extract certain integers from string value, of different length, which contains unwanted integers. Pattern or Position -

March 15, 2011

i of beginner programmer , looking , explanation of problem. looking extract id numbers string new column, fill in missing numbers.

i working pandas dataframe , have following set of street names, id number , others missing:

*start station*: "19th & l st (31224)" "14th & r st nw (31202)" "paul rd & pl nw (31602)" "14th & r st nw" "19th & l st" "paul rd & pl nw"  desired outcome: *start station*         *startstatnum* "14th & r st nw"        31202 "19th & l st"           31224 "paul rd & pl nw"       31602 "14th & r st nw"        31202 "19th & l st"           31224 "paul rd & pl nw"       31602

i having difficulty after first step of splitting. can split based on position following:

def stat_num(stat_num):     return stat_num.split('(')[-1].split(')')[0].strip()  db["startstatnum"] = pd.dataframe({'num':db['start station'].apply(stat_num)})  gives: *start station*         *startstatnum* "19th & l st (31224)"        31202 "14th & r st nw (31202)"     31224 "paul rd & pl nw (31602)"    31602 "14th & r st nw"            "14th & r st nw" "19th & l st"               "19th & l st" "paul rd & pl nw"           "paul rd & pl nw"

the problem arise when want find/fill startstatnum station id numbers don't have.

i have been trying know str.extract, str.contains, re.findall , tried following possible stepping stone:

db['start_s2']  = db['start_stat_num'].str.extract(" ((\d+))") db['start_s2']  = db['start station'].str.contains(" ((\d+))") db['start_s2']  = db['start station'].re.findall(" ((\d+))")

i have tried following here

def parseintegers(mixedlist): return [x x in db['start station'] if (isinstance(x, int) or isinstance(x, long)) , not isinstance(x, bool)]

however when pass values in, list 'x' 1 value. bit of noob, don't think going pattern route best take in unwanted integers (although possibly turn nan's less 30000 (the lowest value id number) have idea simple i'm overlooking, after 20 straight hours , lot of searching, @ bit of loss.

any extremely helpful.

here's way worked me, firstly extract numbers in braces:

in [71]:  df['start stat num'] = df['start station'].str.findall(r'\((\d+)\)').str[0] df out[71]:              start station start stat num 0      19th & l st (31224)          31224 1   14th & r st nw (31202)          31202 2  paul rd & pl nw (31602)          31602 3           14th & r st nw            nan 4              19th & l st            nan 5          paul rd & pl nw            nan

now remove number don't need anymore:

in [72]:  df['start station'] = df['start station'].str.split(' \(').str[0] df out[72]:      start station start stat num 0      19th & l st          31224 1   14th & r st nw          31202 2  paul rd & pl nw          31602 3   14th & r st nw            nan 4      19th & l st            nan 5  paul rd & pl nw            nan

now can fill in missing station number calling map on df nan rows removed, , station name set index, lookup station name , return station number:

in [73]:  df['start stat num'] = df['start station'].map(df.dropna().set_index('start station')['start stat num']) df out[73]:      start station start stat num 0      19th & l st          31224 1   14th & r st nw          31202 2  paul rd & pl nw          31602 3   14th & r st nw          31202 4      19th & l st          31224 5  paul rd & pl nw          31602

Search This Blog

Lix

python - Extract certain integers from string value, of different length, which contains unwanted integers. Pattern or Position -

Comments

Post a Comment

Popular posts from this blog

javascript - three.js lot of meshes optimization -

smartface.io - Proper way to change color scheme for whole application -

Email notification in google apps script -