How to split a dataframe string column into multiple columns? - python

I have a pandas dataframe. This dataframe consists of a single column. I want to parse this column according to the '&' sign and add the data to the right of the "=" sign as a new column. Examples are below.
The dataframe I have;
tags
0 letter1=A&letter2=B&letter3=C
1 letter1=D&letter2=E&letter3=F
2 letter1=G&letter2=H&letter3=I
3 letter1=J&letter2=K&letter3=L
4 letter1=M&letter2=N&letter3=O
5 letter1=P&letter2=R&letter3=S
. .
. .
dataframe that I want to convert;
letter1 letter2 letter3
0 A B C
1 D E F
2 G H I
3 J K L
4 M N O
.
.
I tried to do something with this code snippet.
columnname= df["tags"][0].split("&")[i].split("=")[0]
value =df["tags"][0].split("&")[i].split("=")[1]
But I'm not sure how I can do it for the whole dataframe. I am looking for a faster and stable way.
Thanks in advance,

do this..
import pandas as pd
tags = [
"letter1=A&letter2=B&letter3=C",
"letter1=D&letter2=E&letter3=F",
"letter1=G&letter2=H&letter3=I",
"letter1=J&letter2=K&letter3=L",
"letter1=M&letter2=N&letter3=O",
"letter1=P&letter2=R&letter3=S"
]
df = pd.DataFrame({"tags": tags})
df["letter1"] = df["tags"].apply(lambda x: x.split("&")[0].split("=")[-1])
df["letter2"] = df["tags"].apply(lambda x: x.split("&")[1].split("=")[-1])
df["letter3"] = df["tags"].apply(lambda x: x.split("&")[2].split("=")[-1])
df = df[["letter1", "letter2", "letter3"]]
df

Split into separate columns, via str.split, using & :
step1 = df.tags.str.split("&", expand=True)
Get the new columns from the first row of step1:
new_columns = step1.loc[0, :].str[:-2].array
Get rid of the letter1= prefix in each column, set the new_columns as the header:
step1.set_axis(new_columns, axis='columns').transform(lambda col: col.str[-1])
letter1 letter2 letter3
0 A B C
1 D E F
2 G H I
3 J K L
4 M N O
5 P R S

d=list(df["tags"])
r={}
for i in d:
for ele in i.split("&"):
if ele.split("=")[0] in r.keys():
r[ele.split("=")[0]].append(ele.split("=")[1])
else:
r[ele.split("=")[0]]=[]
r[ele.split("=")[0]].append(ele.split("=")[1])
df = pd.DataFrame({i:pd.Series(r[i]) for i in r})
print (df)

Using regex
import pandas as pd
import re
tags = [
"letter1=A&letter2=B&letter3=C",
"letter1=D&letter2=E&letter3=F",
"letter1=G&letter2=H&letter3=I",
"letter1=J&letter2=K&letter3=L",
"letter1=M&letter2=N&letter3=O",
"letter1=P&letter2=R&letter3=S"
]
df = pd.DataFrame({"tags": tags})
pattern=re.compile("\=(\w+)") # Look for pattern
df['letter1'], df['letter3'],df["letter2"] = zip(*df["tags"].apply(lambda x: pattern.findall(x)))
Output
tags letter1 letter2 letter3
0 letter1=A&letter2=B&letter3=C A B C
1 letter1=D&letter2=E&letter3=F D E F
2 letter1=G&letter2=H&letter3=I G H I
3 letter1=J&letter2=K&letter3=L J K L
4 letter1=M&letter2=N&letter3=O M N O
5 letter1=P&letter2=R&letter3=S P R S

Related

How to put the print information into a current dataframe

I am having brain fart. I wrote some code to get keywords from my data frame. It worked, but how can I put the print information into my current data frame. Thank you for the help in advance.
from scipy.sparse import coo_matrix
def sort_coo(coo_matrix):
tuples = zip(coo_matrix.col, coo_matrix.data)
return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
def extract_topn_from_vector(feature_names, sorted_items, topn=10):
"""get the feature names and tf-idf score of top n items"""
#use only topn items from vector
sorted_items = sorted_items[:topn]
score_vals = []
feature_vals = []
# word index and corresponding tf-idf score
for idx, score in sorted_items:
#keep track of feature name and its corresponding score
score_vals.append(round(score, 3))
feature_vals.append(feature_names[idx])
#create a tuples of feature,score
#results = zip(feature_vals,score_vals)
results= {}
for idx in range(len(feature_vals)):
results[feature_vals[idx]]=score_vals[idx]
return results
#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())
#extract only the top n; n here is 10
keywords=extract_topn_from_vector(feature_names,sorted_items,5)
#now print the results - NEED TO PUT THIS INFORMATION IN MY CURRENT DATAFRAME
print("\nAbstract:")
print(doc)
print("\nKeywords:")
for k in keywords:
print(k,keywords[k])
First: DataFrame is not Excel so it may not look like you may expect.
You can use append() to add new row with text. It should automatically add NaN if row is shorted. OR it will add columns with NaN if row is longer.
import pandas as pd
data = {
'X': ['A','B','C'],
'Y': ['D','E','F'],
'Z': ['G','H','I']
}
df = pd.DataFrame(data)
print(df)
df = df.append({"X": 'Abstract:'}, ignore_index=True)
df = df.append({"X": 'Keywords:'}, ignore_index=True)
keywords = {"first": 123, "second": 456, "third": 789}
for key, value in keywords.items():
df = df.append({"X": key, "Y": value}, ignore_index=True)
print(df)
Result:
# Before
X Y Z
0 A D G
1 B E H
2 C F I
# After
X Y Z
0 A D G
1 B E H
2 C F I
3 Abstract: NaN NaN
4 Keywords: NaN NaN
5 first 123 NaN
6 second 456 NaN
7 third 789 NaN
Eventually later you can replace NaN with something else - ie. empty string:
df = df.fillna('')
Result:
X Y Z
0 A D G
1 B E H
2 C F I
3 Abstract:
4 Keywords:
5 first 123
6 second 456
7 third 789

Is it possible to split a column value and add a new column at the same time for dataframe?

I have a dataframe with some columns delimited with '|', and I need to flatten this dataframe. Example:
name type
a l
b m
c|d|e n
For this df, I want to flatten it to:
name type
a l
b m
c n
d n
e n
To do this, I used this command:
df = df.assign(name=df.name.str.split('|')).explode(column).drop_duplicates()
Now, I want do one more thing besides above flatten operation:
name type co_occur
a l
b m
c n d
c n e
d n e
That is, not only split the 'c|d|e' into two rows, but also create a new column which contains a 'co_occur' relationship, in which 'c' and 'd' and 'e' co-occur with each other.
I don't see an easy way to do this by modifying:
df = df.assign(name=df.name.str.split('|')).explode(column).drop_duplicates()
I think this is what you want. Use combinations and piece everything together
from itertools import combinations
import io
data = '''name type
a l
b m
c|d|e n
j|k o
f|g|h|i p
'''
df = pd.read_csv(io.StringIO(data), sep=' \s+', engine='python')
# hold the new dataframes as you iterate via apply()
df_hold = []
def explode_combos(x):
combos = list(combinations(x['name'].split('|'),2))
# print(combos)
# print(x['type'])
df_hold.append(pd.DataFrame([{'name':c[0], 'type':x['type'], 'co_cur': c[1]} for c in combos]))
return
# only apply() to those rows that need to be exploded
dft = df[df['name'].str.contains('\|')].apply(explode_combos, axis=1)
# concatenate the result
dfn = pd.concat(df_hold)
# add back to rows that weren't operated on (see the ~)
df_final = pd.concat([df[~df['name'].str.contains('\|')], dfn]).fillna('')
name type co_cur
0 a l
1 b m
0 c n d
1 c n e
2 d n e
0 j o k
0 f p g
1 f p h
2 f p i
3 g p h
4 g p i
5 h p i

Fuzzy match in a column in a dataframe in python

I have a column that has strings. I want to do a fuzzy match and mark those which have an 80% match in a column next to it. I can do it with the following code on a smaller dataset but my original dataset is too big for this to work efficiently. Is there a better way to do this? This is what I have done.
import pandas as pd
l = [[1,'a','b','c','help pls'],[2,'a','c','c','yooo'],[3,'a','c','c','you will not pass'],[4,'a','b','b','You shall not pass'],[5,'a','c','c','You shall not pass!']]
df = pd.DataFrame(l,columns = ['Serial No','one','two','three','four'])
df['yes/no 2'] = ""
for i in range(0, df.shape[0]):
for j in range(0, df.shape[0]):
if (i != j):
if (fuzz.token_sort_ratio(df.iloc[i,df.shape[1]-2],df.iloc[j,df.shape[1]-2]) > 80):
df.iloc[i,df.shape[1]-1] = "yes"
import pandas as pd
from fuzzywuzzy import fuzz
l = [[1,'a','b','c','help pls'],[2,'a','c','c','yooo'],[3,'a','c','c','you will not pass'],[4,'a','b','b','You shall not pass'],[5,'a','c','c','You shall not pass!']]
df = pd.DataFrame(l,columns = ['Serial No','one','two','three','four'])
def match(row):
thresh = 80
return fuzz.token_sort_ratio(row["two"],row["three"])>thresh
df["Yes/No"] = df.apply(match,axis=1)
print(df)
Serial No one two three four Yes/No
0 1 a b c help pls False
1 2 a c c yooo True
2 3 a c c you will not pass True
3 4 a b b You shall not pass True
4 5 a c c You shall not pass! True
import pandas as pd
from fuzzywuzzy import fuzz,process
l = [[1,'a','b','c','help pls'],[2,'a','c','c','yooo'],[3,'a','c','c','you will not pass'],[4,'a','b','b','You shall not pass'],[5,'a','c','c','You shall not pass!']]
df = pd.DataFrame(l,columns = ['Serial No','one','two','three','four']).reset_index()
def match(df,col):
thresh = 80
return df[col].apply(lambda x:"Yes" if len(process.extractBests(x[1],[xx[1] for i,xx in enumerate(df[col]) if i!=x[0]],
scorer=fuzz.token_sort_ratio,score_cutoff=thresh+1,limit=1))>0 else "No")
df["five"] = df.apply(lambda x:(x["index"],x["four"]),axis=1)
df["Yes/No"] = df.pipe(match,"five")
print(df)
index Serial No one two three four five Yes/No
0 0 1 a b c help pls (0, help pls) No
1 1 2 a c c yooo (1, yooo) No
2 2 3 a c c you will not pass (2, you will not pass) Yes
3 3 4 a b b You shall not pass (3, You shall not pass) Yes
4 4 5 a c c You shall not pass! (4, You shall not pass!) Yes

Python pandas apply too slow Fuzzy Match

def fuzzy_clean(i, dfr, merge_list, key):
for col in range(0,len(merge_list)):
if col == 0:
scaled_down = dfr[dfr[merge_list[col]]==i[merge_list[col]]]
else:
scaled_down = scaled_down[scaled_down[merge_list[col]]==i[merge_list[col]]]
if len(scaled_down)>0:
if i[key] in scaled_down[key].values.tolist():
return i[key]
else:
return pd.to_datetime(scaled_down[key][min(abs([scaled_down[key]-i[key]])).index].values[0])
else:
return i[key]
df[key]=df.apply(lambda i: fuzzy_clean(i,dfr,merge_list,key), axis=1)
I'm trying to eventually merge together two dataframes, dfr and df. The issue I have is that I need to merge on about 9 columns, one of which being a timestamp that doesn't quite match up between the two dataframes where sometimes it is slightly lagging, sometimes leading. I wrote a function that works when using the following; however, in practice it is just too slow running through hundreds of thousands of rows.
merge_list is a list of columns that each dataframe share that match up 100%
key is a string of a column, 'timestamp', that each share, which is what doesn't match up too well
Any suggestions in speeding this up would be greatly appreciated!
The data looks like the following:
df:
timestamp A B C
0 100 x y z
1 101 y i u
2 102 r a e
3 103 q w e
dfr:
timestamp A B C
0 100.01 x y z
1 100.99 y i u
2 101.05 y i u
3 102 r a e
4 103.01 q w e
5 103.20 q w e
I want df to look like the following:
timestamp A B C
0 100.01 x y z
1 100.99 y i u
2 102 r a e
3 103.01 q w e
Adding the final merge for reference:
def fuzzy_merge(df_left, df_right, on, key, how='outer'):
df_right[key]=df_right.apply(lambda i: fuzzy_clean(i,df_left,on,key), axis=1)
return pd.merge(df_left, df_right, on=on+[key], how=how, indicator=True).sort_values(key)
I've found a solution that I believe works. Pandas has a merge_asof that follows, still verifying possible double counting but seemed to do a decent job.
pd.merge_asof(left_df, right_df, on='timestamp', by=merge_list, direction='nearest')

conditional column output for pandas dataframe

I have a pandas DataFrame looking like this:
nameA statusA nameB statusB
a Q x X
b Q y X
c X z Q
d X o Q
e Q p X
f Q r Q
i want to print the rows of this dataframe based on the following rule: output column nameA if statusA is Q else if statusB is Q output column nameB. and in case statusA and statusB are both Q, both columns nameA and nameB should be output.
is there a oneliner for this?
UPDATE:
expected output:
a,Q
b,Q
z,Q
o,Q
e,Q
f,Q,r,Q
> data['con'] = data['statusA'] + data['statusB']
> data.apply(lambda v: v['nameA'] if v['con'] == 'QX' else v['nameB'] if v['con'] == 'XQ' else v['nameA']+ ','+ v['nameB'], axis=1)
0 a
1 b
2 z
3 o
4 e
5 f,r
dtype: object
You can use string concatenation for producing the exact result.
Some thing like
> data.apply(lambda v: v['nameA']+',Q' if v['con'] == 'QX' else v['nameB'] + ',Q' if v['con'] == 'XQ' else v['nameA']+ ',Q,' + v['nameB'] + ',Q', axis=1)
0 a,Q
1 b,Q
2 z,Q
3 o,Q
4 e,Q
5 f,Q,r,Q
dtype: object

Categories

Resources