I have a pandas DataFrame looking like this:
nameA statusA nameB statusB
a Q x X
b Q y X
c X z Q
d X o Q
e Q p X
f Q r Q
i want to print the rows of this dataframe based on the following rule: output column nameA if statusA is Q else if statusB is Q output column nameB. and in case statusA and statusB are both Q, both columns nameA and nameB should be output.
is there a oneliner for this?
UPDATE:
expected output:
a,Q
b,Q
z,Q
o,Q
e,Q
f,Q,r,Q
> data['con'] = data['statusA'] + data['statusB']
> data.apply(lambda v: v['nameA'] if v['con'] == 'QX' else v['nameB'] if v['con'] == 'XQ' else v['nameA']+ ','+ v['nameB'], axis=1)
0 a
1 b
2 z
3 o
4 e
5 f,r
dtype: object
You can use string concatenation for producing the exact result.
Some thing like
> data.apply(lambda v: v['nameA']+',Q' if v['con'] == 'QX' else v['nameB'] + ',Q' if v['con'] == 'XQ' else v['nameA']+ ',Q,' + v['nameB'] + ',Q', axis=1)
0 a,Q
1 b,Q
2 z,Q
3 o,Q
4 e,Q
5 f,Q,r,Q
dtype: object
Related
I have a dataframe with some columns delimited with '|', and I need to flatten this dataframe. Example:
name type
a l
b m
c|d|e n
For this df, I want to flatten it to:
name type
a l
b m
c n
d n
e n
To do this, I used this command:
df = df.assign(name=df.name.str.split('|')).explode(column).drop_duplicates()
Now, I want do one more thing besides above flatten operation:
name type co_occur
a l
b m
c n d
c n e
d n e
That is, not only split the 'c|d|e' into two rows, but also create a new column which contains a 'co_occur' relationship, in which 'c' and 'd' and 'e' co-occur with each other.
I don't see an easy way to do this by modifying:
df = df.assign(name=df.name.str.split('|')).explode(column).drop_duplicates()
I think this is what you want. Use combinations and piece everything together
from itertools import combinations
import io
data = '''name type
a l
b m
c|d|e n
j|k o
f|g|h|i p
'''
df = pd.read_csv(io.StringIO(data), sep=' \s+', engine='python')
# hold the new dataframes as you iterate via apply()
df_hold = []
def explode_combos(x):
combos = list(combinations(x['name'].split('|'),2))
# print(combos)
# print(x['type'])
df_hold.append(pd.DataFrame([{'name':c[0], 'type':x['type'], 'co_cur': c[1]} for c in combos]))
return
# only apply() to those rows that need to be exploded
dft = df[df['name'].str.contains('\|')].apply(explode_combos, axis=1)
# concatenate the result
dfn = pd.concat(df_hold)
# add back to rows that weren't operated on (see the ~)
df_final = pd.concat([df[~df['name'].str.contains('\|')], dfn]).fillna('')
name type co_cur
0 a l
1 b m
0 c n d
1 c n e
2 d n e
0 j o k
0 f p g
1 f p h
2 f p i
3 g p h
4 g p i
5 h p i
I have one dataframe which I have to divide it into 2 dataframes.
Example:
Project_Number Indication
S100 X
S100 Y
S200 Z
S300 P
S300 Q
S300 R
S400 S
Now I have to divide into 2 based on Project_Number. If particular project_number is having more than 1 value then it should go into 1 dataframe and if it is having single value then go in 2nd dataframe.
Output:
df1-
Project_Number Indication
S100 X
S100 Y
S300 P
S300 Q
S300 R
df2-
Project_Number Indication
S200 Z
S400 S
Use Series.duplicated with keep=False for all dupes:
m = df['Project_Number'].duplicated(keep=False)
df1 = df[m]
df2 = df[~m]
You can do this in a few steps using the groupby() and duplicated():
df = pd.DataFrame([x.split(" ") for x in ("""S100 X
S100 Y
S200 Z
S300 P
S300 Q
S300 R
S400 S""").split("\n")], columns="Project_Number,Indication".split(","))
(has_multiple1, df1), (has_multiple2, df2) = list(df.groupby(df['Project_Number'].duplicated(keep=False)))
Consider these series:
>>> a = pd.Series('abc a abc c'.split())
>>> b = pd.Series('a abc abc a'.split())
>>> pd.concat((a, b), axis=1)
0 1
0 abc a
1 a abc
2 abc abc
3 c a
>>> unknown_operation(a, b)
0 False
1 True
2 True
3 False
The desired logic is to determine if the string in the left column is a substring of the string in the right column. pd.Series.str.contains does not accept another Series, and pd.Series.isin checks if the value exists in the other series (not in the same row specifically). I'm interested to know if there's a vectorized solution (not using .apply or a loop), but it may be that there isn't one.
Let us try with numpy defchararray which is vectorized
from numpy.core.defchararray import find
find(df['1'].values.astype(str),df['0'].values.astype(str))!=-1
Out[740]: array([False, True, True, False])
IIUC,
df[1].str.split('', expand=True).eq(df[0], axis=0).any(axis=1) | df[1].eq(df[0])
Output:
0 False
1 True
2 True
3 False
dtype: bool
I tested various functions with a randomly generated Dataframe of 1,000,000 5 letter entries.
Running on my machine, the averages of 3 tests showed:
zip > v_find > to_list > any > apply
0.21s > 0.79s > 1s > 3.55s > 8.6s
Hence, i would recommend using zip:
[x[0] in x[1] for x in zip(df['A'], df['B'])]
or vectorized find (as proposed by BENY)
np.char.find(df['B'].values.astype(str), df['A'].values.astype(str)) != -1
My test-setup:
def generate_string(length):
return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))
A = [generate_string(5) for x in range(n)]
B = [generate_string(5) for y in range(n)]
df = pd.DataFrame({"A": A, "B": B})
to_list = pd.Series([a in b for a, b in df[['A', 'B']].values.tolist()])
apply = df.apply(lambda s: s["A"] in s["B"], axis=1)
v_find = np.char.find(df['B'].values.astype(str), df['A'].values.astype(str)) != -1
any = df["B"].str.split('', expand=True).eq(df["A"], axis=0).any(axis=1) | df["B"].eq(df["A"])
zip = [x[0] in x[1] for x in zip(df['A'], df['B'])]
I have a pandas dataframe. This dataframe consists of a single column. I want to parse this column according to the '&' sign and add the data to the right of the "=" sign as a new column. Examples are below.
The dataframe I have;
tags
0 letter1=A&letter2=B&letter3=C
1 letter1=D&letter2=E&letter3=F
2 letter1=G&letter2=H&letter3=I
3 letter1=J&letter2=K&letter3=L
4 letter1=M&letter2=N&letter3=O
5 letter1=P&letter2=R&letter3=S
. .
. .
dataframe that I want to convert;
letter1 letter2 letter3
0 A B C
1 D E F
2 G H I
3 J K L
4 M N O
.
.
I tried to do something with this code snippet.
columnname= df["tags"][0].split("&")[i].split("=")[0]
value =df["tags"][0].split("&")[i].split("=")[1]
But I'm not sure how I can do it for the whole dataframe. I am looking for a faster and stable way.
Thanks in advance,
do this..
import pandas as pd
tags = [
"letter1=A&letter2=B&letter3=C",
"letter1=D&letter2=E&letter3=F",
"letter1=G&letter2=H&letter3=I",
"letter1=J&letter2=K&letter3=L",
"letter1=M&letter2=N&letter3=O",
"letter1=P&letter2=R&letter3=S"
]
df = pd.DataFrame({"tags": tags})
df["letter1"] = df["tags"].apply(lambda x: x.split("&")[0].split("=")[-1])
df["letter2"] = df["tags"].apply(lambda x: x.split("&")[1].split("=")[-1])
df["letter3"] = df["tags"].apply(lambda x: x.split("&")[2].split("=")[-1])
df = df[["letter1", "letter2", "letter3"]]
df
Split into separate columns, via str.split, using & :
step1 = df.tags.str.split("&", expand=True)
Get the new columns from the first row of step1:
new_columns = step1.loc[0, :].str[:-2].array
Get rid of the letter1= prefix in each column, set the new_columns as the header:
step1.set_axis(new_columns, axis='columns').transform(lambda col: col.str[-1])
letter1 letter2 letter3
0 A B C
1 D E F
2 G H I
3 J K L
4 M N O
5 P R S
d=list(df["tags"])
r={}
for i in d:
for ele in i.split("&"):
if ele.split("=")[0] in r.keys():
r[ele.split("=")[0]].append(ele.split("=")[1])
else:
r[ele.split("=")[0]]=[]
r[ele.split("=")[0]].append(ele.split("=")[1])
df = pd.DataFrame({i:pd.Series(r[i]) for i in r})
print (df)
Using regex
import pandas as pd
import re
tags = [
"letter1=A&letter2=B&letter3=C",
"letter1=D&letter2=E&letter3=F",
"letter1=G&letter2=H&letter3=I",
"letter1=J&letter2=K&letter3=L",
"letter1=M&letter2=N&letter3=O",
"letter1=P&letter2=R&letter3=S"
]
df = pd.DataFrame({"tags": tags})
pattern=re.compile("\=(\w+)") # Look for pattern
df['letter1'], df['letter3'],df["letter2"] = zip(*df["tags"].apply(lambda x: pattern.findall(x)))
Output
tags letter1 letter2 letter3
0 letter1=A&letter2=B&letter3=C A B C
1 letter1=D&letter2=E&letter3=F D E F
2 letter1=G&letter2=H&letter3=I G H I
3 letter1=J&letter2=K&letter3=L J K L
4 letter1=M&letter2=N&letter3=O M N O
5 letter1=P&letter2=R&letter3=S P R S
def fuzzy_clean(i, dfr, merge_list, key):
for col in range(0,len(merge_list)):
if col == 0:
scaled_down = dfr[dfr[merge_list[col]]==i[merge_list[col]]]
else:
scaled_down = scaled_down[scaled_down[merge_list[col]]==i[merge_list[col]]]
if len(scaled_down)>0:
if i[key] in scaled_down[key].values.tolist():
return i[key]
else:
return pd.to_datetime(scaled_down[key][min(abs([scaled_down[key]-i[key]])).index].values[0])
else:
return i[key]
df[key]=df.apply(lambda i: fuzzy_clean(i,dfr,merge_list,key), axis=1)
I'm trying to eventually merge together two dataframes, dfr and df. The issue I have is that I need to merge on about 9 columns, one of which being a timestamp that doesn't quite match up between the two dataframes where sometimes it is slightly lagging, sometimes leading. I wrote a function that works when using the following; however, in practice it is just too slow running through hundreds of thousands of rows.
merge_list is a list of columns that each dataframe share that match up 100%
key is a string of a column, 'timestamp', that each share, which is what doesn't match up too well
Any suggestions in speeding this up would be greatly appreciated!
The data looks like the following:
df:
timestamp A B C
0 100 x y z
1 101 y i u
2 102 r a e
3 103 q w e
dfr:
timestamp A B C
0 100.01 x y z
1 100.99 y i u
2 101.05 y i u
3 102 r a e
4 103.01 q w e
5 103.20 q w e
I want df to look like the following:
timestamp A B C
0 100.01 x y z
1 100.99 y i u
2 102 r a e
3 103.01 q w e
Adding the final merge for reference:
def fuzzy_merge(df_left, df_right, on, key, how='outer'):
df_right[key]=df_right.apply(lambda i: fuzzy_clean(i,df_left,on,key), axis=1)
return pd.merge(df_left, df_right, on=on+[key], how=how, indicator=True).sort_values(key)
I've found a solution that I believe works. Pandas has a merge_asof that follows, still verifying possible double counting but seemed to do a decent job.
pd.merge_asof(left_df, right_df, on='timestamp', by=merge_list, direction='nearest')