Element-wise ternary conditional operation on dataframes - python

Say given dateframes df1, df2, df3, what is the best way to get df = df1 if (df2>0) else df3 element-wise?

You can use df.where to achieve this:
In [3]:
df1 = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df2 = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df3 = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
print(df1)
print(df2)
print(df3)
a b c
0 -0.378401 1.456254 -0.327311
1 0.491864 -0.757420 -0.014689
2 0.028873 -0.906428 -0.252586
3 -0.686849 1.515643 1.065322
4 0.570760 -0.857298 -0.152426
a b c
0 1.273215 1.275937 -0.745570
1 -0.460257 -0.756481 1.043673
2 0.452731 1.071703 -0.454962
3 0.418926 1.395290 -1.365873
4 -0.661421 0.798266 0.384397
a b c
0 -0.641351 -1.469222 0.160428
1 1.164031 1.781090 -1.218099
2 0.096094 0.821062 0.815384
3 -1.001950 -1.851345 0.772869
4 -1.137854 1.205580 -0.922832
In [4]:
df = df1.where(df2 >0, df3)
df
Out[4]:
a b c
0 -0.378401 1.456254 0.160428
1 1.164031 1.781090 -0.014689
2 0.028873 -0.906428 0.815384
3 -0.686849 1.515643 0.772869
4 -1.137854 -0.857298 -0.152426

also
df = df1[df2 > 0].combine_first(df3)

Related

Use dataframe column containing "column name strings", to return values from dataframe based on column name and index without using .apply()

I have a dataframe as follows:
df=pandas.DataFrame()
df['A'] = numpy.random.random(10)
df['B'] = numpy.random.random(10)
df['C'] = numpy.random.random(10)
df['Col_name'] = numpy.random.choice(['A','B','C'],size=10)
I want to obtain an output that uses 'Col_name' and the respective index of the dataframe row to lookup the value in the dataframe.
I can get the desired output this with .apply() follows:
df['output'] = df.apply(lambda x: x[ x['Col_name'] ], axis=1)
.apply() is slow over a large dataframe with it iterating row by row. Is there an obvious solution in pandas that is faster/vectorised?
You can also pick each column name (or give list of possible names) and then apply it as mask to filter your dataframe then pick values from desired column and assign them to all rows matching the mask. Then repeat this for another coulmn.
for column_name in df: #or: for column_name in ['A', 'B', 'C']
df.loc[df['Col_name']==column_name, 'output'] = df[column_name]
Rows that will not match any mask will have NaN values.
PS. Accodring to my test with 10000000 random rows - method with .apply() takes 2min 24s to finish while my method takes only 4,3s.
Use melt to flatten your dataframe and keep rows where Col_name equals to variable column:
df['output'] = df.melt('Col_name', ignore_index=False).query('Col_name == variable')['value']
print(df)
# Output
A B C Col_name output
0 0.202197 0.430735 0.093551 B 0.430735
1 0.344753 0.979453 0.999160 C 0.999160
2 0.500904 0.778715 0.074786 A 0.500904
3 0.050951 0.317732 0.363027 B 0.317732
4 0.722624 0.026065 0.424639 C 0.424639
5 0.578185 0.626698 0.376692 C 0.376692
6 0.540849 0.805722 0.528886 A 0.540849
7 0.918618 0.869893 0.825991 C 0.825991
8 0.688967 0.203809 0.734467 B 0.203809
9 0.811571 0.010081 0.372657 B 0.010081
Transformation after melt:
>>> df.melt('Col_name', ignore_index=False)
Col_name variable value
0 B A 0.202197
1 C A 0.344753
2 A A 0.500904 # keep
3 B A 0.050951
4 C A 0.722624
5 C A 0.578185
6 A A 0.540849 # keep
7 C A 0.918618
8 B A 0.688967
9 B A 0.811571
0 B B 0.430735 # keep
1 C B 0.979453
2 A B 0.778715
3 B B 0.317732 # keep
4 C B 0.026065
5 C B 0.626698
6 A B 0.805722
7 C B 0.869893
8 B B 0.203809 # keep
9 B B 0.010081 # keep
0 B C 0.093551
1 C C 0.999160 # keep
2 A C 0.074786
3 B C 0.363027
4 C C 0.424639 # keep
5 C C 0.376692 # keep
6 A C 0.528886
7 C C 0.825991 # keep
8 B C 0.734467
9 B C 0.372657
Update
Alternative with set_index and stack for #Rabinzel:
df['output'] = (
df.set_index('Col_name', append=True).stack()
.loc[lambda x: x.index.get_level_values(1) == x.index.get_level_values(2)]
.droplevel([1, 2])
)
print(df)
# Output
A B C Col_name output
0 0.209953 0.332294 0.812476 C 0.812476
1 0.284225 0.566939 0.087084 A 0.284225
2 0.815874 0.185154 0.155454 A 0.815874
3 0.017548 0.733474 0.766972 A 0.017548
4 0.494323 0.433719 0.979399 C 0.979399
5 0.875071 0.789891 0.319870 B 0.789891
6 0.475554 0.229837 0.338032 B 0.229837
7 0.123904 0.397463 0.288614 C 0.288614
8 0.288249 0.631578 0.393521 A 0.288249
9 0.107245 0.006969 0.367748 C 0.367748
import pandas as pd
import numpy as np
df=pd.DataFrame()
df['A'] = np.random.random(10)
df['B'] = np.random.random(10)
df['C'] = np.random.random(10)
df['Col_name'] = np.random.choice(['A','B','C'],size=10)
df["output"] = np.nan
Even though you do not like going row per row, I still routinely use loops to go through each row just to know where it breaks when it breaks. Here are two loops just to satisfy myself. The column is created ahead with na values becausethe loops needs it to be.
# each rows by index
for i in range(len(df)):
df['output'][i] = df[df['Col_name'][i]][i]
# each rows but by column name
for col in list(df["Col_name"]):
df.loc[:,'output'] = df.loc[:,col]
Here are some "non-loop" ways to do so.
df["output"] = df.lookup(df.index, df.Col_name)
df['output'] = np.where(np.isnan(df['output']), df[df['Col_name']], np.nan)

`pd.concat` with `join=='inner'` doesn't produce intersection of pandas dataframes

I am trying to extract common rows from several dataframes using pd.concat:
>>> import numpy as np
>>> import pandas as pd
>>> x = np.random.random(size=(5, 3))
>>> df1 = pd.DataFrame(x)
>>> df2 = pd.DataFrame(x[1:3])
>>> df3 = pd.DataFrame(x[2:4])
>>> df1
0 1 2
0 0.257662 0.453542 0.805230
1 0.060493 0.463148 0.715994
2 0.452379 0.470137 0.965709
3 0.447546 0.964252 0.163247
4 0.187212 0.973557 0.871090
>>> df2
0 1 2
0 0.060493 0.463148 0.715994
1 0.452379 0.470137 0.965709
>>> df3
0 1 2
0 0.452379 0.470137 0.965709
1 0.447546 0.964252 0.163247
As you can see, only the row 0.452379 0.470137 0.965709 is common to all three dataframes. To extract it, I tried:
>>> pd.concat([df1, df2, df3], join='inner')
0 1 2
0 0.257662 0.453542 0.805230
1 0.060493 0.463148 0.715994
2 0.452379 0.470137 0.965709
3 0.447546 0.964252 0.163247
4 0.187212 0.973557 0.871090
0 0.060493 0.463148 0.715994
1 0.452379 0.470137 0.965709
0 0.452379 0.470137 0.965709
1 0.447546 0.964252 0.163247
Thus, join==inner doesn't seem to work! I should also point out that ignore_index=True has no effect on the behavior. In an article on Real Python, using axis=1 is suggested. However, it is wrong in my opinion:
>>> pd.concat([df1, df2, df3], join='inner', axis=1)
0 1 2 0 1 2 0 1 2
0 0.257662 0.453542 0.805230 0.060493 0.463148 0.715994 0.452379 0.470137 0.965709
1 0.060493 0.463148 0.715994 0.452379 0.470137 0.965709 0.447546 0.964252 0.163247
What is wrong with what I am doing? Also, how would I extract common rows from several dataframes if this way doesn't work? I am using Pandas version 0.25.3.
In short, go with reduce(lambda left,right: pd.merge(left,right,on=cols), dfs),
(see Method #2 - make sure to include from functools import reduce), but please see an explanation for pd.concat (Method #1):
Method #1 (concat): I think the most dynamic, robust pd.concat way (of the ways I've tried with concat specifically) is to use. The only major benefit of this solution over the second method below is that you don't have to use an additional library; however, I think you could also write similar code with merge without having to use another library:
dfs = [df1, df2, df3]
cols = [*df1.columns] # enclosing with [*] is the same as tolist()
for df in dfs:
df.set_index(cols, inplace=True) # can only use inplace when looping through dfs (at least using my simpler method)
pd.concat(dfs, join='inner', axis=1).reset_index() # see below paragraph for explanation
Out[1]:
0 1 2
0 0.452379 0.470137 0.965709
Please note that join='inner' means you are joining on the index NOT the unique rows. Also, join only matters if you pass axis=1, which is why effectively nothing happens.
Method #2: (merge with reduce):
#Anky pointed out that how='inner' is default with merge. This was actually the first answer I posted, but I got confused about expected output and went full circle. Please see the simplest answer below:
from functools import reduce
dfs = [df1, df2, df3]
cols = [*df1.columns]
reduce(lambda left,right: pd.merge(left,right,on=cols), dfs)
Out[2]:
0 1 2
0 0.452379 0.470137 0.965709
If you are attempting to look for common rows:
temp = pd.concat([df1, df2, df3])
temp[temp.duplicated()]
I'm sure there is a more elegant solution to this, however.
Try this,
df = pd.merge(df1, df2, how='inner', on=[col1, col2, col3])
# add extral tag column
df_list = [df1, df2, df3]
for i, dfi in enumerate(df_list):
dfi['tag'] = i + 1
# merge DataFrame
df = pd.concat([df1, df2, df3], ignore_index=True)
# find the duplicates rows
cols = df.columns[:-1].tolist()
cond = df[cols].duplicated(keep=False)
obj = df[cond].groupby(cols)['tag'].agg(tuple)
# filter
cond = obj.map(len) == len(df_list)
obj[cond]
obj example:
# 0 1 2
# 0.148080 0.837398 0.565498 (1, 3)
# 0.572673 0.256735 0.620923 (1, 2, 3)
# 0.822542 0.856137 0.645639 (1, 2)
# Name: tag, dtype: object
In a similar fashion to what #Ajay A said,
import numpy as np
import pandas as pd
x = np.random.random(size=(5, 3))
df1 = pd.DataFrame(x)
df2 = pd.DataFrame(x[1:3])
df3 = pd.DataFrame(x[2:4])
Then,
df1
Out[22]:
0 1 2
0 0.845894 0.530659 0.629198
1 0.697229 0.225557 0.314540
2 0.972633 0.685077 0.191109
3 0.069966 0.961317 0.352933
4 0.176633 0.663602 0.235032
df2
Out[23]:
0 1 2
0 0.697229 0.225557 0.314540
1 0.972633 0.685077 0.191109
df3
Out[24]:
0 1 2
0 0.972633 0.685077 0.191109
1 0.069966 0.961317 0.352933
Then you can use pd.merge with how='inner'
pd.merge(df2, df3, how='inner')
Out[25]:
0 1 2
0 0.972633 0.685077 0.191109
or if what you are looking for is soing the intersection of the three,
pd.merge(pd.merge(df1,df2,how='inner'), df3, how='inner')
Out[26]:
0 1 2
0 0.972633 0.685077 0.191109
Use a for loop to handle a df_list.
df_list = [df1, df2, df3]
df_intersection = df1
for df in df_list[1:]:
df_intersection = pd.merge(df_intersection, df, how='inner')

pd.merge and check changed data

if I have these dataframes
df1 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','b','c','d'],
'col2': ['h','e','l','p']})
df2 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','e','f','d'],
'col2': ['h','e','lp','p']})
df1
index col1 col2
0 1 a h
1 2 b e
2 3 c l
3 4 d p
df2
index col1 col2
0 1 a h
1 2 e e
2 3 f lp
3 4 d p
I want to merge them and see whether or not the rows are different and get an output like this
index col1 col1_validation col2 col2_validation
0 1 a True h True
1 2 b False e True
2 3 c False l False
3 4 d True p True
how can I achieve that?
It looks like col1 and col2 from your "merged" dataframe are just taken from df1. In that case, you can simply compare the col1, col2 between the original data frames and add those as columns:
cols = ["col1", "col2"]
val_cols = ["col1_validation", "col2_validation"]
# (optional) new dataframe, so you don't mutate df1
df = df1.copy()
new_cols = (df1[cols] == df2[cols])
df[val_cols] = new_cols
You can merge and compare the two data frames with something similar to the following:
df1 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','b','c','d'],
'col2': ['h','e','l','p']})
df2 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','e','f','d'],
'col2': ['h','e','lp','p']})
# give columns unique name when merging
df1.columns = df1.columns + '_df1'
df2.columns = df2.columns + '_df2'
# merge/combine data frames
combined = pd.concat([df1, df2], axis = 1)
# add calculated columns
combined['col1_validation'] = combined['col1_df1'] == combined['col1_df2']
combined['col12validation'] = combined['col2_df1'] == combined['col2_df2']

How to split pandas.DataFrame by index order?

If I have two pandas.DataFrame with the same columns.
df1 = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef'))
df2 = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef'))
I concatenate them into one:
df = pd.concat([df1, df2], ignore_index = False)
The index values now are not ignored.
After I perform some data manipulation without changing the index values, how can I reverse back the concatenation, so that I end up with a list of the two data frames again?
I recommend using keys in concat
df = pd.concat([df1, df2], ignore_index = False,keys=['df1','df2'])
df
Out[28]:
a b c d e f
df1 0 0.426246 0.162134 0.231001 0.645908 0.282457 0.715134
1 0.973173 0.854198 0.419888 0.617750 0.115466 0.565804
2 0.474284 0.757242 0.452319 0.046627 0.935915 0.540498
3 0.046215 0.740778 0.204866 0.047914 0.143158 0.317274
4 0.311755 0.456133 0.704235 0.255057 0.558791 0.319582
df2 0 0.449926 0.330672 0.830240 0.861221 0.234013 0.299515
1 0.552645 0.620980 0.313907 0.039247 0.356451 0.849368
2 0.159485 0.620178 0.428837 0.315384 0.910175 0.020809
3 0.687249 0.824803 0.118434 0.661684 0.013440 0.611711
4 0.576244 0.915196 0.544099 0.750581 0.192548 0.477207
Convert back
df1,df2=[y.reset_index(level=0,drop=True) for _, y in df.groupby(level=0)]
df1
Out[30]:
a b c d e f
0 0.426246 0.162134 0.231001 0.645908 0.282457 0.715134
1 0.973173 0.854198 0.419888 0.617750 0.115466 0.565804
2 0.474284 0.757242 0.452319 0.046627 0.935915 0.540498
3 0.046215 0.740778 0.204866 0.047914 0.143158 0.317274
4 0.311755 0.456133 0.704235 0.255057 0.558791 0.319582
If you prefer to do without groupby, you could use this.
list_dfs = [df1, df2]
df = pd.concat(list_dfs, ignore_index = False)
new_dfs = []
counter = 0
for i in list_dfs:
new_dfs.append(df[counter:counter+len(i)])
counter += len(i)

How to assign a value_count output to a dataframe

I am trying to assign the output from a value_count to a new df. My code follows.
import pandas as pd
import glob
df = pd.concat((pd.read_csv(f, names=['date','bill_id','sponsor_id']) for f in glob.glob('/home/jayaramdas/anaconda3/df/s11?_s_b')))
column_list = ['date', 'bill_id']
df = df.set_index(column_list, drop = True)
df = df['sponsor_id'].value_counts()
df.columns=['sponsor', 'num_bills']
print (df)
The value count is not being assigned the column headers specified 'sponsor', 'num_bills'. I'm getting the following output from print.head
1036 426
791 408
1332 401
1828 388
136 335
Name: sponsor_id, dtype: int64
your column length doesn't match, you read 3 columns from the csv and then set the index to 2 of them, you calculated value_counts which produces a Series with the column values as the index and the value_counts as the values, you need to reset_index and then overwrite the column names:
df = df.reset_index()
df.columns=['sponsor', 'num_bills']
Example:
In [276]:
df = pd.DataFrame({'col_name':['a','a','a','b','b']})
df
Out[276]:
col_name
0 a
1 a
2 a
3 b
4 b
In [277]:
df['col_name'].value_counts()
Out[277]:
a 3
b 2
Name: col_name, dtype: int64
In [278]:
type(df['col_name'].value_counts())
Out[278]:
pandas.core.series.Series
In [279]:
df = df['col_name'].value_counts().reset_index()
df.columns = ['col_name', 'count']
df
Out[279]:
col_name count
0 a 3
1 b 2
Appending value_counts() to multi-column dataframe:
df = pd.DataFrame({'C1':['A','B','A'],'C2':['A','B','A']})
vc_df = df.value_counts().to_frame('Count').reset_index()
display(df, vc_df)
C1 C2
0 A A
1 B B
2 A A
C1 C2 Count
0 A A 2
1 B B 1

Categories

Resources