Updating values with another dataframe - python

I have 2 pandas dataframes. The second one is contained in the first one. How can I replace the values in the first one with the ones in the second?
consider this example:
df1 = pd.DataFrame(0, index=[1,2,3], columns=['a','b','c'])
df2 = pd.DataFrame(1, index=[1, 2], columns=['a', 'c'])
ris= [[1, 0, 1],
[1, 0, 1],
[0, 0, 0]]
and ris has the same index and columns of d1
A possible solution is:
for i in df2.index:
for j in df2.columns:
df1.loc[i, j] = df2.loc[i, j]
But this is ugly

I think you can use copy with combine_first:
df3 = df1.copy()
df1[df2.columns] = df2[df2.columns]
print df1.combine_first(df3)
a b c
1 1.0 0 1.0
2 1.0 0 1.0
3 0.0 0 0.0
Next solution is creating empty new DataFrame df4 with index and columns from df1 and fill it by double combine_first:
df4 = pd.DataFrame(index=df1.index, columns=df1.columns)
df4 = df4.combine_first(df2).combine_first(df1)
print df4
a b c
1 1.0 0.0 1.0
2 1.0 0.0 1.0
3 0.0 0.0 0.0

Try
In [7]: df1['a'] = df2['a']
In [8]: df1['c'] = df2['c']
In [14]: df1[['a','c']] = df2[['a','c']]
If the column names are not known:
In [25]: for col in df2.columns:
....: df1[col] = df2[col]

Related

Drop Non-equivalent Multiindex Rows in Pandas Dataframe

Goal
If sub-column min equals to sub-column max and if min and max sub-column do not equal to each other in any of the column (ao, his, cyp1a2s, cyp3a4s in this case), drop the row.
Example
arrays = [np.array(['ao', 'ao', 'hia', 'hia', 'cyp1a2s', 'cyp1a2s', 'cyp3a4s', 'cyp3a4s']),
np.array(['min', 'max', 'min', 'max', 'min', 'max', 'min', 'max'])]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['',''])
df = pd.DataFrame(np.array([[1, 1, 0, 0, float('nan'), float('nan'), 0, 0],
[1, 1, 0, 0, float('nan'), 1, 0, 0],
[0, 2, 0, 0, float('nan'), float('nan'), 1, 1],]), index=['1', '2', '3'], columns=index)
df
ao hia cyp1a2s cyp3a4s
min max min max min max min max
1 1.0 1.0 0.0 0.0 NaN NaN 0.0 0.0
2 1.0 1.0 0.0 0.0 NaN 1.0 0.0 0.0
3 0.0 2.0 0.0 0.0 NaN NaN 1.0 1.0
Want
df = pd.DataFrame(np.array([[1, 1, 0, 0, float('nan'), float('nan'), 0, 0]]), index=['1'], columns=index)
df
ao hia cyp1a2s cyp3a4s
min max min max min max min max
1 1.0 1.0 0.0 0.0 NaN NaN 0.0 0.0
Attempt
df.apply(lambda x: x['min'].map(str) == x['max'].map(str), axis=1)
KeyError: ('min', 'occurred at index 1')
Note
The actual dataframe has 50+ columns.
Use DataFrame.xs for DataFrame by second levels of MultiIndex, replace NaNs:
df1 = df.xs('min', axis=1, level=1).fillna('nan')
df2 = df.xs('max', axis=1, level=1).fillna('nan')
Or convert data to strings:
df1 = df.xs('min', axis=1, level=1).astype('str')
df2 = df.xs('max', axis=1, level=1).astype('str')
Compare Dataframes by DataFrame.eq and test if all Trues by DataFrame.all and last filter by boolean indexing:
df = df[df1.eq(df2).all(axis=1)]
print (df)
ao hia cyp1a2s cyp3a4s
min max min max min max min max
1 1.0 1.0 0.0 0.0 NaN NaN 0.0 0.0
The reason df.apply() didn't work is you needed to reference 2 levels of columns.
Also .map(str) was invalid for mapping from float64... used .astype(str)
The following work for >1 columns:
eqCols = ['cyp1a2s','hia']
neqCols = list(set(df.xs('min', level=1, axis=1).columns) - set(eqCols))
EQ = lambda r,c : r[c]['min'].astype(str) == r[c]['max'].astype(str)
df[df.apply(lambda r: ([EQ(r,c) for c in eqCols][0]) & ([(not EQ(r,c)) for c in neqCols][0]), axis=1)]

Pandas: Groupby each column in a different way

Let's say that I have the following data-frame:
df = pd.DataFrame({"unique_id": [1, 1, 1], "att1_amr": [11, 11, 11], "att2_nominal": [1, np.nan, np.nan], "att3_nominal": [np.nan, 1, np.nan], "att4_bok": [33.33, 33.33, 33.33], "att5_nominal": [np.nan, np.nan, np.nan], "att6_zpq": [22.22, 22.22, 22.22]})
What I want to do is group-by the rows of the data-frame by unique_id such that I can apply a separate group-by operation on the columns that contain the word nominal and a separate to all other. To be more specific, I want to group-by the columns that contain nominal using sum(min_count = 1) and the other with first() or last(). The result should be the following:
df_result = pd.DataFrame({"unique_id": [1], "att1_amr": [11], "att2_nominal": [1], "att3_nominal": [1], "att4_bok": [33.33], "att5_nominal": [np.nan], "att6_zpq": [22.22]})
Thank you!
You can create dictionary dynamically - first all columns with nominal with lambda function and then all another columns with last and merge it together, last call DataFrameGroupBy.agg:
d1 = dict.fromkeys(df.columns[df.columns.str.contains('nominal')],
lambda x : x.sum(min_count=1))
d2 = dict.fromkeys(df.columns.difference(['unique_id'] + list(d1)), 'last')
d = {**d1, **d2}
df = df.groupby('unique_id').agg(d)
print (df)
att2_nominal att3_nominal att5_nominal att1_amr att4_bok \
unique_id
1 1.0 1.0 NaN 11 33.33
att6_zpq
unique_id
1 22.22
Another more cleaner solution:
d = {k: (lambda x : x.sum(min_count=1))
if 'nominal' in k
else 'last'
for k in df.columns.difference(['unique_id'])}
df = df.groupby('unique_id').agg(d)
print (df)
att1_amr att2_nominal att3_nominal att4_bok att5_nominal \
unique_id
1 11 1.0 1.0 33.33 NaN
att6_zpq
unique_id
1 22.22
Why not just:
>>> df.ffill().bfill().drop_duplicates()
att1_amr att2_nominal att3_nominal att4_bok att5_nominal att6_zpq \
0 11 1.0 1.0 33.33 NaN 22.22
unique_id
0 1
>>>
The solution provided by #jezrael works just fine while being the most elegant one, however, I ran into severe performance issues. Surprisingly, I found this to be a much faster solution while achieving the same goal.
nominal_cols = df.filter(like="nominal").columns.values
other_cols = [col for col in df.columns.values if col not in nominal_cols and col != "unique_id"]
df1 = df.groupby('unique_id', as_index=False)[nominal_cols].sum(min_count=1)
df2 = df.groupby('unique_id', as_index=False)[other_cols].first()
pd.merge(df1, df2, on=["unique_id"], how="inner")

Adding a new column with specific dtype in pandas

Can we assign a new column to pandas and also declare the datatype in one fell scoop?
df = pd.DataFrame({'BP': ['100/80'],'Sex': ['M']})
df2 = (df.drop('BP',axis=1)
.assign(BPS = lambda x: df.BP.str.extract('(?P<BPS>\d+)/'))
.assign(BPD = lambda x: df.BP.str.extract('/(?P<BPD>\d+)'))
)
print(df2)
df2.dtypes
Can we have dtype as np.float using only the chained expression?
Obviously, you don't have to do this, but you can.
df.drop('BP', 1).join(
df['BP'].str.split('/', expand=True)
.set_axis(['BPS', 'BPD'], axis=1, inplace=False)
.astype(float))
Sex BPS BPD
0 M 100.0 80.0
Your two str.extract calls can be done away with in favour of a single str.split call. You can then make one astype call.
Personally, if you ask me about style, I would say this looks more elegant:
u = (df['BP'].str.split('/', expand=True)
.set_axis(['BPS', 'BPD'], axis=1, inplace=False)
.astype(float))
df.drop('BP', 1).join(u)
Sex BPS BPD
0 M 100.0 80.0
Adding astype when you assign the values
df2 = (df.drop('BP',axis=1)
.assign(BPS = lambda x: df.BP.str.extract('(?P<BPS>\d+)/').astype(float))
.assign(BPD = lambda x: df.BP.str.extract('/(?P<BPD>\d+)').astype(float))
)
df2.dtypes
Sex object
BPS float64
BPD float64
dtype: object
What I will do
df.assign(**df.pop('BP').str.extract(r'(?P<BPS>\d+)/(?P<BPD>\d+)').astype(float))
Sex BPS BPD
0 M 100.0 80.0
use df.insert:
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])
print('df to start with:', df, '\ndtypes:', df.dtypes, sep='\n')
print('\n')
df.insert(
len(df.columns), 'new col 1', pd.Series([[1, 2, 3], 'a'], dtype=object))
df.insert(
len(df.columns), 'new col 2', pd.Series([1, 2, 3]))
df.insert(
len(df.columns), 'new col 3', pd.Series([1., 2, 3]))
print('df with columns added:', df, '\ndtypes:', df.dtypes, sep='\n')
output
df to start with:
a b
0 1 2
1 3 4
dtypes:
a int64
b int64
dtype: object
df with columns added:
a b new col 1 new col 2 new col 3
0 1 2 [1, 2, 3] 1 1.0
1 3 4 a 2 2.0
dtypes:
a int64
b int64
new col 1 object
new col 2 int64
new col 3 float64
dtype: object
Just assign numpy arrays of the required type (inspired by a related question/answer).
import numpy as np
import pandas as pd
df = pd.DataFrame({
'a': np.array([1, 2, 3], dtype=int),
'b': np.array([4, 5, 6], dtype=float),
})
print('df to start with:', df, '\ndtypes:', df.dtypes, sep='\n')
print('\n')
df['new col 1'] = np.array([[1, 2, 3], 'a', np.nan], dtype=object)
df['new col 2'] = np.array([1, 2, 3], dtype=int)
df['new col 3'] = np.array([1, 2, 3], dtype=float)
print('df with columns added:', df, '\ndtypes:', df.dtypes, sep='\n')
output
df to start with:
a b
0 1 4.0
1 2 5.0
2 3 6.0
dtypes:
a int64
b float64
dtype: object
df with columns added:
a b new col 1 new col 2 new col 3
0 1 4.0 [1, 2, 3] 1 1.0
1 2 5.0 a 2 2.0
2 3 6.0 NaN 3 3.0
dtypes:
a int64
b float64
new col 1 object
new col 2 int64
new col 3 float64
dtype: object

Comparing two dataframes and store values based on conditions in python or R

I have 2 dataframes, each with 2 columns (shown in the picture). I'm trying to define a function or perform an operation to scan df2 on df1 and store
df2["values"] in df1["values"] if df2["ID"] matches df1["ID"].
I want the result as shown in New_df1 (picture)
I have tried a for loop with function append() but it's really tricky to make it work...
You can do this via pandas.concat, sorting and dropping druplicates:
import pandas as pd, numpy as np
df1 = pd.DataFrame([[i, np.nan] for i in list('abcdefghik')],
columns=['ID', 'Values'])
df2 = pd.DataFrame([['a', 2], ['c', 5], ['e', 4], ['g', 7], ['h', 1]],
columns=['ID', 'Values'])
res = pd.concat([df1, df2], axis=0)\
.sort_values(['ID', 'Values'])\
.drop_duplicates('ID')
print(res)
# ID Values
# 0 a 2.0
# 1 b NaN
# 1 c 5.0
# 3 d NaN
# 2 e 4.0
# 5 f NaN
# 3 g 7.0
# 4 h 1.0
# 8 i NaN
# 9 k NaN

set values in dataframe based on columns in other dataframe

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(5, 3), columns=['X','Y','Z'])
I can easily set the values in df to zero if they are less than a constant:
df[df < 0.0] = 0.0
can someone tell me how to instead compare to a column in a different dataframe? I assumed this would work, but it does not:
df[df < df2.X] = 0.0
IIUC you need to use lt and pass axis=0 to compare column-wise:
In [83]:
df = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(5, 3), columns=['X','Y','Z'])
df
Out[83]:
A B C
0 2.410659 -1.508592 -1.626923
1 -1.550511 0.983712 -0.021670
2 1.295553 -0.388102 0.091239
3 2.179568 2.266983 0.030463
4 1.413852 -0.109938 1.232334
In [87]:
df2
Out[87]:
X Y Z
0 0.267544 0.355003 -1.478263
1 -1.419736 0.197300 -1.183842
2 0.049764 -0.033631 0.343932
3 -0.863873 -1.361624 -1.043320
4 0.219959 0.560951 1.820347
In [86]:
df[df.lt(df2.X, axis=0)] = 0
df
Out[86]:
A B C
0 2.410659 0.000000 0.000000
1 0.000000 0.983712 -0.021670
2 1.295553 0.000000 0.091239
3 2.179568 2.266983 0.030463
4 1.413852 0.000000 1.232334

Categories

Resources