set values in dataframe based on columns in other dataframe

set values in dataframe based on columns in other dataframe - python

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(5, 3), columns=['X','Y','Z'])
I can easily set the values in df to zero if they are less than a constant:
df[df < 0.0] = 0.0
can someone tell me how to instead compare to a column in a different dataframe? I assumed this would work, but it does not:
df[df < df2.X] = 0.0

IIUC you need to use lt and pass axis=0 to compare column-wise:
In [83]:
df = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(5, 3), columns=['X','Y','Z'])
df
Out[83]:
A B C
0 2.410659 -1.508592 -1.626923
1 -1.550511 0.983712 -0.021670
2 1.295553 -0.388102 0.091239
3 2.179568 2.266983 0.030463
4 1.413852 -0.109938 1.232334
In [87]:
df2
Out[87]:
X Y Z
0 0.267544 0.355003 -1.478263
1 -1.419736 0.197300 -1.183842
2 0.049764 -0.033631 0.343932
3 -0.863873 -1.361624 -1.043320
4 0.219959 0.560951 1.820347
In [86]:
df[df.lt(df2.X, axis=0)] = 0
df
Out[86]:
A B C
0 2.410659 0.000000 0.000000
1 0.000000 0.983712 -0.021670
2 1.295553 0.000000 0.091239
3 2.179568 2.266983 0.030463
4 1.413852 0.000000 1.232334

Related

Comparing two dataframes and store values based on conditions in python or R

I have 2 dataframes, each with 2 columns (shown in the picture). I'm trying to define a function or perform an operation to scan df2 on df1 and store
df2["values"] in df1["values"] if df2["ID"] matches df1["ID"].
I want the result as shown in New_df1 (picture)
I have tried a for loop with function append() but it's really tricky to make it work...

You can do this via pandas.concat, sorting and dropping druplicates:
import pandas as pd, numpy as np
df1 = pd.DataFrame([[i, np.nan] for i in list('abcdefghik')],
columns=['ID', 'Values'])
df2 = pd.DataFrame([['a', 2], ['c', 5], ['e', 4], ['g', 7], ['h', 1]],
columns=['ID', 'Values'])
res = pd.concat([df1, df2], axis=0)\
.sort_values(['ID', 'Values'])\
.drop_duplicates('ID')
print(res)
# ID Values
# 0 a 2.0
# 1 b NaN
# 1 c 5.0
# 3 d NaN
# 2 e 4.0
# 5 f NaN
# 3 g 7.0
# 4 h 1.0
# 8 i NaN
# 9 k NaN

Dataframe empty when passing column names

I am facing issue where on passing numpy array to dataframe without column names initializes it properly. Whereas, if I pass column names, it is empty.
x = np.array([(1, '1'), (2, '2')], dtype = 'i4,S1')
df = pd.DataFrame(x)
In []: df
Out[]:
f0 f1
0 1 1
1 2 2
df2 = pd.DataFrame(x, columns=['a', 'b'])
In []: df2
Out[]:
Empty DataFrame
Columns: [a, b]
Index: []

I think you need specify column names in parameter dtype, see DataFrame from structured or record array:
x = np.array([(1, '1'), (2, '2')], dtype=[('a', 'i4'),('b', 'S1')])
df2 = pd.DataFrame(x)
print (df2)
a b
0 1 b'1'
1 2 b'2'
Another solution without parameter dtype:
x = np.array([(1, '1'), (2, '2')])
df2 = pd.DataFrame(x, columns=['a', 'b'])
print (df2)
a b
0 1 1
1 2 2

It's the dtype param, without specifiying it, it works as expected.
See the example at documentation DataFrame
import numpy as np
import pandas as pd
x = np.array([(1, "11"), (2, "22")])
df = pd.DataFrame(x)
print df
df2 = pd.DataFrame(x, columns=['a', 'b'])
print df2

Element-wise ternary conditional operation on dataframes

Say given dateframes df1, df2, df3, what is the best way to get df = df1 if (df2>0) else df3 element-wise?

You can use df.where to achieve this:
In [3]:
df1 = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df2 = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df3 = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
print(df1)
print(df2)
print(df3)
a b c
0 -0.378401 1.456254 -0.327311
1 0.491864 -0.757420 -0.014689
2 0.028873 -0.906428 -0.252586
3 -0.686849 1.515643 1.065322
4 0.570760 -0.857298 -0.152426
a b c
0 1.273215 1.275937 -0.745570
1 -0.460257 -0.756481 1.043673
2 0.452731 1.071703 -0.454962
3 0.418926 1.395290 -1.365873
4 -0.661421 0.798266 0.384397
a b c
0 -0.641351 -1.469222 0.160428
1 1.164031 1.781090 -1.218099
2 0.096094 0.821062 0.815384
3 -1.001950 -1.851345 0.772869
4 -1.137854 1.205580 -0.922832
In [4]:
df = df1.where(df2 >0, df3)
df
Out[4]:
a b c
0 -0.378401 1.456254 0.160428
1 1.164031 1.781090 -0.014689
2 0.028873 -0.906428 0.815384
3 -0.686849 1.515643 0.772869
4 -1.137854 -0.857298 -0.152426

also
df = df1[df2 > 0].combine_first(df3)

Updating values with another dataframe

I have 2 pandas dataframes. The second one is contained in the first one. How can I replace the values in the first one with the ones in the second?
consider this example:
df1 = pd.DataFrame(0, index=[1,2,3], columns=['a','b','c'])
df2 = pd.DataFrame(1, index=[1, 2], columns=['a', 'c'])
ris= [[1, 0, 1],
[1, 0, 1],
[0, 0, 0]]
and ris has the same index and columns of d1
A possible solution is:
for i in df2.index:
for j in df2.columns:
df1.loc[i, j] = df2.loc[i, j]
But this is ugly

I think you can use copy with combine_first:
df3 = df1.copy()
df1[df2.columns] = df2[df2.columns]
print df1.combine_first(df3)
a b c
1 1.0 0 1.0
2 1.0 0 1.0
3 0.0 0 0.0
Next solution is creating empty new DataFrame df4 with index and columns from df1 and fill it by double combine_first:
df4 = pd.DataFrame(index=df1.index, columns=df1.columns)
df4 = df4.combine_first(df2).combine_first(df1)
print df4
a b c
1 1.0 0.0 1.0
2 1.0 0.0 1.0
3 0.0 0.0 0.0

Try
In [7]: df1['a'] = df2['a']
In [8]: df1['c'] = df2['c']
In [14]: df1[['a','c']] = df2[['a','c']]
If the column names are not known:
In [25]: for col in df2.columns:
....: df1[col] = df2[col]

Confusing pandas Panel behavior : am I doing something wrong?

In [142]:
import pandas as pd
df = pd.DataFrame([[0,1,2,3]], columns=['a', 'b', 'c', 'd'])
df1 = pd.DataFrame()
d = {'name' : pd.Panel(items=['x', 'y', 'z'])}
d['name']['x']
Out[142]:
Index([], dtype='object') Empty DataFrame
0 rows × 0 columns
This doesn't seem to work:
In [143]:
d['name']['x'] = df
d['name']['x']
Out[143]:
Index([], dtype='object') Empty DataFrame
0 rows × 0 columns
But this does:
In [144]:
df1 = df
df1
Out[144]:
a b c d
0 0 1 2 3
1 rows × 4 columns
Is there something about Panels that I'm missing?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

set values in dataframe based on columns in other dataframe - python

Related

Comparing two dataframes and store values based on conditions in python or R

Dataframe empty when passing column names

Element-wise ternary conditional operation on dataframes

Updating values with another dataframe

Confusing pandas Panel behavior : am I doing something wrong?

Categories

Resources