for example, two dataframes are as below
df1
index a b
0 1 1
1 1 1
df2
index a b
1 2 2
2 2 2
and I want df1.append(df2) with overwrite
so result maybe as below
merged df
index a b
0 1 1
1 2 2 <= overwrite value of df2
2 2 2
is there any good way in pandas?
Using combine_first
df1=df1.set_index('index')
df2=df2.set_index('index')
df2.combine_first(df1)
Out[279]:
a b
index
0 1.0 1.0
1 2.0 2.0
2 2.0 2.0
Related
I have a DataFrame in which I have a duplicate column namely weather.
As Seen in this picture of dataframe. One of them contains NaN values that is the one I want to remove from the DataFrame.
I tried this method
data_cleaned4.drop('Weather', axis=1)
It dropped both columns as it should. I tried to pass a condition to drop method but I couldn't. It shows me an error.
data_cleaned4.drop(data_cleaned4['Weather'].isnull().sum() > 0, axis=1)
Can anyone tell me how do I remove this column. Remember that the second last contains the NaN values not the last one.
A general solution. (df.isnull().any(axis=0).values) gets which columns have any NaN values and df.columns.duplicated(keep=False) marks all duplicates as True, both combined will give the columns which you want to retain
General Solution:
df.loc[:, ~((df.isnull().any(axis=0).values) & df.columns.duplicated(keep=False))]
Input
A B C C A
0 1 1 1 3.0 NaN
1 1 1 1 2.0 1.0
2 2 3 4 NaN 2.0
3 1 1 1 4.0 1.0
Output
A B C
0 1 1 1
1 1 1 1
2 2 3 4
3 1 1 1
Just for column C:
df.loc[:, ~(df.columns.duplicated(keep=False) & (df.isnull().any(axis=0).values)
& (df.columns == 'C'))]
Input
A B C C A
0 1 1 1 3.0 NaN
1 1 1 1 2.0 1.0
2 2 3 4 NaN 2.0
3 1 1 1 4.0 1.0
Output
A B C A
0 1 1 1 NaN
1 1 1 1 1.0
2 2 3 4 2.0
3 1 1 1 1.0
Due to the duplicate names you can rename a little bit, that's what the first lien of the code belwo does, then it should work...
data_cleaned4 = data_cleaned4.iloc[:, [j for j, c in enumerate(data_cleaned4.columns) if j != i]]
checkone = data_cleaned4.iloc[:,-1].isna().any()
checktwo = data_cleaned4.iloc[:,-2].isna().any()
if checkone:
data_cleaned4.drop(data_cleaned4.columns[-1], axis=1)
elif checktwo:
data_cleaned4.drop(data_cleaned4.columns[-2], axis=1)
else:
data_cleaned4.drop(data_cleaned4.columns[-2], axis=1)
Without a testable sample and assuming you don't have NaNs anywhere else in your dataframe
df = df.dropna(axis=1)
should work
I have a dataframe which look like:
0 target_year ID v1 v2
1 2000 1 0.3 1
2 2000 2 1.2 4
...
10 2001 1 3 2
11 2001 2 2 2
An I would like the following output:
0 ID v1_1 v2_1 v1_2 v2_2
1 1 0.3 1 3 2
2 2 1.2 4 2 2
Do you have any idea how to do that?
You could use pd.pivot_table, using the GroupBy.cumcount of ID as columns.
Then we can use a list comprehension with f-strings to merge the MultiIndex header into a sinlge level:
cols = df.groupby('ID').ID.cumcount() + 1
df_piv = (pd.pivot_table(data = df.drop('target_year', axis=1)[['v1','v2']],
index = df.ID,
columns = cols)
df_piv.columns = [f'{i}_{j}' for i,j in df_piv.columns]
v1_1 v1_2 v2_1 v2_2
ID
1 0.3 3.0 1 2
2 1.2 2.0 4 2
Use GroupBy.cumcount for counter column, reshape by DataFrame.set_index with DataFrame.unstack and last flatten in list comprehension and f-strings:
g = df.groupby('ID').ID.cumcount() + 1
df = df.drop('target_year', axis=1).set_index(['ID', g]).unstack()
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
ID v1_1 v1_2 v2_1 v2_2
0 1 0.3 3.0 1 2
1 2 1.2 2.0 4 2
If your data come in only two years, you can also merge:
cols = ['ID','v1', 'v2']
df[df.target_year.eq(2000)][cols].merge(df[df.target_year.eq(2001)][cols],
on='ID',
suffixes=['_1','_2'])
Output
ID v1_1 v2_1 v1_2 v2_2
0 1 0.3 1 3.0 2
1 2 1.2 4 2.0 2
df:
index a b c d
-
0 1 2 NaN NaN
1 2 NaN 3 NaN
2 5 NaN 6 NaN
3 1 NaN NaN 5
df expect:
index one two
-
0 1 2
1 2 3
2 5 6
3 1 5
Above output example is self-explanatory. Basically, I just need to shift the two values from columns [a, b, c, d] except NaN into another set of two columns ["one", "two"]
Use back filling missing values and select first 2 columns:
df = df.bfill(axis=1).iloc[:, :2].astype(int)
df.columns = ["one", "two"]
print (df)
one two
index
0 1 2
1 2 3
2 5 6
3 1 5
Or combine_first + drop:
df['two']=df.pop('b').combine_first(df.pop('c')).combine_first(df.pop('d'))
df=df.drop(['b','c','d'],1)
df.columns=['index','one','two']
Or fillna:
df['two']=df.pop('b').fillna(df.pop('c')).fillna(df.pop('d'))
df=df.drop(['b','c','d'],1)
df.columns=['index','one','two']
Both cases:
print(df)
Is:
index one two
0 0 1 2.0
1 1 2 3.0
2 2 5 6.0
3 3 1 5.0
If want output like #jezrael's, add a: (both cases all okay)
df=df.set_index('index')
And then:
print(df)
Is:
one two
index
0 1 2.0
1 2 3.0
2 5 6.0
3 1 5.0
I want to deal with duplicates in a pandas df:
df=pd.DataFrame({'A':[1,1,1,2,1],'B':[2,2,1,2,1],'C':[2,2,1,1,1],'D':['a','c','a','c','c']})
df
I want to keep only rows with unique values of A, B, C an create binary columns D_a and D_c, so the results will be something like this without doing super slow loops on each row..
result= pd.DataFrame({'A':[1,1,2],'B':[2,1,2],'C':[2,1,1],'D_a':[1,1,0],'D_c':[1,1,1]})
Thanks a lot
You can use:
df1 = (df.groupby(['A','B','C'])['D']
.value_counts()
.unstack(fill_value=0)
.add_prefix('D_')
.clip_upper(1)
.reset_index()
.rename_axis(None, axis=1))
print (df1)
A B C D_a D_c
0 1 1 1 1 1
1 1 2 2 1 1
2 2 2 1 0 1
Using get_dummies + sum -
df = df.set_index(['A', 'B', 'C'])\
.D.str.get_dummies()\
.sum(level=[0, 1, 2])\
.add_prefix('D_')\
.reset_index()
df
A B C D_a D_c
0 1 1 1 1 1
1 1 2 2 1 1
2 2 2 1 0 1
You can do something like this
df.loc[df['D']=='a', 'D_a'] = 1
df.loc[df['D']=='c', 'D_c'] = 1
This will put a 1 in a new column where every an "a" or "c" appears.
A B C D D_a D_c
0 1 2 2 a 1.0 NaN
1 1 2 2 c NaN 1.0
2 1 1 1 a 1.0 NaN
3 2 2 1 c NaN 1.0
4 1 1 1 c NaN 1.0
but then you have to replace the NaN with a 0.
df = df.fillna(0)
Next you only have to select the columns you need and then drop the duplicates.
df = df[["A","B","C", "D_a", "D_c"]].drop_duplicates()
Hope this is the solution you were looking for.
Based on the fact that directly append two dataframe with different numbers of columns, an error would occur as pandas.io.common.CParserError: Error tokenizing data. C error: Expected 4 fields in line 242, saw 5. How can I do with pandas to avoid the error??
I have figure out one naive approach: just to process the original data, to make the numbers of columns equally.
Can it be more elegant?? I think the missing columns can be filled with np.nan after pd.append.
You should be able to concat the dataframes as shown.
You will need to rename the columns to suit you needs.
df1 = pd.DataFrame({'a':[1,2,3,4],'b':[1,2,3,4],'c':[1,2,3,4]})
df2 = pd.DataFrame({'a':[1,2,3,4],'c':[1,2,3,4]})
df = pd.concat([df1,df2])
print('df1')
print(df1)
print('\ndf2')
print(df2)
print('\ndf')
print(df)
Output:
df1
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
df2
a c
0 1 1
1 2 2
2 3 3
3 4 4
df
a b c
0 1 1.0 1
1 2 2.0 2
2 3 3.0 3
3 4 4.0 4
0 1 NaN 1
1 2 NaN 2
2 3 NaN 3
3 4 NaN 4