Pandas merge two df

Pandas merge two df - python

I have two DataFrames
df1 has following form
ID col1 col2
0 1 2 10
1 3 1 21
and df2 looks like this
ID field1 field2
0 1 4 1
1 1 3 3
2 3 5 4
3 3 9 5
4 1 2 0
I want to concatenate both DataFrames but so that I have only one line per each ID, so it'd look like this:
ID col1 col2 field1_1 field2_1 field1_2 field2_2 field1_3 field2_3
0 1 2 10 4 1 3 3 2 0
1 3 1 21 5 4 9 5
I have tried merging and pivoting the data df.pivot(index=df1.index, columns='ID')
But because the length is variable, I become a ValueError.
ValueError: all arrays must be same length

Without over formatting, we want to merge and add a level of a multi index that counts the 'ID's.
df = df1.merge(df2)
cc = df.groupby('ID').cumcount()
df.set_index(['ID', 'col1', 'col2', cc]).unstack()
field1 field2
0 1 2 0 1 2
ID col1 col2
1 2 10 4.0 3.0 2.0 1.0 3.0 0.0
3 1 21 5.0 9.0 NaN 4.0 5.0 NaN
We can nail down the formatting with:
df = df1.merge(df2)
cc = df.groupby('ID').cumcount() + 1
d1 = df.set_index(['ID', 'col1', 'col2', cc]).unstack().sort_index(axis=1, level=1)
d1.columns = d1.columns.to_series().map('{0[0]}_{0[1]}'.format)
d1.reset_index()
ID col1 col2 field1_1 field2_1 field1_2 field2_2 field1_3 field2_3
0 1 2 10 4.0 1.0 3.0 3.0 2.0 0.0
1 3 1 21 5.0 4.0 9.0 5.0 NaN NaN

Related

how can I replace NaN value with data from another dataframe in python?

from io import StringIO
import pandas as pd
x1 = """No.,col1,col2,col3,A
123,2,5,2,NaN
453,4,3,1,3
146,7,9,4,2
175,2,4,3,NaN
643,0,0,0,2
"""
x2 = """No.,col1,col2,col3,A
123,24,57,22,1
453,41,39,15,2
175,21,43,37,3
"""
df1 = pd.read_csv(StringIO(x1), sep=",")
df2 = pd.read_csv(StringIO(x2), sep=",")
how can I fill the NaN value in df1 with the corresponding No. column in df2, to have
No. col1 col2 col3 A
123 2 5 2 1
453 4 3 1 3
146 7 9 4 2
175 2 4 3 3
643 0 0 0 2
I tried the following line but nothing changed
df1['A'].fillna(df2['A'])

Use combine_first that is explicitly designed for this purpose:
(df1.set_index('No.')
.combine_first(df2.set_index('No.'))
.reset_index()
)
output:
No. col1 col2 col3 A
0 123 2.0 5.0 2.0 1.0
1 146 7.0 9.0 4.0 2.0
2 175 2.0 4.0 3.0 3.0
3 453 4.0 3.0 1.0 3.0
4 643 0.0 0.0 0.0 2.0
or fillna after setting 'No.' as index:
(df1.set_index('No.')
.fillna(df2.set_index('No.'))
.reset_index()
)
output:
No. col1 col2 col3 A
0 123 2 5 2 1.0
1 453 4 3 1 3.0
2 146 7 9 4 2.0
3 175 2 4 3 3.0
4 643 0 0 0 2.0
)

Try this:
df1['A'] = df1['A'].fillna(df2.set_index('No.').reindex(df1['No.'])['A'].reset_index(drop=True))

Another way with fillna and map:
df1["A"] = df1["A"].fillna(df1["No."].map(df2.set_index("No.")["A"]))
>>> df1
No. col1 col2 col3 A
0 123 2 5 2 1.0
1 453 4 3 1 3.0
2 146 7 9 4 2.0
3 175 2 4 3 3.0
4 643 0 0 0 2.0

Pandas missing values using conditions (groupby other columns)

I have a data frame like this
df
Col A Col B Col C
25 1 2
NaN 3 1
27 2 3
29 3 1
I want to fill Nan values in col A based on Col C and Col B.
My output df should be like this
25 1 2
29 3 1
27 2 3
29 3 1
I have tried this code df.groupby(['Col B','Col C']).ffill()
but didnt worked.Any suggestion will be helpful

Here you go:
df['Col A'] = df["Col A"].fillna(df.groupby(['Col B','Col C'])["Col A"].transform(lambda x: x.mean()))
print(df)
Prints:
Col A Col B Col C
0 25.0 1 2
1 29.0 3 1
2 27.0 2 3
3 29.0 3 1

You can try
df.fillna(df.groupby(['ColB','ColC']).transform('first'),inplace=True)
df
Out[386]:
ColA ColB ColC
0 25.0 1 2
1 29.0 3 1
2 27.0 2 3
3 29.0 3 1

Pandas countif based on multiple conditions, result in new column

How can I add a field that returns 1/0 if the value in any specified column in not NaN?
Example:
df = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9,10],
'val1': [2,2,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,2],
'val2': [7,0.2,5,8,np.nan,1,0,np.nan,1,1],
})
display(df)
mycols = ['val1', 'val2']
# if entry in mycols != np.nan, then df[row, 'countif'] =1; else 0
Desired output dataframe:

We do not need countif logic in pandas , try notna + any
df['out'] = df[['val1','val2']].notna().any(1).astype(int)
df
Out[381]:
id val1 val2 out
0 1 2.0 7.0 1
1 2 2.0 0.2 1
2 3 NaN 5.0 1
3 4 NaN 8.0 1
4 5 NaN NaN 0
5 6 1.0 1.0 1
6 7 NaN 0.0 1
7 8 NaN NaN 0
8 9 NaN 1.0 1
9 10 2.0 1.0 1

Using iloc accessor filtre last two columns. Check if the sum of not NaNs in each row is more than zero. Convert resulting Boolean to integers.
df['countif']=df.iloc[:,1:].notna().sum(1).gt(0).astype(int)
id val1 val2 countif
0 1 2.0 7.0 1
1 2 2.0 0.2 1
2 3 NaN 5.0 1
3 4 NaN 8.0 1
4 5 NaN NaN 0
5 6 1.0 1.0 1
6 7 NaN 0.0 1
7 8 NaN NaN 0
8 9 NaN 1.0 1
9 10 2.0 1.0 1

pandas shift does not work for subset of columns and rows

I have the following data Frame:
df = pd.DataFrame({"a":[1,2,3,4,5], "b":[3,2,1,2,2], "c": [2,1,0,2,1]})
a b c
0 1 3 2
1 2 2 1
2 3 1 0
3 4 2 2
4 5 2 1
and I want to shift columns a and b at indexes 0 to 2. I.e. my desired result is
a b c
0 NaN NaN 2
1 1 3 1
2 2 2 0
3 4 1 2
4 5 2 1
If I do
df[["a", "b"]][0:3] = df[["a", "b"]][0:3].shift(1)
and look at df, it appears to not have changed.
However, if a select only the rows or the columns, it works:
Single Column, select subset of rows:
df["a"][0:3] = df["a"][0:3].shift(1)
Output:
a b c
0 NaN 3 2
1 1.0 2 1
2 2.0 1 0
3 4.0 2 2
4 5.0 2 1
Likewise, if i select a list of columns, but all rows, it works as expected, too:
df[["a", "b"]] = df[["a", "b"]].shift(1)
output:
a b c
0 NaN NaN 2
1 1.0 3.0 1
2 2.0 2.0 0
3 3.0 1.0 2
4 4.0 2.0 1
Why does df[["a", "b"]][0:3] = df[["a", "b"]][0:3].shift(1) not work as expected? am I missing something?

Problem is there are double selecting - first columns and then rows, so updating copy. Check also evaluation order matters.
Possible solution with one selecting DataFrame.loc for index labels and columns names:
df.loc[0:2, ["a", "b"]] = df.loc[0:2, ["a", "b"]].shift(1)
print (df)
a b c
0 NaN NaN 2
1 1.0 3.0 1
2 2.0 2.0 0
3 4.0 2.0 2
4 5.0 2.0 1
If not default index and is necessary select first 2 rows:
df = pd.DataFrame({"a":[1,2,3,4,5], "b":[3,2,1,2,2], "c": [2,1,0,2,1]},
index=list('abcde'))
df.loc[df.index[0:2], ["a", "b"]] = df.loc[df.index[0:2], ["a", "b"]].shift(1)
print (df)
a b c
a NaN NaN 2
b 1.0 3.0 1
c 3.0 1.0 0
d 4.0 2.0 2
e 5.0 2.0 1

Finding mean of specific column and keep all rows that have specific mean values

I have this dataframe.
from pandas import DataFrame
import pandas as pd
df = pd.DataFrame({'name': ['A','D','M','T','B','C','D','E','A','L'],
'id': [1,1,1,2,2,3,3,3,3,5],
'rate': [3.5,4.5,2.0,5.0,4.0,1.5,2.0,2.0,1.0,5.0]})
>> df
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 C 3 1.5
6 D 3 2.0
7 E 3 2.0
8 A 3 1.0
9 L 5 5.0
df = df.groupby('id')['rate'].mean()
what i want is this:
1) find mean of every 'id'.
2) give the number of ids (length) which has mean >= 3.
3) give back all rows of dataframe (where mean of any id >= 3.
Expected output:
Number of ids (length) where mean >= 3: 3
>> dataframe where (mean(id) >=3)
>>df
name id rate
0 A 1 3.0
1 D 1 4.0
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 L 5 5.0

Use GroupBy.transform for means by all groups with same size like original DataFrame, so possible filter by boolean indexing:
df = df[df.groupby('id')['rate'].transform('mean') >=3]
print (df)
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
9 L 5 5.0
Detail:
print (df.groupby('id')['rate'].transform('mean'))
0 3.333333
1 3.333333
2 3.333333
3 4.500000
4 4.500000
5 1.625000
6 1.625000
7 1.625000
8 1.625000
9 5.000000
Name: rate, dtype: float64
Alternative solution with DataFrameGroupBy.filter:
df = df.groupby('id').filter(lambda x: x['rate'].mean() >=3)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas merge two df - python

Related

how can I replace NaN value with data from another dataframe in python?

Pandas missing values using conditions (groupby other columns)

Pandas countif based on multiple conditions, result in new column

pandas shift does not work for subset of columns and rows

Finding mean of specific column and keep all rows that have specific mean values

Categories

Resources