I have two DataFrames
df1 has following form
ID col1 col2
0 1 2 10
1 3 1 21
and df2 looks like this
ID field1 field2
0 1 4 1
1 1 3 3
2 3 5 4
3 3 9 5
4 1 2 0
I want to concatenate both DataFrames but so that I have only one line per each ID, so it'd look like this:
ID col1 col2 field1_1 field2_1 field1_2 field2_2 field1_3 field2_3
0 1 2 10 4 1 3 3 2 0
1 3 1 21 5 4 9 5
I have tried merging and pivoting the data df.pivot(index=df1.index, columns='ID')
But because the length is variable, I become a ValueError.
ValueError: all arrays must be same length
Without over formatting, we want to merge and add a level of a multi index that counts the 'ID's.
df = df1.merge(df2)
cc = df.groupby('ID').cumcount()
df.set_index(['ID', 'col1', 'col2', cc]).unstack()
field1 field2
0 1 2 0 1 2
ID col1 col2
1 2 10 4.0 3.0 2.0 1.0 3.0 0.0
3 1 21 5.0 9.0 NaN 4.0 5.0 NaN
We can nail down the formatting with:
df = df1.merge(df2)
cc = df.groupby('ID').cumcount() + 1
d1 = df.set_index(['ID', 'col1', 'col2', cc]).unstack().sort_index(axis=1, level=1)
d1.columns = d1.columns.to_series().map('{0[0]}_{0[1]}'.format)
d1.reset_index()
ID col1 col2 field1_1 field2_1 field1_2 field2_2 field1_3 field2_3
0 1 2 10 4.0 1.0 3.0 3.0 2.0 0.0
1 3 1 21 5.0 4.0 9.0 5.0 NaN NaN
Related
from io import StringIO
import pandas as pd
x1 = """No.,col1,col2,col3,A
123,2,5,2,NaN
453,4,3,1,3
146,7,9,4,2
175,2,4,3,NaN
643,0,0,0,2
"""
x2 = """No.,col1,col2,col3,A
123,24,57,22,1
453,41,39,15,2
175,21,43,37,3
"""
df1 = pd.read_csv(StringIO(x1), sep=",")
df2 = pd.read_csv(StringIO(x2), sep=",")
how can I fill the NaN value in df1 with the corresponding No. column in df2, to have
No. col1 col2 col3 A
123 2 5 2 1
453 4 3 1 3
146 7 9 4 2
175 2 4 3 3
643 0 0 0 2
I tried the following line but nothing changed
df1['A'].fillna(df2['A'])
Use combine_first that is explicitly designed for this purpose:
(df1.set_index('No.')
.combine_first(df2.set_index('No.'))
.reset_index()
)
output:
No. col1 col2 col3 A
0 123 2.0 5.0 2.0 1.0
1 146 7.0 9.0 4.0 2.0
2 175 2.0 4.0 3.0 3.0
3 453 4.0 3.0 1.0 3.0
4 643 0.0 0.0 0.0 2.0
or fillna after setting 'No.' as index:
(df1.set_index('No.')
.fillna(df2.set_index('No.'))
.reset_index()
)
output:
No. col1 col2 col3 A
0 123 2 5 2 1.0
1 453 4 3 1 3.0
2 146 7 9 4 2.0
3 175 2 4 3 3.0
4 643 0 0 0 2.0
)
Try this:
df1['A'] = df1['A'].fillna(df2.set_index('No.').reindex(df1['No.'])['A'].reset_index(drop=True))
Another way with fillna and map:
df1["A"] = df1["A"].fillna(df1["No."].map(df2.set_index("No.")["A"]))
>>> df1
No. col1 col2 col3 A
0 123 2 5 2 1.0
1 453 4 3 1 3.0
2 146 7 9 4 2.0
3 175 2 4 3 3.0
4 643 0 0 0 2.0
I have a data frame like this
df
Col A Col B Col C
25 1 2
NaN 3 1
27 2 3
29 3 1
I want to fill Nan values in col A based on Col C and Col B.
My output df should be like this
25 1 2
29 3 1
27 2 3
29 3 1
I have tried this code df.groupby(['Col B','Col C']).ffill()
but didnt worked.Any suggestion will be helpful
Here you go:
df['Col A'] = df["Col A"].fillna(df.groupby(['Col B','Col C'])["Col A"].transform(lambda x: x.mean()))
print(df)
Prints:
Col A Col B Col C
0 25.0 1 2
1 29.0 3 1
2 27.0 2 3
3 29.0 3 1
You can try
df.fillna(df.groupby(['ColB','ColC']).transform('first'),inplace=True)
df
Out[386]:
ColA ColB ColC
0 25.0 1 2
1 29.0 3 1
2 27.0 2 3
3 29.0 3 1
How can I add a field that returns 1/0 if the value in any specified column in not NaN?
Example:
df = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9,10],
'val1': [2,2,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,2],
'val2': [7,0.2,5,8,np.nan,1,0,np.nan,1,1],
})
display(df)
mycols = ['val1', 'val2']
# if entry in mycols != np.nan, then df[row, 'countif'] =1; else 0
Desired output dataframe:
We do not need countif logic in pandas , try notna + any
df['out'] = df[['val1','val2']].notna().any(1).astype(int)
df
Out[381]:
id val1 val2 out
0 1 2.0 7.0 1
1 2 2.0 0.2 1
2 3 NaN 5.0 1
3 4 NaN 8.0 1
4 5 NaN NaN 0
5 6 1.0 1.0 1
6 7 NaN 0.0 1
7 8 NaN NaN 0
8 9 NaN 1.0 1
9 10 2.0 1.0 1
Using iloc accessor filtre last two columns. Check if the sum of not NaNs in each row is more than zero. Convert resulting Boolean to integers.
df['countif']=df.iloc[:,1:].notna().sum(1).gt(0).astype(int)
id val1 val2 countif
0 1 2.0 7.0 1
1 2 2.0 0.2 1
2 3 NaN 5.0 1
3 4 NaN 8.0 1
4 5 NaN NaN 0
5 6 1.0 1.0 1
6 7 NaN 0.0 1
7 8 NaN NaN 0
8 9 NaN 1.0 1
9 10 2.0 1.0 1
I have the following data Frame:
df = pd.DataFrame({"a":[1,2,3,4,5], "b":[3,2,1,2,2], "c": [2,1,0,2,1]})
a b c
0 1 3 2
1 2 2 1
2 3 1 0
3 4 2 2
4 5 2 1
and I want to shift columns a and b at indexes 0 to 2. I.e. my desired result is
a b c
0 NaN NaN 2
1 1 3 1
2 2 2 0
3 4 1 2
4 5 2 1
If I do
df[["a", "b"]][0:3] = df[["a", "b"]][0:3].shift(1)
and look at df, it appears to not have changed.
However, if a select only the rows or the columns, it works:
Single Column, select subset of rows:
df["a"][0:3] = df["a"][0:3].shift(1)
Output:
a b c
0 NaN 3 2
1 1.0 2 1
2 2.0 1 0
3 4.0 2 2
4 5.0 2 1
Likewise, if i select a list of columns, but all rows, it works as expected, too:
df[["a", "b"]] = df[["a", "b"]].shift(1)
output:
a b c
0 NaN NaN 2
1 1.0 3.0 1
2 2.0 2.0 0
3 3.0 1.0 2
4 4.0 2.0 1
Why does df[["a", "b"]][0:3] = df[["a", "b"]][0:3].shift(1) not work as expected? am I missing something?
Problem is there are double selecting - first columns and then rows, so updating copy. Check also evaluation order matters.
Possible solution with one selecting DataFrame.loc for index labels and columns names:
df.loc[0:2, ["a", "b"]] = df.loc[0:2, ["a", "b"]].shift(1)
print (df)
a b c
0 NaN NaN 2
1 1.0 3.0 1
2 2.0 2.0 0
3 4.0 2.0 2
4 5.0 2.0 1
If not default index and is necessary select first 2 rows:
df = pd.DataFrame({"a":[1,2,3,4,5], "b":[3,2,1,2,2], "c": [2,1,0,2,1]},
index=list('abcde'))
df.loc[df.index[0:2], ["a", "b"]] = df.loc[df.index[0:2], ["a", "b"]].shift(1)
print (df)
a b c
a NaN NaN 2
b 1.0 3.0 1
c 3.0 1.0 0
d 4.0 2.0 2
e 5.0 2.0 1
I have this dataframe.
from pandas import DataFrame
import pandas as pd
df = pd.DataFrame({'name': ['A','D','M','T','B','C','D','E','A','L'],
'id': [1,1,1,2,2,3,3,3,3,5],
'rate': [3.5,4.5,2.0,5.0,4.0,1.5,2.0,2.0,1.0,5.0]})
>> df
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 C 3 1.5
6 D 3 2.0
7 E 3 2.0
8 A 3 1.0
9 L 5 5.0
df = df.groupby('id')['rate'].mean()
what i want is this:
1) find mean of every 'id'.
2) give the number of ids (length) which has mean >= 3.
3) give back all rows of dataframe (where mean of any id >= 3.
Expected output:
Number of ids (length) where mean >= 3: 3
>> dataframe where (mean(id) >=3)
>>df
name id rate
0 A 1 3.0
1 D 1 4.0
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 L 5 5.0
Use GroupBy.transform for means by all groups with same size like original DataFrame, so possible filter by boolean indexing:
df = df[df.groupby('id')['rate'].transform('mean') >=3]
print (df)
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
9 L 5 5.0
Detail:
print (df.groupby('id')['rate'].transform('mean'))
0 3.333333
1 3.333333
2 3.333333
3 4.500000
4 4.500000
5 1.625000
6 1.625000
7 1.625000
8 1.625000
9 5.000000
Name: rate, dtype: float64
Alternative solution with DataFrameGroupBy.filter:
df = df.groupby('id').filter(lambda x: x['rate'].mean() >=3)