How to replace part of the data-frame with another data-frame - python

i have two data frames and i like to filter the data and replace list of columns from df1 with the same columns in df2
i want to filter this df by df1.loc[df1["name"]=="A"]
first_data={"col1":[2,3,4,5,7],
"col2":[4,2,4,6,4],
"col3":[7,6,9,11,2],
"col4":[14,11,22,8,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df1=pd.DataFrame.from_dict(first_data)
and put the columns ["col1","col2","n_roll"] when name="A"
on the same places in df2 (on the same indexs)
sec_df={"col1":[55,0,57,1,3],
"col2":[55,0,4,4,53],
"col3":[55,33,9,0,2],
"col4":[55,0,22,4,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df2=pd.DataFrame.from_dict(sec_df)
if i put that list of cols=[col1,col2,col3,col4]
so i like to get this
data={"col1":[55,0,4,1,7],
"col2":[55,0,4,4,4],
"col3":[55,33,9,0,2],
"col4":[55,0,22,4,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df=pd.DataFrame.from_dict(data)
df

You can achieve this with a double combine_first
Combine a filtered version of df1 with df2
However, the columns that were excluded in the filtered version of df1 were left behind and you have NaN values. But, that is okay -- Just do another combine_first on df2 to get those values!
(df1.loc[df1['name'] != 'A', ["col1","col2","n_roll"]]
.combine_first(df2)
.combine_first(df2))
Out[1]:
col1 col2 col3 col4 n_roll name
0 55.0 55.0 55.0 55.0 8.0 A
1 0.0 0.0 33.0 0.0 2.0 A
2 4.0 4.0 9.0 22.0 1.0 V
3 1.0 4.0 0.0 4.0 3.0 A
4 7.0 4.0 2.0 5.0 9.0 B

Use one line to achieve this man!
df1=df1[df1.name!='A'].append(df2[df2.name=='A'].rename(columns={'hight':'n_roll'})).sort_index()
col1 col2 col3 col4 name n_roll
0 55 55 55 55 A 8
1 0 0 33 0 A 2
2 4 4 9 22 V 1
3 1 4 0 4 A 3
4 7 4 2 5 B 9
How it works
d=df1[df1.name!='A']#selects df1 where name is not A
df2[df2.name=='A']#selects df where name is A
e=df2[df2.name=='A'].rename(columns={'hight':'n_roll'})#renames column height to allow appending
d.append(e)# combines the dataframes
d.append(e).sort_index()#sorts the index

Related

Take n last rows of a dataframe with no NaN

Let's take this dataframe :
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4,5,6], Col2=[4,np.nan,5,np.nan,1,5]))
Col1 Col2
0 1.0 4.0
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
4 5.0 1.0
5 6.0 5.0
I would like to extract the n last rows of df with no NaN.
Could you please help me to get this expected result ?
Col1 Col2
0 5 1
1 6 5
EDIT : Let's say I don't know where is the last NaN
Use DataFrame.dropna with DataFrame.tail and converting to integers:
N = 2
df1 = df.dropna().tail(N).astype(int)
#alternative
#df1 = df.dropna().iloc[-N:].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
EDIT: For last group with no missing values compare misisng values with DataFrame.isna and DataFrame.any, then swap order with cumulative sum, so last group has 0 values in mask:
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
If no row match it return correct empty DataFrame:
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4], Col2=[np.nan,np.nan,5,np.nan]))
print (df)
Col1 Col2
0 1.0 NaN
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Empty DataFrame
Columns: [Col1, Col2]
Index: []
another way is to use isna with drop_duplicates and cumsum to get the max index and then just use index filtering
last_na = df.isna().cumsum(axis=0).drop_duplicates(keep='first').index.max() + 1
new_df = df.iloc[last_na:]
print(new_df)
Col1 Col2
4 5.0 1.0
5 6.0 5.0

how to replace NaN value in python [duplicate]

This question already has answers here:
How to replace NaN values by Zeroes in a column of a Pandas Dataframe?
(17 answers)
Closed 4 years ago.
I have a list of NaN values in my dataframe and I want to replace NaN values with an empty string.
What I've tried so far, which isn't working:
df_conbid_N_1 = pd.read_csv("test-2019.csv",dtype=str, sep=';', encoding='utf-8')
df_conbid_N_1['Excep_Test'] = df_conbid_N_1['Excep_Test'].replace("NaN","")
Use fillna (docs):
An example -
df = pd.DataFrame({'no': [1, 2, 3],
'Col1':['State','City','Town'],
'Col2':['abc', np.NaN, 'defg'],
'Col3':['Madhya Pradesh', 'VBI', 'KJI']})
df
no Col1 Col2 Col3
0 1 State abc Madhya Pradesh
1 2 City NaN VBI
2 3 Town defg KJI
df.Col2.fillna('', inplace=True)
df
no Col1 Col2 Col3
0 1 State abc Madhya Pradesh
1 2 City VBI
2 3 Town defg KJI
Simple! you can do this way
df_conbid_N_1 = pd.read_csv("test-2019.csv",dtype=str, sep=';',encoding='utf-8').fillna("")
We have pandas' fillna to fill missing values.
Let's go through some uses cases with a sample dataframe:
df = pd.DataFrame({'col1':['John', np.nan, 'Anne'], 'col2':[np.nan, 3, 4]})
col1 col2
0 John NaN
1 NaN 3.0
2 Anne 4.0
As mentioned in the docs, fillna accepts the following as fill values:
values: scalar, dict, Series, or DataFrame
So we can replace with a constant value, such as an empty string with:
df.fillna('')
col1 col2
0 John
1 3
2 Anne 4
1
You can also replace with a dictionary mapping column_name:replace_value:
df.fillna({'col1':'Alex', 'col2':2})
col1 col2
0 John 2.0
1 Alex 3.0
2 Anne 4.0
Or you can also replace with another pd.Series or pd.DataFrame:
df_other = pd.DataFrame({'col1':['John', 'Franc', 'Anne'], 'col2':[5, 3, 4]})
df.fillna(df_other)
col1 col2
0 John 5.0
1 Franc 3.0
2 Anne 4.0
This is very useful since it allows you to fill missing values on the dataframes' columns using some extracted statistic from the columns, such as the mean or mode. Say we have:
df = pd.DataFrame(np.random.choice(np.r_[np.nan, np.arange(3)], (3,5)))
print(df)
0 1 2 3 4
0 NaN NaN 0.0 1.0 2.0
1 NaN 2.0 NaN 2.0 1.0
2 1.0 1.0 2.0 NaN NaN
Then we can easilty do:
df.fillna(df.mean())
0 1 2 3 4
0 1.0 1.5 0.0 1.0 2.0
1 1.0 2.0 1.0 2.0 1.0
2 1.0 1.0 2.0 1.5 1.5

Move values in rows in a new column in pandas

I have a DataFrame with an Ids column an several columns with data, like the column "value" in this example.
For this DataFrame I want to move all the values that correspond to the same id to a new column in the row as shown below:
I guess there is an opposite function to "melt" that allow this, but I'm not getting how to pivot this DF.
The dicts for the input and out DFs are:
d = {"id":[1,1,1,2,2,3,3,4,5],"value":[12,13,1,22,21,23,53,64,9]}
d2 = {"id":[1,2,3,4,5],"value1":[12,22,23,64,9],"value2":[1,21,53,"","",],"value3":[1,"","","",""]}
Create MultiIndex by cumcount, reshape by unstack and add change columns names by add_prefix:
df = (df.set_index(['id',df.groupby('id').cumcount()])['value']
.unstack()
.add_prefix('value')
.reset_index())
print (df)
id value0 value1 value2
0 1 12.0 13.0 1.0
1 2 22.0 21.0 NaN
2 3 23.0 53.0 NaN
3 4 64.0 NaN NaN
4 5 9.0 NaN NaN
Missing values is possible replace by fillna, but get mixed numeric with strings data, so some function should failed:
df = (df.set_index(['id',df.groupby('id').cumcount()])['value']
.unstack()
.add_prefix('value')
.reset_index()
.fillna(''))
print (df)
id value0 value1 value2
0 1 12.0 13 1
1 2 22.0 21
2 3 23.0 53
3 4 64.0
4 5 9.0
You can GroupBy to a list, then expand the series of lists:
df = pd.DataFrame(d) # create input dataframe
res = df.groupby('id')['value'].apply(list).reset_index() # groupby to list
res = res.join(pd.DataFrame(res.pop('value').values.tolist())) # expand lists to columns
print(res)
id 0 1 2
0 1 12 13.0 1.0
1 2 22 21.0 NaN
2 3 23 53.0 NaN
3 4 64 NaN NaN
4 5 9 NaN NaN
In general, such operations will be expensive as the number of columns is arbitrary. Pandas / NumPy solutions work best when you can pre-allocate memory, which isn't possible here.

Pandas merge two df

I have two DataFrames
df1 has following form
ID col1 col2
0 1 2 10
1 3 1 21
and df2 looks like this
ID field1 field2
0 1 4 1
1 1 3 3
2 3 5 4
3 3 9 5
4 1 2 0
I want to concatenate both DataFrames but so that I have only one line per each ID, so it'd look like this:
ID col1 col2 field1_1 field2_1 field1_2 field2_2 field1_3 field2_3
0 1 2 10 4 1 3 3 2 0
1 3 1 21 5 4 9 5
I have tried merging and pivoting the data df.pivot(index=df1.index, columns='ID')
But because the length is variable, I become a ValueError.
ValueError: all arrays must be same length
Without over formatting, we want to merge and add a level of a multi index that counts the 'ID's.
df = df1.merge(df2)
cc = df.groupby('ID').cumcount()
df.set_index(['ID', 'col1', 'col2', cc]).unstack()
field1 field2
0 1 2 0 1 2
ID col1 col2
1 2 10 4.0 3.0 2.0 1.0 3.0 0.0
3 1 21 5.0 9.0 NaN 4.0 5.0 NaN
We can nail down the formatting with:
df = df1.merge(df2)
cc = df.groupby('ID').cumcount() + 1
d1 = df.set_index(['ID', 'col1', 'col2', cc]).unstack().sort_index(axis=1, level=1)
d1.columns = d1.columns.to_series().map('{0[0]}_{0[1]}'.format)
d1.reset_index()
ID col1 col2 field1_1 field2_1 field1_2 field2_2 field1_3 field2_3
0 1 2 10 4.0 1.0 3.0 3.0 2.0 0.0
1 3 1 21 5.0 4.0 9.0 5.0 NaN NaN

How To Create A Dataframe Series

I have a dictionary of Pandas Series objects that I want to turn into a Dataframe. The key for each series should be the column heading. The individual series overlap but, each label is unique.
I thought I should be able to just do
df = pd.DataFrame(data)
But I keep getting the error InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
I get the same error if I try to turn each series into a frame, and use pd.concat(data, axis=1).
Which doesn't make sense if you take the column label into account. What am I doing wrong, and how do I fix it?
I believe you need reset_index with parameter drop=True of each Series in dict comprehension, because duplicates in index:
s = pd.Series([1,4,5,2,0], index=[1,2,2,3,5])
s1 = pd.Series([5,7,8,1],index=[1,2,3,4])
data = {'a':s, 'b': s1}
print (s.reset_index(drop=True))
0 1
1 4
2 5
3 2
4 0
dtype: int64
df = pd.concat({k:v.reset_index(drop=True) for k,v in data.items()}, axis=1)
print (df)
a b
0 1 5.0
1 4 7.0
2 5 8.0
3 2 1.0
4 0 NaN
If you need drop rows where duplicated index use boolean indexing with duplicated:
print (s[~s.index.duplicated()])
1 1
2 4
3 2
5 0
dtype: int64
df = pd.concat({k:v[~v.index.duplicated()] for k,v in data.items()}, axis=1)
print (df)
a b
1 1.0 5.0
2 4.0 7.0
3 2.0 8.0
4 NaN 1.0
5 0.0 NaN
Another solution:
print (s.groupby(level=0).mean())
1 1.0
2 4.5
3 2.0
5 0.0
dtype: float64
df = pd.concat({k:v.groupby(level=0).mean() for k,v in data.items()}, axis=1)
print (df)
a b
1 1.0 5.0
2 4.5 7.0
3 2.0 8.0
4 NaN 1.0
5 0.0 NaN

Categories

Resources