Combination of two dataframes - still show NaN values

Combination of two dataframes - still show NaN values - python

I would like to fill my first dataframe with data from the second dataframe. Since I don't need and any special condition I suppose combine_first function looks like the right choice for me.
Unfortunately when I try to combine two dataframes result is still the original dataframe.
My code:
import pandas as pd
df1 = pd.DataFrame({'Gen1': [5, None, 3, 2, 1],
'Gen2': [1, 2, None, 4, 5]})
df2 = pd.DataFrame({'Gen1': [None, 4, None, None, None],
'Gen2': [None, None, 3, None, None]})
df1.combine_first(df2)
Then, When I print(df1) I get df1 as I initiate it on the second row.
Where did I make a mistake?

For me working nice if assign back output, but very similar method DataFrame.update working inplace:
df = df1.combine_first(df2)
print (df)
Gen1 Gen2
0 5.0 1.0
1 4.0 2.0
2 3.0 3.0
3 2.0 4.0
4 1.0 5.0
df1.update(df2)
print (df1)
Gen1 Gen2
0 5.0 1.0
1 4.0 2.0
2 3.0 3.0
3 2.0 4.0
4 1.0 5.0

combine_first returns a dataframe which has the change and not updating the existing dataframe so you should get the return dataframe
df1=df1.combine_first(df2)

Related

How to fill up each element in a list to given columns of a dataframe?

Let's say I have a dataframe as shown.
I have a list now like [6,7,6]. How do I fill these to the my 3 desired columns i.e,[one,Two,Four] of dataframe? Notice that I have not given column 'three'. Final Dataframe should look like:

You can append a Series:
df = pd.DataFrame([[2, 4, 4, 8]],
columns=['One', 'Two', 'Three', 'Four'])
values = [6, 3, 6]
lst = ['One', 'Two', 'Four']
df = df.append(pd.Series(values, index=lst), ignore_index=True)
or a dict:
df = df.append(dict(zip(lst, values)), ignore_index=True)
output:
One Two Three Four
0 2.0 4.0 4.0 8.0
1 6.0 3.0 NaN 6.0

you could do:
columnstobefilled = ["One","Two","Four"]
elementsfill = [6,3,6]
for column,element in zip(columnstobefilled,elementsfill):
df[column] = element

Since you want the list values to be in specific places you have to specify where each value should go. One way to include this is to use a key value pair object, a dictionary. Once you create that you can use append to include it as a row in your dataframe:
d = {'one':6,'Two':7,'Four':6}
df.append(d,ignore_index=True)
one Two Three Four
0 2.0 4.0 4.0 8.0
1 6.0 7.0 NaN 6.0
Dataset:
df = pd.DataFrame({'one':2,'Two':4,'Three':4,'Four':8},
index=[0])

import pandas as pd
df = pd.DataFrame({'One':2, 'Two':4, 'Three':4, 'Four':8}, index=[0])
new_row = {'One':6, 'Two':7, 'Three':None, 'Four':6}
df.append(new_row, ignore_index=True)
print(df)
output:
One Two Three Four
0 2.0 4.0 4.0 8.0
1 6.0 7.0 NaN 6.0

Group latest values in pandas columns for a given id

I have a pandas dataframe containing some metrics for a given date and user.
>>> pd.DataFrame({"user": ['juan','juan','juan','gonzalo'], "date": [1, 2, 3, 1], "var1": [1, 2, None, 1], "var2": [None, 4, 5, 6]})
user date var1 var2
0 juan 1 1.0 NaN
1 juan 2 2.0 4.0
2 juan 3 NaN 5.0
3 gonzalo 1 1.0 6.0
Now, for each user, I want to extract the 2 more recent values for each variable (var1, var2) ignoring NaN unless there aren't enough values to fill the data.
For reference, this should be the resulting dataframe for the data depicted above
user var1_0 var1_1 var2_0 var2_1
juan 2.0 1.0 5.0 4.0
gonzalo 1.0 NaN 6.0 NaN
each "historical" value is added as a new column with a _0 or _1 suffix.

First sorting if necessary by both columns in DataFrame.sort_values with reshape by DataFrame.sort_values and remove missing values, filter top2 rrows per groups by GroupBy.head, then create counter column by GroupBy.cumcount with pivoting in DataFrame.pivot with flatten MultiIndex:
df1 = (df.sort_values(['user','date'])
.melt(id_vars='user', value_vars=['var1','var2'])
.dropna(subset=['value'])
)
df1 = df1.groupby(['user','variable']).head(2)
df1['g'] = df1.groupby(['user','variable']).cumcount(ascending=False)
df1 = df1.pivot(index='user', columns=['variable', 'g'], values='value')
#oldier pandas versions
#df1 = df1.set_index(['user','variable', 'g'])['value'].unstack([1,2])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.reset_index()
print (df1)
user var1_0 var1_1 var2_0 var2_1
0 gonzalo 1.0 NaN 6.0 NaN
1 juan 2.0 1.0 5.0 4.0

You could group by user and aggregate to get the 2 most recent values. That get's almost all the way there - but instead of columns you have a list of elements. If you want to have the actual 2 columns you have to split the newly created list into columns. Full code:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"user": ["juan", "juan", "juan", "gonzalo"],
"date": [1, 2, 3, 1],
"var1": [1, 2, None, 1],
"var2": [None, 4, 5, 6],
}
)
# This almost gets you there
df = (
df.sort_values(by="date")
.groupby("user")
.agg({"var1": lambda x: x.dropna().head(2), "var2": lambda x: x.dropna().head(2)})
)
# Split the columns and get the correct column names
df[["var1_0", "var2_0"]] = df[["var1", "var2"]].apply(
lambda row: pd.Series(el[0] if isinstance(el, np.ndarray) else el for el in row),
axis=1,
)
df[["var1_1", "var2_1"]] = df[["var1", "var2"]].apply(
lambda row: pd.Series(el[-1] if isinstance(el, np.ndarray) else None for el in row),
axis=1,
)
print(df)
>>
var1 var2 var1_0 var2_0 var1_1 var2_1
user
gonzalo 1.0 6.0 1.0 6.0 NaN NaN
juan [1.0, 2.0] [4.0, 5.0] 1.0 4.0 2.0 5.0

Concatenate all columns in dataframe except for NaN

Another simple one. I have a DataFrame (1056 x 39) that contains reference variables from a pivot table. I now need to generate a column of concatenated values of all columns, which exclude NaNs. The trouble is that I have quite a few NaNs which are interfering with the output.
Based on another post that I have found Concatenating all columns in pandas dataframe, I can use this approach.
df['Merge'] = df.astype(str).agg(' or '.join,axis=1)
The trouble is that NaNs remain. How can I modify this line to exclude NaN values (skip them essentially) such that the output will only contain concatenated values.
The intended output should appear as (first row):
df['Merge'][0] = 'Var1 or Var2 or Var 20 or Var28' (all NaN values were excluded)
Thanks :)

You can stack to remove the NaN then cast to string and groupby + str.join
import pandas as pd
df = pd.DataFrame([[1.0, np.NaN, 2, 3, 'foo'], [np.NaN, None, 5, 'bar', 'bazz']])
df['merged'] = df.stack().astype(str).groupby(level=0).agg(' or '.join)
# 0 1 2 3 4 merged
#0 1.0 NaN 2 3 foo 1.0 or 2 or 3 or foo
#1 NaN NaN 5 bar bazz 5 or bar or bazz
Or you can apply along the rows, dropping nulls, casting to string then joining all the non-nulls.
df = pd.DataFrame([[1.0, np.NaN, 2, 3, 'foo'], [np.NaN, None, 5, 'bar', 'bazz']])
df['merged'] = df.apply(lambda row: ' or '.join(row.dropna().astype(str)), axis=1)
# 0 1 2 3 4 merged
#0 1.0 NaN 2 3 foo 1.0 or 2 or 3 or foo
#1 NaN NaN 5 bar bazz 5 or bar or bazz

Boolean Operation on Pandas Dataframe Column Average - this has got to be simple

I have at pandas dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,5,3],
'B': [4,2,6]})
df['avg'] = df.mean(axis=1)
df[df<df['avg']]
I would like keep all the values in the dataframe that are below the average value in column df['avg']. When I perform the below operation I am returned all NAN's
df[df<df['avg']]
If I set up a for loop I can get the boolean of what I want.
col_names = ['A', 'B']
for colname in col_names:
df[colname] = df[colname]<df['avg']
What I am searching for would look like this:
df_desired = pd.DataFrame({
'A':[1,np.nan,3],
'B':[np.nan,2,np.nan],
'avg' :[2.5, 3.5, 4.5]
})
How do I do this? There has to be a pythonic way to do this.

You can use .mask(..) [pandas-doc] here. We can use numpy's broadcasting to generate an array of booleans that are higher than the given average:
>>> df.mask(df.values > df['avg'].values[:,None])
A B avg
0 1.0 NaN 2.5
1 NaN 2.0 3.5
2 3.0 NaN 4.5

I think this is somewhat more idiomatic, and clearer, than the accepted solution:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 5, 3],
'B': [4, 2, 6]})
print(df)
df['avg'] = df.mean(axis=1)
print(df)
df[df[['A', 'B']].ge(df['avg'], axis=0)] = np.NaN
print(df)
Output:
A B
0 1 4
1 5 2
2 3 6
A B avg
0 1 4 2.5
1 5 2 3.5
2 3 6 4.5
A B avg
0 1.0 NaN 2.5
1 NaN 2.0 3.5
2 3.0 NaN 4.5
Speaking of the accepted solution, it is no longer recommended to use .values in order to convert a Pandas DataFrame or Series to a NumPy Array. Fortunately, we don't actually need to use it at all here:
df.mask(df > df['avg'][:, np.newaxis])

Combine two pandas DataFrame into one new

I have two Pandas DataFrames whose data from different sources, but both DataFrames have the same column names. When combined only one column will keep the name.
Like this:
speed_df = pd.DataFrame.from_dict({
'ts': [0,1,3,4],
'val': [5,4,2,1]
})
temp_df = pd.DataFrame.from_dict({
'ts': [0,1,2],
'val': [9,8,7]
})
And I need to have a result like this:
final_df = pd.DataFrame.from_dict({
'ts': [0,1,2,3,4],
'speed': [5,4,NaN,1],
'temp': [9,8,7,NaN,NaN]
})
Later I will deal with empty cells (here filled with NaN) by copying the values of the previous valid value. And get something like this:
final_df = pd.DataFrame.from_dict({
'ts': [0,1,2,3,4],
'speed': [5,4,4,1],
'temp': [9,8,7,7,7]
})

Use pd.merge
In [406]: (pd.merge(speed_df, temp_df, how='outer', on='ts')
.rename(columns={'val_x': 'speed','val_y': 'temp'})
.sort_values(by='ts'))
Out[406]:
ts speed temp
0 0 5.0 9.0
1 1 4.0 8.0
4 2 NaN 7.0
2 3 2.0 NaN
3 4 1.0 NaN
In [407]: (pd.merge(speed_df, temp_df, how='outer', on='ts')
.rename(columns={'val_x': 'speed', 'val_y': 'temp'})
.sort_values(by='ts').ffill())
Out[407]:
ts speed temp
0 0 5.0 9.0
1 1 4.0 8.0
4 2 4.0 7.0
2 3 2.0 7.0
3 4 1.0 7.0

Two main DataFrame options, one is pd.merge and the other is pd.fillna. Here is the code:
df = speed_df.merge(temp_df, how='outer', on='ts')
df = df.rename(columns=dict(val_x='speed', val_y='temp'))
df = df.sort_values('ts')
df.fillna(method='ffill')
Hope this would be helpful.
Thanks

You need to do a left outer join using pandas.merge function
d = pd.merge(speed_df,temp_df,on='ts',how='outer').rename(columns=\
{'val_x':'speed','val_y':'temp'})
d = d.sort_values('ts')
d['speed']=d['speed'].fillna(4)
d['temp']=d['temp'].fillna(7)
That should return you this:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combination of two dataframes - still show NaN values - python

For me working nice if assign back output, but very similar method DataFrame.update working inplace: df = df1.combine_first(df2) print (df) Gen1 Gen2 0 5.0 1.0 1 4.0 2.0 2 3.0 3.0 3 2.0 4.0 4 1.0 5.0 df1.update(df2) print (df1) Gen1 Gen2 0 5.0 1.0 1 4.0 2.0 2 3.0 3.0 3 2.0 4.0 4 1.0 5.0

combine_first returns a dataframe which has the change and not updating the existing dataframe so you should get the return dataframe df1=df1.combine_first(df2)

Related

How to fill up each element in a list to given columns of a dataframe?

Group latest values in pandas columns for a given id

Concatenate all columns in dataframe except for NaN

Boolean Operation on Pandas Dataframe Column Average - this has got to be simple

Combine two pandas DataFrame into one new

Categories

Resources