pandas two dataframes, some sort of merge - python

I have two dataframes like this:
df['one'] = [1,2,3,4,5]
df['two'] = [nan, 15, nan, 22, nan]
I need some sort of join or merge which will give me dataframe like this:
df['result'] = [1,15,3,22,5]
any ideas?

You can use np.where to do it. So if df.two is NaN, you use df.one's value, otherwise use df.two.
import pandas as pd
import numpy as np
# your data
# ========================================
df = pd.DataFrame(dict(one=[1,2,3,4,5], two=[np.nan, 15, np.nan, 22, np.nan]))
print(df)
one two
0 1 NaN
1 2 15
2 3 NaN
3 4 22
4 5 NaN
# processing
# ========================================
df['result'] = np.where(df.two.isnull(), df.one, df.two)
one two result
0 1 NaN 1
1 2 15 15
2 3 NaN 3
3 4 22 22
4 5 NaN 5

You can use the pandas method combine_first() to fill the missing values from a DataFrame or Series with values from another; in this case, you want to fill the missing values in df['two'] with the corresponding values in df['one']:
In [342]: df['result']= df['two'].combine_first(df['one'])
In [343]: df
Out[343]:
one two result
0 1 NaN 1
1 2 15 15
2 3 NaN 3
3 4 22 22
4 5 NaN 5

Related

Dynamically Fill NaN Values in Dataframe

I have a dataframe with a series of numbers. For example:
Index Column 1
1 10
2 12
3 24
4 NaN
5 20
6 15
7 NaN
8 NaN
9 2
I can't use bfill or ffill as the rule is dynamic, taking the value from the previous row and dividing by the number of consecutive NaN + 1. For example, rows 3 and 4 should be replaced with 12 as 24/2, rows 6, 7 and 8 should be replaced with 5. All other numbers should remain unchanged.
How should I do that?
Note: Edited the dataframe to be more general by inserting a new row between rows 4 and 5 and another row at the end.
You can do:
m = (df["Column 1"].notna()) & (
(df["Column 1"].shift(-1).isna()) | (df["Column 1"].shift().isna())
)
out = df.groupby(m.cumsum()).transform(
lambda x: x.fillna(0).mean() if x.isna().any() else x
)
print(out):
Index Column 1
0 1 10.0
1 2 12.0
2 3 12.0
3 4 12.0
4 5 20.0
5 6 5.0
6 7 5.0
7 8 5.0
8 9 2.0
Explanation and intermediate values:
Basically look for the rows where the next value is NaN or previous value is NaN but their value itself is not NaN. Those rows form the first row of such groups.
So the m in above code looks like:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 True
now I want to form groups of rows that are ['True', <all Falses>] because those are the groups I want to take average of. For that use cumsum
If you want to take a look at those groups, you can use ngroup() after groupby on m.cumsum():
0 0
1 0
2 1
3 1
4 2
5 3
6 3
7 3
8 4
The above is only to show what are the groups.
Now for each group you can get the mean of the group if the group has any NaN value. This is accomplished by checking for NaNs using x.isna().any().
If the group has any NaN value then assign mean after filling NaN with 0 ,otherwise just keep the group as is. This is accomplished by the lambda:
lambda x: x.fillna(0).mean() if x.isna().any() else x
Why not using interpolate? There is a method=s that would probably fitsyour desire
However, if you really want to do as you described above, you can do something like this. (Note that iterating over rows in pandas is considered bad practice, but it does the job)
import pandas as pd
import numpy as np
df = pd.DataFrame([10,
12,
24,
np.NaN,
15,
np.NaN,
np.NaN])
for col in df:
for idx in df.index: # (iterating over rows is considered bad practice)
local_idx=idx
while(local_idx+1<len(df) and np.isnan(df.at[local_idx+1,col])):
local_idx+=1
if (local_idx-idx)>0:
fillvalue = df.loc[idx]/(local_idx-idx+1)
for fillidx in range(idx, local_idx+1):
df.loc[fillidx] = fillvalue
df
Output:
0
0 10.0
1 12.0
2 12.0
3 12.0
4 5.0
5 5.0
6 5.0

what is the best way to create running total columns in pandas

What is the most pandastic way to create running total columns at various levels (without iterating over the rows)?
input:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['test'] = np.nan,np.nan,'X','X','X','X',np.nan,'X','X','X','X','X','X',np.nan,np.nan,'X','X'
df['desired_output_level_1'] = np.nan,np.nan,'1','1','1','1',np.nan,'2','2','2','2','2','2',np.nan,np.nan,'3','3'
df['desired_output_level_2'] = np.nan,np.nan,'1','2','3','4',np.nan,'1','2','3','4','5','6',np.nan,np.nan,'1','2'
output:
test desired_output_level_1 desired_output_level_2
0 NaN NaN NaN
1 NaN NaN NaN
2 X 1 1
3 X 1 2
4 X 1 3
5 X 1 4
6 NaN NaN NaN
7 X 2 1
8 X 2 2
9 X 2 3
10 X 2 4
11 X 2 5
12 X 2 6
13 NaN NaN NaN
14 NaN NaN NaN
15 X 3 1
16 X 3 2
The test column can only contain X's or NaNs.
The number of consecutive X's is random.
In the 'desired_output_level_1' column, trying to count up the number of series of X's.
In the 'desired_output_level_2' column, trying to find the duration of each series.
Can anyone help? Thanks in advance.
Perhaps not the most pandastic way, but seems to yield what you are after.
Three key points:
we are operating on only rows that are not NaN, so let's create a mask:
mask = df['test'].notna()
For level 1 computation, it's easy to compare when there is a change from NaN to not NaN by shifting rows by one:
df.loc[mask, "level_1"] = (df["test"].isna() & df["test"].shift(-1).notna()).cumsum()
For level 2 computation, it's a bit trickier. One way to do it is to run the computation for each level_1 group and do .transform to preserve the indexing:
df.loc[mask, "level_2"] = (
df.loc[mask, ["level_1"]]
.assign(level_2=1)
.groupby("level_1")["level_2"]
.transform("cumsum")
)
Last step (if needed) is to transform columns to strings:
df['level_1'] = df['level_1'].astype('Int64').astype('str')
df['level_2'] = df['level_2'].astype('Int64').astype('str')

Drop dataframe rows with values that are an array of NaN

I have a dataframe where in one column, I've ended up with some values that are not merely "NaN" but an array of NaNs (ie, "[nan, nan, nan]")
I want to change those values to 0. If it were simply "nan" I would use:
df.fillna(0)
But that doesn't work in this instance.
For instance if:
df1 = pd.DataFrame({
'ID':[1,2,3,4,5,6],
'Version':[1,1,2,2,1,2],
'Cost':[17,np.nan,24,[np.nan, np.nan, np.nan],13,8]})
Using df1.fillna(0) yields:
ID Version Cost
0 1 1 17
1 2 1 0
2 3 2 24
3 4 2 [nan, nan, nan]
4 5 1 13
5 6 2 8
When I'd like to get the output:
ID Version Cost
0 1 1 17
1 2 1 0
2 3 2 24
3 4 2 0
4 5 1 13
5 6 2 8
In your case column Cost is an object so you can first convert to numeric and then fillna.
import pandas as pd
df = pd.DataFrame({"ID":list(range(1,7)),
"Version":[1,1,2,2,1,2],
"Cost": [17,0,24,['nan', 'nan', 'nan'], 13, 8]})
Where df.dtypes
ID int64
Version int64
Cost object
dtype: object
So you can convert this columns to_numeric using errors='coerce' which means that assign a np.nan if conversion is not possible.
df["Cost"] = pd.to_numeric(df["Cost"], errors='coerce')\
.fillna(0)
or if you prefer in two steps
df["Cost"] = pd.to_numeric(df["Cost"], errors='coerce')
df["Cost"] = df["Cost"].fillna(0)

Merging DataFrames on dates in chronological order in Pandas

I have about 50 DataFrames in a list that have a form like this, where the particular dates included in each DataFrame are not necessarily the same.
>>> print(df1)
Unnamed: 0 df1_name
0 2004/04/27 2.2700
1 2004/04/28 2.2800
2 2004/04/29 2.2800
3 2004/04/30 2.2800
4 2004/05/04 2.2900
5 2004/05/05 2.3000
6 2004/05/06 2.3200
7 2004/05/07 2.3500
8 2004/05/10 2.3200
9 2004/05/11 2.3400
10 2004/05/12 2.3700
Now, I want to merge these 50 DataFrames together on the date column (unnamed first column in each DataFrame), and include all dates that are present in any of the DataFrames. Should a DataFrame not have a value for that date, it can just be NaN.
So a minimal example:
>>> print(sample1)
Unnamed: 0 sample_1
0 2004/04/27 1
1 2004/04/28 2
2 2004/04/29 3
3 2004/04/30 4
>>> print(sample2)
Unnamed: 0 sample_2
0 2004/04/28 5
1 2004/04/29 6
2 2004/05/01 7
3 2004/05/03 8
Then after the merge
>>> print(merged_df)
Unnamed: 0 sample_1 sample_2
0 2004/04/27 1 NaN
1 2004/04/28 2 5
2 2004/04/29 3 6
3 2004/04/30 4 NaN
....
Is there an easy way to make use of the merge or join functions of Pandas to accomplish this? I have gotten awfully stuck trying to determine how to combine the dates like this.
All you need to do is pd.concat on all your sample dataframes. But you have to set a couple of things. One, set the index of each one to be the column you want to merge on. Ensure that column is a date column. Below is an example of how to do it.
One liner
pd.concat([s.set_index('Unnamed: 0') for s in [sample1, sample2]], axis=1).rename_axis('Unnamed: 0').reset_index()
Unnamed: 0 sample_1 sample_2
0 2004/04/27 1.0 NaN
1 2004/04/28 2.0 5.0
2 2004/04/29 3.0 6.0
3 2004/04/30 4.0 NaN
4 2004/05/01 NaN 7.0
5 2004/05/03 NaN 8.0
I think this is more understandable
sample1 = pd.DataFrame([
['2004/04/27', 1],
['2004/04/28', 2],
['2004/04/29', 3],
['2004/04/30', 4],
], columns=['Unnamed: 0', 'sample_1'])
sample2 = pd.DataFrame([
['2004/04/28', 5],
['2004/04/29', 6],
['2004/05/01', 7],
['2004/05/03', 8],
], columns=['Unnamed: 0', 'sample_2'])
list_of_samples = [sample1, sample2]
for i, sample in enumerate(list_of_samples):
s = list_of_samples[i].copy()
cols = s.columns.tolist()
cols[0] = 'Date'
s.columns = cols
s.Date = pd.to_datetime(s.Date)
s.set_index('Date', inplace=True)
list_of_samples[i] = s
pd.concat(list_of_samples, axis=1)
sample_1 sample_2
Date
2004-04-27 1.0 NaN
2004-04-28 2.0 5.0
2004-04-29 3.0 6.0
2004-04-30 4.0 NaN
2004-05-01 NaN 7.0
2004-05-03 NaN 8.0

Merging and Filling in Pandas DataFrames

I have two dataframes in Pandas. The columns are named the same and they have the same dimensions, but they have different (and missing) values.
I would like to merge based on one key column and take the max or non-missing data for each equivalent row.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'key':[1,3,5,7], 'a':[np.NaN, 0, 5, 1], 'b':[datetime.datetime.today() - datetime.timedelta(days=x) for x in range(0,4)]})
df1
a b key
0 NaN 2014-08-01 10:37:23.828683 1
1 0 2014-07-31 10:37:23.828726 3
2 5 2014-07-30 10:37:23.828736 5
3 1 2014-07-29 10:37:23.828744 7
df2 = pd.DataFrame({'key':[1,3,5,7], 'a':[2, 0, np.NaN, 3], 'b':[datetime.datetime.today() - datetime.timedelta(days=x) for x in range(2,6)]})
df2.ix[2,'b']=np.NaN
df2
a b key
0 2 2014-07-30 10:38:13.857203 1
1 0 2014-07-29 10:38:13.857253 3
2 NaN NaT 5
3 3 2014-07-27 10:38:13.857272 7
The end result would look like:
df_together
a b key
0 2 2014-07-30 10:38:13.857203 1
1 0 2014-07-29 10:38:13.857253 3
2 5 2014-07-30 10:37:23.828736 5
3 3 2014-07-27 10:38:13.857272 7
I hope my example covers all cases. If both dataframes have NaN (or NaT) values, they the result should also have NaN (or NaT) values. Try as I might, I can't get the pd.merge function to give what I want.
Often it is easiest in these circumstances to do:
df_together = pd.concat([df1, df2]).groupby('key').max()

Categories

Resources