Merging and Filling in Pandas DataFrames - python

I have two dataframes in Pandas. The columns are named the same and they have the same dimensions, but they have different (and missing) values.
I would like to merge based on one key column and take the max or non-missing data for each equivalent row.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'key':[1,3,5,7], 'a':[np.NaN, 0, 5, 1], 'b':[datetime.datetime.today() - datetime.timedelta(days=x) for x in range(0,4)]})
df1
a b key
0 NaN 2014-08-01 10:37:23.828683 1
1 0 2014-07-31 10:37:23.828726 3
2 5 2014-07-30 10:37:23.828736 5
3 1 2014-07-29 10:37:23.828744 7
df2 = pd.DataFrame({'key':[1,3,5,7], 'a':[2, 0, np.NaN, 3], 'b':[datetime.datetime.today() - datetime.timedelta(days=x) for x in range(2,6)]})
df2.ix[2,'b']=np.NaN
df2
a b key
0 2 2014-07-30 10:38:13.857203 1
1 0 2014-07-29 10:38:13.857253 3
2 NaN NaT 5
3 3 2014-07-27 10:38:13.857272 7
The end result would look like:
df_together
a b key
0 2 2014-07-30 10:38:13.857203 1
1 0 2014-07-29 10:38:13.857253 3
2 5 2014-07-30 10:37:23.828736 5
3 3 2014-07-27 10:38:13.857272 7
I hope my example covers all cases. If both dataframes have NaN (or NaT) values, they the result should also have NaN (or NaT) values. Try as I might, I can't get the pd.merge function to give what I want.

Often it is easiest in these circumstances to do:
df_together = pd.concat([df1, df2]).groupby('key').max()

Related

Identifying consecutive NaN's with pandas part 2

I have a question related to the earlier question: Identifying consecutive NaN's with pandas
I am new on stackoverflow so I cannot add a comment, but I would like to know how I can partly keep the original index of the dataframe when counting the number of consecutive nans.
So instead of:
df = pd.DataFrame({'a':[1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14]})
df
Out[38]:
a
0 1
1 2
2 NaN
3 NaN
4 NaN
5 6
6 7
7 8
8 9
9 10
10 NaN
11 NaN
12 13
13 14
I would like to obtain the following:
Out[41]:
a
0 0
1 0
2 3
5 0
6 0
7 0
8 0
9 0
10 2
12 0
13 0
I have found a workaround. It is quite ugly, but it does the trick. I hope you don't have massive data, because it might be not very performing:
df = pd.DataFrame({'a':[1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14]})
df1 = df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).sum()
# Determine the different groups of NaNs. We only want to keep the 1st. The 0's are non-NaN values, the 1's are the first in a group of NaNs.
b = df.isna()
df2 = b.cumsum() - b.cumsum().where(~b).ffill().fillna(0).astype(int)
df2 = df2.loc[df2['a'] <= 1]
# Set index from the non-zero 'NaN-count' to the index of the first NaN
df3 = df1.loc[df1 != 0]
df3.index = df2.loc[df2['a'] == 1].index
# Update the values from df3 (which has the right values, and the right index), to df2
df2.update(df3)
The NaN-group thingy is inspired by the following answer: This is coming from the this answer.

pandas add a column with only one row

This sounds a bit weird, but I think that's exactly what I needed now:
I got several pandas dataframes that contains columns with float numbers, for example:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
Now I want to add a column, with only one row, and the value is equal to the average of column 'a', in this case, is 3.0. So the new dataframe will looks like this:
a b c average
0 0 1 2 3.0
1 3 4 5
2 6 7 8
And all the rows below are empty.
I've tried things like df['average'] = np.mean(df['a']) but that give me a whole column of 3.0. Any help will be appreciated.
Assign a series, this is cleaner.
df['average'] = pd.Series(df['a'].mean(), index=df.index[[0]])
Or, even better, assign with loc:
df.loc[df.index[0], 'average'] = df['a'].mean().item()
Filling NaNs is straightforward, you can do
df['average'] = df['average'].fillna('')
df
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8
Can do something like:
df['average'] = [np.mean(df['a'])]+['']*(len(df)-1)
Here is a full example:
import pandas as pd
import numpy as np
df = pd.DataFrame(
[(0,1,2), (3,4,5), (6,7,8)],
columns=['a', 'b', 'c'])
print(df)
a b c
0 0 1 2
1 3 4 5
2 6 7 8
df['average'] = ''
df['average'][0] = df['a'].mean()
print(df)
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8

Merging DataFrames on dates in chronological order in Pandas

I have about 50 DataFrames in a list that have a form like this, where the particular dates included in each DataFrame are not necessarily the same.
>>> print(df1)
Unnamed: 0 df1_name
0 2004/04/27 2.2700
1 2004/04/28 2.2800
2 2004/04/29 2.2800
3 2004/04/30 2.2800
4 2004/05/04 2.2900
5 2004/05/05 2.3000
6 2004/05/06 2.3200
7 2004/05/07 2.3500
8 2004/05/10 2.3200
9 2004/05/11 2.3400
10 2004/05/12 2.3700
Now, I want to merge these 50 DataFrames together on the date column (unnamed first column in each DataFrame), and include all dates that are present in any of the DataFrames. Should a DataFrame not have a value for that date, it can just be NaN.
So a minimal example:
>>> print(sample1)
Unnamed: 0 sample_1
0 2004/04/27 1
1 2004/04/28 2
2 2004/04/29 3
3 2004/04/30 4
>>> print(sample2)
Unnamed: 0 sample_2
0 2004/04/28 5
1 2004/04/29 6
2 2004/05/01 7
3 2004/05/03 8
Then after the merge
>>> print(merged_df)
Unnamed: 0 sample_1 sample_2
0 2004/04/27 1 NaN
1 2004/04/28 2 5
2 2004/04/29 3 6
3 2004/04/30 4 NaN
....
Is there an easy way to make use of the merge or join functions of Pandas to accomplish this? I have gotten awfully stuck trying to determine how to combine the dates like this.
All you need to do is pd.concat on all your sample dataframes. But you have to set a couple of things. One, set the index of each one to be the column you want to merge on. Ensure that column is a date column. Below is an example of how to do it.
One liner
pd.concat([s.set_index('Unnamed: 0') for s in [sample1, sample2]], axis=1).rename_axis('Unnamed: 0').reset_index()
Unnamed: 0 sample_1 sample_2
0 2004/04/27 1.0 NaN
1 2004/04/28 2.0 5.0
2 2004/04/29 3.0 6.0
3 2004/04/30 4.0 NaN
4 2004/05/01 NaN 7.0
5 2004/05/03 NaN 8.0
I think this is more understandable
sample1 = pd.DataFrame([
['2004/04/27', 1],
['2004/04/28', 2],
['2004/04/29', 3],
['2004/04/30', 4],
], columns=['Unnamed: 0', 'sample_1'])
sample2 = pd.DataFrame([
['2004/04/28', 5],
['2004/04/29', 6],
['2004/05/01', 7],
['2004/05/03', 8],
], columns=['Unnamed: 0', 'sample_2'])
list_of_samples = [sample1, sample2]
for i, sample in enumerate(list_of_samples):
s = list_of_samples[i].copy()
cols = s.columns.tolist()
cols[0] = 'Date'
s.columns = cols
s.Date = pd.to_datetime(s.Date)
s.set_index('Date', inplace=True)
list_of_samples[i] = s
pd.concat(list_of_samples, axis=1)
sample_1 sample_2
Date
2004-04-27 1.0 NaN
2004-04-28 2.0 5.0
2004-04-29 3.0 6.0
2004-04-30 4.0 NaN
2004-05-01 NaN 7.0
2004-05-03 NaN 8.0

Update in pandas on specific columns

I want to update values in one pandas data frame based on the values in another dataframe, but I want to specify which column to update by (i.e., which column should be the “key” for looking up matching rows). Right now it seems to do treat the first column as the key one. Is there a way to pass it a specific column name?
Example:
import pandas as pd
import numpy as np
df_a = pd.DataFrame()
df_a['x'] = range(5)
df_a['y'] = range(4, -1, -1)
df_a['z'] = np.random.rand(5)
df_b = pd.DataFrame()
df_b['x'] = range(5)
df_b['y'] = range(5)
df_b['z'] = range(5)
print('df_b:')
print(df_b.head())
print('\nold df_a:')
print(df_a.head(10))
df_a.update(df_b)
print('\nnew df_a:')
print(df_a.head())
Out:
df_b:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
old df_a:
x y z
0 0 4 0.333648
1 1 3 0.683656
2 2 2 0.605688
3 3 1 0.816556
4 4 0 0.360798
new df_a:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
You see, what it did is replaced y and z in df_a with the respective columns in df_b based on matches of x between df_a and df_b.
What if I wanted to keep y the same? What if I want it to replace based on y and not x. Also, what if there are multiple columns on which I’d like to do the replacement (in the real problem, I have to update a dataset with a new dataset, where there is a match in two or three columns between the two on the values from a fourth column).
Basically, I want to do some sort of a merge-replace action, where I specify which columns I am merging/replacing on and which column should be replaced.
Hope this makes things clearer. If this cannot be accomplished with update in pandas, I am wondering if there is another way (short of writing a separate function with for loops for it).
This is my current solution, but it seems somewhat inelegant:
df_merge = df_a.merge(df_b, on='y', how='left', suffixes=('_a', '_b'))
print(df_merge.head())
df_merge['x'] = df_merge.x_b
df_merge['z'] = df_merge.z_b
df_update = df_a.copy()
df_update.update(df_merge)
print(df_update)
Out:
x_a y z_a x_b z_b
0 0 0 0.505949 0 0
1 1 1 0.231265 1 1
2 2 2 0.241109 2 2
3 3 3 0.579765 NaN NaN
4 4 4 0.172409 NaN NaN
x y z
0 0 0 0.000000
1 1 1 1.000000
2 2 2 2.000000
3 3 3 0.579765
4 4 4 0.172409
5 5 5 0.893562
6 6 6 0.638034
7 7 7 0.940911
8 8 8 0.998453
9 9 9 0.965866

Adding row in Pandas DataFrame keeping index order

I have a DataFrame and I would like to add some inexisting rows to it. I have found the .loc method, but this adds the values at the end, and not in a sorted way. For example
import numpy as np
import pandas as pd
dfi = pd.DataFrame(np.arange(6).reshape(3,2),columns=['A','B'])
>>> dfi
A B
0 0 1
1 2 3
2 4 5
[3 rows x 2 columns]
Adding a inexisting row through .loc:
dfi.loc[5,:] = 0
>>> dfi
A B
0 0 1
1 2 3
2 4 5
5 0 0
[3 rows x 2 columns]
So far everything ok. But this is what happens when trying to add another row, with index smaller than the last one:
dfi.loc[3,:] = 0
>>> dfi
A B
0 0 1
1 2 3
2 4 5
5 0 0
3 0 0
[3 rows x 2 columns]
I would like it to put the row with index 3 between the row 2 and the 5. I could sort the DataFrame by index everytime, but that would take too long. Is there another way?
My actual problem is considering a DataFrame where the indexes are datetime objects. I didn't put the whole detail of that implementation here because that would confuse what my real problem is: adding rows in DataFrame such that the result has an ordered index.
If your index is almost continuous, only missing a few values here and there. I think you may try the following,
In [15]:
df=pd.DataFrame(np.zeros((100,2)), columns=['A', 'B'])
df['A']=np.nan
df['B']=np.nan
In [16]:
df.iloc[[0,1,2]]=pd.DataFrame({'A': [0,2,4,], 'B': [1,3,5]})
df.iloc[5]=[0,0]
df.iloc[3]=0
print df.dropna()
A B
0 0 1
1 2 3
2 4 5
3 0 0
5 0 0
[5 rows x 2 columns]

Categories

Resources