I have about 50 DataFrames in a list that have a form like this, where the particular dates included in each DataFrame are not necessarily the same.
>>> print(df1)
Unnamed: 0 df1_name
0 2004/04/27 2.2700
1 2004/04/28 2.2800
2 2004/04/29 2.2800
3 2004/04/30 2.2800
4 2004/05/04 2.2900
5 2004/05/05 2.3000
6 2004/05/06 2.3200
7 2004/05/07 2.3500
8 2004/05/10 2.3200
9 2004/05/11 2.3400
10 2004/05/12 2.3700
Now, I want to merge these 50 DataFrames together on the date column (unnamed first column in each DataFrame), and include all dates that are present in any of the DataFrames. Should a DataFrame not have a value for that date, it can just be NaN.
So a minimal example:
>>> print(sample1)
Unnamed: 0 sample_1
0 2004/04/27 1
1 2004/04/28 2
2 2004/04/29 3
3 2004/04/30 4
>>> print(sample2)
Unnamed: 0 sample_2
0 2004/04/28 5
1 2004/04/29 6
2 2004/05/01 7
3 2004/05/03 8
Then after the merge
>>> print(merged_df)
Unnamed: 0 sample_1 sample_2
0 2004/04/27 1 NaN
1 2004/04/28 2 5
2 2004/04/29 3 6
3 2004/04/30 4 NaN
....
Is there an easy way to make use of the merge or join functions of Pandas to accomplish this? I have gotten awfully stuck trying to determine how to combine the dates like this.
All you need to do is pd.concat on all your sample dataframes. But you have to set a couple of things. One, set the index of each one to be the column you want to merge on. Ensure that column is a date column. Below is an example of how to do it.
One liner
pd.concat([s.set_index('Unnamed: 0') for s in [sample1, sample2]], axis=1).rename_axis('Unnamed: 0').reset_index()
Unnamed: 0 sample_1 sample_2
0 2004/04/27 1.0 NaN
1 2004/04/28 2.0 5.0
2 2004/04/29 3.0 6.0
3 2004/04/30 4.0 NaN
4 2004/05/01 NaN 7.0
5 2004/05/03 NaN 8.0
I think this is more understandable
sample1 = pd.DataFrame([
['2004/04/27', 1],
['2004/04/28', 2],
['2004/04/29', 3],
['2004/04/30', 4],
], columns=['Unnamed: 0', 'sample_1'])
sample2 = pd.DataFrame([
['2004/04/28', 5],
['2004/04/29', 6],
['2004/05/01', 7],
['2004/05/03', 8],
], columns=['Unnamed: 0', 'sample_2'])
list_of_samples = [sample1, sample2]
for i, sample in enumerate(list_of_samples):
s = list_of_samples[i].copy()
cols = s.columns.tolist()
cols[0] = 'Date'
s.columns = cols
s.Date = pd.to_datetime(s.Date)
s.set_index('Date', inplace=True)
list_of_samples[i] = s
pd.concat(list_of_samples, axis=1)
sample_1 sample_2
Date
2004-04-27 1.0 NaN
2004-04-28 2.0 5.0
2004-04-29 3.0 6.0
2004-04-30 4.0 NaN
2004-05-01 NaN 7.0
2004-05-03 NaN 8.0
Related
I have a dataframe with a series of numbers. For example:
Index Column 1
1 10
2 12
3 24
4 NaN
5 20
6 15
7 NaN
8 NaN
9 2
I can't use bfill or ffill as the rule is dynamic, taking the value from the previous row and dividing by the number of consecutive NaN + 1. For example, rows 3 and 4 should be replaced with 12 as 24/2, rows 6, 7 and 8 should be replaced with 5. All other numbers should remain unchanged.
How should I do that?
Note: Edited the dataframe to be more general by inserting a new row between rows 4 and 5 and another row at the end.
You can do:
m = (df["Column 1"].notna()) & (
(df["Column 1"].shift(-1).isna()) | (df["Column 1"].shift().isna())
)
out = df.groupby(m.cumsum()).transform(
lambda x: x.fillna(0).mean() if x.isna().any() else x
)
print(out):
Index Column 1
0 1 10.0
1 2 12.0
2 3 12.0
3 4 12.0
4 5 20.0
5 6 5.0
6 7 5.0
7 8 5.0
8 9 2.0
Explanation and intermediate values:
Basically look for the rows where the next value is NaN or previous value is NaN but their value itself is not NaN. Those rows form the first row of such groups.
So the m in above code looks like:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 True
now I want to form groups of rows that are ['True', <all Falses>] because those are the groups I want to take average of. For that use cumsum
If you want to take a look at those groups, you can use ngroup() after groupby on m.cumsum():
0 0
1 0
2 1
3 1
4 2
5 3
6 3
7 3
8 4
The above is only to show what are the groups.
Now for each group you can get the mean of the group if the group has any NaN value. This is accomplished by checking for NaNs using x.isna().any().
If the group has any NaN value then assign mean after filling NaN with 0 ,otherwise just keep the group as is. This is accomplished by the lambda:
lambda x: x.fillna(0).mean() if x.isna().any() else x
Why not using interpolate? There is a method=s that would probably fitsyour desire
However, if you really want to do as you described above, you can do something like this. (Note that iterating over rows in pandas is considered bad practice, but it does the job)
import pandas as pd
import numpy as np
df = pd.DataFrame([10,
12,
24,
np.NaN,
15,
np.NaN,
np.NaN])
for col in df:
for idx in df.index: # (iterating over rows is considered bad practice)
local_idx=idx
while(local_idx+1<len(df) and np.isnan(df.at[local_idx+1,col])):
local_idx+=1
if (local_idx-idx)>0:
fillvalue = df.loc[idx]/(local_idx-idx+1)
for fillidx in range(idx, local_idx+1):
df.loc[fillidx] = fillvalue
df
Output:
0
0 10.0
1 12.0
2 12.0
3 12.0
4 5.0
5 5.0
6 5.0
I work with Python and I try to implement the function merge with two tables df_agg and df_total. With this function, I used the argument left with the expectation that from the first table with the title all rows will be covered. For the first table, it is important to consider that the first table contains duplicates in the join column id while the second table does not have duplicates in id.
df_new = pd.merge(df_agg,df_total, on='id', how='left')
The merge command executes successfully.But the results are extraordinary, instead to have the same sum of df_agg['total'] with df_new['total'], results in the df_new['total'] being greater than df_agg.
So can anybody help me with what causes this problem and suggest to me some arguments in the function in order to have the same sum before and after merging?
It means id has duplicates in both DataFrames, so new DataFrame has more rows like df_agg (is created 'product' of duplicated rows by all combinations).
df_agg = pd.DataFrame( {"id": [1,1,2,3,3], 'a':range(5) })
df_total = pd.DataFrame( {"id": [1,1,1,3,4], 'b':range(10,15) })
df_new = pd.merge(df_agg,df_total, on='id', how='left')
print (df_new)
id a b
0 1 0 10.0
1 1 0 11.0
2 1 0 12.0
3 1 1 10.0
4 1 1 11.0
5 1 1 12.0
6 2 2 NaN
7 3 3 13.0
8 3 4 13.0
print (len(df_new), len(df_agg))
9 5
Possible solution is remove duplicates:
df_new = pd.merge(df_agg,df_total.drop_duplicates('id'), on='id', how='left')
print (df_new)
id a b
0 1 0 10.0
1 1 1 10.0
2 2 2 NaN
3 3 3 13.0
4 3 4 13.0
print (len(df_new), len(df_agg))
5 5
I have following dataframe,
>>> data = pd.DataFrame({'Name': ['CTA15', 'CTA15', 'AC007', 'AC007', 'AC007'],
'ID': [22, 22, 2, 2, 2],
'Sample':['PE12', 'PL14', 'AE29', 'AE04', 'PE03'],
'count_col' : [2, 2, 3, 3, 3]})
>>> data
ID Name Sample count_col
0 22 CTA15 PE12 2
1 22 CTA15 PL14 2
2 2 AC007 AE29 3
3 2 AC007 AE04 3
4 2 AC007 PE03 3
I need to rearrange my data frame as following,
Name Sample count_col
CTA15 PE12 2
PL14
AC007 AE10 3
AE29
PE03
What I tried is,
pd.pivot_table(All_variants_REL,index=["Name",'Sample'],
values=['Count'],aggfunc={'Name':np.size})
But it not showing accurate count in count column
Any helps would be great..
It seems you need mask + astype by boolean mask created by duplicated:
Notice: I add cast to str, because else get mixed values in column count (strings with ints) and some pandas function can be broken.
Notice1 - Solution works, if values in Name column are sorted.
cols = ['Name','count']
df[cols] = df[cols].astype(str).mask(df.duplicated(['Name']), '')
print (df)
Name ID Sample count
0 CTA15 22 PE12 2
1 22 PL14
2 AC007 2 AE29 3
3 2 AE04
4 2 PE03
If need NaNs simply omit , - but last column values are convert to float (because NaN is float)
cols = ['Name','count']
df[cols] = df[cols].mask(df.duplicated(['Name']))
print (df)
Name ID Sample count
0 CTA15 22 PE12 2.0
1 NaN 22 PL14 NaN
2 AC007 2 AE29 3.0
3 NaN 2 AE04 NaN
4 NaN 2 PE03 NaN
For lists is possible use:
cols = ['Name','count', 'ID']
df = df.groupby(cols)['Sample'].apply(list).reset_index()
print (df)
Name count ID Sample
0 AC007 3 2 [AE29, AE04, PE03]
1 CTA15 2 22 [PE12, PL14]
Why not simply set a multi index? Doing so will translate to having all columns show if you have many more columns than in the example DataFrame.
>>> data = pd.DataFrame({'Name': ['CTA15', 'CTA15', 'AC007', 'AC007', 'AC007'],
'ID': [22, 22, 2, 2, 2],
'Sample':['PE12', 'PL14', 'AE29', 'AE04', 'PE03'],
'count_col' : [2, 2, 3, 3, 3]})
(Side note: I wouldn't recommend having a column with the name count as it is a DataFrame method and will cause issues down the road. For example, data.count does not return a Series as we might expect.)
>>> data
ID Name Sample count_col
0 22 CTA15 PE12 2
1 22 CTA15 PL14 2
2 2 AC007 AE29 3
3 2 AC007 AE04 3
4 2 AC007 PE03 3
Set the multi index, which will serve as solution for an arbitrarily large DataFrame.
>>> data.set_index(['Name', 'Sample'])
ID count_col
Name Sample
CTA15 PE12 22 2
PL14 22 2
AC007 AE29 2 3
AE04 2 3
PE03 2 3
I have two dataframes like this:
df['one'] = [1,2,3,4,5]
df['two'] = [nan, 15, nan, 22, nan]
I need some sort of join or merge which will give me dataframe like this:
df['result'] = [1,15,3,22,5]
any ideas?
You can use np.where to do it. So if df.two is NaN, you use df.one's value, otherwise use df.two.
import pandas as pd
import numpy as np
# your data
# ========================================
df = pd.DataFrame(dict(one=[1,2,3,4,5], two=[np.nan, 15, np.nan, 22, np.nan]))
print(df)
one two
0 1 NaN
1 2 15
2 3 NaN
3 4 22
4 5 NaN
# processing
# ========================================
df['result'] = np.where(df.two.isnull(), df.one, df.two)
one two result
0 1 NaN 1
1 2 15 15
2 3 NaN 3
3 4 22 22
4 5 NaN 5
You can use the pandas method combine_first() to fill the missing values from a DataFrame or Series with values from another; in this case, you want to fill the missing values in df['two'] with the corresponding values in df['one']:
In [342]: df['result']= df['two'].combine_first(df['one'])
In [343]: df
Out[343]:
one two result
0 1 NaN 1
1 2 15 15
2 3 NaN 3
3 4 22 22
4 5 NaN 5
I have two dataframes in Pandas. The columns are named the same and they have the same dimensions, but they have different (and missing) values.
I would like to merge based on one key column and take the max or non-missing data for each equivalent row.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'key':[1,3,5,7], 'a':[np.NaN, 0, 5, 1], 'b':[datetime.datetime.today() - datetime.timedelta(days=x) for x in range(0,4)]})
df1
a b key
0 NaN 2014-08-01 10:37:23.828683 1
1 0 2014-07-31 10:37:23.828726 3
2 5 2014-07-30 10:37:23.828736 5
3 1 2014-07-29 10:37:23.828744 7
df2 = pd.DataFrame({'key':[1,3,5,7], 'a':[2, 0, np.NaN, 3], 'b':[datetime.datetime.today() - datetime.timedelta(days=x) for x in range(2,6)]})
df2.ix[2,'b']=np.NaN
df2
a b key
0 2 2014-07-30 10:38:13.857203 1
1 0 2014-07-29 10:38:13.857253 3
2 NaN NaT 5
3 3 2014-07-27 10:38:13.857272 7
The end result would look like:
df_together
a b key
0 2 2014-07-30 10:38:13.857203 1
1 0 2014-07-29 10:38:13.857253 3
2 5 2014-07-30 10:37:23.828736 5
3 3 2014-07-27 10:38:13.857272 7
I hope my example covers all cases. If both dataframes have NaN (or NaT) values, they the result should also have NaN (or NaT) values. Try as I might, I can't get the pd.merge function to give what I want.
Often it is easiest in these circumstances to do:
df_together = pd.concat([df1, df2]).groupby('key').max()