merge to replace nans of the same column in pandas dataframe? - python

I have the following dataframe to which I want to merge multiple dataframes to, this df consist of ID, date, and many other variables..
ID date ..other variables...
A 2017Q1
A 2017Q2
A 2018Q1
B 2017Q1
B 2017Q2
B 2017Q3
C 2018Q1
C 2018Q2
.. ..
And i have a bunch of dataframes (by quarter) that has asset holdings information
df_2017Q1:
ID date asset_holdings
A 2017Q1 1
B 2017Q1 2
C 2017Q1 4
...
df_2017Q2
ID date asset_holdings
A 2017Q2 2
B 2017Q2 5
C 2017Q2 4
...
df_2017Q3
ID date asset_holdings
A 2017Q3 1
B 2017Q3 2
C 2017Q3 10
...
df_2017Q4..
ID date asset_holdings
A 2017Q4 10
B 2017Q4 20
C 2017Q4 14
...
df_2018Q1..
ID date asset_holdings
A 2018Q1 11
B 2018Q1 23
C 2018Q1 15
...
df_2018Q2...
ID date asset_holdings
A 2018Q2 11
B 2018Q2 26
C 2018Q2 19
...
....
desired output
ID date asset_holdings ..other variables...
A 2017Q1 1
A 2017Q2 2
A 2018Q1 11
B 2017Q1 2
B 2017Q2 5
B 2017Q3 2
C 2018Q1 15
C 2018Q2 19
.. ..
I think merging on ID and date, should do it but this will create + n columns which I do not want, so I want to create a column "asset_holdings" and merge the right dfs while updating NAN values. But not sure if this is the smartest way. Any help will be appreciated!

Try to use pd.concat() to concatenate your different DataFrames and then use sort_values(['ID', 'date']) to sort the values by the columns ID and date.
See the example below as demonstration.
import pandas as pd
df1 = pd.DataFrame({'ID':list('ABCD'), 'date':['2017Q1']*4, 'other':[1,2,3,4]})
df2 = pd.DataFrame({'ID':list('ABCD'), 'date':['2017Q2']*4, 'other':[4,3,2,1]})
df3 = pd.DataFrame({'ID':list('ABCD'), 'date':['2018Q1']*4, 'other':[7,6,5,4]})
ans = pd.concat([df1, df2, df3]).sort_values(['ID', 'date'], ignore_index=True)
>>> ans
ID date other
0 A 2017Q1 1
1 A 2017Q2 4
2 A 2018Q1 7
3 B 2017Q1 2
4 B 2017Q2 3
5 B 2018Q1 6
6 C 2017Q1 3
7 C 2017Q2 2
8 C 2018Q1 5
9 D 2017Q1 4
10 D 2017Q2 1
11 D 2018Q1 4

Related

How to search for time series data across two data frames

I have two pandas data frames df1 and df2 like the following:
df1 containing all the data of type a in increasing time order:
type Date
0 a 1970-01-01
1 a 2008-08-01
2 a 2009-07-24
3 a 2010-09-30
4 a 2011-09-29
5 a 2013-06-11
6 a 2013-12-17
7 a 2015-06-02
8 a 2016-06-14
9 a 2017-06-21
10 a 2018-11-26
11 a 2019-06-03
12 a 2019-12-16
df2 containing all the data of type b in increasing time order:
type Date
0 b 2017-11-29
1 b 2018-05-30
2 b 2018-11-26
3 b 2019-06-03
4 b 2019-12-16
5 b 2020-06-18
6 b 2020-12-17
7 b 2021-06-28
A type a entry and a type b entry are determined as matching if the date difference between them is within one year. One type a entry can only match with one other type b entry, and vice versa. Time efficiently, how can I find the maximum amount of matching pairs in increasing time order like the following?
type1 Date1 type2 Date2
0 a 2017-06-21 b 2017-11-29
1 a 2018-11-26 b 2018-05-30
2 a 2019-06-03 b 2018-11-26
3 a 2019-12-16 b 2019-06-03
Use merge_asof:
df3 = pd.merge_asof(df1.rename(columns={'Date':'Date2', 'type':'type1'}),
df2.rename(columns={'Date':'Date1', 'type':'type2'}),
left_on='Date2',
right_on='Date1',
direction='nearest',
allow_exact_matches=False,
tolerance=pd.Timedelta('365 days')).dropna(subset=['Date1'])
print (df3)
type1 Date2 type2 Date1
9 a 2017-06-21 b 2017-11-29
10 a 2018-11-26 b 2018-05-30
11 a 2019-06-03 b 2018-11-26
12 a 2019-12-16 b 2020-06-18

Save previos entry per group / id and date in a column

I have a dataframe in python, with the following sorted format:
df
Name Date Value
A 01.01.20 10
A 02.01.20 20
A 03.01.20 15
B 01.01.20 5
B 02.01.20 10
B 03.01.20 5
C 01.01.20 3
C 03.01.20 6
So not every Name has every date filled, how can I create a new column with previos date value (if it is missing, just pick the current value) so that it leads to:
Name Date Value Previos
A 01.01.20 10 10
A 02.01.20 20 10
A 03.01.20 15 20
B 01.01.20 5 5
B 02.01.20 10 5
B 03.01.20 5 10
C 01.01.20 3 3
C 03.01.20 6 6
Use DataFrameGroupBy.shift with Series.fillna:
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%y')
df['Previos'] = df.groupby('Name')['Value'].shift().fillna(df['Value'])
print (df)
Name Date Value Previos
0 A 2020-01-01 10 10.0
1 A 2020-01-02 20 10.0
2 A 2020-01-03 15 20.0
3 B 2020-01-01 5 5.0
4 B 2020-01-02 10 5.0
5 B 2020-01-03 5 10.0
6 C 2020-01-01 3 3.0
7 C 2020-01-03 6 3.0
But if need shift by 1 day so in last group are same values like original solution is different - first is created DatetimeIndex and for new column is used DataFrame.join:
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%y')
df = df.set_index('Date')
s = df.groupby('Name')['Value'].shift(freq='D').rename('Previous')
df = df.join(s, on=['Name','Date']).fillna({'Previous': df['Value']})
print (df)
Name Value Previous
Date
2020-01-01 A 10 10.0
2020-01-02 A 20 10.0
2020-01-03 A 15 20.0
2020-01-01 B 5 5.0
2020-01-02 B 10 5.0
2020-01-03 B 5 10.0
2020-01-01 C 3 3.0
2020-01-03 C 6 6.0

Insert row in pandas Dataframe based on Date Column

I have a Dataframe df and a list li, My dataframe column contains:
Student Score Date
A 10 15-03-19
C 11 16-03-19
A 12 16-03-19
B 10 16-03-19
A 9 17-03-19
My list contain Name of all Student li=[A,B,C]
If any student have not came on particular day then insert the name of student in dataframe with score value = 0
My Final Dataframe should be like:
Student Score Date
A 10 15-03-19
B 0 15-03-19
C 0 15-03-19
C 11 16-03-19
A 12 16-03-19
B 10 16-03-19
A 9 17-03-19
B 0 17-03-19
C 0 17-03-19
Use DataFrame.reindex with MultiIndex.from_product:
li = list('ABC')
mux = pd.MultiIndex.from_product([df['Date'].unique(), li], names=['Date', 'Student'])
df = df.set_index(['Date', 'Student']).reindex(mux, fill_value=0).reset_index()
print (df)
Date Student Score
0 15-03-19 A 10
1 15-03-19 B 0
2 15-03-19 C 0
3 16-03-19 A 12
4 16-03-19 B 10
5 16-03-19 C 11
6 17-03-19 A 9
7 17-03-19 B 0
8 17-03-19 C 0
Alternative is use left join with DataFrame.merge and helper DataFrame created by product, last replace missing values by fillna:
from itertools import product
df1 = pd.DataFrame(list(product(df['Date'].unique(), li)), columns=['Date', 'Student'])
df = df1.merge(df, how='left').fillna(0)
print (df)
Date Student Score
0 15-03-19 A 10.0
1 15-03-19 B 0.0
2 15-03-19 C 0.0
3 16-03-19 A 12.0
4 16-03-19 B 10.0
5 16-03-19 C 11.0
6 17-03-19 A 9.0
7 17-03-19 B 0.0
8 17-03-19 C 0.0

Grouping in pandas and assign repetition number (first, second, third)

I have a python pandas dataframe that looks like this:
date userid
2017-03 a
2017-04 b
2017-06 b
2017-08 b
2017-05 c
2017-08 c
I would like create a third column that indicates the number of times that the sample was repeated at that date, so the frame looks like this:
date userid repetition
2017-03 a 1
2017-04 b 1
2017-06 b 2
2017-08 b 3
2017-05 c 1
2017-08 c 2
So far, I grouped it by userid and date, but I only found the way to get the total counts
data['newcol'] = data.groupby(['sampleid'])['date'].transform('count')
Thank you very much!!
Use cumcount
In [282]: df.groupby('userid').cumcount().add(1)
Out[282]:
0 1
1 1
2 2
3 3
4 1
5 2
dtype: int64
In [283]: df.assign(repetition=df.groupby('userid').cumcount().add(1))
Out[283]:
date userid repetition
0 2017-03 a 1
1 2017-04 b 1
2 2017-06 b 2
3 2017-08 b 3
4 2017-05 c 1
5 2017-08 c 2
Or, assign
In [285]: df['repetition'] = df.groupby('userid').cumcount().add(1)
In [286]: df
Out[286]:
date userid repetition
0 2017-03 a 1
1 2017-04 b 1
2 2017-06 b 2
3 2017-08 b 3
4 2017-05 c 1
5 2017-08 c 2

Generating sub data frame based on a value in an column

I have following data frame in pandas. Now I want to generate sub data frame if I see a value in Activity column. So for example, I want to have data frame with all the data with Name A IF Activity column as value 3 or 5.
Name Date Activity
A 01-02-2015 1
A 01-03-2015 2
A 01-04-2015 3
A 01-04-2015 1
B 01-02-2015 1
B 01-02-2015 2
B 01-03-2015 1
B 01-04-2015 5
C 01-31-2015 1
C 01-31-2015 2
C 01-31-2015 2
So for the above data, I want to get
df_A as
Name Date Activity
A 01-02-2015 1
A 01-03-2015 2
A 01-04-2015 3
A 01-04-2015 1
df_B as
B 01-02-2015 1
B 01-02-2015 2
B 01-03-2015 1
B 01-04-2015 5
Since Name C does not have 3 or 5 in the column Activity, I do not want to get this data frame.
Also, the names in the data frame can vary with each input file.
Once I have these data frame separated, I want to plot a time series.
You can groupby dataframe by column Name, apply custom function f and then select dataframes df_A and df_B:
print df
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
8 C 2015-01-31 1
9 C 2015-01-31 2
10 C 2015-01-31 2
def f(df):
if ((df['Activity'] == 3) | (df['Activity'] == 5)).any():
return df
g = df.groupby('Name').apply(f).reset_index(drop=True)
df_A = g.loc[g.Name == 'A']
print df_A
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
df_B = g.loc[g.Name == 'B']
print df_B
Name Date Activity
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
df_A.plot()
df_B.plot()
In the end you can use plot - more info
EDIT:
If you want create dataframes dynamically, use can find all unique values of column Name by drop_duplicates:
for name in g.Name.drop_duplicates():
print g.loc[g.Name == name]
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
Name Date Activity
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
You can use a dictionary comprehension to create a sub dataframe for each Name with an Activity value of 3 or 5.
active_names = df[df.Activity.isin([3, 5])].Name.unique().tolist()
dfs = {name: df.loc[df.Name == name, :] for name in active_names}
>>> dfs['A']
Name Date Activity
0 A 01-02-2015 1
1 A 01-03-2015 2
2 A 01-04-2015 3
3 A 01-04-2015 1
>>> dfs['B']
Name Date Activity
4 B 01-02-2015 1
5 B 01-02-2015 2
6 B 01-03-2015 1
7 B 01-04-2015 5

Categories

Resources