Insert row in pandas Dataframe based on Date Column - python

I have a Dataframe df and a list li, My dataframe column contains:
Student Score Date
A 10 15-03-19
C 11 16-03-19
A 12 16-03-19
B 10 16-03-19
A 9 17-03-19
My list contain Name of all Student li=[A,B,C]
If any student have not came on particular day then insert the name of student in dataframe with score value = 0
My Final Dataframe should be like:
Student Score Date
A 10 15-03-19
B 0 15-03-19
C 0 15-03-19
C 11 16-03-19
A 12 16-03-19
B 10 16-03-19
A 9 17-03-19
B 0 17-03-19
C 0 17-03-19

Use DataFrame.reindex with MultiIndex.from_product:
li = list('ABC')
mux = pd.MultiIndex.from_product([df['Date'].unique(), li], names=['Date', 'Student'])
df = df.set_index(['Date', 'Student']).reindex(mux, fill_value=0).reset_index()
print (df)
Date Student Score
0 15-03-19 A 10
1 15-03-19 B 0
2 15-03-19 C 0
3 16-03-19 A 12
4 16-03-19 B 10
5 16-03-19 C 11
6 17-03-19 A 9
7 17-03-19 B 0
8 17-03-19 C 0
Alternative is use left join with DataFrame.merge and helper DataFrame created by product, last replace missing values by fillna:
from itertools import product
df1 = pd.DataFrame(list(product(df['Date'].unique(), li)), columns=['Date', 'Student'])
df = df1.merge(df, how='left').fillna(0)
print (df)
Date Student Score
0 15-03-19 A 10.0
1 15-03-19 B 0.0
2 15-03-19 C 0.0
3 16-03-19 A 12.0
4 16-03-19 B 10.0
5 16-03-19 C 11.0
6 17-03-19 A 9.0
7 17-03-19 B 0.0
8 17-03-19 C 0.0

Related

Save previos entry per group / id and date in a column

I have a dataframe in python, with the following sorted format:
df
Name Date Value
A 01.01.20 10
A 02.01.20 20
A 03.01.20 15
B 01.01.20 5
B 02.01.20 10
B 03.01.20 5
C 01.01.20 3
C 03.01.20 6
So not every Name has every date filled, how can I create a new column with previos date value (if it is missing, just pick the current value) so that it leads to:
Name Date Value Previos
A 01.01.20 10 10
A 02.01.20 20 10
A 03.01.20 15 20
B 01.01.20 5 5
B 02.01.20 10 5
B 03.01.20 5 10
C 01.01.20 3 3
C 03.01.20 6 6
Use DataFrameGroupBy.shift with Series.fillna:
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%y')
df['Previos'] = df.groupby('Name')['Value'].shift().fillna(df['Value'])
print (df)
Name Date Value Previos
0 A 2020-01-01 10 10.0
1 A 2020-01-02 20 10.0
2 A 2020-01-03 15 20.0
3 B 2020-01-01 5 5.0
4 B 2020-01-02 10 5.0
5 B 2020-01-03 5 10.0
6 C 2020-01-01 3 3.0
7 C 2020-01-03 6 3.0
But if need shift by 1 day so in last group are same values like original solution is different - first is created DatetimeIndex and for new column is used DataFrame.join:
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%y')
df = df.set_index('Date')
s = df.groupby('Name')['Value'].shift(freq='D').rename('Previous')
df = df.join(s, on=['Name','Date']).fillna({'Previous': df['Value']})
print (df)
Name Value Previous
Date
2020-01-01 A 10 10.0
2020-01-02 A 20 10.0
2020-01-03 A 15 20.0
2020-01-01 B 5 5.0
2020-01-02 B 10 5.0
2020-01-03 B 5 10.0
2020-01-01 C 3 3.0
2020-01-03 C 6 6.0

Adding rows in dataframe based on values of another dataframe

I have the following two dataframes. Please note that 'amt' is grouped by 'id' in both dataframes.
df1
id code amt
0 A 1 5
1 A 2 5
2 B 3 10
3 C 4 6
4 D 5 8
5 E 6 11
df2
id code amt
0 B 1 9
1 C 12 10
I want to add a row in df2 for every id of df1 not contained in df2. For example as Id's A, D and E are not contained in df2,I want to add a row for these Id's. The appended row should contain the id not contained in df2, null value for the attribute code and stored value in df1 for attribute amt
The result should be something like this:
id code name
0 B 1 9
1 C 12 10
2 A nan 5
3 D nan 8
4 E nan 11
I would highly appreciate if I can get some guidance on it.
By using pd.concat
df=df1.drop('code',1).drop_duplicates()
df[~df.id.isin(df2.id)]
pd.concat([df2,df[~df.id.isin(df2.id)]],axis=0).rename(columns={'amt':'name'}).reset_index(drop=True)
Out[481]:
name code id
0 9 1.0 B
1 10 12.0 C
2 5 NaN A
3 8 NaN D
4 11 NaN E
Drop dups from df1 then append df2 then drop more dups then append again.
df2.append(
df1.drop_duplicates('id').append(df2)
.drop_duplicates('id', keep=False).assign(code=np.nan),
ignore_index=True
)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11
Slight variation
m = ~np.in1d(df1.id.values, df2.id.values)
d = ~df1.duplicated('id').values
df2.append(df1[m & d].assign(code=np.nan), ignore_index=True)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11

How to replace part of dataframe in pandas

I have sample dataframe like this
df1=
A B C
a 1 2
b 3 4
b 5 6
c 7 8
d 9 10
I would like to replace a part of this dataframe (col A=a and b) with this dataframe
df2=
A B C
b 9 10
b 11 12
c 13 14
I would like to get result below
df3=
A B C
a 1 2
b 9 10
b 11 12
c 13 14
d 9 10
I tried
df1[df1.A.isin("bc")]...
But I couldnt figure out how to replace.
someone tell how to replace dataframe.
As I explained try update.
import pandas as pd
df1 = pd.DataFrame({"A":['a','b','b','c'], "B":[1,2,4,6], "C":[3,2,1,0]})
df2 = pd.DataFrame({"A":['b','b','c'], "B":[100,400,300], "C":[39,29,100]}).set_index(df1.loc[df1.A.isin(df2.A),:].index)
df1.update(df2)
Out[75]:
A B C
0 a 1.0 3.0
1 b 100.0 39.0
2 b 400.0 29.0
3 c 300.0 100.0
You need combine_first or update by column A, but because duplicates need cumcount:
df1['g'] = df1.groupby('A').cumcount()
df2['g'] = df2.groupby('A').cumcount()
df1 = df1.set_index(['A','g'])
df2 = df2.set_index(['A','g'])
df3 = df2.combine_first(df1).reset_index(level=1, drop=True).astype(int).reset_index()
print (df3)
A B C
0 a 1 2
1 b 9 10
2 b 11 12
3 c 13 14
4 d 9 10
Another solution:
df1['g'] = df1.groupby('A').cumcount()
df2['g'] = df2.groupby('A').cumcount()
df1 = df1.set_index(['A','g'])
df2 = df2.set_index(['A','g'])
df1.update(df2)
df1 = df1.reset_index(level=1, drop=True).astype(int).reset_index()
print (df1)
A B C
0 a 1 2
1 b 9 10
2 b 11 12
3 c 13 14
4 d 9 10
If duplicatesof column A in df1 are same in df2 and have same length:
df2.index = df1.index[df1.A.isin(df2.A)]
df3 = df2.combine_first(df1)
print (df3)
A B C
0 a 1.0 2.0
1 b 9.0 10.0
2 b 11.0 12.0
3 c 13.0 14.0
4 d 9.0 10.0
you could solve your problem with the following:
import pandas as pd
df1 = pd.DataFrame({'A':['a','b','b','c','d'],'B':[1,3,5,7,9],'C':[2,4,6,8,10]})
df2 = pd.DataFrame({'A':['b','b','c'],'B':[9,11,13],'C':[10,12,14]}).set_index(df1.loc[df1.A.isin(df2.A),:].index)
df1.loc[df1.A.isin(df2.A), ['B', 'C']] = df2[['B', 'C']]
Out[108]:
A B C
0 a 1 2
1 b 9 10
2 b 11 12
3 c 13 14
4 d 9 10

Merge dataframes on nearest datetime / timestamp

I have two data frames as follows:
A = pd.DataFrame({"ID":["A", "A", "C" ,"B", "B"], "date":["06/22/2014","07/02/2014","01/01/2015","01/01/1991","08/02/1999"]})
B = pd.DataFrame({"ID":["A", "A", "C" ,"B", "B"], "date":["02/15/2015","06/30/2014","07/02/1999","10/05/1990","06/24/2014"], "value": ["3","5","1","7","8"] })
Which look like the following:
>>> A
ID date
0 A 2014-06-22
1 A 2014-07-02
2 C 2015-01-01
3 B 1991-01-01
4 B 1999-08-02
>>> B
ID date value
0 A 2015-02-15 3
1 A 2014-06-30 5
2 C 1999-07-02 1
3 B 1990-10-05 7
4 B 2014-06-24 8
I want to merge A with the values of B using the nearest date. In this example, none of the dates match, but it could the the case that some do.
The output should be something like this:
>>> C
ID date value
0 A 06/22/2014 8
1 A 07/02/2014 5
2 C 01/01/2015 3
3 B 01/01/1991 7
4 B 08/02/1999 1
It seems to me that there should be a native function in pandas that would allow this.
Note: as similar question has been asked here
pandas.merge: match the nearest time stamp >= the series of timestamps
You can use reindex with method='nearest' and then merge:
A['date'] = pd.to_datetime(A.date)
B['date'] = pd.to_datetime(B.date)
A.sort_values('date', inplace=True)
B.sort_values('date', inplace=True)
B1 = B.set_index('date').reindex(A.set_index('date').index, method='nearest').reset_index()
print (B1)
print (pd.merge(A,B1, on='date'))
ID_x date ID_y value
0 B 1991-01-01 B 7
1 B 1999-08-02 C 1
2 A 2014-06-22 B 8
3 A 2014-07-02 A 5
4 C 2015-01-01 A 3
You can also add parameter suffixes:
print (pd.merge(A,B1, on='date', suffixes=('_', '')))
ID_ date ID value
0 B 1991-01-01 B 7
1 B 1999-08-02 C 1
2 A 2014-06-22 B 8
3 A 2014-07-02 A 5
4 C 2015-01-01 A 3
pd.merge_asof(A, B, on="date", direction='nearest')

Generating sub data frame based on a value in an column

I have following data frame in pandas. Now I want to generate sub data frame if I see a value in Activity column. So for example, I want to have data frame with all the data with Name A IF Activity column as value 3 or 5.
Name Date Activity
A 01-02-2015 1
A 01-03-2015 2
A 01-04-2015 3
A 01-04-2015 1
B 01-02-2015 1
B 01-02-2015 2
B 01-03-2015 1
B 01-04-2015 5
C 01-31-2015 1
C 01-31-2015 2
C 01-31-2015 2
So for the above data, I want to get
df_A as
Name Date Activity
A 01-02-2015 1
A 01-03-2015 2
A 01-04-2015 3
A 01-04-2015 1
df_B as
B 01-02-2015 1
B 01-02-2015 2
B 01-03-2015 1
B 01-04-2015 5
Since Name C does not have 3 or 5 in the column Activity, I do not want to get this data frame.
Also, the names in the data frame can vary with each input file.
Once I have these data frame separated, I want to plot a time series.
You can groupby dataframe by column Name, apply custom function f and then select dataframes df_A and df_B:
print df
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
8 C 2015-01-31 1
9 C 2015-01-31 2
10 C 2015-01-31 2
def f(df):
if ((df['Activity'] == 3) | (df['Activity'] == 5)).any():
return df
g = df.groupby('Name').apply(f).reset_index(drop=True)
df_A = g.loc[g.Name == 'A']
print df_A
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
df_B = g.loc[g.Name == 'B']
print df_B
Name Date Activity
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
df_A.plot()
df_B.plot()
In the end you can use plot - more info
EDIT:
If you want create dataframes dynamically, use can find all unique values of column Name by drop_duplicates:
for name in g.Name.drop_duplicates():
print g.loc[g.Name == name]
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
Name Date Activity
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
You can use a dictionary comprehension to create a sub dataframe for each Name with an Activity value of 3 or 5.
active_names = df[df.Activity.isin([3, 5])].Name.unique().tolist()
dfs = {name: df.loc[df.Name == name, :] for name in active_names}
>>> dfs['A']
Name Date Activity
0 A 01-02-2015 1
1 A 01-03-2015 2
2 A 01-04-2015 3
3 A 01-04-2015 1
>>> dfs['B']
Name Date Activity
4 B 01-02-2015 1
5 B 01-02-2015 2
6 B 01-03-2015 1
7 B 01-04-2015 5

Categories

Resources