I have following data frame in pandas. Now I want to generate sub data frame if I see a value in Activity column. So for example, I want to have data frame with all the data with Name A IF Activity column as value 3 or 5.
Name Date Activity
A 01-02-2015 1
A 01-03-2015 2
A 01-04-2015 3
A 01-04-2015 1
B 01-02-2015 1
B 01-02-2015 2
B 01-03-2015 1
B 01-04-2015 5
C 01-31-2015 1
C 01-31-2015 2
C 01-31-2015 2
So for the above data, I want to get
df_A as
Name Date Activity
A 01-02-2015 1
A 01-03-2015 2
A 01-04-2015 3
A 01-04-2015 1
df_B as
B 01-02-2015 1
B 01-02-2015 2
B 01-03-2015 1
B 01-04-2015 5
Since Name C does not have 3 or 5 in the column Activity, I do not want to get this data frame.
Also, the names in the data frame can vary with each input file.
Once I have these data frame separated, I want to plot a time series.
You can groupby dataframe by column Name, apply custom function f and then select dataframes df_A and df_B:
print df
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
8 C 2015-01-31 1
9 C 2015-01-31 2
10 C 2015-01-31 2
def f(df):
if ((df['Activity'] == 3) | (df['Activity'] == 5)).any():
return df
g = df.groupby('Name').apply(f).reset_index(drop=True)
df_A = g.loc[g.Name == 'A']
print df_A
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
df_B = g.loc[g.Name == 'B']
print df_B
Name Date Activity
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
df_A.plot()
df_B.plot()
In the end you can use plot - more info
EDIT:
If you want create dataframes dynamically, use can find all unique values of column Name by drop_duplicates:
for name in g.Name.drop_duplicates():
print g.loc[g.Name == name]
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
Name Date Activity
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
You can use a dictionary comprehension to create a sub dataframe for each Name with an Activity value of 3 or 5.
active_names = df[df.Activity.isin([3, 5])].Name.unique().tolist()
dfs = {name: df.loc[df.Name == name, :] for name in active_names}
>>> dfs['A']
Name Date Activity
0 A 01-02-2015 1
1 A 01-03-2015 2
2 A 01-04-2015 3
3 A 01-04-2015 1
>>> dfs['B']
Name Date Activity
4 B 01-02-2015 1
5 B 01-02-2015 2
6 B 01-03-2015 1
7 B 01-04-2015 5
Related
I have 2 DataFrames, and I want to replace the values in one dataframe, with the values of the other dataframe, base on the columns on the first one. I put the compositions to clarify.
DF1:
A B C D E
Date
01/01/2019 1 2 3 4 5
02/01/2019 1 2 3 4 5
03/01/2019 1 2 3 4 5
DF2:
name1 name2 name3
Date
01/01/2019 A B D
02/01/2019 B C E
03/01/2019 A D E
THE RESULT I WANT:
name1 name2 name3
Date
01/01/2019 1 2 4
02/01/2019 2 3 5
03/01/2019 1 4 5
Try:
result = df2.melt(id_vars="index").merge(
df1.melt(id_vars="index"),
left_on=["index", "value"],
right_on=["index", "variable"],
).drop(columns=["value_x", "variable_y"]).pivot(
index="index", columns="variable_x", values="value_y"
)
print(result)
The two melt's transform your dataframes to only contain the numbers in one column, and an additional column for the orignal column names:
df1.melt(id_vars='index')
index variable value
0 01/01/2019 A 1
1 02/01/2019 A 1
2 03/01/2019 A 1
3 01/01/2019 B 2
4 02/01/2019 B 2
5 03/01/2019 B 2
...
These you can now join on index and value/variable. The last part is just removing a couple of columns and then reshaping the table back to the desired form.
The result is
variable_x name1 name2 name3
index
01/01/2019 1 2 4
02/01/2019 2 3 5
03/01/2019 1 4 5
Use DataFrame.lookup for each column separately:
for c in df2.columns:
df2[c] = df1.lookup(df1.index, df2[c])
print (df2)
name1 name2 name3
01/01/2019 1 2 4
02/01/2019 2 3 5
03/01/2019 1 4 5
General solution is possible different index and columns names:
print (df1)
A B C D G
01/01/2019 1 2 3 4 5
02/01/2019 1 2 3 4 5
05/01/2019 1 2 3 4 5
print (df2)
name1 name2 name3
01/01/2019 A B D
02/01/2019 B C E
08/01/2019 A D E
df1.index = pd.to_datetime(df1.index, dayfirst=True)
df2.index = pd.to_datetime(df2.index, dayfirst=True)
cols = df2.stack().unique()
idx = df2.index
df11 = df1.reindex(columns=cols, index=idx)
print (df11)
A B D C E
2019-01-01 1.0 2.0 4.0 3.0 NaN
2019-01-02 1.0 2.0 4.0 3.0 NaN
2019-01-08 NaN NaN NaN NaN NaN
for c in df2.columns:
df2[c] = df11.lookup(df11.index, df2[c])
print (df2)
name1 name2 name3
2019-01-01 1.0 2.0 4.0
2019-01-02 2.0 3.0 NaN
2019-01-08 NaN NaN NaN
I have a Dataframe df and a list li, My dataframe column contains:
Student Score Date
A 10 15-03-19
C 11 16-03-19
A 12 16-03-19
B 10 16-03-19
A 9 17-03-19
My list contain Name of all Student li=[A,B,C]
If any student have not came on particular day then insert the name of student in dataframe with score value = 0
My Final Dataframe should be like:
Student Score Date
A 10 15-03-19
B 0 15-03-19
C 0 15-03-19
C 11 16-03-19
A 12 16-03-19
B 10 16-03-19
A 9 17-03-19
B 0 17-03-19
C 0 17-03-19
Use DataFrame.reindex with MultiIndex.from_product:
li = list('ABC')
mux = pd.MultiIndex.from_product([df['Date'].unique(), li], names=['Date', 'Student'])
df = df.set_index(['Date', 'Student']).reindex(mux, fill_value=0).reset_index()
print (df)
Date Student Score
0 15-03-19 A 10
1 15-03-19 B 0
2 15-03-19 C 0
3 16-03-19 A 12
4 16-03-19 B 10
5 16-03-19 C 11
6 17-03-19 A 9
7 17-03-19 B 0
8 17-03-19 C 0
Alternative is use left join with DataFrame.merge and helper DataFrame created by product, last replace missing values by fillna:
from itertools import product
df1 = pd.DataFrame(list(product(df['Date'].unique(), li)), columns=['Date', 'Student'])
df = df1.merge(df, how='left').fillna(0)
print (df)
Date Student Score
0 15-03-19 A 10.0
1 15-03-19 B 0.0
2 15-03-19 C 0.0
3 16-03-19 A 12.0
4 16-03-19 B 10.0
5 16-03-19 C 11.0
6 17-03-19 A 9.0
7 17-03-19 B 0.0
8 17-03-19 C 0.0
I'm trying to import data to a pandas DataFrame with columns being date string, label, value. My data looks like the following (just with 4 dates and 5 labels)
from numpy import random
import numpy as np
import pandas as pd
# Creating the data
dates = ("2015-01-01", "2015-01-02", "2015-01-03", "2015-01-04")
values = [random.rand(5) for _ in range(4)]
data = dict(zip(dates,values))
So, the data is a dictionary where the keys are dates, the keys a list of values where the index is the label.
Loading this data structure into a DataFrame
df1 = pd.DataFrame(data)
gives me the dates as columns, the label as index, and the value as the value.
An alternative loading would be
df2 = pd.DataFrame()
df2.from_dict(data, orient='index')
where the dates are index, and columns are labels.
In either of both cases do I manage to do pivoting or stacking to my preferred view.
How should I approach the pivoting/stacking to get the view I want? Or should I change my data structure before loading it into a DataFrame? In particular I'd like to avoid of having to create all the rows of the table beforehand by using a bunch of calls to zip.
IIUC:
Option 1
pd.DataFrame.stack
pd.DataFrame(data).stack() \
.rename('value').rename_axis(['label', 'date']).reset_index()
label date value
0 0 2015-01-01 0.345109
1 0 2015-01-02 0.815948
2 0 2015-01-03 0.758709
3 0 2015-01-04 0.461838
4 1 2015-01-01 0.584527
5 1 2015-01-02 0.823529
6 1 2015-01-03 0.714700
7 1 2015-01-04 0.160735
8 2 2015-01-01 0.779006
9 2 2015-01-02 0.721576
10 2 2015-01-03 0.246975
11 2 2015-01-04 0.270491
12 3 2015-01-01 0.465495
13 3 2015-01-02 0.622024
14 3 2015-01-03 0.227865
15 3 2015-01-04 0.638772
16 4 2015-01-01 0.266322
17 4 2015-01-02 0.575298
18 4 2015-01-03 0.335095
19 4 2015-01-04 0.761181
Option 2
comprehension
pd.DataFrame(
[[i, d, v] for d, l in data.items() for i, v in enumerate(l)],
columns=['label', 'date', 'value']
)
label date value
0 0 2015-01-01 0.345109
1 1 2015-01-01 0.584527
2 2 2015-01-01 0.779006
3 3 2015-01-01 0.465495
4 4 2015-01-01 0.266322
5 0 2015-01-02 0.815948
6 1 2015-01-02 0.823529
7 2 2015-01-02 0.721576
8 3 2015-01-02 0.622024
9 4 2015-01-02 0.575298
10 0 2015-01-03 0.758709
11 1 2015-01-03 0.714700
12 2 2015-01-03 0.246975
13 3 2015-01-03 0.227865
14 4 2015-01-03 0.335095
15 0 2015-01-04 0.461838
16 1 2015-01-04 0.160735
17 2 2015-01-04 0.270491
18 3 2015-01-04 0.638772
19 4 2015-01-04 0.761181
I have two data frames as follows:
A = pd.DataFrame({"ID":["A", "A", "C" ,"B", "B"], "date":["06/22/2014","07/02/2014","01/01/2015","01/01/1991","08/02/1999"]})
B = pd.DataFrame({"ID":["A", "A", "C" ,"B", "B"], "date":["02/15/2015","06/30/2014","07/02/1999","10/05/1990","06/24/2014"], "value": ["3","5","1","7","8"] })
Which look like the following:
>>> A
ID date
0 A 2014-06-22
1 A 2014-07-02
2 C 2015-01-01
3 B 1991-01-01
4 B 1999-08-02
>>> B
ID date value
0 A 2015-02-15 3
1 A 2014-06-30 5
2 C 1999-07-02 1
3 B 1990-10-05 7
4 B 2014-06-24 8
I want to merge A with the values of B using the nearest date. In this example, none of the dates match, but it could the the case that some do.
The output should be something like this:
>>> C
ID date value
0 A 06/22/2014 8
1 A 07/02/2014 5
2 C 01/01/2015 3
3 B 01/01/1991 7
4 B 08/02/1999 1
It seems to me that there should be a native function in pandas that would allow this.
Note: as similar question has been asked here
pandas.merge: match the nearest time stamp >= the series of timestamps
You can use reindex with method='nearest' and then merge:
A['date'] = pd.to_datetime(A.date)
B['date'] = pd.to_datetime(B.date)
A.sort_values('date', inplace=True)
B.sort_values('date', inplace=True)
B1 = B.set_index('date').reindex(A.set_index('date').index, method='nearest').reset_index()
print (B1)
print (pd.merge(A,B1, on='date'))
ID_x date ID_y value
0 B 1991-01-01 B 7
1 B 1999-08-02 C 1
2 A 2014-06-22 B 8
3 A 2014-07-02 A 5
4 C 2015-01-01 A 3
You can also add parameter suffixes:
print (pd.merge(A,B1, on='date', suffixes=('_', '')))
ID_ date ID value
0 B 1991-01-01 B 7
1 B 1999-08-02 C 1
2 A 2014-06-22 B 8
3 A 2014-07-02 A 5
4 C 2015-01-01 A 3
pd.merge_asof(A, B, on="date", direction='nearest')
With this DataFrame:
import pandas as pd
df = pd.DataFrame([[1,1],[1,2],[1,3],[1,5],[1,7],[1,9]], index=pd.date_range('2015-01-01', periods=6), columns=['a', 'b'])
i.e.
a b
2015-01-01 1 1
2015-01-02 1 2
2015-01-03 1 3
2015-01-04 1 5
2015-01-05 1 7
2015-01-06 1 9
the fact of using df = df.groupby(df.b // 4).last() makes the datetime index disappear. Why?
a b
b
0 1 3
1 1 7
2 1 9
Expected result instead:
a b
2015-01-03 1 3
2015-01-05 1 7
2015-01-06 1 9
For groupby your index always getting from grouping values. For you case you could use reset_index and then set_index:
df['c'] = df.b // 4
result = df.reset_index().groupby('c').last().set_index('index')
In [349]: result
Out[349]:
a b
index
2015-01-03 1 3
2015-01-05 1 7
2015-01-06 1 9