I have the following panel dataset. "winner" =1 if in period (date), someone is a winner, zero if loser.
ID date winner
A 2017Q4 NaN
A 2018Q4 1
A 2019Q4 0
A 2020Q4 0
A 2021Q4 1
B 2017Q4 NaN
B 2018Q4 1
B 2019Q4 1
B 2020Q4 0
B 2021Q4 0
C 2017Q4 NaN
C 2018Q4 0
C 2019Q4 0
C 2020Q4 0
C 2021Q4 0
D 2017Q4 NaN
D 2018Q4 0
D 2019Q4 1
D 2020Q4 1
D 2021Q4 1
I want to create four dummy variables, WW =1 if someone is winner in two consecutive periods. LL=1 if loser in two consecutive periods. WL if winner in period 1 and loser the next period, and LW vice versa.
UPDATE
when i apply the answers below i get the following
ID date winner WW LL WL LW
A 2017Q4 NaN
A 2018Q4 1 0 0 0 0
A 2019Q4 0 0 0 1 0
A 2020Q4 0 0 1 0 0
A 2021Q4 1 0 0 0 1
B 2017Q4 NaN
B 2018Q4 1 0 0 0 0
B 2019Q4 1 1 0 0 0
B 2020Q4 0 0 0 1 0
B 2021Q4 0 0 1 0 0
C 2017Q4 NaN
C 2018Q4 0 0 0 0 0
C 2019Q4 0 0 1 0 0
C 2020Q4 0 0 1 0 0
C 2021Q4 0 0 1 0 0
D 2017Q4 NaN
D 2018Q4 0 0 0 0 0
D 2019Q4 1 0 0 0 1
D 2020Q4 1 1 0 0 0
D 2021Q4 1 1 0 0 0
How do i make sure I get the NaN when the previous value is NaN?
desired output
ID date winner WW LL WL LW
A 2017Q4 NaN
A 2018Q4 1 NaN NaN NaN NaN
A 2019Q4 0 0 0 1 0
A 2020Q4 0 0 1 0 0
A 2021Q4 1 0 0 0 1
B 2017Q4 NaN
B 2018Q4 1 NaN NaN NaN NaN
B 2019Q4 1 1 0 0 0
B 2020Q4 0 0 0 1 0
B 2021Q4 0 0 1 0 0
C 2017Q4 NaN
C 2018Q4 0 NaN NaN NaN NaN
C 2019Q4 0 0 1 0 0
C 2020Q4 0 0 1 0 0
C 2021Q4 0 0 1 0 0
D 2017Q4 NaN
D 2018Q4 0 NaN NaN NaN NaN
D 2019Q4 1 0 0 0 1
D 2020Q4 1 1 0 0 0
D 2021Q4 1 1 0 0 0
How to do this in the most simple way?
Here's one way: Use groupby.shift to get the previous record; then use numpy.select to assign values, which you use get_dummies to convert to dummy variables:
import numpy as np
df['previous'] = df.groupby('ID')['winner'].shift()
tmp = df[['previous','winner']]
dummy_vars = ['WW','LL','WL', 'LW']
out = (df.join(pd.get_dummies(np.select([tmp.eq(1).all(1),
tmp.eq(0).all(1),
tmp.eq([1,0]).all(1),
tmp.eq([0,1]).all(1)],
dummy_vars, ''))[dummy_vars+['']]
.mask(df['previous'].isna(), ''))
.drop(columns=['previous','']))
Output:
ID date winner WW LL WL LW
0 A 2018Q4 1
1 A 2019Q4 0 0 0 1 0
2 A 2020Q4 0 0 1 0 0
3 A 2021Q4 1 0 0 0 1
4 B 2018Q4 1
5 B 2019Q4 1 1 0 0 0
6 B 2020Q4 0 0 0 1 0
7 B 2021Q4 0 0 1 0 0
8 C 2018Q4 0
9 C 2019Q4 0 0 1 0 0
10 C 2020Q4 0 0 1 0 0
11 C 2021Q4 0 0 1 0 0
12 D 2018Q4 0
13 D 2019Q4 1 0 0 0 1
14 D 2020Q4 1 1 0 0 0
15 D 2021Q4 1 1 0 0 0
map 1 and 0 to "W" and "L"
get the 2-period streak
get_dummies for the "streak"
join to original DataFrame ignoring the first row of each ID
wins = df["winner"].fillna(0).map({1:"W",0:"L"})
streaks = wins.shift() + wins
other = pd.get_dummies(streaks.where(df["ID"].eq(df["ID"].shift())))
output = df.join(other.where(df["ID"].duplicated()&df["winner"].shift().notna()))
>>> output
ID date winner LL LW WL WW
0 A 2017Q4 NaN NaN NaN NaN NaN
1 A 2018Q4 1.0 NaN NaN NaN NaN
2 A 2019Q4 0.0 0.0 0.0 1.0 0.0
3 A 2020Q4 0.0 1.0 0.0 0.0 0.0
4 A 2021Q4 1.0 0.0 1.0 0.0 0.0
5 B 2017Q4 NaN NaN NaN NaN NaN
6 B 2018Q4 1.0 NaN NaN NaN NaN
7 B 2019Q4 1.0 0.0 0.0 0.0 1.0
8 B 2020Q4 0.0 0.0 0.0 1.0 0.0
9 B 2021Q4 0.0 1.0 0.0 0.0 0.0
10 C 2017Q4 NaN NaN NaN NaN NaN
11 C 2018Q4 0.0 NaN NaN NaN NaN
12 C 2019Q4 0.0 1.0 0.0 0.0 0.0
13 C 2020Q4 0.0 1.0 0.0 0.0 0.0
14 C 2021Q4 0.0 1.0 0.0 0.0 0.0
15 D 2017Q4 NaN NaN NaN NaN NaN
16 D 2018Q4 0.0 NaN NaN NaN NaN
17 D 2019Q4 1.0 0.0 1.0 0.0 0.0
18 D 2020Q4 1.0 0.0 0.0 0.0 1.0
19 D 2021Q4 1.0 0.0 0.0 0.0 1.0
Related
for example I have a dataframe:
0
1
2
3
4
5
6
0
0.493212
0.586246
nan
0.589289
nan
0.629087
0.593872
1
0.568513
0.367722
nan
nan
nan
nan
0.423369
2
0.70054
0.735529
nan
nan
0.494135
nan
nan
3
nan
nan
nan
0.338822
0.466331
0.765367
0.83082
4
0.512891
nan
0.623782
0.642438
nan
0.541117
0.92981
If I compare it like:
df >= 0.5
The result is:
0
1
2
3
4
5
6
0
0
1
0
1
0
1
1
1
1
0
0
0
0
0
0
2
1
1
0
0
0
0
0
3
0
0
0
0
0
1
1
4
1
0
1
1
0
1
1
How can I keep nan cell ? I mean I need 0.5 > np.nan == np.nan not 0.5 > np.nan == False
IIUC, you can use a mask:
df.lt(0.5).astype(int).mask(df.isna())
output:
0 1 2 3 4 5 6
0 1.0 0.0 NaN 0.0 NaN 0.0 0.0
1 0.0 1.0 NaN NaN NaN NaN 1.0
2 0.0 0.0 NaN NaN 1.0 NaN NaN
3 NaN NaN NaN 1.0 1.0 0.0 0.0
4 0.0 NaN 0.0 0.0 NaN 0.0 0.0
If you want to keep the integer type:
out = df.lt(0.5).astype(pd.Int64Dtype()).mask(df.isna()))
output:
0 1 2 3 4 5 6
0 1 0 <NA> 0 <NA> 0 0
1 0 1 <NA> <NA> <NA> <NA> 1
2 0 0 <NA> <NA> 1 <NA> <NA>
3 <NA> <NA> <NA> 1 1 0 0
4 0 <NA> 0 0 <NA> 0 0
Use DataFrame.mask with convert values to integers:
df = (df >= 0.5).astype(int).mask(df.isna())
print (df)
0 1 2 3 4 5 6
0 0.0 1.0 NaN 1.0 NaN 1.0 1.0
1 1.0 0.0 NaN NaN NaN NaN 0.0
2 1.0 1.0 NaN NaN 0.0 NaN NaN
3 NaN NaN NaN 0.0 0.0 1.0 1.0
4 1.0 NaN 1.0 1.0 NaN 1.0 1.0
Details:
print ((df >= 0.5).astype(int))
0 1 2 3 4 5 6
0 0 1 0 1 0 1 1
1 1 0 0 0 0 0 0
2 1 1 0 0 0 0 0
3 0 0 0 0 0 1 1
4 1 0 1 1 0 1 1
Another idea with numpy.select:
df[:] = np.select([df.isna(), df >= 0.5], [None, 1], default=0)
print (df)
0 1 2 3 4 5 6
0 0.0 1.0 NaN 1.0 NaN 1.0 1.0
1 1.0 0.0 NaN NaN NaN NaN 0.0
2 1.0 1.0 NaN NaN 0.0 NaN NaN
3 NaN NaN NaN 0.0 0.0 1.0 1.0
4 1.0 NaN 1.0 1.0 NaN 1.0 1.0
Btw, if need True/False with NaN is possible use Nullable Boolean data type:
df = (df >= 0.5).astype(int).mask(df.isna()).astype('boolean')
print (df)
0 1 2 3 4 5 6
0 False True <NA> True <NA> True True
1 True False <NA> <NA> <NA> <NA> False
2 True True <NA> <NA> False <NA> <NA>
3 <NA> <NA> <NA> False False True True
4 True <NA> True True <NA> True True
I would like to subtract [a groupby mean of subset] from the [original] dataframe:
I have a pandas DataFrame data whose index is in datetime object (monthly, say 100 years = 100yr*12mn) and 10 columns of station IDs. (i.e., 1200 row * 10 col pd.Dataframe)
1)
I would like to first take a subset of above data, e.g. top 50 years (i.e., 50yr*12mn),
data_sub = data_org[data_org.index.year <= top_50_year]
and calculate monthly mean for each month for each stations (columns). e.g.,
mean_sub = data_sub.groupby(data_sub.index.month).mean()
or
mean_sub = data_sub.groupby(data_sub.index.month).transform('mean')
which seem to do the job.
2)
Now I want to subtract above from the [original] NOT from the [subset], e.g.,
data_org - mean_sub
which I do not know how to. So in summary, I would like to calculate monthly mean from a subset of the original data (e.g., only using 50 years), and subtract that monthly mean from the original data month by month.
It was easy to subtract if I were using the full [original] data to calculate the mean (i.e., .transform('mean') or .apply(lambda x: x - x.mean()) do the job), but what should I do if the mean is calculated from a [subset] data?
Could you share your insight for this problem? Thank you in advance!
#mozway
The input (and also the output) shape looks like the following:
Input shape with random values
Only the values of output are anomalies from the [subset]'s monthly mean. Thank you.
One idea is replace non matched values to NaN by DataFrame.where, so after GroupBy.transform get same indices like original DataFrame, so possible subtract:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data1 = data_org.where(data_org.index.to_series().dt.year <= top_50_year)
print (data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 NaN NaN NaN
2001-04-30 NaN NaN NaN
2001-07-31 NaN NaN NaN
2001-10-31 NaN NaN NaN
2002-01-31 NaN NaN NaN
2002-04-30 NaN NaN NaN
mean_data1 = data1.groupby(data1.index.month).transform('mean')
print (mean_data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 2.0 2.0 6.0
2001-04-30 1.0 3.0 9.0
2001-07-31 6.0 1.0 0.0
2001-10-31 1.0 9.0 0.0
2002-01-31 2.0 2.0 6.0
2002-04-30 1.0 3.0 9.0
df = data_org - mean_data1
print (df)
0 1 2
2000-01-31 0.0 0.0 0.0
2000-04-30 0.0 0.0 0.0
2000-07-31 0.0 0.0 0.0
2000-10-31 0.0 0.0 0.0
2001-01-31 -2.0 7.0 -3.0
2001-04-30 3.0 -3.0 -9.0
2001-07-31 -2.0 0.0 7.0
2001-10-31 2.0 -7.0 4.0
2002-01-31 5.0 0.0 -2.0
2002-04-30 7.0 -3.0 -2.0
Another idea with filtering:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data_sub = data_org[data_org.index.year <= top_50_year]
print (data_sub)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
mean_sub = data_sub.groupby(data_sub.index.month).mean()
print (mean_sub)
0 1 2
1 2 2 6
4 1 3 9
7 6 1 0
10 1 9 0
Create new column m for months:
data_org['m'] = data_org.index.month
print (data_org)
0 1 2 m
2000-01-31 2 2 6 1
2000-04-30 1 3 9 4
2000-07-31 6 1 0 7
2000-10-31 1 9 0 10
2001-01-31 0 9 3 1
2001-04-30 4 0 0 4
2001-07-31 4 1 7 7
2001-10-31 3 2 4 10
2002-01-31 7 2 4 1
2002-04-30 8 0 7 4
And for this solumn are merged mean_sub by DataFrame.join
mean_data1 = data_org[['m']].join(mean_sub, on='m')
print (mean_data1)
m 0 1 2
2000-01-31 1 2 2 6
2000-04-30 4 1 3 9
2000-07-31 7 6 1 0
2000-10-31 10 1 9 0
2001-01-31 1 2 2 6
2001-04-30 4 1 3 9
2001-07-31 7 6 1 0
2001-10-31 10 1 9 0
2002-01-31 1 2 2 6
2002-04-30 4 1 3 9
df = data_org - mean_data1
print (df)
0 1 2 m
2000-01-31 0 0 0 0
2000-04-30 0 0 0 0
2000-07-31 0 0 0 0
2000-10-31 0 0 0 0
2001-01-31 -2 7 -3 0
2001-04-30 3 -3 -9 0
2001-07-31 -2 0 7 0
2001-10-31 2 -7 4 0
2002-01-31 5 0 -2 0
2002-04-30 7 -3 -2 0
I have a dataframe with a date+time and a label, which I want to reshape into date (/month) columns with label frequencies for that month:
date_time label
1 2017-09-26 17:08:00 0
3 2017-10-03 13:27:00 2
4 2017-10-04 19:04:00 0
11 2017-10-11 18:28:00 1
27 2017-10-13 11:22:00 0
28 2017-10-13 21:43:00 0
39 2017-10-16 14:43:00 0
40 2017-10-16 21:39:00 0
65 2017-10-21 21:53:00 2
...
98 2017-11-01 20:08:00 3
99 2017-11-02 12:00:00 3
100 2017-11-02 12:01:00 2
109 2017-11-02 12:03:00 3
110 2017-11-03 22:24:00 0
111 2017-11-04 09:05:00 3
112 2017-11-06 12:36:00 3
113 2017-11-06 12:48:00 2
128 2017-11-07 15:20:00 2
143 2017-11-10 16:36:00 3
144 2017-11-10 20:00:00 0
145 2017-11-10 20:02:00 0
I group the label frequency by month with this line (thanks partially to this post):
df2 = df.groupby([pd.Grouper(key='date_time', freq='M'), 'label'])['label'].count()
which outputs
date_time label
2017-09-30 0 1
2017-10-31 0 6
1 1
2 8
3 2
2017-11-30 0 25
4 2
5 1
2 4
3 11
2017-12-31 0 14
5 3
2 5
3 7
2018-01-31 0 8
4 1
5 1
2 2
3 3
but, as mentioned before, I would like to get the data by month/date columns:
2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
0 1 6 25 14 8
1 0 1 0 0 0
2 0 8 4 5 2
3 0 2 11 7 3
4 0 0 2 0 1
5 0 0 1 3 1
currently I can do sort of divide the data with
pd.concat([df2[m] for m in df2.index.levels[0]], axis=1).fillna(0)
but I lose the column names:
label label label label label
0 1.0 6.0 25.0 14.0 8.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 8.0 4.0 5.0 2.0
3 0.0 2.0 11.0 7.0 3.0
4 0.0 0.0 2.0 0.0 1.0
5 0.0 0.0 1.0 3.0 1.0
So I have to do a longer version where I generate a series, rename it, concatenate and then fill in the blanks:
m_list = []
for m in df2.index.levels[0]:
m_labels = df2[m]
m_labels = m_labels.rename(m)
m_list.append(m_labels)
pd.concat(m_list, axis=1).fillna(0)
resulting in
2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
0 1.0 6.0 25.0 14.0 8.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 8.0 4.0 5.0 2.0
3 0.0 2.0 11.0 7.0 3.0
4 0.0 0.0 2.0 0.0 1.0
5 0.0 0.0 1.0 3.0 1.0
Is there a shorter/more elegant way to get to this last datagrame from my original one?
You just need unstack here
df.groupby([pd.Grouper(key='date_time', freq='M'), 'label'])['label'].count().unstack(0,fill_value=0)
Out[235]:
date_time 2017-09-30 2017-10-31 2017-11-30
label
0 1 5 3
1 0 1 0
2 0 2 3
3 0 0 6
Base on your groupby output
s.unstack(0,fill_value=0)
Out[240]:
date_time 2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
label
0 1 6 25 14 8
1 0 1 0 0 0
2 0 8 4 5 2
3 0 2 11 7 3
4 0 0 2 0 1
5 0 0 1 3 1
I want to merge static data with time varying data.
First dataframe
a_columns = pd.MultiIndex.from_product([["A","B","C"],["1","2"]])
a_index = pd.date_range("20100101","20110101",freq="BM")
a = pd.DataFrame(columns=a_columns,index=a_index)#A
Second dataframe
b_columns = ["3","4","5"]
b_index = ["A","B","C"]
b = pd.DataFrame(columns=b_columns,index=b_index)
How do i join these two? My desired dataframe has the form as A but with additional columns.
Thanks!
I think you need reshape by stack and then create df by to_frame - for concat need Datetimeindex, so new index was from first value of index of a.
Last concat + sort_index:
#added some data - 2
a_columns = pd.MultiIndex.from_product([["A","B","C"],["1","2"]])
a_index = pd.date_range("20100101","20110101",freq="BM")
a = pd.DataFrame(2,columns=a_columns,index=a_index)#A
#added some data - 1
b_columns = ["3","4","5"]
b_index = ["A","B","C"]
b = pd.DataFrame(1,columns=b_columns,index=b_index)
c = b.stack().to_frame(a.index[0]).T
print (c)
A B C
3 4 5 3 4 5 3 4 5
2010-01-29 1 1 1 1 1 1 1 1 1
d = pd.concat([a,c], axis=1).sort_index(axis=1)
print (d)
A B C
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
2010-01-29 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-02-26 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-03-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-04-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-05-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-06-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-07-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-08-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-09-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-10-29 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-11-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-12-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
Last if need replace NaNs only in added columns by first row:
d[c.columns] = d[c.columns].ffill()
print (d)
A B C
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
2010-01-29 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-02-26 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-03-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-04-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-05-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-06-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-07-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-08-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-09-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-10-29 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-11-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-12-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
Similar solution with reindex:
c = b.stack().to_frame(a.index[0]).T.reindex(a.index, method='ffill')
print (c)
A B C
3 4 5 3 4 5 3 4 5
2010-01-29 1 1 1 1 1 1 1 1 1
2010-02-26 1 1 1 1 1 1 1 1 1
2010-03-31 1 1 1 1 1 1 1 1 1
2010-04-30 1 1 1 1 1 1 1 1 1
2010-05-31 1 1 1 1 1 1 1 1 1
2010-06-30 1 1 1 1 1 1 1 1 1
2010-07-30 1 1 1 1 1 1 1 1 1
2010-08-31 1 1 1 1 1 1 1 1 1
2010-09-30 1 1 1 1 1 1 1 1 1
2010-10-29 1 1 1 1 1 1 1 1 1
2010-11-30 1 1 1 1 1 1 1 1 1
2010-12-31 1 1 1 1 1 1 1 1 1
d = pd.concat([a,c], axis=1).sort_index(axis=1)
print (d)
A B C
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
2010-01-29 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-02-26 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-03-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-04-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-05-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-06-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-07-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-08-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-09-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-10-29 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-11-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-12-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
Following the question I asked: Combine similar rows to one row in python dataframe1
I have the original data below, and have 2 questions want to ask :
yyyymmdd hr ariel cat kiki mmax vicky gaolie shiu nick ck
0 2015-12-27 9 0 0 0 0 0 0 0 23 0
1 2015-12-27 10 0 0 0 0 0 0 0 2 0
2 2015-12-27 11 0 0 0 0 0 0 0 20 0
3 2015-12-27 12 0 0 0 0 0 0 0 4 0
4 2015-12-27 17 0 0 0 0 0 0 0 2 0
5 2015-12-27 19 1 0 0 0 0 0 0 0 0
6 2015-12-28 8 0 8 0 0 0 0 0 0 0
7 2015-12-28 9 11 11 0 0 0 0 19 0 0
8 2015-12-28 10 85 13 0 0 2 0 15 0 0
9 2015-12-28 11 2 11 0 0 2 0 14 0 0
10 2015-12-28 12 2 20 0 4 0 0 10 0 0
11 2015-12-28 13 8 9 0 9 3 0 9 0 0
12 2015-12-28 14 4 10 0 8 0 0 22 0 0
13 2015-12-28 15 3 3 0 2 0 0 16 0 0
14 2015-12-28 16 14 5 1 1 0 0 19 0 0
15 2015-12-28 17 15 1 2 0 0 0 19 0 0
16 2015-12-28 18 0 0 0 6 0 0 0 0 0
17 2015-12-28 19 0 0 0 5 0 0 0 0 0
18 2015-12-28 20 0 0 0 1 0 0 0 0 0
how can I "fill" the "hr" index of the DataFrame? The result should be something like this:
yyyymmdd hr ariel cat kiki mmax vicky gaolie shiu nick ck
12/27/15 8 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 9 0 0 0 0 0 0 0 23 0
12/27/15 10 0 0 0 0 0 0 0 2 0
12/27/15 11 0 0 0 0 0 0 0 20 0
12/27/15 12 0 0 0 0 0 0 0 4 0
12/27/15 13 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 14 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 15 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 16 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 17 0 0 0 0 0 0 0 2 0
12/27/15 18 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/27/15 19 1 0 0 0 0 0 0 0 0
12/27/15 20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12/28/15 8 0 8 0 0 0 0 0 0 0
12/28/15 9 11 11 0 0 0 0 19 0 0
12/28/15 10 85 13 0 0 2 0 15 0 0
12/28/15 11 2 11 0 0 2 0 14 0 0
12/28/15 12 2 20 0 4 0 0 10 0 0
12/28/15 13 8 9 0 9 3 0 9 0 0
12/28/15 14 4 10 0 8 0 0 22 0 0
12/28/15 15 3 3 0 2 0 0 16 0 0
12/28/15 16 14 5 1 1 0 0 19 0 0
12/28/15 17 15 1 2 0 0 0 19 0 0
12/28/15 18 0 0 0 6 0 0 0 0 0
12/28/15 19 0 0 0 5 0 0 0 0 0
12/28/15 20 0 0 0 1 0 0 0 0 0
how can I plot the line charts based on columns and hr ?
x-axis = columns , i.e. : ariel ,cat, kiki...
y-axis = hr, i.e. : 8,9,10...20
every subplot represents one date (i.e. 2015-12-27, 2015-12-28..)
and here is the framework of the plot I want to get :
please click here for the picture
You can convert yyyymmdd to datetime, combine with the hr information and then resample to hourly frequency like so:
df.yyyymmdd = pd.to_datetime(df.yyyymmdd)
df.yyyymmdd = df.apply(lambda x: x.yyyymmdd + pd.DateOffset(hours = x.hr), axis=1)
df.set_index('yyyymmdd', inplace=True)
df = df.resample('H')
to get:
hr ariel cat kiki mmax vicky gaolie shiu nick ck
yyyymmdd
2015-12-27 09:00:00 9 0 0 0 0 0 0 0 23 0
2015-12-27 10:00:00 10 0 0 0 0 0 0 0 2 0
2015-12-27 11:00:00 11 0 0 0 0 0 0 0 20 0
2015-12-27 12:00:00 12 0 0 0 0 0 0 0 4 0
2015-12-27 13:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 14:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 15:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 16:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 17:00:00 17 0 0 0 0 0 0 0 2 0
2015-12-27 18:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 19:00:00 19 1 0 0 0 0 0 0 0 0
2015-12-27 20:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 21:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 22:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-27 23:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 01:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 02:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 03:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 04:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 05:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 06:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 07:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-12-28 08:00:00 8 0 8 0 0 0 0 0 0 0
2015-12-28 09:00:00 9 11 11 0 0 0 0 19 0 0
2015-12-28 10:00:00 10 85 13 0 0 2 0 15 0 0
2015-12-28 11:00:00 11 2 11 0 0 2 0 14 0 0
2015-12-28 12:00:00 12 2 20 0 4 0 0 10 0 0
2015-12-28 13:00:00 13 8 9 0 9 3 0 9 0 0
2015-12-28 14:00:00 14 4 10 0 8 0 0 22 0 0
2015-12-28 15:00:00 15 3 3 0 2 0 0 16 0 0
2015-12-28 16:00:00 16 14 5 1 1 0 0 19 0 0
2015-12-28 17:00:00 17 15 1 2 0 0 0 19 0 0
2015-12-28 18:00:00 18 0 0 0 6 0 0 0 0 0
2015-12-28 19:00:00 19 0 0 0 5 0 0 0 0 0
2015-12-28 20:00:00 20 0 0 0 1 0 0 0 0 0
You could plot the result as follows - assuming that you are looking for one subplot for each date and column:
for d, data in df.groupby(pd.TimeGrouper('D')):
data.plot.line(figsize=(10, 20), sharey=True)
plt.gcf().savefig('cats {}.png'.format(d), bbox_inches='tight')
plt.close()
to get: