I'm working with the following dataset with hourly counts (df):
The datframe has 8784 rows (for the year 2016, hourly).
I'd like to see if there are daily trends (e.g if there is an increase in the morning hours. For this i'd like to create a plot that has the hour of the day (from 0 to 24) on the x-axis and number of cyclists on the y axis (something like in the picture below from http://ofdataandscience.blogspot.co.uk/2013/03/capital-bikeshare-time-series-clustering.html).
I experimented with differet ways of pivot, resample and set_index and plotting it with matplotlib, without success. In other words, i couldn't find a way to sum up every observation at a certain hour and then plot those for each weekday
Any ideas how to do this? Thanks in advance!
I think you can use groupby by hour and weekday and aggregate sum (or maybe mean), last reshape by unstack and DataFrame.plot:
df = df.groupby([df['Date'].dt.hour, 'weekday'])['Cyclists'].sum().unstack().plot()
Solution with pivot_table:
df1 = df.pivot_table(index=df['Date'].dt.hour,
columns='weekday',
values='Cyclists',
aggfunc='sum').plot()
Sample:
N = 200
np.random.seed(100)
rng = pd.date_range('2016-01-01', periods=N, freq='H')
df = pd.DataFrame({'Date': rng, 'Cyclists': np.random.randint(100, size=N)})
df['weekday'] = df['Date'].dt.weekday_name
print (df.head())
Cyclists Date weekday
0 8 2016-01-01 00:00:00 Friday
1 24 2016-01-01 01:00:00 Friday
2 67 2016-01-01 02:00:00 Friday
3 87 2016-01-01 03:00:00 Friday
4 79 2016-01-01 04:00:00 Friday
print (df.groupby([df['Date'].dt.hour, 'weekday'])['Cyclists'].sum().unstack())
weekday Friday Monday Saturday Sunday Thursday Tuesday Wednesday
Date
0 102 91 120 53 95 86 21
1 102 83 100 27 20 94 25
2 121 53 105 56 10 98 54
3 164 78 54 30 8 42 6
4 163 0 43 48 89 84 37
5 49 13 150 47 72 95 58
6 24 57 32 39 30 76 39
7 127 76 128 38 12 33 94
8 72 3 59 44 18 58 51
9 138 70 67 18 93 42 30
10 77 3 7 64 92 22 66
11 159 84 49 56 44 0 24
12 156 79 47 34 57 55 55
13 42 10 65 53 0 98 17
14 116 87 61 74 73 19 45
15 106 60 14 17 54 53 89
16 22 3 55 72 92 68 45
17 154 48 71 13 66 62 35
18 60 52 80 30 16 50 16
19 79 43 2 17 5 68 12
20 11 36 94 53 51 35 86
21 180 5 19 68 90 23 82
22 103 71 98 50 34 9 67
23 92 38 63 91 67 48 92
df.groupby([df['Date'].dt.hour, 'weekday'])['Cyclists'].sum().unstack().plot()
EDIT:
You can also convert wekkday to categorical for correct soting of columns by names of week:
names = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday','Friday', 'Saturday', 'Sunday']
df['weekday'] = df['weekday'].astype('category', categories=names, ordered=True)
df.groupby([df['Date'].dt.hour, 'weekday'])['Cyclists'].sum().unstack().plot()
Related
I have process where the end product is a Pandas DF where the output, which is variable in terms of data and length, is structured like this example of the output.
9 80340796
10 80340797
11 80340798
12 80340799
13 80340800
14 80340801
15 80340802
16 80340803
17 80340804
18 80340805
19 80340806
20 80340807
21 80340808
22 80340809
23 80340810
24 80340811
25 80340812
26 80340813
27 80340814
28 80340815
29 80340816
30 80340817
31 80340818
32 80340819
33 80340820
34 80340821
35 80340822
36 80340823
37 80340824
38 80340825
39 80340826
40 80340827
41 80340828
42 80340829
43 80340830
44 80340831
45 80340832
46 80340833
I need to get the numbers in the second column above, into the following grid format based on the numbers in the first column above.
1 2 3 4 5 6 7 8 9 10 11 12
A 1 9 17 25 33 41 49 57 65 73 81 89
B 2 10 18 26 34 42 50 58 66 74 82 90
C 3 11 19 27 35 43 51 59 67 75 83 91
D 4 12 20 28 36 44 52 60 68 76 84 92
E 5 13 21 29 37 45 53 61 69 77 85 93
F 6 14 22 30 38 46 54 62 70 78 86 94
G 7 15 23 31 39 47 55 63 71 79 87 95
H 8 16 24 32 40 48 56 64 72 80 88 96
So the end result in this example would be
Any advice on how to go about this would be much appreciated. I've been asked for this by a colleague, so the data is easy to read for their team (as it matches the layout of a physical test) but I have no idea how to produce it.
pandas pivot table, can do what you want in your question, but first you have to create 2 auxillary columns, 1 determing which column the value has to go in, another which row it is. You can get that as shown in the following example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'num': list(range(9, 28)), 'val': list(range(80001, 80020))})
max_rows = 8
df['row'] = (df['num']-1)%8
df['col'] = np.ceil(df['num']/8).astype(int)
df.pivot_table(values=['val'], columns=['col'], index=['row'])
val
col 2 3 4
row
0 80001.0 80009.0 80017.0
1 80002.0 80010.0 80018.0
2 80003.0 80011.0 80019.0
3 80004.0 80012.0 NaN
4 80005.0 80013.0 NaN
5 80006.0 80014.0 NaN
6 80007.0 80015.0 NaN
7 80008.0 80016.0 NaN
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
print(df)
I want column 'A' always to have a value greater than column 'B'.
df.A, df.B = df[['A', 'B']].max(axis=1), df[['A', 'B']].min(axis=1)
Try this:
newdf = df.apply(lambda x: x if x[0]>x[1] else [*x[:2][::-1],*x[2:]],axis=1)
print(newdf)
Output:
A B C D
0 85 14 22 85
1 62 54 20 1
2 82 78 48 59
3 81 59 54 39
4 92 12 79 44
5 69 64 8 11
6 49 34 48 69
7 68 28 80 27
8 72 17 2 40
9 26 15 49 62
10 29 2 86 12
11 69 7 32 99
12 39 35 65 32
13 45 36 36 12
14 54 21 29 79
15 91 82 35 80
16 67 16 4 37
17 94 82 93 37
18 64 18 2 15
19 13 11 28 82
20 78 9 93 45
21 72 41 16 33
22 92 71 62 69
23 87 79 71 11
24 31 14 8 24
25 85 27 43 3
26 82 34 14 52
27 41 32 39 48
28 13 12 24 86
29 96 17 14 80
.. .. .. .. ..
70 17 13 20 91
71 26 7 57 96
72 41 0 24 58
73 98 68 90 13
74 88 35 81 56
75 65 43 70 86
76 82 81 44 68
77 97 45 23 66
78 81 45 78 48
79 62 24 43 62
80 43 13 42 49
81 97 28 75 45
82 3 0 54 40
83 57 46 16 38
84 87 46 35 13
85 41 13 78 89
86 62 36 94 23
87 84 35 69 93
88 63 18 39 3
89 45 42 30 6
90 81 8 49 82
91 28 28 11 47
92 97 81 49 92
93 86 24 82 40
94 76 72 30 51
95 93 92 1 69
96 97 76 38 81
97 87 49 26 64
98 98 25 93 55
99 57 2 87 10
[100 rows x 4 columns]
You can apply it to any no of columns.
import numpy as np
import pandas as pd
#np.random.seed(1)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
#we are just sorting values of each rows in descending order.
df.values[:,::-1].sort()
print(df)
It gives following output:
A B C D
0 72 37 12 9
1 79 75 64 5
2 76 71 16 1
3 50 25 20 6
4 84 28 18 11
5 68 50 29 14
6 96 94 87 87
7 86 13 9 7
8 63 61 57 22
9 81 60 1 0
10 88 47 13 8
11 72 71 30 3
12 70 57 49 21
13 68 43 24 3
14 80 76 52 26
15 82 64 41 15
16 98 87 68 25
17 26 25 22 7
18 67 27 23 9
19 83 57 38 37
20 34 32 10 8
I just had a quick question. How would one go about getting the last cell value of an excel spreadsheet when working with it as a dataframe using pandas, for every single different column. I'm having quite some difficulty with this, I know the index can be found with len(), but I can't quite wrap my finger around it. Thank you any help would be greatly appreciated.
If you want the last cell of a dataframe meaning the most bottom right cell, then you can use .iloc:
df = pd.DataFrame(np.arange(1,101).reshape((10,-1)))
df
Output:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 10
1 11 12 13 14 15 16 17 18 19 20
2 21 22 23 24 25 26 27 28 29 30
3 31 32 33 34 35 36 37 38 39 40
4 41 42 43 44 45 46 47 48 49 50
5 51 52 53 54 55 56 57 58 59 60
6 61 62 63 64 65 66 67 68 69 70
7 71 72 73 74 75 76 77 78 79 80
8 81 82 83 84 85 86 87 88 89 90
9 91 92 93 94 95 96 97 98 99 100
Use .iloc with -1 index selection on both rows and columns.
df.iloc[-1,-1]
Output:
100
DataFrame.head(n) gets the top n results from the dataframe. DataFrame.tail(n) gets the bottom n results from the dataframe.
If your dataframe is named df, you could use df.tail(1) to get the last row of the dataframe. The returned value is also a dataframe.
I have a DataFrame (df) and I need to split it into n number of Dataframes based on the column numbers. But, it has to follow the Triangular Series pattern:
df1 = df[[0]]
df2 = df[[1,2]]
df3 = df[[3,4,5]]
df4 = df[[6,7,8,9]]
etc.
Consider the dataframe df
df = pd.DataFrame(
np.arange(100).reshape(10, 10),
columns=list('ABCDEFGHIJ')
)
df
A B C D E F G H I J
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
2 20 21 22 23 24 25 26 27 28 29
3 30 31 32 33 34 35 36 37 38 39
4 40 41 42 43 44 45 46 47 48 49
5 50 51 52 53 54 55 56 57 58 59
6 60 61 62 63 64 65 66 67 68 69
7 70 71 72 73 74 75 76 77 78 79
8 80 81 82 83 84 85 86 87 88 89
9 90 91 92 93 94 95 96 97 98 99
i_s, j_s = np.arange(4).cumsum(), np.arange(1, 5).cumsum()
df1, df2, df3, df4 = [
df.iloc[:, i:j] for i, j in zip(i_s, j_s)
]
Verify
pd.concat(dict(enumerate([df.iloc[:, i:j] for i, j in zip(i_s, j_s)])), axis=1)
0 1 2 3
A B C D E F G H I J
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
2 20 21 22 23 24 25 26 27 28 29
3 30 31 32 33 34 35 36 37 38 39
4 40 41 42 43 44 45 46 47 48 49
5 50 51 52 53 54 55 56 57 58 59
6 60 61 62 63 64 65 66 67 68 69
7 70 71 72 73 74 75 76 77 78 79
8 80 81 82 83 84 85 86 87 88 89
9 90 91 92 93 94 95 96 97 98 99
first get Triangular Number Series, then apply it to dataframe
n = len(df.columns.tolist())
end = 0
i = 0
res = []
while end < n:
begin = end
end = i*(i+1)/2
res.append(begin,end)
idx = map( lambda x:range(x),res)
for i in idx:
df[i]
So I am trying to merge the following columns of data which are currently indexed as daily entries (but only have points once per week). I have separated the columns into year variables but am having trouble getting them into a combined dataframe and disregard the date index so that I can build out min/max columns by week over the years. I am not sure how to get merge/join function to do this.
#Create year variables, append to new dataframe with new index
I have the following:
def minmaxdata():
Totrigs = dataforgraphs()
tr = Totrigs
yrs=[tr['2007'],tr['2008'],tr['2009'],tr['2010'],tr['2011'],tr['2012'],tr['2013'],tr['2014']]
yrlist = ['tr07','tr08','tr09','tr10','tr11','tr12','tr13','tr14']
dic = dict(zip(yrlist,yrs))
yr07,yr08,yr09,yr10,yr11,yr12,yr13,yr14 =dic['tr07'],dic['tr08'],dic['tr09'],dic['tr10'],dic['tr11'],dic['tr12'],dic['tr13'],dic['tr14']
minmax = yr07.append([yr08,yr09,yr10,yr11,yr12,yr13,yr14],ignore_index=True)
I would like a Dataframe like the following:
2007 2008 2009 2010 2011 2012 2013 2014 min max
1 10 13 10 12 34 23 22 14 10 34
2 25 ...
3 22
4 ...
5
.
.
. ...
52
I'm not sure what your original data look like, but I don't think it's a good idea to hard-code all years. You lose re-usability. I'll setup a sequence of random integers indexed by date with one date per week.
In [65]: idx = pd.date_range ('2007-1-1','2014-12-31',freq='W')
In [66]: df = pd.DataFrame(np.random.randint(100, size=len(idx)), index=idx, columns=['value'])
In [67]: df.head()
Out[67]:
value
2007-01-07 7
2007-01-14 2
2007-01-21 85
2007-01-28 55
2007-02-04 36
In [68]: df.tail()
Out[68]:
value
2014-11-30 76
2014-12-07 34
2014-12-14 43
2014-12-21 26
2014-12-28 17
Then get year of the week:
In [69]: df['year'] = df.index.year
In [70]: df['week'] = df.groupby('year').cumcount()+1
(You may try df.index.week for week# but I've seen weird behavior like starting from week #53 in Jan.)
Finally, do a pivot table to transform and get row-wise max/min:
In [71]: df2 = df.pivot_table(index='week', columns='year', values='value')
In [72]: df2['max'] = df2.max(axis=1)
In [73]: df2['min'] = df2.min(axis=1)
And now our dataframe df2 looks like this and should be what you need:
In [74]: df2
Out[74]:
year 2007 2008 2009 2010 2011 2012 2013 2014 max min
week
1 7 82 13 32 24 58 18 10 82 7
2 2 5 29 0 2 97 59 83 97 0
3 85 89 8 83 63 73 47 49 89 8
4 55 5 1 44 78 10 13 87 87 1
5 36 41 48 98 98 24 24 69 98 24
6 51 43 62 60 44 57 34 33 62 33
7 37 66 72 46 28 11 73 36 73 11
8 30 13 86 93 46 67 95 15 95 13
9 78 84 16 21 70 39 43 90 90 16
10 9 2 88 15 39 81 44 96 96 2
11 34 76 16 44 44 26 30 77 77 16
12 2 24 23 13 25 69 25 74 74 2
13 66 91 67 77 18 47 95 66 95 18
14 59 52 22 42 40 99 88 21 99 21
15 76 17 31 57 43 31 91 67 91 17
16 76 38 53 43 84 45 78 9 84 9
17 88 53 34 22 99 93 61 42 99 22
18 78 19 82 19 5 80 55 69 82 5
19 54 92 56 6 2 85 7 67 92 2
20 8 56 86 41 60 76 31 81 86 8
21 64 76 11 38 41 98 39 72 98 11
22 21 86 34 1 15 27 26 95 95 1
23 82 90 3 17 62 18 93 20 93 3
24 47 42 32 27 83 8 22 14 83 8
25 15 66 70 16 4 22 26 14 70 4
26 12 68 21 7 86 2 27 10 86 2
27 85 85 9 39 17 94 67 42 94 9
28 73 80 96 49 46 23 69 84 96 23
29 57 74 6 71 79 31 79 7 79 6
30 18 84 85 34 71 69 0 62 85 0
31 24 40 93 53 72 46 44 71 93 24
32 95 4 58 57 68 27 95 71 95 4
33 65 84 87 41 38 45 71 33 87 33
34 62 14 41 83 79 63 44 13 83 13
35 49 96 50 62 25 45 69 63 96 25
36 6 38 86 34 98 60 67 80 98 6
37 99 44 26 19 19 20 57 17 99 17
38 2 40 7 65 68 58 68 13 68 2
39 72 31 83 65 69 39 10 76 83 10
40 90 31 42 20 7 8 62 79 90 7
41 10 46 82 96 30 43 12 84 96 10
42 79 38 28 78 25 9 80 2 80 2
43 64 83 63 40 29 86 10 15 86 10
44 89 91 62 48 53 69 16 0 91 0
45 99 26 85 45 26 53 79 86 99 26
46 35 14 46 25 74 6 68 44 74 6
47 17 9 84 88 29 83 85 1 88 1
48 18 69 55 16 77 35 16 76 77 16
49 60 4 36 50 81 28 50 34 81 4
50 36 29 38 28 81 86 71 43 86 28
51 41 82 95 27 95 77 74 26 95 26
52 2 81 89 82 28 2 11 17 89 2
53 NaN NaN NaN NaN NaN 0 NaN NaN 0 0
EDIT:
If you need max/min over a certain columns, just list them. In this case (2007-2013), they are consecutive so you can do the following.
df2['max_2007to2013'] = df2[range(2007,2014)].max(axis=1)
If not, simply list them like: df2[[2007,2010,2012,2013]].max(axis=1)