As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000
ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2
I have my CSV data saved as a dataframe and I want to take the values of a row then use them in a function. I'll try to show what I am looking for. I have tried sorting by amounts but I can figure out how to separate out the data after that step. I am new to Pandas and I would appreciate any helpful and problem-relevant feedback.
UPDATE: If you suggest using .apply on the dataframe, could you show me a good way of applying a complex function. The Pandas documentation only shows simple functions which I don't find useful given the contex.
Here is the df
Date Amount
0 12/27/2019 NaN
1 12/27/2019 -14.00
2 12/27/2019 -15.27
3 12/30/2019 -1.00
4 12/30/2019 -35.01
5 12/30/2019 -9.99
6 01/02/2020 -7.57
7 01/03/2020 1225.36
8 01/03/2020 -40.00
9 01/03/2020 -59.90
10 01/03/2020 -9.52
11 01/06/2020 100.00
12 01/06/2020 -6.41
13 01/06/2020 -31.07
14 01/06/2020 -2.50
15 01/06/2020 -7.46
16 01/06/2020 -18.98
17 01/06/2020 -1.25
18 01/06/2020 -2.50
19 01/06/2020 -1.25
20 01/06/2020 -170.94
21 01/06/2020 -150.00
22 01/07/2020 -20.00
23 01/07/2020 -18.19
24 01/07/2020 -4.00
25 01/08/2020 -1.85
26 01/08/2020 -1.10
27 01/09/2020 -21.00
28 01/09/2020 -31.00
29 01/09/2020 -7.13
30 01/10/2020 -10.00
31 01/10/2020 -1.75
32 01/10/2020 -125.00
33 01/13/2020 -10.60
34 01/13/2020 -2.50
35 01/13/2020 -7.00
36 01/13/2020 -46.32
37 01/13/2020 -1.25
38 01/13/2020 -39.04
39 01/13/2020 -9.46
40 01/13/2020 -179.00
41 01/13/2020 -140.00
42 01/15/2020 -150.04
I want to take the amount value from a row, then look for a matching amount value. Once a matching value is found I want to run a timedelta between the two rows with a matching value.
Thus far, every time I have tried a conditional statement of some sort I get an error. Does anyone have any ideas how I might be able to accomplish this task?
Here is a bit of code I have started with.
amount_1 = df.loc[1, 'Amount']
amount_2 = df.loc[2, 'Amount']
print(amount_1, amount_2)
date_1 = df.loc[2, 'Date'] #skipping the first row.
x = 2
x += 1
date_2 = df.loc[x, 'Date']
## Not real code, but a logical flow I am aiming for
if amount_2 == amount_1:
timed = date_2 - date_1
print(timed, amount_2)
elif amount_2 != amount_1:
# go to the next row and check
you could use something like that:
distinct_values = df["Amount"].unique() # Select all distinct values
for value_unique in distinct_values: # for each distinct value
temp_df = df.loc[df["Amount"] == value_unique] # find rows of that value
# You could iterate over that temp df to do your timedelta operations...
Trying to filter out a number of actions a user has done if the number of actions reaches a threshold.
Here is the data set: (Only Few records)
user_id,session_id,item_id,rating,length,time
123,36,28,3.5,6243.0,2015-03-07 22:44:40
123,36,29,2.5,4884.0,2015-03-07 22:44:14
123,36,30,3.5,6846.0,2015-03-07 22:44:28
123,36,54,6.5,10281.0,2015-03-07 22:43:56
123,36,61,3.5,7639.0,2015-03-07 22:43:44
123,36,62,7.5,18640.0,2015-03-07 22:43:34
123,36,63,8.5,7189.0,2015-03-07 22:44:06
123,36,97,2.5,7627.0,2015-03-07 22:42:53
123,36,98,4.5,9000.0,2015-03-07 22:43:04
123,36,99,7.5,7514.0,2015-03-07 22:43:13
223,63,30,8.0,5412.0,2015-03-22 01:42:10
123,36,30,5.5,8046.0,2015-03-07 22:42:05
223,63,32,8.5,4872.0,2015-03-22 01:42:03
123,36,32,7.5,11914.0,2015-03-07 22:41:54
225,63,35,7.5,6491.0,2015-03-22 01:42:19
123,36,35,5.5,7202.0,2015-03-07 22:42:15
123,36,36,6.5,6806.0,2015-03-07 22:42:43
123,36,37,2.5,6810.0,2015-03-07 22:42:34
225,63,41,5.0,15026.0,2015-03-22 01:42:37
225,63,45,6.5,8532.0,2015-03-07 22:42:25
I can groupby the data using user_id and session_id and get a count of items a user has rated in a session:
df.groupby(['user_id', 'session_id']).agg({'item_id':'count'}).rename(columns={'item_id': 'count'})
List of items that user has rated in a session can be obtained:
df.groupby(['user_id','session_id'])['item_id'].apply(list)
The goal is to get following if a user has rated more than 3 items in session, I want to pick only the first three items (keep only first three per user per session) from the original data frame. Maybe use the time to sort the items?
First tried to obtain which sessions contain more than 3, somewhat struggling to go beyond.
df.groupby(['user_id', 'session_id'])['item_id'].apply(
lambda x: (x > 3).count())
Example: from original df, user 123 should have first three records belong to session 36
It seems like you want to use groupby with head:
In [8]: df.groupby([df.user_id, df.session_id]).head(3)
Out[8]:
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
10 223 63 30 8.0 5412.0 2015-03-22 01:42:10
12 223 63 32 8.5 4872.0 2015-03-22 01:42:03
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25
One way is to use sort_values followed by groupby.cumcount. A method I find useful is to extract any series or MultiIndex data before applying any filtering.
The below example filters for minimum user_id / session_id combination of 3 items and only takes the first 3 in each group.
sizes = df.groupby(['user_id', 'session_id']).size()
counter = df.groupby(['user_id', 'session_id']).cumcount() + 1 # counting begins at 0
indices = df.set_index(['user_id', 'session_id']).index
df = df.sort_values('time')
res = df[(indices.map(sizes.get) >= 3) & (counter <=3)]
print(res)
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25
I have a dataframe with daily data, for over 3 years.
I would like to construct another dataframe containing the data from the last 5 days of each month.
The rows of the 'date' column would be in this case (for the new constructed dataframe) :
2013-01-27
2013-01-28
2013-01-29
2013-01-30
2013-01-31
2013-02-23
2013-02-25
2013-02-26
2013-02-27
2013-02-28
Could someone tell me how I could manage that ?
Many thanks !
One way to do this is to dt.day and dt.days_in_month with boolean indexing:
df = pd.DataFrame({'Date':pd.date_range('2010-01-01','2013-12-31',freq='D'),
'Value':np.random.rand(1461)})
df_out = df[df['Date'].dt.day > df['Date'].dt.days_in_month-5]
print(df_out.head(20))
Output:
Date Value
26 2010-01-27 0.097695
27 2010-01-28 0.236572
28 2010-01-29 0.910922
29 2010-01-30 0.777657
30 2010-01-31 0.943031
54 2010-02-24 0.217144
55 2010-02-25 0.970090
56 2010-02-26 0.658967
57 2010-02-27 0.189376
58 2010-02-28 0.229299
85 2010-03-27 0.986992
86 2010-03-28 0.980633
87 2010-03-29 0.258102
88 2010-03-30 0.827310
89 2010-03-31 0.813219
115 2010-04-26 0.135519
116 2010-04-27 0.263941
117 2010-04-28 0.120624
118 2010-04-29 0.993652
119 2010-04-30 0.901466
Assuming that your column is named Date.
df.groupby([df.Date.dt.month,df.Date.dt.year]).apply(lambda x: x[-5:]).reset_index(drop=True).sort_values('Date')
total_val_count = dataset[attr].value_counts()
for i in range(len(total_val_count.index)):
print total_val_count[i]
I have written this piece of code which counts occurrences of all distinct values of an attribute in a dataframe. The problem I am facing is that I am unable to access the first value by using index 0. I get a KeyError: 0 error in the first loop run itself.
The total_val_count contains proper values as shown below:
34 2887
4 2708
13 2523
35 2507
33 2407
3 2404
36 2382
26 2378
16 2282
22 2187
21 2141
12 2104
25 2073
5 2052
15 2044
17 2040
14 2027
28 1984
27 1980
23 1979
24 1960
30 1953
29 1936
31 1884
18 1877
7 1858
37 1767
20 1762
11 1740
8 1722
6 1693
32 1692
10 1662
9 1576
19 1308
2 1266
1 175
38 63
dtype: int64
total_val_count is a Series. The index of the Series are values in dataset[attr],
and the values in the Series are the number of times the associated value in dataset[attr] appears.
When you index a Series with total_val_count[i], Pandas looks for i in the index and returns the assocated value. In other words, total_val_count[i] is indexing by index value, not by ordinal.
Think of a Series as a mapping from the index to the values. When using plain indexing, e.g. total_val_count[i], it behaves more like a dict than a list.
You are getting a KeyError because 0 is not a value in the index.
To index by ordinal, use total_val_count.iloc[i].
Having said that, using for i in range(len(total_val_count.index)) -- or, what amounts to the same thing, for i in range(len(total_val_count)) -- is not recommended. Instead of
for i in range(len(total_val_count)):
print(total_val_count.iloc[i])
you could use
for value in total_val_count.values:
print(value)
This is more readable, and allows you to access the desired value as a variable, value, instead of the more cumbersome total_val_count.iloc[i].
Here is an example which shows how to iterate over the values, the keys, both the keys and values:
import pandas as pd
s = pd.Series([1, 2, 3, 2, 2])
total_val_count = s.value_counts()
print(total_val_count)
# 2 3
# 3 1
# 1 1
# dtype: int64
for value in total_val_count.values:
print(value)
# 3
# 1
# 1
for key in total_val_count.keys():
print(key)
# 2
# 3
# 1
for key, value in total_val_count.iteritems():
print(key, value)
# (2, 3)
# (3, 1)
# (1, 1)
for i in range(len(total_val_count)):
print(total_val_count.iloc[i])
# 3
# 1
# 1