Select specific days data for each month in a dataframe - python

I have a dataframe with daily data, for over 3 years.
I would like to construct another dataframe containing the data from the last 5 days of each month.
The rows of the 'date' column would be in this case (for the new constructed dataframe) :
2013-01-27
2013-01-28
2013-01-29
2013-01-30
2013-01-31
2013-02-23
2013-02-25
2013-02-26
2013-02-27
2013-02-28
Could someone tell me how I could manage that ?
Many thanks !

One way to do this is to dt.day and dt.days_in_month with boolean indexing:
df = pd.DataFrame({'Date':pd.date_range('2010-01-01','2013-12-31',freq='D'),
'Value':np.random.rand(1461)})
df_out = df[df['Date'].dt.day > df['Date'].dt.days_in_month-5]
print(df_out.head(20))
Output:
Date Value
26 2010-01-27 0.097695
27 2010-01-28 0.236572
28 2010-01-29 0.910922
29 2010-01-30 0.777657
30 2010-01-31 0.943031
54 2010-02-24 0.217144
55 2010-02-25 0.970090
56 2010-02-26 0.658967
57 2010-02-27 0.189376
58 2010-02-28 0.229299
85 2010-03-27 0.986992
86 2010-03-28 0.980633
87 2010-03-29 0.258102
88 2010-03-30 0.827310
89 2010-03-31 0.813219
115 2010-04-26 0.135519
116 2010-04-27 0.263941
117 2010-04-28 0.120624
118 2010-04-29 0.993652
119 2010-04-30 0.901466

Assuming that your column is named Date.
df.groupby([df.Date.dt.month,df.Date.dt.year]).apply(lambda x: x[-5:]).reset_index(drop=True).sort_values('Date')

Related

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000
ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2

How can I iterate over a Pandas DataFrame and run a function over them

I have my CSV data saved as a dataframe and I want to take the values of a row then use them in a function. I'll try to show what I am looking for. I have tried sorting by amounts but I can figure out how to separate out the data after that step. I am new to Pandas and I would appreciate any helpful and problem-relevant feedback.
UPDATE: If you suggest using .apply on the dataframe, could you show me a good way of applying a complex function. The Pandas documentation only shows simple functions which I don't find useful given the contex.
Here is the df
Date Amount
0 12/27/2019 NaN
1 12/27/2019 -14.00
2 12/27/2019 -15.27
3 12/30/2019 -1.00
4 12/30/2019 -35.01
5 12/30/2019 -9.99
6 01/02/2020 -7.57
7 01/03/2020 1225.36
8 01/03/2020 -40.00
9 01/03/2020 -59.90
10 01/03/2020 -9.52
11 01/06/2020 100.00
12 01/06/2020 -6.41
13 01/06/2020 -31.07
14 01/06/2020 -2.50
15 01/06/2020 -7.46
16 01/06/2020 -18.98
17 01/06/2020 -1.25
18 01/06/2020 -2.50
19 01/06/2020 -1.25
20 01/06/2020 -170.94
21 01/06/2020 -150.00
22 01/07/2020 -20.00
23 01/07/2020 -18.19
24 01/07/2020 -4.00
25 01/08/2020 -1.85
26 01/08/2020 -1.10
27 01/09/2020 -21.00
28 01/09/2020 -31.00
29 01/09/2020 -7.13
30 01/10/2020 -10.00
31 01/10/2020 -1.75
32 01/10/2020 -125.00
33 01/13/2020 -10.60
34 01/13/2020 -2.50
35 01/13/2020 -7.00
36 01/13/2020 -46.32
37 01/13/2020 -1.25
38 01/13/2020 -39.04
39 01/13/2020 -9.46
40 01/13/2020 -179.00
41 01/13/2020 -140.00
42 01/15/2020 -150.04
I want to take the amount value from a row, then look for a matching amount value. Once a matching value is found I want to run a timedelta between the two rows with a matching value.
Thus far, every time I have tried a conditional statement of some sort I get an error. Does anyone have any ideas how I might be able to accomplish this task?
Here is a bit of code I have started with.
amount_1 = df.loc[1, 'Amount']
amount_2 = df.loc[2, 'Amount']
print(amount_1, amount_2)
date_1 = df.loc[2, 'Date'] #skipping the first row.
x = 2
x += 1
date_2 = df.loc[x, 'Date']
## Not real code, but a logical flow I am aiming for
if amount_2 == amount_1:
timed = date_2 - date_1
print(timed, amount_2)
elif amount_2 != amount_1:
# go to the next row and check
you could use something like that:
distinct_values = df["Amount"].unique() # Select all distinct values
for value_unique in distinct_values: # for each distinct value
temp_df = df.loc[df["Amount"] == value_unique] # find rows of that value
# You could iterate over that temp df to do your timedelta operations...

Find median in nth range in Python

I am trying to find value of every median in my dataset for every 15 days. Dataset has three columns - index, value and date.
This is for evaluation of this median according to some conditions. Each of 15 days will get new value according to conditions.
I've tried several approaches (mostly python comprehension) but I am still a beginner to solve it properly.
value date index
14 13065 1983-07-15 14
15 13065 1983-07-16 15
16 13065 1983-07-17 16
17 13065 1983-07-18 17
18 13065 1983-07-19 18
19 13065 1983-07-20 19
20 13065 1983-07-21 20
21 13065 1983-07-22 21
22 13065 1983-07-23 22
23 ..... ......... ..
medians = [dataset['value'].median() for range(0, len(dataset['index']), 15) in dataset['value']]
I am expecting to return medians from the dataframe to a new variable.
syntaxError: can't assign to function call
Assuming you have data in the below format:
test = pd.DataFrame({'date': pd.date_range(start = '2016/02/12', periods = 1000, freq='1D'),
'value': np.random.randint(1,1000,1000)})
test.head()
date value
0 2016-02-12 243
1 2016-02-13 313
2 2016-02-14 457
3 2016-02-15 236
4 2016-02-16 893
If you want to median for every 15 days then use pd.Grouper and groupby date:
test.groupby(pd.Grouper(freq='15D', key='date')).median().reset_index()
date Value
2016-02-12 457.0
2016-02-27 733.0
2016-03-13 688.0
2016-03-28 504.0
2016-04-12 591.0
Note that while using pd.Grouper, your date column should be of type datetime. If it's not, convert using:
test['date'] = pd.to_datetime(test['date'])

Data cleaning and preparation for Time-Series-LSTM

I need to prepare my Data to feed it into an LSTM for predicting the next day.
My Dataset is a time series in seconds but I have just 3-5 hours a day of Data. (I just have this specific Dataset so can't change it)
I have Date-Time and a certain Value.
E.g.:
datetime..............Value
2015-03-15 12:00:00...1000
2015-03-15 12:00:01....10
.
.
I would like to write a code where I extract e.g. 4 hours and delete the first extracted hour just for specific months (because this data is faulty).
I managed to write a code to extract e.g. 2 hours for x-Data (Input) and y-Data (Output).
I hope I could explain my problem to you.
The Dataset is 1 Year in seconds Data, 6pm-11pm rest is missing.
In e.g. August-November the first hour is faulty data and needs to be deleted.
init = True
for day in np.unique(x_df.index.date):
temp = x_df.loc[(day + pd.DateOffset(hours=18)):(day + pd.DateOffset(hours=20))]
if len(temp) == 7201:
if init:
x_df1 = np.array([temp.values])
init = False
else:
#print (temp.values.shape)
x_df1 = np.append(x_df1, np.array([temp.values]), axis=0)
#else:
#if not temp.empty:
#print (temp.index[0].date(), len(temp))
x_df1 = np.array(x_df1)
print('X-Shape:', x_df1.shape,
'Y-Shape:', y_df1.shape)
#sample, timesteps and features for LSTM
X-Shape: (32, 7201, 6) Y-Shape: (32, 7201)
My expected result is to have a dataset of e.g. 4 hours a day where the first hour in e.g. August, September, and October is deleted.
I would be also very happy if there is someone who can also provide me with a nicer code to do so.
Probably not the most efficient solution, but maybe it still fits.
First lets generate some random data for the first 4 months and 5 days per month:
import random
import pandas as pd
df = pd.DataFrame()
for month in range(1,5): #First 4 Months
for day in range(5,10): #5 Days
hour = random.randint(18,19)
minute = random.randint(1,59)
dt = datetime.datetime(2018,month,day,hour,minute,0)
dti = pd.date_range(dt, periods=60*60*4, freq='S')
values = [random.randrange(1, 101, 1) for _ in range(len(dti))]
df = df.append(pd.DataFrame(values, index=dti, columns=['Value']))
Now let's define a function to filter the first row per day:
def first_value_per_day(df):
res_df = df.groupby(df.index.date).apply(lambda x: x.iloc[[0]])
res_df.index = res_df.index.droplevel(0)
return res_df
and print the results:
print(first_value_per_day(df))
Value
2018-01-05 18:31:00 85
2018-01-06 18:25:00 40
2018-01-07 19:54:00 52
2018-01-08 18:23:00 46
2018-01-09 18:08:00 51
2018-02-05 18:58:00 6
2018-02-06 19:12:00 16
2018-02-07 18:18:00 10
2018-02-08 18:32:00 50
2018-02-09 18:38:00 69
2018-03-05 19:54:00 100
2018-03-06 18:37:00 70
2018-03-07 18:58:00 26
2018-03-08 18:28:00 30
2018-03-09 18:34:00 71
2018-04-05 18:54:00 2
2018-04-06 19:16:00 100
2018-04-07 18:52:00 85
2018-04-08 19:08:00 66
2018-04-09 18:11:00 22
So, now we need a list of the specific months, that should be processed, in this case 2 and 3. Now we use the defined function and filter the days for every selected month and loop over those to find the indexes of all values inside the first entry per day +1 hour later and drop them:
MONTHS_TO_MODIFY = [2,3]
HOURS_TO_DROP = 1
fvpd = first_value_per_day(df)
for m in MONTHS_TO_MODIFY:
fvpdm = fvpd[fvpd.index.month == m]
for idx, value in fvpdm.iterrows():
start_dt = idx
end_dt = idx + datetime.timedelta(hours=HOURS_TO_DROP)
index_list = df[(df.index >= start_dt) & (df.index < end_dt)].index.tolist()
df.drop(index_list, inplace=True)
result:
print(first_value_per_day(df))
Value
2018-01-05 18:31:00 85
2018-01-06 18:25:00 40
2018-01-07 19:54:00 52
2018-01-08 18:23:00 46
2018-01-09 18:08:00 51
2018-02-05 19:58:00 1
2018-02-06 20:12:00 42
2018-02-07 19:18:00 34
2018-02-08 19:32:00 34
2018-02-09 19:38:00 61
2018-03-05 20:54:00 15
2018-03-06 19:37:00 88
2018-03-07 19:58:00 36
2018-03-08 19:28:00 38
2018-03-09 19:34:00 42
2018-04-05 18:54:00 2
2018-04-06 19:16:00 100
2018-04-07 18:52:00 85
2018-04-08 19:08:00 66
2018-04-09 18:11:00 22

Unable to index x-axis of bokeh line chart with timestamps

I have been trying to make a bokeh line chart, however I am running into the issue of indexing the x-axis with a column of time stamps from my pandas data frame. Currently my data frame looks like this:
TMAX TMIN TAVG DAY NUM
2007-04-30 65 46 55.5 2007-04-30 1
2007-05-01 75 45 60.0 2007-05-01 2
2007-05-02 66 52 59.0 2007-05-02 3
2007-05-03 65 43 54.0 2007-05-03 4
2007-05-04 61 45 53.0 2007-05-04 5
2007-05-05 65 43 54.0 2007-05-05 6
2007-05-06 77 51 64.0 2007-05-06 7
2007-05-07 89 66 77.5 2007-05-07 8
2007-05-08 91 56 73.5 2007-05-08 9
2007-05-09 83 48 65.5 2007-05-09 10
2007-05-10 68 47 57.5 2007-05-10 11
2007-05-11 65 46 55.5 2007-05-11 12
2007-05-12 63 43 53.0 2007-05-12 13
2007-05-13 65 46 55.5 2007-05-13 14
2007-05-14 71 46 58.5 2007-05-14 15
....
[3592 rows x 5 columns]
I want to index the line plot with the values of the "DAY" column, however, I get an error no matter the approach I take. The documentation for line plots says that "x (str or list(str), optional) – specifies variable(s) to use for x axis". My code is as follows:
xyvalues = np.array([df['TAVG'], df_reg['ry'], df['DAY']])
regr = Line(data=xyvalues, x='DAY', title="Linear Regression of Data", ylabel="Average Daily Temperature", xlabel="Number of Days")
output_file("regression.html")
show(regr)
This gives me the error "TypeError: Cannot compare type 'Timestamp' with type 'float64'". I have tried converting it to float, but it doesn't seem to have an effect. Any help would be much appreciated. The df_reg['ry'] is data from a linear regression data frame.
Documentation for line graphs can be found here: http://docs.bokeh.org/en/latest/docs/reference/charts.html#line
Inside Line, you need to pass a pandas data frame to the data argument in order to be able to refer to your variable DAY for the x axis ticks. Here I create a new pandas DataFrame from the other two:
import pandas as pd
df2 = pd.DataFrame(data=dict(TAVG=df['TAVG'], ry=df_reg['ry'], DAY=df['DAY']))
regr = Line(data=df2, x='DAY',
title="Linear Regression of Data",
ylabel="Average Daily Temperature",
xlabel="Number of Days")
output_file("regression.html")
show(regr)

Categories

Resources