Segmenting a dataframe based on date with datetime column. Python - python

I have a dataframe named data as shown:
Date
Value
X1
X2
X3
2019-05
15
23
65
98
2019-05
34
132
56
87
2019-06
23
66
90
44
The date column is in a datetime format of Year-Month starting from 2017-01 and the most recent 2022-05. I want to write a piece that will extract data into separate data frames. More specifically I want one data frame to contain the rows of the current month and year (2022-05), another dataframe to contain to data from the previous month (2022-04), and one more dataframe that contains data from 12 months ago (2021-05).
For my code I have the following:
import pandas as pd
from datetime import datetime as dt
data = pd.read_csv("results.csv")
current = data[data["Date"].dt.month == dt.now().month]
My results show the following:
Date
Value
X1
X2
X3
2019-05
15
23
65
98
2019-05
34
132
56
87
2020-05
23
66
90
44
So I get the rows that match the current month but I need it to match the current year I assumed I could just add multiple conditions to match current month and current year but that did not seem to work for me.
Also is there a way to write the code in such a way where I can extract the data from the previous month and the previous year based on what the current month-year is? My first thought was to just take the month and subtract 1 and do the same thing for the year and if the current year is in January I would just write an exception to subtract 1 from both the month and year for the previous month analysis.

Split your DF into a dict of DFs and then access the one you want directly by the date (YYYY-MM).
index
Date
Value
X1
X2
X3
0
2017-04
15
23
65
98
1
2019-05
34
132
56
87
2
2021-06
23
66
90
44
dfs = {x:df[df.Date == x ] for x in df.Date.unique()}
dfs['2017-04']
index
Date
Value
X1
X2
X3
0
2017-04
15
23
65
98

You can do this with a groupby operation, which is a first-class kind of thing in tabular analysis (sql/pandas). In this case, you want to group by both year and month, creating dataframes:
dfs = []
for key, group_df in df.groupby([df.Date.dt.year, df.Date.dt.month]):
dfs.append(group_df)
dfs will have the subgroups you want.
One thing: it's worth noting that there is a performance cost breaking dataframes into list items. Its just as likely that you could do whatever processing comes next directly in the groupby statement, such as df.groupby(...).X1.transform(sum) for example.

Related

How to aggregate time-series data over specific ranges?

I have a pandas dataframe that looks like this, whereby each row represents data collected on a different day (days 1 -> 5) for each participant (long form).
ID Heart_Rate
1 89
1 98
1 99
1 73
1 54
...
24 88
24 90
24 79
24 92
24 97
How can I aggregate the data over the first 3 days for each participant such that I create a new data frame with 1 row for each patient whereby the data represents the mean heart rate over 72 hours.
We can set the index of dataframe to ID then group the dataframe on level=0 and aggregate using head to select first three rows for each user ID then take mean on level=0 to get the average heart rate over the first 72 hours:
out = df.set_index('ID').groupby(level=0).head(3).mean(level=0)
Alternate approach which is more efficient but applicable only if there are always equal number of rows present corresponding to each user ID and dataframe is sorted on ID column:
n_days = 5 # Number of rows present for each user ID
n_days_to_avg = 3 # First n rows/days to average
m = np.isin(np.r_[:len(df)] % n_days, np.r_[:n_days_to_avg])
out = df[m].groupby('ID').mean()
>>> out
Heart_Rate
ID
1 95.333333
24 85.666667

How to export dictionary with multiple values in Excel file

I have a dictionary with multiple values to a key. For ex:
dict = {u'Days': [u'Monday', u'Tuesday', u'Wednesday', u'Thursday'],u'Temp_value':[54,56,57,45], u'Level_value': [30,34,35,36] and so on...}
I want to export this Data to excel in the below-mentioned formet.
Column 1 Column 2 column 3 so on...
Days Temp_value Level_value
Monday 54 30
Tuesday 56 34
Wednesday 57 35
Thursday 45 36
How can I do that?
Use pandas
import pandas as pd
df = pd.DataFrame(your_dict)
df.to_excel('your_file.xlsx', index=False)

Averaging over specific months in pandas

I'm having trouble creating averages using pandas. My problem is that I want to create the averages combining the months Nov,Dec,Jan,Feb,March, for each winter, however they fall on different years and therefore I can't just do an average of those values falling within one calendar year. I have tried subsetting the data into two datetime objects as..
nd_npi_obs = ndjfm_npi_obs[ndjfm_npi_obs.index.month.isin([11,12])]
jfm_npi_obs = ndjfm_npi_obs[ndjfm_npi_obs.index.month.isin([1,2,3])]
..however I'm having trouble manipulating the dates (years) in order to do a simple average. I'm inexperienced with pandas and wondering if there is a more elegant way than exporting to excel and changing the year! The data is in the form..
Date
1899-01-01 00:00:00 100994.0
1899-02-01 00:00:00 100932.0
1899-03-01 00:00:00 100978.0
1899-11-01 00:00:00 100274.0
1899-12-01 00:00:00 100737.0
1900-01-01 100655.0
1900-02-01 100633.0
1900-03-01 100512.0
1900-11-01 101212.0
1900-12-01 100430.0
Interesting problem. Since you are averaging over five months this makes resampling more tricky. You should be able to overcome this by logical indexing and building a new dataframe. I assume your index is a datetime value.
index = pd.date_range('1899 9 1', '1902, 3, 1', freq='1M')
data = np.random.randint(0, 100, (index.size, 5))
df = pd.DataFrame(index=index, data=data, columns=list('ABCDE'))
# find rows that meet your criteria and average
idx1 = (df.index.year==1899) & (df.index.month >10)
idx2 = (df.index.year==1900) & (df.index.month < 4)
winterAve = df.loc[idx1 | idx2, :].mean(axis=0)
Just to visually check that the indexing/slicing is doing what we need....
>>>df.loc[idx1 | idx2, :]
Out[200]:
A B C D E
1899-11-30 48 91 87 29 47
1899-12-31 63 5 0 35 22
1900-01-31 37 8 89 86 38
1900-02-28 7 35 56 63 46
1900-03-31 72 34 96 94 35
You should be able to put this in a for loop to iterate over multiple years, etc.
Group data by month using pd.Grouper
g = df.groupby(pd.Grouper(freq="M")) # DataFrameGroupBy (grouped by Month)
For each group, calculate the average of only 'A' column
monthly_averages = g.aggregate({"A":np.mean})

how to speed up dataframe analysis

I'm looping through a DataFrame of 200k rows. It's doing what I want but it takes hours. I'm not very sophisticated when it comes to all the ways you can join and manipulate DataFrames so I wonder if I'm doing this in a very inefficient way. It's quite simple, here's the code:
three_yr_gaps = []
for index, row in df.iterrows():
three_yr_gaps.append(df[(df['GROUP_ID'] == row['GROUP_ID']) &
(df['BEG_DATE'] >= row['THREE_YEAR_AGO']) &
(df['END_DATE'] <= row['BEG_DATE'])]['GAP'].sum() + row['GAP'])
df['GAP_THREE'] = three_yr_gaps
The DF has a column called GAP that holds an integer value. the logic I'm employing to sum this number up is:
for each row get these columns from the dataframe:
those that match on the group id, and...
those that have a beginning date within the last 3 years of this rows start date, and...
those that have an ending date before this row's beginning date.
sum up those rows GAP number and add this row's GAP number then append those to a list of indexes.
So is there a faster way to introduce this logic into some kind of automatic merge or join that could speed up this process?
PS.
I was asked for some clarification on input and output, so here's a constructed dataset to play with:
from dateutil import parser
df = pd.DataFrame( columns = ['ID_NBR','GROUP_ID','BEG_DATE','END_DATE','THREE_YEAR_AGO','GAP'],
data = [['09','185',parser.parse('2008-08-13'),parser.parse('2009-07-01'),parser.parse('2005-08-13'),44],
['10','185',parser.parse('2009-08-04'),parser.parse('2010-01-18'),parser.parse('2006-08-04'),35],
['11','185',parser.parse('2010-01-18'),parser.parse('2011-01-18'),parser.parse('2007-01-18'),0],
['12','185',parser.parse('2014-09-04'),parser.parse('2015-09-04'),parser.parse('2011-09-04'),0]])
and here's what I wrote at the top of the script, may help:
The purpose of this script is to extract gaps counts over the
last 3 year period. It uses gaps.sql as its source extract. this query
returns a DataFrame that looks like this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP
09 185 2008-08-13 2009-07-01 2005-08-13 44
10 185 2009-08-04 2010-01-18 2006-08-04 35
11 185 2010-01-18 2011-01-18 2007-01-18 0
12 185 2014-09-04 2015-09-04 2011-09-04 0
The python code then looks back at the previous 3 years (those
previous rows that have the same GROUP_ID but whose effective dates
come after their own THIRD_YEAR_AGO and whose end date come before
their own beginning date). Those rows are added up and a new column is
made called GAP_THREE. What remains is this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP GAP_THREE
09 185 2008-08-13 2009-07-01 2005-08-13 44 44
10 185 2009-08-04 2010-01-18 2006-08-04 35 79
11 185 2010-01-18 2011-01-18 2007-01-18 0 79
12 185 2014-09-04 2015-09-04 2011-09-04 0 0
you'll notice that row id_nbr 11 has a 79 value in the last 3 years but id_nbr 12 has 0 because the last gap was 35 in 2009 which is more than 3 years before 12's beginning date of 2014

How to obtain 1 column from a series object pandas?

I originally have 3 columns, timestamp,response_time and type columns, what I need to do is find the mean of response time where all timestamps are same hence I grouped all timestamps together and applied mean function on them. I got the following series which is fine:
0 16.949689
1 17.274615
2 16.858884
3 17.025155
4 17.062008
5 16.846885
6 17.172994
7 17.025797
8 17.001974
9 16.924636
10 16.813300
11 17.152066
12 17.291899
13 16.946970
14 16.972884
15 16.871824
16 16.840024
17 17.227682
18 17.288211
19 17.370553
20 17.395759
21 17.449579
22 17.340357
23 17.137308
24 16.981012
25 16.946727
26 16.947073
27 16.830850
28 17.366538
29 17.054468
30 16.823983
31 17.115429
32 16.859003
33 16.919645
34 17.351895
35 16.930233
36 17.025194
37 16.824997
And I need to be able to plot column 1 vs column 2, but I am not abel to extract them seperately.
I obtained this column by doing groupby('timestamp') and then a mean() on that.
The problem I need to solve is how to extract each column of this series? or is there a better way to calculate the mean of 1 column for all same entries of another column?
ORIGINAL DATA :
1445544152817,SEND_MSG,123
1445544152817,SEND_MSG,123
1445544152829,SEND_MSG,135
1445544152829,SEND_MSG,135
1445544152830,SEND_MSG,135
1445544152830,GET_QUEUE,12
1445544152830,SEND_MSG,136
1445544152830,SEND_MSG,136
1445544152830,SEND_MSG,136
1445544152831,SEND_MSG,138
1445544152831,SEND_MSG,136
1445544152831,SEND_MSG,137
1445544152831,SEND_MSG,137
1445544152831,SEND_MSG,137
1445544152832,SEND_MSG,138
1445544152832,SEND_MSG,138
1445544152833,SEND_MSG,138
1445544152833,SEND_MSG,139
1445544152834,SEND_MSG,140
1445544152834,SEND_MSG,140
1445544152834,SEND_MSG,140
1445544152835,SEND_MSG,140
1445544152835,SEND_MSG,141
1445544152849,SEND_MSG,155
1445544152849,SEND_MSG,155
1445544152850,GET_QUEUE,21
1445544152850,GET_QUEUE,21
For each timestamp I want to find average of response_time and plot,I did that successfully as shown in the series above(first data) but I cannot seperate the timestamp and response_time columns anymore.
A Series always has just one column. The first column you see is the index. You can get it by your_series.index(). If you want the timestamp to become a data column again, and not an index you can use the as_index keyword in groupby:
df.groupby('timestamp', as_index = False).mean()
Or use your_series.reset_index().
if its a series, you can directly use:
your_series.mean()
you can extract the column by:
df['column_name']
then you can apply mean() to the series:
df['column_name'].mean()

Categories

Resources