Pandas Time Series: Remove Rows Per ID - python

I have a Pandas dataframe of the form:
Date ID Temp
2019/03/27 1 23
2019/04/27 2 32
2019/04/27 1 42
2019/04/28 1 41
2019/01/27 2 33
2019/08/27 2 23
What I need to do?
Select Rows which are at least 30 days old from their latest measurement for each id.
i.e. the latest date for Id = 2 is 2019/08/27, so for ID =2 I need to select rows which are at least 30 days older. So, the row with 2019/08/27 for ID=2 will itself be dropped.
Similarly, the latest date for ID = 1 is 2019/04/28. This means I can select rows for ID =1 only if the date is less than 2019/03/28 (30 days older). So, the row 2019/04/27 with ID=1 will be dropped.
How to do this in Pandas. Any help is greatly appreciated.
Thank you.
Final dataframe will be:
Date ID Temp
2019/03/27 1 23
2019/04/27 2 32
2019/01/27 2 33

In your case using groupby + transform('last') and filter the original df
Yourdf=df[df.Date<df.groupby('ID').Date.transform('last')-pd.Timedelta('30 days')].copy()
Date ID Temp
0 2019-03-27 1 23
1 2019-04-27 2 32
4 2019-01-27 2 33
Notice I am adding the .copy at the end to prevent the setting copy error.

Related

count values of groups by consecutive days

i have data with 3 columns: date, id, sales.
my first task is filtering sales above 100. i did it.
second task, grouping id by consecutive days.
index
date
id
sales
0
01/01/2018
03
101
1
01/01/2018
07
178
2
02/01/2018
03
120
3
03/01/2018
03
150
4
05/01/2018
07
205
the result should be:
index
id
count
0
03
3
1
07
1
2
07
1
i need to do this task without using pandas/dataframe, but right now i can't imagine from which side attack this problem.
just for effort, i tried the suggestion for a solution here count consecutive days python dataframe
but the ids' not grouped.
here is my code:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date']).dt.date
s = data.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = data.groupby(['id', s]).size().reset_index(level=0, drop=True)
it is very importent that the "new_frame" will have "count" column, because after i need to count id by range of those count days in "count" column. e.g. count of id's in range of 0-7 days, 7-12 days etc. but it's not part of my question.
Thank you a lot
Your code is close, but need some fine-tuning, as follows:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
df2 = data.sort_values(['id', 'date'])
s = df2.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = df2.groupby(['id', s]).size().reset_index(level=1, drop=True).reset_index(name='count')
Result:
print(new_frame)
id count
0 3 3
1 7 1
2 7 1
Summary of changes:
As your dates are in dd/mm/yyyy instead of the default mm/dd/yyyy, you have to specify the parameter dayfirst=True in pd.to_datetime(). Otherwise, 02/01/2018 will be regarded as 2018-02-01 instead of 2018-01-02 as expected and the day diff with adjacent entries will be around 30 as opposed to 1.
We added a sort step to sort by columns id and date to simplify the later grouping during the creation of the series s.
In the last groupby() the code reset_index(level=0, drop=True) should be dropping level=1 instead. Since, level=0 is the id fields which we want to keep.
In the last groupby(), we do an extra .reset_index(name='count') to make the Pandas series change back to a dataframe and also name the new column as count.

How to make sum of Row which are flitered with specific column values in Pandas?

i have a dataframe like below.
Trying to sum Week 7 and Week 8
SalesQuantity values for all the regarding productCodes[1-317], and update their week 7 rows Sales Quantity as a new value. And deleting their Week 8 rows from Dataframe.
Week column range is [7-26] and all of the weeks include [1-317] product code
cause of the original data is before group by [Week,ProductCode]
Week ProductCode SalesQuantity
7 1 49.285714
7 2 36.714286
7 3 33.285714
7 4 36.857143
7 5 42.714286
... ... ...
8 3 61.000000
26 314 4.285714
26 315 3.571429
26 316 6.142857
26 317 3.285714
Example Result : From the above table, adding week 7+8 SalesQuantities for product code 3: 61.000+33.285714= 94.285.714‬ new SalesQuantity updated value for week 7 is founded for ProductCode 3.
After that, need delete Week 8 row for ProductCode 3.
How to automate it for all of the ProductCode[1-317]?
Thanks
Use the `groupby()' method:
sumSales = data[['productCode', 'SalesQuality']].groupby('ProductCode').sum()
This creates a new DataFrame, with the sum of SalesQuality, indexed with the product code. The data[['productCode', 'SalesQuality']] part creates a sub-selection of the original data frame, otherwise the weeks also get summed.

How to remove duplicate entries but keep the first row selected columns value and last row selected columns value?

I'm creating the charts in periscopedata and doing pandas to derive our results. I'm facing difficulties when removing duplicates from the results.
This is our data look like in final dataframe after calculating.
vendor_ID date opening purchase paid closing
B2345 01/01/2015 5 20 10 15
B2345 01/01/2015 15 50 20 45
B2345 02/01/2015 45 4 30 19
I want to remove the duplicate entry based on vendor_ID and date but keep the starting opening and keep the last entry closing
i.e) Expected result I want
vendor_ID date opening purchase paid closing
B2345 01/01/2015 5 70 30 45
B2345 02/01/2015 45 4 30 19
I've tried below code to remove the duplicates but that gave us different error.
df.drop_duplicates(subset=["vendor_ID", "date"], keep="last", inplace=True)
How do I code such way to remove the duplicates and keep the first and last as mentioned in above example.
Use GroupBy.agg with GroupBy.first, GroupBy.last and GroupBy.sum specified for each column for output:
Notice: Thanks #Erfan - if need use minimal and maximal column instead first and last change dict to {'opening':'min','purchase':'sum','paid':'sum', 'closing':'max'}
df1 = (df.groupby(["vendor_ID", "date"], as_index=False)
.agg({'opening':'first','purchase':'sum','paid':'sum', 'closing':'last'}))
print (df1)
vendor_ID date opening purchase paid closing
0 B2345 01/01/2015 5 70 30 45
1 B2345 02/01/2015 45 4 30 19
Also if not sure if datetimes are sorted:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df = df.sort_values(["vendor_ID", "date"])
df1 = (df.groupby(["vendor_ID", "date"], as_index=False)
.agg({'opening':'first','purchase':'sum','paid':'sum', 'closing':'last'}))
print (df1)
vendor_ID date opening purchase paid closing
0 B2345 2015-01-01 5 70 30 45
1 B2345 2015-01-02 45 4 30 19
You can also create dictionary dynamic for sum all columns without first 2 and used for first and last:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df = df.sort_values(["vendor_ID", "date"])
d = {'opening':'first', 'closing':'last'}
sum_cols = df.columns.difference(list(d.keys()) + ['vendor_ID','date'])
final_d = {**dict.fromkeys(sum_cols,'sum'), **d}
df1 = df.groupby(["vendor_ID", "date"], as_index=False).agg(final_d).reindex(df.columns,axis=1)
print (df1)
vendor_ID date opening purchase paid closing
0 B2345 2015-01-01 5 70 30 45
1 B2345 2015-01-02 45 4 30 19

how to speed up dataframe analysis

I'm looping through a DataFrame of 200k rows. It's doing what I want but it takes hours. I'm not very sophisticated when it comes to all the ways you can join and manipulate DataFrames so I wonder if I'm doing this in a very inefficient way. It's quite simple, here's the code:
three_yr_gaps = []
for index, row in df.iterrows():
three_yr_gaps.append(df[(df['GROUP_ID'] == row['GROUP_ID']) &
(df['BEG_DATE'] >= row['THREE_YEAR_AGO']) &
(df['END_DATE'] <= row['BEG_DATE'])]['GAP'].sum() + row['GAP'])
df['GAP_THREE'] = three_yr_gaps
The DF has a column called GAP that holds an integer value. the logic I'm employing to sum this number up is:
for each row get these columns from the dataframe:
those that match on the group id, and...
those that have a beginning date within the last 3 years of this rows start date, and...
those that have an ending date before this row's beginning date.
sum up those rows GAP number and add this row's GAP number then append those to a list of indexes.
So is there a faster way to introduce this logic into some kind of automatic merge or join that could speed up this process?
PS.
I was asked for some clarification on input and output, so here's a constructed dataset to play with:
from dateutil import parser
df = pd.DataFrame( columns = ['ID_NBR','GROUP_ID','BEG_DATE','END_DATE','THREE_YEAR_AGO','GAP'],
data = [['09','185',parser.parse('2008-08-13'),parser.parse('2009-07-01'),parser.parse('2005-08-13'),44],
['10','185',parser.parse('2009-08-04'),parser.parse('2010-01-18'),parser.parse('2006-08-04'),35],
['11','185',parser.parse('2010-01-18'),parser.parse('2011-01-18'),parser.parse('2007-01-18'),0],
['12','185',parser.parse('2014-09-04'),parser.parse('2015-09-04'),parser.parse('2011-09-04'),0]])
and here's what I wrote at the top of the script, may help:
The purpose of this script is to extract gaps counts over the
last 3 year period. It uses gaps.sql as its source extract. this query
returns a DataFrame that looks like this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP
09 185 2008-08-13 2009-07-01 2005-08-13 44
10 185 2009-08-04 2010-01-18 2006-08-04 35
11 185 2010-01-18 2011-01-18 2007-01-18 0
12 185 2014-09-04 2015-09-04 2011-09-04 0
The python code then looks back at the previous 3 years (those
previous rows that have the same GROUP_ID but whose effective dates
come after their own THIRD_YEAR_AGO and whose end date come before
their own beginning date). Those rows are added up and a new column is
made called GAP_THREE. What remains is this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP GAP_THREE
09 185 2008-08-13 2009-07-01 2005-08-13 44 44
10 185 2009-08-04 2010-01-18 2006-08-04 35 79
11 185 2010-01-18 2011-01-18 2007-01-18 0 79
12 185 2014-09-04 2015-09-04 2011-09-04 0 0
you'll notice that row id_nbr 11 has a 79 value in the last 3 years but id_nbr 12 has 0 because the last gap was 35 in 2009 which is more than 3 years before 12's beginning date of 2014

How to obtain 1 column from a series object pandas?

I originally have 3 columns, timestamp,response_time and type columns, what I need to do is find the mean of response time where all timestamps are same hence I grouped all timestamps together and applied mean function on them. I got the following series which is fine:
0 16.949689
1 17.274615
2 16.858884
3 17.025155
4 17.062008
5 16.846885
6 17.172994
7 17.025797
8 17.001974
9 16.924636
10 16.813300
11 17.152066
12 17.291899
13 16.946970
14 16.972884
15 16.871824
16 16.840024
17 17.227682
18 17.288211
19 17.370553
20 17.395759
21 17.449579
22 17.340357
23 17.137308
24 16.981012
25 16.946727
26 16.947073
27 16.830850
28 17.366538
29 17.054468
30 16.823983
31 17.115429
32 16.859003
33 16.919645
34 17.351895
35 16.930233
36 17.025194
37 16.824997
And I need to be able to plot column 1 vs column 2, but I am not abel to extract them seperately.
I obtained this column by doing groupby('timestamp') and then a mean() on that.
The problem I need to solve is how to extract each column of this series? or is there a better way to calculate the mean of 1 column for all same entries of another column?
ORIGINAL DATA :
1445544152817,SEND_MSG,123
1445544152817,SEND_MSG,123
1445544152829,SEND_MSG,135
1445544152829,SEND_MSG,135
1445544152830,SEND_MSG,135
1445544152830,GET_QUEUE,12
1445544152830,SEND_MSG,136
1445544152830,SEND_MSG,136
1445544152830,SEND_MSG,136
1445544152831,SEND_MSG,138
1445544152831,SEND_MSG,136
1445544152831,SEND_MSG,137
1445544152831,SEND_MSG,137
1445544152831,SEND_MSG,137
1445544152832,SEND_MSG,138
1445544152832,SEND_MSG,138
1445544152833,SEND_MSG,138
1445544152833,SEND_MSG,139
1445544152834,SEND_MSG,140
1445544152834,SEND_MSG,140
1445544152834,SEND_MSG,140
1445544152835,SEND_MSG,140
1445544152835,SEND_MSG,141
1445544152849,SEND_MSG,155
1445544152849,SEND_MSG,155
1445544152850,GET_QUEUE,21
1445544152850,GET_QUEUE,21
For each timestamp I want to find average of response_time and plot,I did that successfully as shown in the series above(first data) but I cannot seperate the timestamp and response_time columns anymore.
A Series always has just one column. The first column you see is the index. You can get it by your_series.index(). If you want the timestamp to become a data column again, and not an index you can use the as_index keyword in groupby:
df.groupby('timestamp', as_index = False).mean()
Or use your_series.reset_index().
if its a series, you can directly use:
your_series.mean()
you can extract the column by:
df['column_name']
then you can apply mean() to the series:
df['column_name'].mean()

Categories

Resources