I have been spending hours trying to write a function to detect trend in a time series by taking the past 4 months months of data prior to today. I organized my monthly data with dt.month but the issue is that I cannot get the previous year's 12th month if today is january. Here is a toy dataset:
data1 = pd.DataFrame({'Id' : ['001','001','001','001','001','001','001','001','001',
'002','002','002','002','002','002','002','002','002',],
'Date': ['2020-01-12', '2019-12-30', '2019-12-01','2019-11-01', '2019-08-04', '2019-08-04', '2019-08-01', '2019-07-20', '2019-06-04',
'2020-01-11', '2019-12-12', '2019-12-01','2019-12-01', '2019-09-10', '2019-08-10', '2019-08-01', '2019-06-20', '2019-06-01'],
'Quantity' :[3,5,6,72,1,5,6,3,9,3,6,7,3,2,5,74,3,4]
})
and my data cleaning to get the format that i need is this:
data1['Date'] =pd.to_datetime(data1['Date'], format='%Y-%m')
data2 = data1.groupby('Id').apply(lambda x: x.set_index('Date').resample('M').sum())['Quantity'].reset_index()
data2['M'] =pd.to_datetime(data2['Date']).dt.month
data2['Y'] =pd.to_datetime(data2['Date']).dt.year
data = pd.DataFrame(data2.groupby(['Id','Date','M','Y'])['Quantity'].sum())
data = data.rename(columns={0 : 'Quantity'})
and my function looks like this:
def check_trend():
today_month = int(time.strftime("%-m"))
data['n3-n4'] = data['Quantity'].loc[data['M']== (today_month - 3)]-data['Quantity'].loc[data['M']== (today_month - 4)]
data['n2-n3'] = data['Quantity'].loc[data['M'] == (today_month - 2)] - data['Quantity'].loc[data['M'] == (today_month - 3)]
data['n2-n1'] = data['Quantity'].loc[data['M'] == (today_month - 2)] - data['Quantity'].loc[data['M'] == (today_month - 1)]
if data['n3-n4'] < 0 and data['n2-n3'] <0 and data['n2-n1'] <0:
elif data['n3-n4'] > 0 and data['n2-n3'] > 0 and dat['n2-n1'] >0:
data['Trend'] = 'Yes'
else:
data['Trend'] = 'No'
print(check_trend)
I have looked at this: Get (year,month) for the last X months but it does not seem to be working for a specific groupby object.
I would really appreciate a hint! At least I would love to know if this method to identify trend in a dataset is a good one. After that I plan on using exponential smoothing if there is no trend and Holt's method if there is trend.
UPDATE: thanks to #Vorsprung durch Technik, I have the function working well but i still struggle to incorporate the result into a new dataframe containing the Ids from data2
forecast = pd.DataFrame()
forecast['Id'] = data1['Id'].unique()
for k,g in data2.groupby(level='Id'):
forecast['trendup'] = g.tail(5)['Quantity'].is_monotonic_increasing
forecast['trendown'] = g.tail(5)['Quantity'].is_monotonic_decreasing
this returns the same value for each row of the dataset, like if it was doing the calculation for only the first one, how can I ensure that it gets calculated for EACH Id value?
I don't think you need check_trend().
There are built-in functions for this:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.is_monotonic_increasing.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.is_monotonic_decreasing.html
Let me know if this does what you need:
data2 = data1.groupby('Id').apply(lambda x: x.set_index('Date').resample('M').sum())
for k,g in data2.groupby(level='Id'):
print(g.tail(4)['Quantity'].is_monotonic_increasing)
print(g.tail(4)['Quantity'].is_monotonic_decreasing)
This is what is returned by g.tail(4):
Quantity
Id Date
001 2019-10-31 0
2019-11-30 72
2019-12-31 11
2020-01-31 3
Quantity
Id Date
002 2019-10-31 0
2019-11-30 0
2019-12-31 16
2020-01-31 3
"DF","00000000#11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","00000.POO#GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
"DF","0000650000#YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
"Bus","00009.GAURAV#GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
"Rail","0000.ANU#GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
"Rail","0000MANNU#GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
"Rail","0000PRANNOY0000#GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
"DF","0000PRANNOY0000#YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
"Rail","0000RAHUL#GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
"DF","0000SS0#GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
"DF","0001HARISH#GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
"DF","0001HARISH#GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
"DF","0001HARISH#GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
"Rail","000AYUSH#GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
"Rail","000AYUSH#GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
"Rail","000AYUSH#GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"
Above is the sample data.
Data is sorted according to email addresses and the file is very large around 1.5Gb
I want output in another csv file something like this
"DF","00000000#11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2",1,0 days
"Rail","00000.POO#GMAIL.COM","NR251764697478","24JUN2011","B2C","2025",1,0 days
"DF","0000650000#YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792",1,0 days
"Bus","00009.GAURAV#GMAIL.COM","NU27012932319739","26JAN2013","B2C","800",1,0 days
"Rail","0000.ANU#GMAIL.COM","NR251764697526","24JUN2011","B2C","595",1,0 days
"Rail","0000MANNU#GMAIL.COM","NR251277005737","29OCT2011","B2C","957",1,0 days
"Rail","0000PRANNOY0000#GMAIL.COM","NR251297862893","21NOV2011","B2C","212",1,0 days
"DF","0000PRANNOY0000#YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080",1,0 days
"Rail","0000RAHUL#GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731",1,0 days
"DF","0000SS0#GMAIL.COM","NF251355775967","10MAY2011","B2C","2000",1,0 days
"DF","0001HARISH#GMAIL.COM","NF251352240086","09DEC2010","B2C","4006",1,0 days
"DF","0001HARISH#GMAIL.COM","NF251742087846","12DEC2010","B2C","1000",2,3 days
"DF","0001HARISH#GMAIL.COM","NF252022031180","22DEC2010","B2C","3439",3,10 days
"Rail","000AYUSH#GMAIL.COM","NR2151213260036","28NOV2012","B2C","41",1,0 days
"Rail","000AYUSH#GMAIL.COM","NR2151313264432","29NOV2012","B2C","96",2,1 days
"Rail","000AYUSH#GMAIL.COM","NR2151413266728","29NOV2012","B2C","96",3,0 days
"Rail","000AYUSH#GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",4,9 days
"Rail","000AYUSH#GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",5,0 days
"Rail","000AYUSH#GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",6,4 days
"Rail","000AYUSH#GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",7,0 days
"Rail","000AYUSH#GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",8,44 days
"Rail","000AYUSH#GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",9,0 days
i.e if entry occurs 1st time i need to append 1 if it occurs 2nd time i need to append 2 and likewise i mean i need to count no of occurences of an email address in the file and if an email exists twice or more i want difference among dates and remember dates are not sorted so we have to sort them also against a particular email address and i am looking for a solution in python using numpy or pandas library or any other library that can handle this type of huge data without giving out of bound memory exception i have dual core processor with centos 6.3 and having ram of 4GB
make sure you have 0.11, read these docs: http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables, and these recipes: http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore (esp the 'merging on millions of rows'
Here is a solution that seems to work. Here is the workflow:
read data from your csv by chunks and appending to an hdfstore
iterate over the store, which creates another store that does the combiner
Essentially we are taking a chunk from the table and combining with a chunk from every other part of the file. The combiner function does not reduce, but instead calculates your function (the diff in days) between all elements in that chunk, eliminating duplicates as you go, and taking the latest data after each loop. Kind of like a recursive reduce almost.
This should be O(num_of_chunks**2) memory and calculation time
chunksize could be say 1m (or more) in your case
processing [0] [datastore.h5]
processing [1] [datastore_0.h5]
count date diff email
4 1 2011-06-24 00:00:00 0 0000.ANU#GMAIL.COM
1 1 2011-06-24 00:00:00 0 00000.POO#GMAIL.COM
0 1 2010-07-26 00:00:00 0 00000000#11111.COM
2 1 2013-01-01 00:00:00 0 0000650000#YAHOO.COM
3 1 2013-01-26 00:00:00 0 00009.GAURAV#GMAIL.COM
5 1 2011-10-29 00:00:00 0 0000MANNU#GMAIL.COM
6 1 2011-11-21 00:00:00 0 0000PRANNOY0000#GMAIL.COM
7 1 2011-06-26 00:00:00 0 0000PRANNOY0000#YAHOO.CO.IN
8 1 2012-10-25 00:00:00 0 0000RAHUL#GMAIL.COM
9 1 2011-05-10 00:00:00 0 0000SS0#GMAIL.COM
12 1 2010-12-09 00:00:00 0 0001HARISH#GMAIL.COM
11 2 2010-12-12 00:00:00 3 0001HARISH#GMAIL.COM
10 3 2010-12-22 00:00:00 13 0001HARISH#GMAIL.COM
14 1 2012-11-28 00:00:00 0 000AYUSH#GMAIL.COM
15 2 2012-11-29 00:00:00 1 000AYUSH#GMAIL.COM
17 3 2012-12-08 00:00:00 10 000AYUSH#GMAIL.COM
18 4 2012-12-12 00:00:00 14 000AYUSH#GMAIL.COM
13 5 2013-01-25 00:00:00 58 000AYUSH#GMAIL.COM
import pandas as pd
import StringIO
import numpy as np
from time import strptime
from datetime import datetime
# your data
data = """
"DF","00000000#11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","00000.POO#GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
"DF","0000650000#YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
"Bus","00009.GAURAV#GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
"Rail","0000.ANU#GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
"Rail","0000MANNU#GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
"Rail","0000PRANNOY0000#GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
"DF","0000PRANNOY0000#YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
"Rail","0000RAHUL#GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
"DF","0000SS0#GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
"DF","0001HARISH#GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
"DF","0001HARISH#GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
"DF","0001HARISH#GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
"Rail","000AYUSH#GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
"Rail","000AYUSH#GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
"Rail","000AYUSH#GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"
"""
# read in and create the store
data_store_file = 'datastore.h5'
store = pd.HDFStore(data_store_file,'w')
def dp(x, **kwargs):
return [ datetime(*strptime(v,'%d%b%Y')[0:3]) for v in x ]
chunksize=5
reader = pd.read_csv(StringIO.StringIO(data),names=['x1','email','x2','date','x3','x4'],
header=0,usecols=['email','date'],parse_dates=['date'],
date_parser=dp, chunksize=chunksize)
for i, chunk in enumerate(reader):
chunk['indexer'] = chunk.index + i*chunksize
# create the global index, and keep it in the frame too
df = chunk.set_index('indexer')
# need to set a minimum size for the email column
store.append('data',df,min_itemsize={'email' : 100})
store.close()
# define the combiner function
def combiner(x):
# given a group of emails (the same), return a combination
# with the new data
# sort by the date
y = x.sort('date')
# calc the diff in days (an integer)
y['diff'] = (y['date']-y['date'].iloc[0]).apply(lambda d: float(d.item().days))
y['count'] = pd.Series(range(1,len(y)+1),index=y.index,dtype='float64')
return y
# reduce the store (and create a new one by chunks)
in_store_file = data_store_file
in_store1 = pd.HDFStore(in_store_file)
# iter on the store 1
for chunki, df1 in enumerate(in_store1.select('data',chunksize=2*chunksize)):
print "processing [%s] [%s]" % (chunki,in_store_file)
out_store_file = 'datastore_%s.h5' % chunki
out_store = pd.HDFStore(out_store_file,'w')
# iter on store 2
in_store2 = pd.HDFStore(in_store_file)
for df2 in in_store2.select('data',chunksize=chunksize):
# concat & drop dups
df = pd.concat([df1,df2]).drop_duplicates(['email','date'])
# group and combine
result = df.groupby('email').apply(combiner)
# remove the mi (that we created in the groupby)
result = result.reset_index('email',drop=True)
# only store those rows which are in df2!
result = result.reindex(index=df2.index).dropna()
# store to the out_store
out_store.append('data',result,min_itemsize={'email' : 100})
in_store2.close()
out_store.close()
in_store_file = out_store_file
in_store1.close()
# show the reduced store
print pd.read_hdf(out_store_file,'data').sort(['email','diff'])
Use the built-in sqlite3 database: you can insert the data, sort and group as necessary, and there's no problem using a file which is larger than available RAM.
Another possible (system-admin) way, avoiding database and SQL queries plus a whole lot of requirements in runtime processes and hardware resources.
Update 20/04 Added more code and simplified approach:-
Convert the timestamp to seconds (from Epoch) and use UNIX sort, using email and this new field (that is: sort -k2 -k4 -n -t, < converted_input_file > output_file)
Initialize 3 variable, EMAIL, PREV_TIME and COUNT
Interate over each line, if new email is encountered, add "1,0 day". Update PREV_TIME=timestamp, COUNT=1, EMAIL=new_email
Next line: 3 possible scenario
a) if same email, different timestamp: calculate days, increment COUNT=1, update PREV_TIME, add "Count, Difference_in_days"
b) If same email, same timestamp: increment COUNT, add "COUNT, 0 day"
c) If new email, start from 3.
Alternative to 1. is to add a new field TIMESTAMP and remove it upon printing out the line.
Note: If 1.5GB is too huge to sort at a go, split it into smaller chuck, using email as the split point. You can run these chunks in parallel on different machine
/usr/bin/gawk -F'","' ' {
split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ");
for (i=1; i<=12; i++) mdigit[month[i]]=i;
print $0 "," mktime(substr($4,6,4) " " mdigit[substr($4,3,3)] " " substr($4,1,2) " 00 00 00"
)}' < input.txt | /usr/bin/sort -k2 -k7 -n -t, > output_file.txt
output_file.txt:
"DF","00000000#11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2",1280102400
"DF","0001HARISH#GMAIL.COM","NF252022031180","09DEC2010","B2C","3439",1291852800
"DF","0001HARISH#GMAIL.COM","NF251742087846","12DEC2010","B2C","1000",1292112000
"DF","0001HARISH#GMAIL.COM","NF251352240086","22DEC2010","B2C","4006",1292976000
...
You pipe the output to Perl, Python or AWK script to process step 2. through 4.