Need to compare very large files around 1.5GB in python - python

"DF","00000000#11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","00000.POO#GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
"DF","0000650000#YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
"Bus","00009.GAURAV#GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
"Rail","0000.ANU#GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
"Rail","0000MANNU#GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
"Rail","0000PRANNOY0000#GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
"DF","0000PRANNOY0000#YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
"Rail","0000RAHUL#GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
"DF","0000SS0#GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
"DF","0001HARISH#GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
"DF","0001HARISH#GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
"DF","0001HARISH#GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
"Rail","000AYUSH#GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
"Rail","000AYUSH#GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
"Rail","000AYUSH#GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"
Above is the sample data.
Data is sorted according to email addresses and the file is very large around 1.5Gb
I want output in another csv file something like this
"DF","00000000#11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2",1,0 days
"Rail","00000.POO#GMAIL.COM","NR251764697478","24JUN2011","B2C","2025",1,0 days
"DF","0000650000#YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792",1,0 days
"Bus","00009.GAURAV#GMAIL.COM","NU27012932319739","26JAN2013","B2C","800",1,0 days
"Rail","0000.ANU#GMAIL.COM","NR251764697526","24JUN2011","B2C","595",1,0 days
"Rail","0000MANNU#GMAIL.COM","NR251277005737","29OCT2011","B2C","957",1,0 days
"Rail","0000PRANNOY0000#GMAIL.COM","NR251297862893","21NOV2011","B2C","212",1,0 days
"DF","0000PRANNOY0000#YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080",1,0 days
"Rail","0000RAHUL#GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731",1,0 days
"DF","0000SS0#GMAIL.COM","NF251355775967","10MAY2011","B2C","2000",1,0 days
"DF","0001HARISH#GMAIL.COM","NF251352240086","09DEC2010","B2C","4006",1,0 days
"DF","0001HARISH#GMAIL.COM","NF251742087846","12DEC2010","B2C","1000",2,3 days
"DF","0001HARISH#GMAIL.COM","NF252022031180","22DEC2010","B2C","3439",3,10 days
"Rail","000AYUSH#GMAIL.COM","NR2151213260036","28NOV2012","B2C","41",1,0 days
"Rail","000AYUSH#GMAIL.COM","NR2151313264432","29NOV2012","B2C","96",2,1 days
"Rail","000AYUSH#GMAIL.COM","NR2151413266728","29NOV2012","B2C","96",3,0 days
"Rail","000AYUSH#GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",4,9 days
"Rail","000AYUSH#GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",5,0 days
"Rail","000AYUSH#GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",6,4 days
"Rail","000AYUSH#GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",7,0 days
"Rail","000AYUSH#GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",8,44 days
"Rail","000AYUSH#GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",9,0 days
i.e if entry occurs 1st time i need to append 1 if it occurs 2nd time i need to append 2 and likewise i mean i need to count no of occurences of an email address in the file and if an email exists twice or more i want difference among dates and remember dates are not sorted so we have to sort them also against a particular email address and i am looking for a solution in python using numpy or pandas library or any other library that can handle this type of huge data without giving out of bound memory exception i have dual core processor with centos 6.3 and having ram of 4GB

make sure you have 0.11, read these docs: http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables, and these recipes: http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore (esp the 'merging on millions of rows'
Here is a solution that seems to work. Here is the workflow:
read data from your csv by chunks and appending to an hdfstore
iterate over the store, which creates another store that does the combiner
Essentially we are taking a chunk from the table and combining with a chunk from every other part of the file. The combiner function does not reduce, but instead calculates your function (the diff in days) between all elements in that chunk, eliminating duplicates as you go, and taking the latest data after each loop. Kind of like a recursive reduce almost.
This should be O(num_of_chunks**2) memory and calculation time
chunksize could be say 1m (or more) in your case
processing [0] [datastore.h5]
processing [1] [datastore_0.h5]
count date diff email
4 1 2011-06-24 00:00:00 0 0000.ANU#GMAIL.COM
1 1 2011-06-24 00:00:00 0 00000.POO#GMAIL.COM
0 1 2010-07-26 00:00:00 0 00000000#11111.COM
2 1 2013-01-01 00:00:00 0 0000650000#YAHOO.COM
3 1 2013-01-26 00:00:00 0 00009.GAURAV#GMAIL.COM
5 1 2011-10-29 00:00:00 0 0000MANNU#GMAIL.COM
6 1 2011-11-21 00:00:00 0 0000PRANNOY0000#GMAIL.COM
7 1 2011-06-26 00:00:00 0 0000PRANNOY0000#YAHOO.CO.IN
8 1 2012-10-25 00:00:00 0 0000RAHUL#GMAIL.COM
9 1 2011-05-10 00:00:00 0 0000SS0#GMAIL.COM
12 1 2010-12-09 00:00:00 0 0001HARISH#GMAIL.COM
11 2 2010-12-12 00:00:00 3 0001HARISH#GMAIL.COM
10 3 2010-12-22 00:00:00 13 0001HARISH#GMAIL.COM
14 1 2012-11-28 00:00:00 0 000AYUSH#GMAIL.COM
15 2 2012-11-29 00:00:00 1 000AYUSH#GMAIL.COM
17 3 2012-12-08 00:00:00 10 000AYUSH#GMAIL.COM
18 4 2012-12-12 00:00:00 14 000AYUSH#GMAIL.COM
13 5 2013-01-25 00:00:00 58 000AYUSH#GMAIL.COM
import pandas as pd
import StringIO
import numpy as np
from time import strptime
from datetime import datetime
# your data
data = """
"DF","00000000#11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","00000.POO#GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
"DF","0000650000#YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
"Bus","00009.GAURAV#GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
"Rail","0000.ANU#GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
"Rail","0000MANNU#GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
"Rail","0000PRANNOY0000#GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
"DF","0000PRANNOY0000#YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
"Rail","0000RAHUL#GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
"DF","0000SS0#GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
"DF","0001HARISH#GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
"DF","0001HARISH#GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
"DF","0001HARISH#GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
"Rail","000AYUSH#GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
"Rail","000AYUSH#GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
"Rail","000AYUSH#GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
"Rail","000AYUSH#GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"
"""
# read in and create the store
data_store_file = 'datastore.h5'
store = pd.HDFStore(data_store_file,'w')
def dp(x, **kwargs):
return [ datetime(*strptime(v,'%d%b%Y')[0:3]) for v in x ]
chunksize=5
reader = pd.read_csv(StringIO.StringIO(data),names=['x1','email','x2','date','x3','x4'],
header=0,usecols=['email','date'],parse_dates=['date'],
date_parser=dp, chunksize=chunksize)
for i, chunk in enumerate(reader):
chunk['indexer'] = chunk.index + i*chunksize
# create the global index, and keep it in the frame too
df = chunk.set_index('indexer')
# need to set a minimum size for the email column
store.append('data',df,min_itemsize={'email' : 100})
store.close()
# define the combiner function
def combiner(x):
# given a group of emails (the same), return a combination
# with the new data
# sort by the date
y = x.sort('date')
# calc the diff in days (an integer)
y['diff'] = (y['date']-y['date'].iloc[0]).apply(lambda d: float(d.item().days))
y['count'] = pd.Series(range(1,len(y)+1),index=y.index,dtype='float64')
return y
# reduce the store (and create a new one by chunks)
in_store_file = data_store_file
in_store1 = pd.HDFStore(in_store_file)
# iter on the store 1
for chunki, df1 in enumerate(in_store1.select('data',chunksize=2*chunksize)):
print "processing [%s] [%s]" % (chunki,in_store_file)
out_store_file = 'datastore_%s.h5' % chunki
out_store = pd.HDFStore(out_store_file,'w')
# iter on store 2
in_store2 = pd.HDFStore(in_store_file)
for df2 in in_store2.select('data',chunksize=chunksize):
# concat & drop dups
df = pd.concat([df1,df2]).drop_duplicates(['email','date'])
# group and combine
result = df.groupby('email').apply(combiner)
# remove the mi (that we created in the groupby)
result = result.reset_index('email',drop=True)
# only store those rows which are in df2!
result = result.reindex(index=df2.index).dropna()
# store to the out_store
out_store.append('data',result,min_itemsize={'email' : 100})
in_store2.close()
out_store.close()
in_store_file = out_store_file
in_store1.close()
# show the reduced store
print pd.read_hdf(out_store_file,'data').sort(['email','diff'])

Use the built-in sqlite3 database: you can insert the data, sort and group as necessary, and there's no problem using a file which is larger than available RAM.

Another possible (system-admin) way, avoiding database and SQL queries plus a whole lot of requirements in runtime processes and hardware resources.
Update 20/04 Added more code and simplified approach:-
Convert the timestamp to seconds (from Epoch) and use UNIX sort, using email and this new field (that is: sort -k2 -k4 -n -t, < converted_input_file > output_file)
Initialize 3 variable, EMAIL, PREV_TIME and COUNT
Interate over each line, if new email is encountered, add "1,0 day". Update PREV_TIME=timestamp, COUNT=1, EMAIL=new_email
Next line: 3 possible scenario
a) if same email, different timestamp: calculate days, increment COUNT=1, update PREV_TIME, add "Count, Difference_in_days"
b) If same email, same timestamp: increment COUNT, add "COUNT, 0 day"
c) If new email, start from 3.
Alternative to 1. is to add a new field TIMESTAMP and remove it upon printing out the line.
Note: If 1.5GB is too huge to sort at a go, split it into smaller chuck, using email as the split point. You can run these chunks in parallel on different machine
/usr/bin/gawk -F'","' ' {
split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ");
for (i=1; i<=12; i++) mdigit[month[i]]=i;
print $0 "," mktime(substr($4,6,4) " " mdigit[substr($4,3,3)] " " substr($4,1,2) " 00 00 00"
)}' < input.txt | /usr/bin/sort -k2 -k7 -n -t, > output_file.txt
output_file.txt:
"DF","00000000#11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2",1280102400
"DF","0001HARISH#GMAIL.COM","NF252022031180","09DEC2010","B2C","3439",1291852800
"DF","0001HARISH#GMAIL.COM","NF251742087846","12DEC2010","B2C","1000",1292112000
"DF","0001HARISH#GMAIL.COM","NF251352240086","22DEC2010","B2C","4006",1292976000
...
You pipe the output to Perl, Python or AWK script to process step 2. through 4.

Related

Is there a better way to iterate through this calculation?

Running this code produces the error message :
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I have 6 years' worth of competitor's results from a 1/2 marathon in one csv file.
The function year_runners aims to create a new column for each year with a difference in finishing time between each runner.
Is there a more efficient way of producing the same result?
Thanks in advance.
Pos Gun_Time Chip_Time Name Number Category
1 1900-01-01 01:19:15 1900-01-01 01:19:14 Steve Hodges 324 Senior Male
2 1900-01-01 01:19:35 1900-01-01 01:19:35 Theo Bately 92 Supervet Male
#calculating the time difference between each finisher in year and adding this result to into a new column called time_diff
def year_runners(year, x, y):
print('Event held in', year)
# x is the first number (position) for the runner of that year,
# y is the last number (position) for that year e.g. 2016 event spans from df[246:534]
time_diff = 0
#
for index, row in df.iterrows():
time_diff = df2015.loc[(x + 1),'Gun_Time'] - df2015.loc[(x),'Chip_Time']
# using Gun time as the start-time for all.
# using chip time as finishing time for each runner.
# work out time difference between the x-placed runner and the runner behind (x + 1)
df2015.loc[x,'time_diff'] = time_diff #set the time_diff column to the value of time_diff for
#each row of x in the dataframe
print("Runner",(x+1),"time, minus runner" , x,"=",time_diff)
x += 1
if x > y:
break
Hi everyone, this was solved using the shift technique.
youtube.com/watch?v=nZzBj6n_abQ
df2015['shifted_Chip_Time'] = df2015['Chip_Time'].shift(1)
df2015['time_diff'] = df2015['Gun_Time'] - df2015['shifted_Chip_Time']

Improve running time when using inflation method on pandas

I'm trying to get real prices for my data in pandas. Right now, I am just playing with one year's worth of data (3962050 rows) and it took me 443 seconds to inflate the values using the code below. Is there a quicker way to find the real value? Is it possible to use pooling? I have many more years and if would take too long to wait every time.
Portion of df:
year quarter fare
0 1994 1 213.98
1 1994 1 214.00
2 1994 1 214.00
3 1994 1 214.50
4 1994 1 214.50
import cpi
import pandas as pd
def inflate_column(data, column):
"""
Adjust for inflation the series of values in column of the
dataframe data. Using cpi library.
"""
print('Beginning to inflate ' + column)
start_time = time.time()
df = data.apply(lambda x: cpi.inflate(x[column],
x.year), axis=1)
print("Inflating process took", time.time() - start_time, " seconds to run")
return df
df['real_fare'] = inflate_column(df, 'fare')
You have multiple values for each year: you can just call one for every year, store it in dict and then use the value instead of calling to cpi.inflate everytime.
all_years = df["year"].unique()
dict_years = {}
for year in all_years:
dict_years[year] = cpi.inflate(1.0, year)
df['real_fare'] = # apply here: dict_years[row['year']]*row['fare']
You can fill the last line using apply, or try do it in some other way like df['real_fare']=df['fare']*...

How can I add time series data into a range of time series list?

I have several spreadsheets with specific time series data. I want to summarize those specific times into a summary sheet with ranges of times.
for example: we have our summary date ranges
[Dec 21, Dec 22, Dec 23] (midnight to midnight).
and the data would be something like this:
Dec 21 10:00 = 15
Dec 21 11:00 = 10
Dec 22 13:00 = 5
Dec 22 16:00 = 10
Dec 23 2:00 = 6
Dec 23 12:00 = 6
Thus I would like the summary to end up being: Dec 21 = 25, Dec 22 = 15, Dec 23 = 12.
I'm using python, datetime, and the openpyxl module to access and create time values.
I'm having a hard time getting my head around the creation of the time series list. as well as the actual addition.
getting the actual datetimes and values from the individual sheets is easy.
for sheet in projectList:
ws = wb[sheet]
LOCSum = 0
LOCList = {}
for cols in range(8,30):
LOCDate = ws.cell(row=4, column=cols).value #A datetime
LOCSum = ws.cell(row=70, column=cols).value #A number
LOCList = LocList + appendToListOfValues(LOCDate, LOCSum)
fitListOfValuesIntoSummary(LOCList)
Once I've got LOCDate and LOCSum, how can I put them together into a list that can then be added to the summary? the appendToListOfValues() function that doesn't really exist. Should it be a dictionary? a Tuple?
then, once I've got a time series list, how do I make it fit into the summary list? the fitListOfValuesIntoSummary() function that also doesn't exist.
and the final kicker, what should I do if the data is outside the designated ranges? Do I just have it added to a "before" and "After" range for the summary list?
Please point me in the direction of some literature as well.
(As I've been typing up this question.)
Would just automatically adding the found value to the summary cell in the excel doc work?
if LOCDate >= summaryDate+1:
summaryDate = summaryDate+1
if summaryDate <= LOCDate <= summaryDate+1:
ws[summary]['correctCol'+'correctRow'].value = ws[summary]['correctCol'+'correctRow'].value + LOCSum
This ended up working for me.
for sheet in projectList:
ws = wb[sheet]
LOCDate = 0
LOCSum = 0
weeklyLOCAvailCol = 7
for cols in range(8,30):
LOCDate = ws.cell(row=19, column=cols).value #a datetime
LOCSum = ws.cell(row=70, column=cols).value #a number
if LOCDate >= wb[summName].cell(row=4, column=weeklyLOCAvailCol+1).value: #check if the current values date is greater than the next slot.
weeklyLOCAvailCol += 1 #if so, move up one slot.
wb[summary].cell(row=LOCAvailpos, column=weeklyLOCAvailCol).value = wb[summary].cell(row=LOCAvailpos, column=weeklyLOCAvailCol).value + LOCSum
#then, add the current LOCSum value to whatever is in the current date range cell.
So what ends up happening is that each summary cell holds the current total of each individual sheet's cell.
H70-J70 on each sheet is summed up in H28 so long as H70-J70 is less than the next time range's first datetime. and once it's found that K70 is greater than I28's datetime, then the next set is stored in I28
This isn't ideal as it requires an excel spreadsheet, but it slots data into a range. If there was a way to do this faster, I think that would help others.

Write a function from csv using dataframes to read and return column values in python

I have the following data set in a csv file:
vehicle---time-----aspd[m/s]------gspd[m/s]----hdg---alt[m-msl]
veh_1---17:19.5---0.163471505---0.140000001---213---273.8900146
veh_2---17:19.5---0.505786836---0.170000002---214---273.9100037
veh_3---17:19.8---0.173484877---0.109999999---213---273.980011
veh_4---44:12.4---18.64673424---19.22999954---316---388.9299927
veh_5---44:13.0---18.13533401---19.10000038---316---389.1700134
I am trying to write a function launch_time() with two inputs (dataframe, vehicle name) that returns the first time the gspd is reported above 10.0 m/s.
The output time must be converted from a string (HH:MM:SS.SS) to a minutes after 12:00 format.
It should look something like this:
>>> launch_time(df, veh_1)
30.0
I will use this function to iterate through each vehicle and then need to record the results into a list of tuples with the format (v_name, launch time) in launch sequence order.
It should look something like this:
'veh_1', 30.0, 'veh_2', 15.0
Disclosure: my python/pandas knowledge is very entry-level.
You can use read_csv with separator -{3,} - read csv with 3 and more -:
import pandas as pd
from pandas.compat import StringIO
temp=u"""vehicle---time-----aspd[m/s]------gspd[m/s]----hdg---alt[m-msl]
veh_1---17:19.5---0.163471505---0.140000001---213---273.8900146
veh_2---17:19.5---0.505786836---0.170000002---214---273.9100037
veh_3---17:19.8---0.173484877---0.109999999---213---273.980011
veh_4---44:12.4---18.64673424---19.22999954---316---388.9299927
veh_5---45:13.0---18.13533401---19.10000038---316---389.1700134"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep="-{3,}", engine='python')
print (df)
vehicle time aspd[m/s] gspd[m/s] hdg alt[m-msl]
0 veh_1 17:19.5 0.163472 0.14 213 273.890015
1 veh_2 17:19.5 0.505787 0.17 214 273.910004
2 veh_3 17:19.8 0.173485 0.11 213 273.980011
3 veh_4 44:12.4 18.646734 19.23 316 388.929993
4 veh_5 45:13.0 18.135334 19.10 316 389.170013
Then convert column time to_timedelta, filter all rows above 10m/s by boolean indexing, sort_values, group on vehicles using groupby, then get the first value in each group and last zip columns vehicle and time and convert to list:
df.time = pd.to_timedelta('00:' + df.time, unit='h').\
astype('timedelta64[m]').astype(int)
req = df[df['gspd[m/s]'] > 10].\
sort_values('time', ascending=True).\
groupby('vehicle', as_index=False).head(1)
print(req)
vehicle time aspd[m/s] gspd[m/s] hdg alt[m-msl]
4 veh_5 45 18.135334 19.10 316 389.170013
3 veh_4 44 18.646734 19.23 316 388.929993
L = list(zip(req['vehicle'],req['time']))
print (L)
[('veh_5', 45), ('veh_4', 44)]

find a user/customer max count of continuous date using Python

Scenario: I have a sample data frame like below
user_id | date_login
--------|-----------
101 | 2015-10-11
101 | 2015-10-12
101 | 2015-11-01
101 | 2015-11-02
101 | 2015-11-03
102 | 2015-10-12
102 | 2015-10-13
...
I would like to know user's max active days, which means the count of continuous days he/she keeps log into the system. For the sample data frame above, the desired result should return like below:
user_id | max_continuous_login_count
--------|-----------
101|3
102|2
I'm thinking to convert date into number to compare, is it necessary, any good practice?
Thanks for the help,
Solution:
import operator
import datetime
from collections import defaultdict
from functools import reduce
dataset = [(101, "2015-10-11"), (101, "2015-10-12"), (102, "2015-10-13")]
data = defaultdict(list)
for user, date in dataset:
data[user].append(datetime.datetime.strptime(date, "%Y-%m-%d").date())
data[user].sort()
def count_days(data, new_date):
max_days, current_max, last_date = data
# Check if there's one day difference, else, reset back to 1.
if abs((new_date - last_date).days) != 1:
current_max = 0
current_max += 1
return max(max_days, current_max), current_max, new_date
result = {}
for user, dates in data.items():
result[user] = reduce(count_days, dates, (0, 0, datetime.date.min))[0]
What I did here, was to first convert the dataset into a dict mapping a user and his login dates. On the way, I converted the dates to date objects and sorted them in the correct order (just in case the dataset is garbled).
I then created a function count_days() which checks if the difference is 1 day between 2 dates. If it is, it increases the max count of days. Then, by using reduce, I created a new results dict mapping user id to max_days.

Categories

Resources