I am trying to create a dummy file to make some ML predictions afterwards. The input are about 2000 'routes' and I want to create a dummy that contains year-month-day-hour combinations for 7 days, meaning 168 rows per route, about 350k rows in total.
The problem that I am facing is that pandas becomes terribly slow in appending rows at a certain size.
I am using the following code:
DAYS = [0, 1, 2, 3, 4, 5, 6]
HODS = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
ISODOW = {
1: "monday",
2: "tuesday",
3: "wednesday",
4: "thursday",
5: "friday",
6: "saturday",
7: "sunday"
}
def createMyPredictionDummy(start=datetime.datetime.now(), sourceFile=(utils.mountBasePath + 'routeProperties.csv'), destFile=(utils.outputBasePath + 'ToBePredictedTTimes.csv')):
'''Generate a dummy file that can be used for predictions'''
data = ['route', 'someProperties']
dataFile = data + ['yr', 'month', 'day', 'dow', 'hod']
# New DataFrame with all required columns
file = pd.DataFrame(columns=dataFile)
# Old data frame that has only the target columns
df = pd.read_csv(sourceFile, converters=convert, delimiter=',')
df = df[data]
# Counter - To avoid constant lookup for length of the DF
ix = 0
routes = df['route'].drop_duplicates().tolist()
# Iterate through all routes and create a row for every route-yr-month-day-hour combination for 7 day --> about 350k rows
for no, route in enumerate(routes):
print('Current route is %s which is no. %g out of %g' % (str(route), no+1, len(routes)))
routeDF = df.loc[df['route'] == route].iloc[0].tolist()
for i in range(0, 7):
tmpDate = start + datetime.timedelta(days=i)
day = tmpDate.day
month = tmpDate.month
year = tmpDate.year
dow = ISODOW[tmpDate.isoweekday()]
for hod in HODS:
file.loc[ix] = routeDF + [year, month, day, dow, hod] # This is becoming terribly slow
ix += 1
file.to_csv(destFile, index=False)
print('Wrote file')
I think the main problem lies in appending the row with .loc[] - Is there any way to append a row more efficiently?
If you have any other suggestions, I am happy to hear them all!
Thanks and best,
carbee
(this is more of a long comment than an answer, sorry but without example data I can't run much...)
Since it seems to me that you're adding rows one at a time sequentially (i.e. the dataframe is indexed by integers accessed sequentially) and you always know the order of the columns, you're probably much better off creating a list of lists and then transforming it to a DataFrame, that is, define something like file_list = [] and then replace the line file.loc[ix] = ... by:
file_list.append(routeDF + [year, month, day, dow, hod])
In the end, you can then define
file = pd.DataFrame(file_list, columns=dataFile)
If furthermore all your data is of a fixed type (e.g. int, depending on what is your routeDF and by not converting dow until after creating the dataframe) you might be even better off by pre-allocating a numpy array and writing into it, but I'm quite sure that adding elements to a list will not be the bottleneck of your code, so this is probably excessive optimization.
Another alternative to minimize changes in your code, simply preallocate enough space by creating a DataFrame full of NaN instead of a DataFrame with no lines, i.e. change the definition of file to (after moving the line with drop_duplicates up):
file = pd.DataFrame(columns=dataFile, index=range(len(routes)*168))
I'm quite sure this is faster than your code, but it might still be slower than the list of lists approach above since it won't know which data types to expect until you fill in data (it might e.g. convert your ints to float which is not ideal). But again, once you get rid of the continuous reallocations due to expanding a DataFrame at each step, this will probably not be your bottleneck anymore (the double loop will likely be.)
You create an empty dataframe named file and then you fill it by appending rows this seems the problem. If you just
def createMyPredictionDummy(...):
...
# make it yield a dict of attributes from the for loop
for hod in HODS:
yield data
# then use this to create the *file* dataframe outside that function
newDF = pd.DataFrame([r for r in createMyPredictionDummy()])
newDF.to_csv(destFile, index=False)
print('Wrote file')
Related
My data come from BigQuery exported to GCS bucket as CSV file and if the file size is quite massive, BigQuery will automatically split the data into several chunk. With time series in mind, the time series might be scattered across different files. I have a custom function that I want to applied to each TimeseriesID.
Here's some constraint of the data:
The data is sorted by TimeseriesID and TimeID
The number of row of each files is may vary, but at minimum 1 row (which is very unlikely)
The starting of TimeID is not always 0
The length of each time series may vary but at maximum it will only scattered across 2 files. No time series scatter in 3 different files.
Here's the initial setup to illustrate the problem:
# Please take note this is just for simplicity. The actual goal is not to calculate mean for all group, but to apply a custom_func to each Timeseries ID
def custom_func(x):
return np.mean(x)
# Please take note this is just for simplicity. In actual, I read the file one by one since reading all the data is not possible
df1 = pd.DataFrame({"TimeseriesID":['A','A','A','B'],"TimeID":[0,1,2,4],"value":[10,20,5,30]})
df2 = pd.DataFrame({"TimeseriesID":['B','B','B','C'],"TimeID":[5,6,7,8],"value":[10,20,5,30]})
df3 = pd.DataFrame({"TimeseriesID":['C','D','D','D'],"TimeID":[9,1,2,3],"value":[10,20,5,30]})
This should be pretty trivial if I can just concat all the files but the problem is if I concat all the dataframe then it won't fit in the memory.
The output I desired is should be similar to this but without concat all the files.
pd.concat([df1,df2,df3],axis=0).groupby('TimeseriesID').agg({"value":simple_func})
I'm also aware about vaex and dask but I want to stick with simple pandas for time being.
I'm also open to solution which involve modifying the BigQuery to split the files better.
Approach presented by op to use concat with million of records would be overkill for memories/other resources.
I have tested OP code using Google Colab Nootebooks and this was a bad approach
import pandas as pd
import numpy as np
import time
# Please take note this is just for simplicity. The actual goal is not to calculate mean for all group, but to apply a custom_func to each Timeseries ID
def custom_func(x):
return np.mean(x)
# Please take note this is just for simplicity. In actual, I read the file one by one since reading all the data is not possible
df1 = pd.DataFrame({"TimeseriesID":['A','A','A','B'],"TimeID":[0,1,2,4],"value":[10,20,5,30]})
df2 = pd.DataFrame({"TimeseriesID":['B','B','B','C'],"TimeID":[5,6,7,8],"value":[10,20,5,30]})
df3 = pd.DataFrame({"TimeseriesID":['C','D','D','D'],"TimeID":[9,1,2,3],"value":[10,20,5,30]})
start = time.time()
df = pd.concat([df1,df2,df3]).groupby('TimeseriesID').agg({"value":custom_func})
elapsed = (time.time() - start)
print(elapsed)
print(df.head())
output will be:
0.023952960968017578
value
TimeseriesID A 11.666667
B 16.250000
C 20.000000
D 18.333333
As you can see, 'concat' takes time to process. Due to few records this is not perceived.
The approach should be as follow:
Get files with data that you are going to process. ie: only workable columns.
Create a dictionary from the processed files key and values. if necessary, obtain values per key in a necessary file. You can store the results in a 'results' directory as json/csv:
A.csv will have all key 'A' values
...
n.csv will have all key 'n' values
Iterate trough results directory and start building your final output inside a dictionary.
{'A': [10, 20, 5], 'B': [30, 10, 20, 5], 'C': [30, 10], 'D': [20, 5, 30]}
apply custom function to each key value list.
{'A': 11.666666666666666, 'B': 16.25, 'C': 20.0, 'D': 18.333333333333332}
You can check the logic using below code, I use json to store the data:
from google.colab import files
import json
import pandas as pd
#initial dataset
df1 = pd.DataFrame({"TimeseriesID":['A','A','A','B'],"TimeID":[0,1,2,4],"value":[10,20,5,30]})
df2 = pd.DataFrame({"TimeseriesID":['B','B','B','C'],"TimeID":[5,6,7,8],"value":[10,20,5,30]})
df3 = pd.DataFrame({"TimeseriesID":['C','D','D','D'],"TimeID":[9,1,2,3],"value":[10,20,5,30]})
#get unique keys and its values
df1.groupby('TimeseriesID')['value'].apply(list).to_json('df1.json')
df2.groupby('TimeseriesID')['value'].apply(list).to_json('df2.json')
df3.groupby('TimeseriesID')['value'].apply(list).to_json('df3.json')
#as this is an example you can download the output as jsons
files.download('df1.json')
files.download('df2.json')
files.download('df3.json')
Update 06/10/2021
I have tuned code for OPs needs. This part creates refined files.
from google.colab import files
import json
#you should use your own function to get the data from the file
def retrieve_data(uploaded,file):
return json.loads(uploaded[file].decode('utf-8'))
#you should use your own function to get a list of files to process
def retrieve_files():
return files.upload()
key_list =[]
#call a function that gets a list of files to process
file_to_process = retrieve_files()
#read every raw file:
for file in file_to_process:
file_data = retrieve_data(file_to_process,file)
for key,value in file_data.items():
if key not in key_list:
key_list.append(key)
with open(f'{key}.json','w') as new_key_file:
new_json = json.dumps({key:value})
new_key_file.write(new_json)
else:
with open(f'{key}.json','r+') as key_file:
raw_json = key_file.read()
old_json = json.loads(raw_json)
new_json = json.dumps({key:old_json[key]+value})
key_file.seek(0)
key_file.write(new_json)
for key in key_list:
files.download(f'{key}.json')
print(key_list)
Update 07/10/2021
I have updated code to avoid confusion. This part process refined files.
import time
import numpy as np
#Once we get the refined values we can use it to apply custom functions
def custom_func(x):
return np.mean(x)
#Get key and data content from single json
def get_data(file_data):
content = file_data.popitem()
return content[0],content[1]
#load key list and build our refined dictionary
refined_values = []
#call a function that gets a list of files to process
file_to_process = retrieve_files()
start = time.time()
#read every refined file:
for file in file_to_process:
#read content of file n
file_data = retrieve_data(file_to_process,file)
#parse and apply function per file read
key,data = get_data(file_data)
func_output = custom_func(data)
#start building refined list
refined_values.append([key,func_output])
elapsed = (time.time() - start)
print(elapsed)
df = pd.DataFrame.from_records(refined_values,columns=['TimerSeriesID','value']).sort_values(by=['TimerSeriesID'])
df = df.reset_index(drop=True)
print(df.head())
output will be:
0.00045609474182128906
TimerSeriesID value
0 A 11.666667
1 B 16.250000
2 C 20.000000
3 D 18.333333
summarize:
When handling large datasets, you should always need to focus on the data that you are going to use and keep it minimal. Only using the workable values.
Processing times are faster when operations are performed by basic operators or python native libraries.
I have a script that collects data from an experiment and adds it to a PyTables table. The script gets data in batches (say, groups of 10). It's a little cumbersome in the code to add one row at a time via the normal method, e.g.:
data_batch = experiment.read()
last_time = time.time()
for data_row in data_batch:
row = table.row
row['timestamp'] = last_time
last_time += dt
row['column1'] = data_row[0]
row['column2'] = data_row[1]
row.append()
table.flush()
I would much rather do something like this:
data_batch = experiment.read()
start_index = len(table)
num_rows = len(data_batch)
table.append_n_rows(num_rows)
table.cols.timestamp[start_index:] = last_time + np.arange(num_rows) * dt
last_time += dt * num_rows
table.cols.column1[start_index:] = data_batch[:, 0]
table.cols.column2[start_index:] = data_batch[:, 1]
table.flush()
Does anyone know if there is some function that does the table.append_n_rows. Right now, all I can do is [table.row for i in range(num_rows)], which I feel is hacky and inefficient.
You are on the right track. In table.append(rows), the rows argument can be any object that can be converted to a structured array. This includes: "NumPy structured arrays, lists of tuples or array records, and a string or Python buffer". (I prefer NumPy arrays because I routinely work with them. Your answer shows how to use a list of tuples.)
There is a significant performance advantage adding data in batches instead of 1 row at a time. I ran some tests and posted to SO a few years ago. I/O performance is primarily related to number of batches, and not the batch size. Take a look at this answer for details: pytables writes much faster than h5py
Also, if you are going to create a large table, consider setting expectedrows parameter when you create the table. This will also improve I/O performance. This has the side benefit of setting an appropriate chunksize.
Recommended approach with your data.
data_batch = experiment.read()
last_time = time.time()
row_list = []
for data_row in data_batch:
row_list.append( (last_time, data_row[0], data_row[1] ) )
last_time += dt
your_table.append( row_list )
your_table.flush()
There is an example in the source code
I'm going to paste it here to avoid a dead link in the future.
import tables as tb
class Particle(tb.IsDescription):
name = tb.StringCol(16, pos=1) # 16-character String
lati = tb.IntCol(pos=2) # integer
longi = tb.IntCol(pos=3) # integer
pressure = tb.Float32Col(pos=4) # float (single-precision)
temperature = tb.FloatCol(pos=5) # double (double-precision)
fileh = tb.open_file('test4.h5', mode='w')
table = fileh.create_table(fileh.root, 'table', Particle,
"A table")
# Append several rows in only one call
table.append([("Particle: 10", 10, 0, 10 * 10, 10**2),
("Particle: 11", 11, -1, 11 * 11, 11**2),
("Particle: 12", 12, -2, 12 * 12, 12**2)])
fileh.close()
I have two arrays (data frames actually but the function below deals with arrays)
bh_arr = array of bank holiday dates in UK
sales_dates = Sales dates for a few years (millions of rows, I mean really millions)
I want to know for each date in sales_dates, how many days to the next bank holiday (from bh_arr).
I built a function like the below and it works but as is evident from the code it is very wasteful, as it calculates all differences first and then gets the non-negative min.
def get_days_to_bh_arr(sale_dates, bh_arr):
"""
Subtract all elements of bh_arr from each element of sales_dates.
Get the min ( > 0) for each element of sales_dates.
Return that array.
"""
bh_arr = pd.to_datetime(bh_arr)
res = []
for each in sale_dates:
gg = int(np.min([stuff for stuff in (bh_arr - pd.to_datetime(each))/np.timedelta64(1,'D') if stuff >= 0]))
res.append(gg)
return np.array(res)
bh_arr = ['2018-03-30', '2018-08-27', '2019-05-27']
sales_dates = ['2018-03-15', '2019-05-22', '2018-02-01', '2018-08-05', '2018-06-21']
get_days_to_bh_arr(sale_dates, bh_arr)
15, 5, 57, 22, 67
In the actual code, I finally make the call like so:
sales['days_to_next_bh'] = get_days_to_bh_arr(sales['full_date'], bh['holiday']).astype(np.int32)
Is there a more efficient way of writing the function (of course there is)?
If not, should I try something else like finding the next date from a sorted 'bh_arr', for each date in 'sales_dates' and only at the end do the subtraction? How would I make that work?
Could I vectorise that instead of looping?
Any guidance would be much appreciated.
In pandas this can be done with pd.merge_asof to bring the closest bank holiday in the future. Then datetime subtraction gets the days in between.
Because merge_asof requires sorting, we need to reset the index so that we can maintain the original ordering after the merge.
import pandas as pd
df_b = pd.DataFrame({'bh': pd.to_datetime(bh_arr)})
df_s = pd.DataFrame({'sd': pd.to_datetime(sales_dates)})
df_s = (pd.merge_asof(df_s.reset_index().sort_values('sd'),
df_b.sort_values('bh'),
direction='forward',
left_on='sd',
right_on='bh')
.sort_values('index'))
arr = (df_s['bh'] - df_s['sd']).dt.days.to_numpy()
#array([15, 5, 57, 22, 67], dtype=int64)
Problem: Given the dataframe below, I'm trying to come up with the code that will apply a function to three distinct columns without having to write three separate function calls.
The code for the data:
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'days': [365, 365, 213, 318, 71],
'spend_30day': [22, 241.5, 0, 27321.05, 345],
'spend_90day': [22, 451.55, 64.32, 27321.05, 566.54],
'spend_365day': [854.56, 451.55, 211.65, 27321.05, 566.54]}
df = pd.DataFrame(data)
cols = df.columns.tolist()
cols = ['name', 'days', 'spend_30day', 'spend_90day', 'spend_365day']
df = df[cols]
df
The function below will essentially annualize spend; if someone has fewer than, say, 365 days in the "days" column, the following function will tell me what the spend would have been if they had 365 days:
def annualize_spend_365(row):
if row['days']/(float(365)) < 1:
return (row['spend_365day']/(row['days']/float(365)))
else:
return row['spend_365day']
Then I apply the function to the particular column:
df.spend_365day = df.apply(annualize_spend_365, axis=1).round(2)
df
This works exactly as I want it to for that one column. However, I don't want to have to rewrite this for each of the three different "spend" columns (30, 90, 365). I want to be able to write code that will generalize and apply this function to multiple columns in one pass.
I thought I could create lists of the columns and their respective days, use the "zip" function, and nest the function in a for loop, but my attempt below ultimately fails:
spend_cols = [df.spend_30day, df.spend_90day, df.spend_365day]
days_list = [30, 90, 365]
for col, day in zip(spend_cols, days_list):
def annualize_spend(row):
if (row.days/(float(day)) < 1:
return (row.col)/((row.days)/float(day))
else:
return row.col
col = df.apply(annualize_spend, axis = 1)
The error:
AttributeError: ("'Series' object has no attribute 'col'")
I'm not sure why the loop approach is failing. Regardless, I'm hoping for guidance on how to generalize function application in pandas. Thanks in advance!
Look at your two function definitions:
def annualize_spend_365(row):
if row['days']/(float(365)) < 1:
return (row['spend_365day']/(row['days']/float(365)))
else:
return row['spend_365day']
and
#col in [df.spend_30day, df.spend_90day, df.spend_365day]
def annualize_spend(row):
if (row.days/(float(day)) < 1:
return (row.col)/((row.days)/float(day))
else:
return row.col
See the difference? On the one hand, in the first case you access the fields with explicit field names, and it works. In the second case you try to access row.col, which fails, but in this case col assumes the values of the corresponding fields in df. Instead try
spend_cols = ['spend_30day', 'spend_90day', 'spend_365day']
before your loop. On the other hand, in the syntax df.days the field name is actually "days", but in df.col the field name is not the string "col", but the value of the string col. So you might want to use row[col] in the latter case as well. And anyway, I'm not sure how wise it is to take col as an output variable inside your loop over col.
I'm unfamiliar with pandas.DataFrame.apply, but it's probably possible to use a single function definition, which takes the number of days and the field of interest as input variables:
def annualize_spend(col,day,row):
if (row['days']/(float(day)) < 1:
return (row[col])/((row['days'])/float(day))
else:
return row[col]
spend_cols = ['spend_30day', 'spend_90day', 'spend_365day']
days_list = [30, 90, 365]
for col, day in zip(spend_cols, days_list):
col = df.apply(lambda row,col=col,day=day: annualize_spend(col,day,row), axis = 1)
The lambda will ensure that only one input parameter of your function is dangling free when it gets applyd.
I'm trying to write a function that takes a continuous time series and returns a data structure which describes any missing gaps in the data (e.g. a DF with columns 'start' and 'end'). It seems like a fairly common issue for time series, but despite messing around with groupby, diff, and the like -- and exploring SO -- I haven't been able to come up with much better than the below.
It's a priority for me that this use vectorized operations to remain efficient. There has got to be a more obvious solution using vectorized operations -- hasn't there? Thanks for any help, folks.
import pandas as pd
def get_gaps(series):
"""
#param series: a continuous time series of data with the index's freq set
#return: a series where the index is the start of gaps, and the values are
the ends
"""
missing = series.isnull()
different_from_last = missing.diff()
# any row not missing while the last was is a gap end
gap_ends = series[~missing & different_from_last].index
# count the start as different from the last
different_from_last[0] = True
# any row missing while the last wasn't is a gap start
gap_starts = series[missing & different_from_last].index
# check and remedy if series ends with missing data
if len(gap_starts) > len(gap_ends):
gap_ends = gap_ends.append(series.index[-1:] + series.index.freq)
return pd.Series(index=gap_starts, data=gap_ends)
For the record, Pandas==0.13.1, Numpy==1.8.1, Python 2.7
This problem can be transformed to find the continuous numbers in a list. find all the indices where the series is null, and if a run of (3,4,5,6) are all null, you only need to extract the start and end (3,6)
import numpy as np
import pandas as pd
from operator import itemgetter
from itertools import groupby
# create an example
data = [2, 3, 4, 5, 12, 13, 14, 15, 16, 17]
s = pd.series( data, index=data)
s = s.reindex(xrange(18))
print find_gap(s)
def find_gap(s):
""" just treat it as a list
"""
nullindex = np.where( s.isnull())[0]
ranges = []
for k, g in groupby(enumerate(nullindex), lambda (i,x):i-x):
group = map(itemgetter(1), g)
ranges.append((group[0], group[-1]))
startgap, endgap = zip(* ranges)
return pd.series( endgap, index= startgap )
reference : Identify groups of continuous numbers in a list