So I know you're never suppose to iterate over a Pandas DataFrame, but I can't find another way around this problem.
I have a bunch of different time series, say they're end-of-day stock prices. They're in a DataFrame like this:
Ticker Price
0 AAA 10
1 AAA 11
2 AAA 10.5
3 BBB 100
4 BBB 110
5 CCC 60
etc.
For each Ticker, I want to take a variety of models and train them on successively larger batches of data. Specifically, I want to take a model, train it on day1 data, predict day2. Train the same model on day1 and day2, predict day3, etc. For each day, I want to slice up to the day before and predict on that subset [day0:dayN-1].
Essentially I'm implementing sklearn's TimeSeriesSplit, except I'm doing it myself because the models I'm training aren't in sklearn (for example, one model is Prophet).
The idea is I try a bunch of models on a bunch of different Tickers, then I see which models work well for which Tickers.
So my basic code for running one model on all my data looks like:
import pandas as pd
def make_predictions(df):
res = pd.DataFrame()
for ticker in df.ticker.unique():
df_ticker = df[df['ticker'] == ticker]
for i,_ in df_ticker.iterrows():
X = df_ticker[0:i]
X = do_preparations(X) # do some processing to prepare the data
m = train_model(X) # train the model
forecast = make_predictions(m) # predict one week
df_ticker.loc[i,'preds'] = forecast['y'][0]
res = pd.concat([res,df_ticker])
return res
But my code runs super slow. Can I speed this up somehow?
I can't figure out how I would use .apply() or any of the other common anti-iterating techniques.
Consider several items:
First, avoid quadratic copying by calling pd.concat inside a loop. Instead, build a list/dict of data frames to be concatenated once outside the loop.
Second, avoid DataFrame.iterrows since you only use i. Instead, traverse the index.
Third, for compactness, avoid unique() with subsequent subset [...]. Instead, use groupby() in a dictionary or list comprehension which may be slightly faster than list.append approach and due to your multiple steps, an inner defined function would be needed.
Inner loop may be unavoidable as you are really running different models.
def make_predictions(df):
def proc_model(sub_df):
for i in sub_df.index:
X = sub_df.loc[0:i]
X = do_preparations(X) # do some processing to prepare the data
m = train_model(X) # train the model
forecast = make_predictions(m) # predict one week
sub_df.loc[i,'preds'] = forecast['y'][0]
return sub_df
# BUILD DICTIONARY OF DATA FRAMES
df_dict = {i:proc_model(g) for i, g in df.groupby('ticker')}
# CONCATENATE DATA FRAMES
res = pd.concat(df_dict, ignore_index=True)
return res
Related
I have a dataframe which has million rows and almost 100 features. I need to firstly cast a feature of them into string, then drop almost 17 features. Then I need to add a column to the data frame, this column called pred. The methodology I add this column is to group the rows by their "Reta" feature if -1 found all the rows with this class will have the pred value of -1 else this will be 1; this can be done with this code:
#getting the prediction
hs_p={}
for i in range(len(classes)):
class_name=classes[i]
#this can be rewritten 3shan law l2aina -1 n-stop bdl ma n-check kolo
check=df.loc[df['CLUSTER'] == class_name]['Reta'].values.tolist()
if (-1 in check):
hs_p[class_name]=-1
else:
hs_p[class_name]=1
hs_p_col=[]
print("prediction done")
#Adding the prediction column to the df
for i in hs_p:
df.loc[df['CLUSTER'] == i, 'pred'] = hs_p[i]
The problem is the data is very huge and it took me a lot of time to run and still no result. I thought about doing parallelization using multiprocessing library in python. However, I do understand that multiprocessing divide the dataframe into chunks, so the first chunk will have some of the class rows and another chunk will have the rest of the class rows, so pred column will not be done accurately. Any ideas about how to do this?
I am doing this educational challenge on kaggle https://www.kaggle.com/c/competitive-data-science-predict-future-sales
The training set is a file of daily sales numbers of some products and the test set we need to predict is the sales of similar items for the month of november.
Now I would like to use my model to make daily predictions and thus expand the test data set by 30 for each row.
I have the following code:
for row in test.itertuples():
df = pd.DataFrame(index = nov15, columns = test.columns)
df['shop_id'] = row.shop_id
df['item_category_id'] = row.item_category_id
df['item_price'] = row.item_price
df['item_id'] = row.item_id
df = df.reset_index()
df.columns = ['date', 'item_id', 'shop_id', 'item_category_id', 'item_price']
df = df[train.columns]
tt = pd.concat([tt, df])
nov15 is a pandas daterange from 1/nov/2015 to 30/nov/2015
tt is just an empty dataset I fill by expanding it by 30 rows (nov 1 to 30) for every row in the test set.
test is the original dataframe I am copying the rows from
It runs, but it takes hours to complete.
Knowing pandas and learning from previous experiences, there is probably an efficient way to do this.
Thank you for your help!
So I have found a "more" efficient way, and then someone over at Reddit's r/learnpython has told me about the correct and most efficient way.
This above dilemma is easily solved by pandas explode function.
And these two lines do what I did above, but within seconds:
test['date'] = [nov15 for _ in range(len(test))]
test = test.explode('date')
Now my more efficient way or second solution, which is in no way anywhere close to equivalent or good was to simply make 30 copies of the dataframe with a column 'date' added.
I am playing with the famous Titanic dataset.
I have the train.csv and test.csv loaded into an array of size two called combined. and concatenated the Ticket column of each like this:
all_tickets = combined[0]['Ticket'].append(combined[1]['Ticket'])
Now, my goal is to apply the same categorical codes to each DataFrame. However, when I tried this for the first one:
combined[0]['TicketCode'] = all_tickets.astype('category').cat.codes
It complained: cannot reindex from a duplicate axis
It makes sense since both sets are from different sizes. How can I achieve my goal in this situation? Grouping the Tickets with a range enumerator?
Thanks
At the end, I just built a dictionary and applied to both DataFrames:
ticket_codes = {}
i = 0
grouped_series = combined[0]['Ticket'].append(combined[1]['Ticket'])
for g, _ in grouped_series.groupby(grouped_series):
ticket_codes[g] = i
i = i + 1
for x in positions:
combined[x]['TicketCode'] = combined[x]['Ticket'].apply(lambda y: ticket_codes[y])
To summarize: How to perform groupby operations in parallel for a limited number of groups, but writing the result of each group apply function to disk?
My problem: I'm trying to create a supervised structure for regression models from information of a lot of clients separated into years. From the same clients I have to build different models, with different inputs X and labels Y, thus my idea is to create a single X and Y dataframe holding all variables at once, and slicing each one according to the task. For example, X could hold information from the salary, age or sex, but model 1 would use only age and sex, while model 2 only use salary.
As clients are not present every year, I can only use clients that are present from one period to the next one.
Instead of selecting the intersection of clients for each pair of contigous years, I'm trying to concatenate the whole information and performing groupby operations by client ID (and then filtering by year sequence, for example using the rows where the difference of periods are 1). The problem of using Dask for this task is that distributed workers are running low on memory (even after increasing the limit to 30Gb each). Note that for each group I'm creating a new dataframe, so I'm not reducing calculation to a single number per group, thus the memory intensive operation.
What I'm currently doing is performing a groupby operation, then iterating over the groupby object and writing to disk sequentially: for example like:
x_file=open('X.csv', 'w')
for name, group in concatenated_data.groupby('ID'):
data_x=my_func(group) # In my real code, my_func returns x and y dataframes
data_x.to_csv(x_file, header=None)
x_file.close()
which write the data sequentially applying my_func which selects the x and y for each group.
What I want is to perform the operation for a controlled number of groups (lets say 3 at the time), and writing the result of each group to disk (maybe with data_x.to_csv(x_file, single_file=True)).
Of course I can do the same for a dask dataframe, and iterate over the groupbpy object using get_group(), but I don't believe it will run in parallel while also keeping the memory on check.
EDIT: Example
# Lets say I have 3 csv files:
data=['./data_2016', './data_2017', './data_2018'] # Each file contains millions of rows (1 per client ID) and like 85 columns
# and certains variables
x_vars=['x1', 'x2', 'x3'] # x variables
y_vars= ['y1', 'y2', 'x1'] # note than some variables can be among x and y (like using today's salary to predict tomorrows salary)
data=[pd.read_csv(x) for x in data]
def func1(df_):
# do some preprocessing stuff
return df_
data=map(func1, data) # Some preprocessing and adding some columns (for example adding column for year)
concatenated_data=pd.concat(data, axis=1) # Big file, all clients from 2016-2018
def my_func(df_): # function applied above
# order by year
df_['Diff']=df_.year.diff() # calculating the difference among years
df['shifted']=df.Diff.shift(-1) # calculate shift of difference
# For exammple, *client z* may be on 2016 and 2018, thus his year difference is 2.
# I can't use *clien z* x_vars to predict y (only a single period ahead regression)
x=df_.loc[df_['shifted']==1, x_vars] # select only contigous years
y=df_.loc[df_['Diff']==1, y_vars] # the same, but a year ahead of x
return (x, y)
# ... Iteration over groupby object
Instead of using groupby() to reduce, I'm expanding the single, big file into an x and y dataframes, on which y holds information a period ahead of x.
As you can see, using a dask dataframe groupby (omitted for simplicity) would parallelize my_func operation, but as I understand would also wait until all operations nodes are completed, thus depleting my memory. What I would like is to perform my_func for certain groups (ideally as most as memory could hold), finish them, save to disk (without problems related to paralell saving) and finally proceed to the next batch of groups.
Maybe I can use some dask delayed objects, but I don't think it will make good use of my memory if a set the batches manually.
I'm not sure if this is what you are looking for
Generate data
import pandas as pd
import numpy as np
import dask.dataframe as dd
import os
n = 200
df = pd.DataFrame({"grp":np.random.choice(list("abcd"), n),
"x":np.random.randn(n),
"y":np.random.randn(n),
"z":np.random.randn(n)})
df.to_csv("file.csv", index=False)
# we will need later on
df.to_parquet("file.parquet", index=False)
Pandas solution
# we save our files on a given folder
fldr = "output1"
os.makedirs(fldr, exist_ok=True)
# we read the columns we need only
cols2read = ["grp", "x", "y"]
df = pd.read_csv("file.csv")
df = df[cols2read]
def write_file(x, fldr):
name = x["grp"].iloc[0]
x.to_csv(f"{fldr}/{name}.csv", index=False)
df.groupby("grp")\
.apply(lambda x: write_file(x, fldr))
Dask solution
This is basically the same but we need to add meta to our apply and the compute
# we save our files on a given folder
fldr = "output2"
os.makedirs(fldr, exist_ok=True)
# we read the columns we need only
cols2read = ["grp", "x", "y"]
df = pd.read_csv("file.csv")
df = df[cols2read]
def write_file(x, fldr):
name = x["grp"].iloc[0]
x.to_csv(f"{fldr}/{name}.csv", index=False)
df.groupby("grp")\
.apply(lambda x: write_file(x, fldr), meta='f8')\
.compute()
Working with parquet
Here I suggest you to work with parquet as it's going to be ways more efficient
cols2read = ["grp", "x", "y"]
df = dd.read_parquet("file.parquet",
columns=cols2read)
df.to_parquet("output3/",
partition_on="grp")
Inside output3 you can find several folders called grp=a and so on. And each off them could eventually contain several files. but you can read all of them with pd.read_parquet("output3/grp=a)
Ok so I have a dataframe object that's indexed as follows:
index, rev, metric1 (more metrics.....)
exp1, 92365, 0.018987
exp2, 92365, -0.070901
exp3, 92365, 0.150140
exp1, 87654, 0.003008
exp2, 87654, -0.065196
exp3, 87654, -0.174096
For each of these metrics I want to create individual stacked barplots comparing them based on their rev.
here's what I've tried:
df = df[['rev', 'metric1']]
df = df.groupby("rev")
df.plot(kind = 'bar')
This results in 2 individual bar graphs of the metric. Ideally I would have these two merged and stacked (right now stacked=true does nothing). Any help would be much appreciated.
This would give me my ideal result, however I don't think reorganizing to fit this is the best way to achieve my goal as I have many metrics and many revisions.
index, metric1(rev87654), metric1(rev92365)
exp1, 0.018987, 0.003008
exp2, -0.070901, -0.065196
exp3, 0.150140, -0.174096
This is my goal. (made by hand)
http://i.stack.imgur.com/5GRqB.png
following from this matplotlib gallery example:
http://matplotlib.org/examples/api/barchart_demo.html
there they get multiple to plot by calling bar once for each set.
You could access these values in pandas with indexing operations as follows:
fig, ax = subplots(figsize=(16.2,10),dpi=300)
Y = Tire2[Tire2.SL==Tire2.SL.unique()[0]].SA.values[0:13]
X = linspace(0,size(Y),size(Y))
ax.bar(X,Y,width=.4)
Y = Tire2[Tire2.SL==Tire2.SL.unique()[2]].SA.values[0:13]
X = linspace(0,size(Y),size(Y))+.5
ax.bar(X,Y,width=.4,color='r')
working from the inside out:
get all of the unique values of 'SL' in one of the cols (rev in your case)
Get a Boolean vector of all rows where 'SL' equals the first (or nth) unique value
Index Tire by that Boolean vector (this will pull out only those rows where the vector is True
access the values of SA or a metric in yourcase. (took only the `[0:13]' values because i was testing this on a huge data set)
bar plot those values
if your experiments are consistently in order in the frame(as shown), that's that. Otherwise you might need to run a little sorting to get your Y values in the right order. .sort(column name) should take care of that. In my code, i'd slip it in between ...[0]] and.SA...
In general, this kind of operation can really help you out in wrangling big frames. .between is useful. And you can always add, multiply etc. the Boolean vectors to construct more complex logic.
I'm not sure how to get the plot you want automatically without doing exactly the reorganization you specify at the end. The answer by user3823992 gives you more detailed control of the plots, but if you want them more automatic here is some temporary reorganization that should work using the indexing similarly but also concatenating back into a DataFrame that will do the plot for you.
import numpy as np
import pandas as pd
exp = ['exp1','exp2','exp3']*2
rev = [1,1,1,2,2,2]
met1 = np.linspace(-0.5,1,6)
met2 = np.linspace(1.0,5.0,6)
met3 = np.linspace(-1,1,6)
df = pd.DataFrame({'rev':rev, 'met1':met1, 'met2':met2, 'met3':met3}, index=exp)
for met in df.columns:
if met != 'rev':
merged = df[df['rev'] == df.rev.unique()[0]][met]
merged.name = merged.name+'rev'+str(df.rev.unique()[0])
for rev in df.rev.unique()[1:]:
tmp = df[df['rev'] == rev][met]
tmp.name = tmp.name+'rev'+str(rev)
merged = pd.concat([merged, tmp], axis=1)
merged.plot(kind='bar')
This should give you three plots, one for each of my fake metrics.
EDIT : Or something like this might do also
df['exp'] = df.index
pt = pd.pivot_table(df, values='met1', rows=['exp'], cols=['rev'])
pt.plot(kind='bar')