Optimizing Python Code: Faster groupby and for loops - python

I want to make a For Loop given below, faster in python.
import pandas as pd
import numpy as np
import scipy
np.random.seed(1)
xl = pd.DataFrame({'Concat' : np.arange(101,999), 'ships_x' : np.random.randint(1001,3000,size=898)})
yl = pd.DataFrame({'PickDate' : np.random.randint(1,8,size=10000),'Concat' : np.random.randint(101,999,size=10000), 'ships_x' : np.random.randint(101,300,size=10000), 'ships_y' : np.random.randint(1001,3000,size=10000)})
tempno = [np.random.randint(1,100,size=5)]
k=1
p = pd.DataFrame(0,index=np.arange(len(xl)),columns=['temp','cv']).astype(object)
for ib in [xb for xb in range(0,len(xl))]:
tempno1 = np.append(tempno,ib)
temp = list(set(tempno1))
temptab = yl[yl['Concat'].isin(np.array(xl['Concat'][tempno1]))].groupby('PickDate')['ships_x','ships_y'].sum().reset_index()
temptab['contri'] = temptab['ships_x']/temptab['ships_y']
p.ix[k-1,'cv'] = 1 if math.isnan(scipy.stats.variation(temptab['contri'])) else scipy.stats.variation(temptab['contri'])
p.ix[k-1,'temp'] = temp
k = k+1
where,
xl, yl - two data frames I am working on with columns like Concat, x_ships and y_ships.
tempno - a initial list of indices of xl dataframe, referring to a list of 'Concat' values.
So, in for loop we add one extra index to tempno in each iteration and then subset 'yl' dataframe based on 'Concat' values matching with those of 'xl' dataframe. Then, we find "coefficient of variation"(taken from scipy lib) and make note in new dataframe 'p'.
The problem is it is taking too much time as number of iterations of for loop varies in thousands. The 'group_by' line is taking maximum time. I have tried and made a few changes, now the code look likes below, changes made mentioned in comments. There is a slight improvement but this doesn't solve my purpose. Please suggest the fastest way possible to implement this. Many thanks.
# Getting all tempno1 into a list with one step
tempno1 = [np.append(tempno,ib) for ib in [xb for xb in range(0,len(xl))]]
temp = [list(set(tempk)) for tempk in tempno1]
# Taking only needed columns from x and y dfs
xtemp = xl[['Concat']]
ytemp = yl[['Concat','ships_x','ships_y','PickDate']]
#Shortlisting y df and groupby in two diff steps
ytemp = [ytemp[ytemp['Concat'].isin(np.array(xtemp['Concat'][tempnokk]))] for tempnokk in tempno1]
temptab = [ytempk.groupby('PickDate')['ships_x','ships_y'].sum().reset_index() for ytempk in ytemp]
tempkcontri = [tempk['ships_x']/tempk['ships_y'] for tempk in temptab]
tempkcontri = [pd.DataFrame(tempkcontri[i],columns=['contri']) for i in range(0,len(tempkcontri))]
temptab = [temptab[i].join(tempkcontri[i]) for i in range(0,len(temptab))]
pcv = [1 if math.isnan(scipy.stats.variation(temptabkk['contri'])) else scipy.stats.variation(temptabkk['contri']) for temptabkk in temptab]
p = pd.DataFrame({'temp' : temp,'cv': pcv})

Related

Calculation of the removal percentage for chemical parameters (faster code)

I have to calculate the removal pecentages of chemical/biological parameters (e.g. after an oxidation process) in a waster water treatment plant.
My code code works so far and does exactly what it should do, but it is really slow.
On my laptop the calculation for the original dataset took about 10 sec and on my PC 4 sec for a 15x80 Data Frame. That is too long, especially if I have to deal with more rows.
What the code does:
The formula for the single removal is defined as: 1 - n(i)/n(i-1)
and for the total removal: 1 - n(i)/n(0)
Every measuring point has its own ID. The code searches for the ID's and performs the calculation and saves it in the data frame.
Here is an example (I cant post the original data):
import pandas as pd
import numpy as np
data = {"ID": ["X1_P0001", "X2_P0001", "X3_P0001", "X1_P0002", "X2_P0002", "X3_P0002", "X4_P0002","X5_P0002", "X1_P0003", "X2_P0003", "X3_P0003"],
"Measurement": [100, 80, 60, 120,90,70,50,25, 85,65,35]}
df["S_removal"]= np.nan
df["T_removal"]= np.nan
Data Frame before calculation
this is my function for the calculation:
def removal_TEST(Rem1, Measure, Rem2):
lst = [i.split("_")[1] for i in df["ID"]] #takes relevant ID information
y = np.unique(lst) #stores unique ID values to loop over them
for ID in y:
id_list = []
for i in range(0, len(df["ID"])):
if ID in df["ID"][i]:
id_list.append(i)
else: # this stores only the relevant id in a new list
id_list.append(np.nan)
indexlist = pd.Series(id_list)
first_index = indexlist.first_valid_index() #gets the first and last index of the id list
last_index = indexlist.last_valid_index()
col_indizes = []
for i in range(first_index, last_index+1):
col_indizes.append(i)
for i in col_indizes:
if i == 0:
continue # for i=0 there is no 0-1 element, so i=0 should be skipped
else:
Rem1[i]= 1-(Measure[i]/Measure[i-1])
Rem1[first_index]= np.nan #first entry of an ID must be NaN value
for i in range(first_index, last_index+1):
col_indizes.append(i)
for i in range(len(Rem2)):
for i in col_indizes:
Rem2[i]= 1-(Measure[i]/Measure[first_index])
Rem2[first_index]= np.nan
this is the result:
Final Data Frame
I am new to Python and to stackoverflow (so sorry if my code and question are not so good to read). Are there any good libraries to speed up my code, or do you have some suggestions?
Thank you :)
Your use of Pandas seems to be getting in the way of solving the problem. The only relevant state seems to be when the group changes and the first and previous measurement values for each row.
I'd be tempted to solve this just using Python primitives, but you could solve this in other ways if you had lots of data (i.e. millions of rows).
import pandas as pd
df = pd.DataFrame({
"ID": ["X1_P0001", "X2_P0001", "X3_P0001", "X1_P0002", "X2_P0002", "X3_P0002", "X4_P0002","X5_P0002", "X1_P0003", "X2_P0003", "X3_P0003"],
"Measurement": [100, 80, 60, 120,90,70,50,25, 85,65,35],
"S_removal": float('nan'),
"T_removal": float('nan'),
})
# somewhere keep track of the last group identifier
last = None
# iterate over rows
for idx, ID, meas in zip(df.index, df['ID'], df['Measurement']):
# what's the current group name
_, grp = ID.split('_', 1)
# see if we're in a new group
if grp != last:
last = grp
# track the group's measurement
grp_meas = meas
else:
# calculate things
df.loc[idx, 'S_removal'] = 1 - meas / last_meas
df.loc[idx, 'T_removal'] = 1 - meas / grp_meas
# keep track of the last measurement
last_meas = meas
I've commented the code in the hopes it makes sense. This takes ~2 seconds for 1000 copies of your example data, so 11000 rows.
Given that OP has said this needs to be done for a wide dataset, here's another version that reduces runtime to ~30ms for 11000 rows and 2 columns:
import numpy as np
import pandas as pd
data = {
"ID": ["X1_P0001", "X2_P0001", "X3_P0001", "X1_P0002", "X2_P0002", "X3_P0002", "X4_P0002","X5_P0002", "X1_P0003", "X2_P0003", "X3_P0003"],
"M1": [100, 80, 60, 120,90,70,50,25, 85,65,35],
"M2": [100, 80, 60, 120,90,70,50,25, 85,65,35],
}
# reset_index() because code below assumes they are unique
df = pd.concat([pd.DataFrame(data)]*1000).reset_index()
# column names
measurement_col_names = ['M1', 'M2']
single_output_names = ['S1', 'S2']
total_output_names = ['T1', 'T2']
# somewhere keep track of the last group identifier
last = None
# somewhere to store intermediate state
vals_idx = []
meas_vals = []
last_vals = []
grp_vals = []
# iterate over rows
for idx, ID, meas in zip(df.index, df['ID'], df.loc[:,measurement_col_names].values):
# what's the current group name
_, grp = ID.split('_', 1)
# we're in a new group
if grp != last:
last = grp
# track the group's measurement
grp_meas = meas
else:
# track values and which rows they apply to
vals_idx.append(idx)
meas_vals.append(meas)
last_vals.append(last_meas)
grp_vals.append(grp_meas)
# keep track of the last measurement
last_meas = meas
# convert to numpy array so it vectorises nicely
meas_vals = np.array(meas_vals)
# perform calculation using fast numpy operations
df.loc[vals_idx, single_output_names] = 1 - (meas_vals / last_vals)
df.loc[vals_idx, total_output_names] = 1 - (meas_vals / grp_vals)

How to vectorize this peak finding for loop in Python?

Basically I'm writing a peak finding function that needs to be able to beat scipy.argrelextrema in benchmarking. Here is a link to the data I'm using, and the code:
https://drive.google.com/open?id=1U-_xQRWPoyUXhQUhFgnM3ByGw-1VImKB
If this link expires, the data can be found at dukascopy bank's online historical data downloader.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('EUR_USD.csv')
data.columns = ['Date', 'open', 'high', 'low', 'close','volume']
data.Date = pd.to_datetime(data.Date, format='%d.%m.%Y %H:%M:%S.%f')
data = data.set_index(data.Date)
data = data[['open', 'high', 'low', 'close']]
data = data.drop_duplicates(keep=False)
price = data.close.values
def fft_detect(price, p=0.4):
trans = np.fft.rfft(price)
trans[round(p*len(trans)):] = 0
inv = np.fft.irfft(trans)
dy = np.gradient(inv)
peaks_idx = np.where(np.diff(np.sign(dy)) == -2)[0] + 1
valleys_idx = np.where(np.diff(np.sign(dy)) == 2)[0] + 1
patt_idx = list(peaks_idx) + list(valleys_idx)
patt_idx.sort()
label = [x for x in np.diff(np.sign(dy)) if x != 0]
# Look for Better Peaks
l = 2
new_inds = []
for i in range(0,len(patt_idx[:-1])):
search = np.arange(patt_idx[i]-(l+1),patt_idx[i]+(l+1))
if label[i] == -2:
idx = price[search].argmax()
elif label[i] == 2:
idx = price[search].argmin()
new_max = search[idx]
new_inds.append(new_max)
plt.plot(price)
plt.plot(inv)
plt.scatter(patt_idx,price[patt_idx])
plt.scatter(new_inds,price[new_inds],c='g')
plt.show()
return peaks_idx, price[peaks_idx]
It basically smoothes data using a fast fourier transform (FFT) then takes the derivative to find the minimum and maximum indices of the smoothed data, then finds the corresponding peaks on the unsmoothed data. Sometimes the peaks it finds are not idea due to some smoothing effects, so I run this for loop to search for higher or lower points for each index between the bounds specified by l. I need help vectorizing this for loop! I have no idea how to do it. Without the for loop, my code is about 50% faster than scipy.argrelextrema, but the for loop slows it down. So if I can find a way to vectorize it, it'd be a very quick, and very effective alternative to scipy.argrelextrema. These two images represent the data without and with the for loop respectively.
This may do it. It's not perfect but hopefully it obtains what you want and shows you a bit how to vectorize. Happy to hear any improvements you think up
label = np.array(label[:-1]) # not sure why this is 1 unit longer than search.shape[0]?
# the idea is to make the index matrix you're for looping over row by row all in one go.
# This part is sloppy and you can improve this generation.
search = np.vstack((np.arange(patt_idx[i]-(l+1),patt_idx[i]+(l+1)) for i in range(0,len(patt_idx[:-1])))) # you can refine this.
# then you can make the price matrix
price = price[search]
# and you can swap the sign of elements so you only need to do argmin instead of both argmin and argmax
price[label==-2] = - price[label==-2]
# now find the indices of the minimum price on each row
idx = np.argmin(price,axis=1)
# and then extract the refined indices from the search matrix
new_inds = search[np.arange(idx.shape[0]),idx] # this too can be cleaner.
# not sure what's going on here so that search[:,idx] doesn't work for me
# probably just a misunderstanding
I find that this reproduces your result but I did not time it. I suspect the search generation is quite slow but probably still faster than your for loop.
Edit:
Here's a better way to produce search:
patt_idx = np.array(patt_idx)
starts = patt_idx[:-1]-(l+1)
stops = patt_idx[:-1]+(l+1)
ds = stops-starts
s0 = stops.shape[0]
s1 = ds[0]
search = np.reshape(np.repeat(stops - ds.cumsum(), ds) + np.arange(ds.sum()),(s0,s1))
Here is an alternative... it uses list comprehension which is generally faster than for-loops
l = 2
# Define the bounds beforehand, its marginally faster than doing it in the loop
upper = np.array(patt_idx) + l + 1
lower = np.array(patt_idx) - l - 1
# List comprehension...
new_inds = [price[low:hi].argmax() + low if lab == -2 else
price[low:hi].argmin() + low
for low, hi, lab in zip(lower, upper, label)]
# Find maximum within each interval
new_max = price[new_inds]
new_global_max = np.max(new_max)

Efficiently assign arrays along an axis (common)number in Python

I have an image which I subsample
Count=0
classim = np.zeros([temp1.shape[0],temp1.shape[1]])
for rows in range(int(np.floor(temp1.shape[0]/SAMPLE_SIZE))):
for cols in range(int(np.floor(temp1.shape[1]/SAMPLE_SIZE))):
classim[np.multiply(rows,SAMPLE_SIZE):np.multiply(rows+1,SAMPLE_SIZE),
np.multiply(cols,SAMPLE_SIZE):np.multiply(cols+1,SAMPLE_SIZE)] = predict.argmax(axis=-1)[Count]
Count = np.add(Count,1)
This is terribly slow. I get the labels from "predict.argmax(axis=-1)[Count]", but can of course have it in vector form.
In other words, how can I vectorise the above loop?
Taking your row calculations outside the inner loop would help a little. Therefore these calculations will only be made once for each row.
A few other tidy-ups gives:
classim = np.zeros_like(temp1)
predict_args = predict.argmax(axis=-1)
for rows in range(temp1.shape[0]//SAMPLE_SIZE):
row_0 = rows * SAMPLE_SIZE
row_1 = (rows+1) * SAMPLE_SIZE
for cols in range(temp1.shape[1]//SAMPLE_SIZE):
col_0 = cols * SAMPLE_SIZE
col_1 = (cols+1) * SAMPLE_SIZE
classim[row_0:row_1,col_0:col_1] = predict_args[Count]
Count+=1
You would need to tell us more about the predict object before I could do much more. But these changes will help a little.
--EDIT--
You could take advantage of the numpy.repeat function. Then there is no need to iterate through the whole classim:
SAMPLE_SIZE = 2
temp1 = np.arange(20*20).reshape((20,20))
sample_shape = (temp1.shape[0]//SAMPLE_SIZE, temp1.shape[0]//SAMPLE_SIZE)
#This line should work as per your question, but returns a single value
#predict_args = predict.argmax(axis=-1)
#Use this for illustration purposes
predict_args = np.arange(sample_shape[0] * sample_shape[1])
subsampled = predict_args.reshape(sample_shape)
classim = np.repeat(np.repeat(subsampled,SAMPLE_SIZE,axis =1),SAMPLE_SIZE, axis=0)
print(subsampled)
print(classim)

Creating new pandas columns with original value plus random number in error range

I have a pandas dataframe which has a column 'INTENSITY' and a numpy array of same length containing the error for each intensity. I would like to generate columns with randomly generated numbers in the error range.
So far I use two nested for loops to create the new columns but I feel like this is inefficient:
theor_err = [ sqrt(abs(x)) for x in theor_df[str(INTENSITY)] ]
theor_err = np.asarray(theor_err)
for nr_sample in range(2):
sample = np.zeros(len(theor_df[str(INTENSITY)]))
for i, error in enumerate(theor_err):
sample[i] = theor_df[str(INTENSITY)][i] + random.uniform(-error, error)
theor_df['gen_{}'.format(nr_sample)] = Series(sample, index=theor_df.index)
theor_df.head()
Is there a more efficient way of approaching a problem like this?
Numpy can handle arrays for you. So, you can do it like this:
import pandas as pd
import numpy as np
a=pd.DataFrame([10,20,15,30],columns=['INTENSITY'])
a['theor_err']=np.sqrt(np.abs(a.INTENSITY))
a['sample']=np.random.uniform(-a['theor_err'],a['theor_err'])
Suppose you want to generate 6 samples. You can try to code below. You can tune the number of samples you want by setting the value k.
df = pd.DataFrame([[1],[2],[3],[4],[-5]], columns=["intensity"])
k = 6
sample_names = ["sample" + str(i+1) for i in range(k)]
df["err"] = np.sqrt(np.abs((df["intensity"])))
df[sample_names] = pd.DataFrame(
df["err"].map(lambda x: np.random.uniform(-x, x, k)).values.tolist())
df.loc[:,sample_names] = df.loc[:,sample_names].add(df.intensity, axis=0)

Python Pandas Merge Causing Memory Overflow

I'm new to Pandas and am trying to merge a few subsets of data. I'm giving a specific case where this happens, but the question is general: How/why is it happening and how can I work around it?
The data I load is around 85 Megs or so but I often watch my python session run up close to 10 gigs of memory usage then give a memory error.
I have no idea why this happens, but it's killing me as I can't even get started looking at the data the way I want to.
Here's what I've done:
Importing the Main data
import requests, zipfile, StringIO
import numpy as np
import pandas as pd
STAR2013url="http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013_all_csv_v3.zip"
STAR2013fileName = 'ca2013_all_csv_v3.txt'
r = requests.get(STAR2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STAR2013=pd.read_csv(z.open(STAR2013fileName))
Importing some Cross Cross Referencing Tables
STARentityList2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013entities_csv.zip"
STARentityList2013fileName = "ca2013entities_csv.txt"
r = requests.get(STARentityList2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARentityList2013=pd.read_csv(z.open(STARentityList2013fileName))
STARlookUpTestID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/tests.zip"
STARlookUpTestID2013fileName = "Tests.txt"
r = requests.get(STARlookUpTestID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpTestID2013=pd.read_csv(z.open(STARlookUpTestID2013fileName))
STARlookUpSubgroupID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/subgroups.zip"
STARlookUpSubgroupID2013fileName = "Subgroups.txt"
r = requests.get(STARlookUpSubgroupID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpSubgroupID2013=pd.read_csv(z.open(STARlookUpSubgroupID2013fileName))
Renaming a Column ID to Allow for Merge
STARlookUpSubgroupID2013 = STARlookUpSubgroupID2013.rename(columns={'001':'Subgroup ID'})
STARlookUpSubgroupID2013
Successful Merge
merged = pd.merge(STAR2013,STARlookUpSubgroupID2013, on='Subgroup ID')
Try a second merge. This is where the Memory Overflow Happens
merged=pd.merge(merged, STARentityList2013, on='School Code')
I did all of this in ipython notebook, but don't think that changes anything.
Although this is an old question, I recently came across the same problem.
In my instance, duplicate keys are required in both dataframes, and I needed a method which could tell if a merge will fit into memory ahead of computation, and if not, change the computation method.
The method I came up with is as follows:
Calculate merge size:
def merge_size(left_frame, right_frame, group_by, how='inner'):
left_groups = left_frame.groupby(group_by).size()
right_groups = right_frame.groupby(group_by).size()
left_keys = set(left_groups.index)
right_keys = set(right_groups.index)
intersection = right_keys & left_keys
left_diff = left_keys - intersection
right_diff = right_keys - intersection
left_nan = len(left_frame[left_frame[group_by] != left_frame[group_by]])
right_nan = len(right_frame[right_frame[group_by] != right_frame[group_by]])
left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan
sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
sizes += [left_nan * right_nan]
left_size = [left_groups[group_name] for group_name in left_diff]
right_size = [right_groups[group_name] for group_name in right_diff]
if how == 'inner':
return sum(sizes)
elif how == 'left':
return sum(sizes + left_size)
elif how == 'right':
return sum(sizes + right_size)
return sum(sizes + left_size + right_size)
Note:
At present with this method, the key can only be a label, not a list. Using a list for group_by currently returns a sum of merge sizes for each label in the list. This will result in a merge size far larger than the actual merge size.
If you are using a list of labels for the group_by, the final row size is:
min([merge_size(df1, df2, label, how) for label in group_by])
Check if this fits in memory
The merge_size function defined here returns the number of rows which will be created by merging two dataframes together.
By multiplying this with the count of columns from both dataframes, then multiplying by the size of np.float[32/64], you can get a rough idea of how large the resulting dataframe will be in memory. This can then be compared against psutil.virtual_memory().available to see if your system can calculate the full merge.
def mem_fit(df1, df2, key, how='inner'):
rows = merge_size(df1, df2, key, how)
cols = len(df1.columns) + (len(df2.columns) - 1)
required_memory = (rows * cols) * np.dtype(np.float64).itemsize
return required_memory <= psutil.virtual_memory().available
The merge_size method has been proposed as an extension of pandas in this issue. https://github.com/pandas-dev/pandas/issues/15068.

Categories

Resources