Computation between two large columns in a Pandas Dataframe - python

I have a dataframe that has 2 columns of zipcodes, I would like to add another column with their distance values, I am able to do this with a fairly low number of rows, but I am now working with a dataframe that has about 500,000 rows for calculations. The code I have works, but on my current dataframe it's been about 30 minutes of running, and still no completion, so I feel what i'm doing is extremely inefficient.
Here is the code
import pgeocode
dist = pgeocode.GeoDistance('us')
def distance_pairing(start,end):
return dist.query_postal_code(start, end)
zips['distance'] = zips.apply(lambda x: distance_pairing(x['zipstart'], x['zipend']), axis=1)
zips
I know looping is out of the question, so is there something else I can do, efficiency wise that would make this better?

Whenever possible, use vectorized operations in pandas and numpy. In this case:
zips['distance'] = dist.query_postal_code(
zips['zipstart'].values,
zips['zipend'].values,
)
This won't always work, but in this case, the underlying pgeocode.haversine function is written (in numpy) to accommodate arrays of x and y coordinates. This should speed up your code by several orders of magnitude for a dataframe of this size.

Related

How to groupby or resample by a specific number of rows -- using Dask (Python)

I'm trying to downsample Dask dataframes by any x number of rows.
For instance, if I was using datetimes as an index, I could just use:
df = df.resample('1h').ohlc()
But I don't want to resample by datetimes, I want to resample by a fixed number of rows...something like:
df = df.resample(rows=100).ohlc()
I did a bunch of searching and found these three old SO pages:
This one suggests:
df.groupby(np.arange(len(df))//x), where x = the number of rows.
pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1)), but I have trouble understanding this one.
pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0), but I also have trouble understanding this one.
This one suggests df.groupby(np.arange(len(df))//x) again.
This one suggests df_sub = df.rolling(x).mean()[::x], but it says it's wasteful, and doesn't seem optimized for Dask.
The best, fastest option seems to be df.groupby(np.arange(len(df))//x), and it works fine in Pandas. However, when I try it in Dask, I get: ValueError: Grouper and axis must be same length
How do I resample by # of rows using Dask?
I have dataframes with:
A standard index (e.g. 1,2,3,4,5...,n)
Datetime values I could potentially use as an index (although I don't necessarily want to)
Non-standard lengths (i.e. Some of them have an even number of rows, and some have an odd number).

Python scatter matrices from dataframe with too many columns

I am new to python and data science, and I am currently working on a project that is based on a very large dataframe, with 75 columns. I am doing some data exploration and I would like to check for possible correlations between the columns. For smaller dataframes I know I could use pandas plotting.scatter_matrix() on the dataframe in order to do so. However, in my case this produces a 75x75 matrix -- and I can't even visualize the individual plots.
An alternative would be creating lists of 5 columns and using scatter_matrix multiple times, but this method would produce too many scatter matrices. For instance, with 15 columns this would be:
import pandas as pd
df = pd.read_csv('dataset.csv')
list1 = [df.iloc[:, i] for i in range(5)]
list2 = [df.iloc[:, i+5] for i in range(5)]
list3 = [df.iloc[:, i+10] for i in range(5)]
pd.plotting.scatter_matrix(df_acoes[list1])
pd.plotting.scatter_matrix(df_acoes[list2])
pd.plotting.scatter_matrix(df_acoes[list3])
In order to use this same method with 75 columns, I'd have to go on until list15. This looks very inefficient. I wonder if there would be a better way to explore correlations in my dataset.
The problem here is to a lesser extend the technical part. The production of the plots (in number 5625) will take quite a long time. Additionally, the plots will take a bit of memory.
So I would ask a few questions to get around the problems:
Is it really necessary to have all these scatter plots?
Can I reduce the dimensional in advance?
Why do I have such a high number of dimensions?
If the plots are really useful, You could produce them by your own and stick them together, or wait until the function is ready.

Speed up iteration over DataFrame items

I wrote a function in which each cell of a DataFrame is divided by a number saved in another dataframe.
def calculate_dfA(df_t,xout):
df_A = df_t.copy()
vector_x = xout.T
for index_col, column in tqdm(df_A.iteritems()):
for index_row, row in df_A.iterrows():
df_A.iloc[index_row,index_col] = df_A.iloc[index_row,index_col]/vector_x.iloc[0,index_col]
return(df_A)
The DataFrame on which I apply the calculation has a size of 14839 rows x 14839 columns. According to tqdm the processing speed is roughly 4.5s/it. Accordingly, the calculation will require approixmately 50 days which is not feasible for me. Is there a way to speed up my calculation?
You need to vectorize your division:
result = df_A.values/vector_x
This will broadcast along the row dimension and divide along the column dimension, as you seem to ask for.
Compared to your double for-loop, you are taking advantage of contiguity and homogeneity of the data in memory. This allows for a massive speedup.
Edit: Coming back to this answer today, I am spotting that converting to a numpy array first speeds up the computation. Locally I get a 10x speedup for an array of size similar to the one in the question here-above. Have edited my answer.
I'm on mobile now but you should try to avoid every for loop in python - theres always a better way
For one I know you can multiply a pandas column (Series) times a column to get your desired result.
I think to multiply every column with the matching column of another DataFrame you would still need to iterate (but only with one for loop => performance boost)
I would strongly recommend that you temporarily convert to a numpy ndarray and work with these

Alternative method for two way interpolation

I wrote some code to perform interpolation based on two criteria, the amount of insurance and the deductible amount %. I was struggling to do the interpolation all at once, so had split the filtering.The table hf contains the known data which I am using to base my interpolation results on.Table df contains the new data which needs the developed factors interpolated based on hf.
Right now my work around is first filtering each table based on the ded_amount percentage and then performing the interpolation into an empty data frame and appending after each loop.
I feel like this is inefficient, and there is a better way to perform this, looking to hear some feedback on some improvements I can make. Thanks
Test data provided below.
import pandas as pd
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
deduct_fact=pd.DataFrame()
for deduct in hf['Ded_amount'].unique():
deduct_table=hf[hf['Ded_amount']==deduct]
aoi_table=df[df['Ded_amount']==deduct]
x=deduct_table['AOI']
y=deduct_table['factor']
f=interpolate.interp1d(x,y,fill_value="extrapolate")
xnew=aoi_table[['AOI']]
ynew=f(xnew)
append_frame=aoi_table
append_frame['Factor']=ynew
deduct_fact=deduct_fact.append(append_frame)
Yep, there is a way to do this more efficiently, without having to make a bunch of intermediate dataframes and appending them. have a look at this code:
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
# Create this column now
df['Factor'] = None
# I like specifying this explicitly; easier to debug
deduction_amounts = list(hf.Ded_amount.unique())
for deduction_amount in deduction_amounts:
# You can index a dataframe and call a column in one line
x, y = hf[hf['Ded_amount']==deduction_amount]['AOI'], hf[hf['Ded_amount']==deduction_amount]['factor']
f = interpolate.interp1d(x, y, fill_value="extrapolate")
# This is the most important bit. Lambda function on the dataframe
df['Factor'] = df.apply(lambda x: f(x['AOI']) if x['Ded_amount']==deduction_amount else x['Factor'], axis=1)
The way the lambda function works is:
It goes row by row through the column 'Factor' and gives it a value based on conditions on the other columns.
It returns the interpolation of the AOI column of df (this is what you called xnew) if the deduction amount matches, otherwise it just returns the same thing back.

Speeding up the itteration and calculation of pandas dataframe

I have a dataframe which consists of 6K items, each item having 70 fields to describe it. So the length of my dataframe is around 420K rows. Then I apply my function like this:
df_dirty[['basic_score', 'additional_score']] = df_dirty.apply(compare.compare, axis=1)
Compare function takes the row from df_dirty and then takes and ID from that row, depending on which it takes other two cells from that row and performs a comparison of those two cells. The comparison may be a simple
if cell1 == cell2:
return True
else:
return False
or a more difficult calculation that takes the values of those cells and then calculates if their ratio is in some range or whatever.
Overall - the function I apply to my dataframe is performing some more actions so it's very time consuming for large datasets of complex data(not only clean numbers, but number and text combinations, etc).
I was wondering if there are any faster ways to do this than simply applying a function?
I have some ideas about what should I do with this:
Put everything on a server and perform all calculations overnight, so it would be faster to just ask for an already calculated result,
also I thought that this would maybe be faster if I used C to write my compare function. What are my other options?

Categories

Resources