Pandas Vectorize Instead of Loop

Pandas Vectorize Instead of Loop - python

I have a dataframe of paths. The task is to get the last modification time for the folder using something like datetime.fromtimestamp(os.path.getmtime('PATH_HERE')) into a separate column
import pandas as pd
import numpy as np
import os
df1 = pd.DataFrame({'Path' : ['C:\\Path1' ,'C:\\Path2', 'C:\\Path3']})
#for a MVCE use the below commented out code. WARNING!!! This WILL Create directories on your machine.
#for path in df1['Path']:
# os.mkdir(r'PUT_YOUR_PATH_HERE\\' + os.path.basename(path))
I can do the task with the below, but it is a slow loop if I have many folders:
for each_path in df1['Path']:
df1.loc[df1['Path'] == each_path, 'Last Modification Time'] = datetime.fromtimestamp(os.path.getmtime(each_path))
How would I go about vectoring this process to improve speed? os.path.getmtime cannot accept the series. I'm looking for something like:
df1['Last Modification Time'] = datetime.fromtimestamp(os.path.getmtime(df1['Path']))

I'm going to present 3 approaches assuming to work with 100 paths. Approach 3 is strictly preferable I think.
# Data initialisation:
paths100 = ['a_whatever_path_here'] * 100
df = pd.DataFrame(columns=['paths', 'time'])
df['paths'] = paths100
def fun1():
# Naive for loop. High readability, slow.
for path in df['paths']:
mask = df['paths'] == path
df.loc[mask, 'time'] = datetime.fromtimestamp(os.path.getmtime(path))
def fun2():
# Naive for loop optimised. Medium readability, medium speed.
for i, path in enumerate(df['paths']):
df.loc[i, 'time'] = datetime.fromtimestamp(os.path.getmtime(path))
def fun3():
# List comprehension. High readability, high speed.
df['time'] = [datetime.fromtimestamp(os.path.getmtime(path)) for path in df['paths']]
% timeit fun1()
>>> 164 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
% timeit fun2()
>>> 11.6 ms ± 67.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
% timeit fun3()
>>> 13.1 ns ± 0.0327 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

You can use a groupby transform (so that you are doing the expensive call only once per group):
g = df1.groupby("Path")["Path"]
s = pd.to_datetime(g.transform(lambda x: os.path.getmtime(x.name)))
df1["Last Modification Time"] = s # putting this on two lines so it looks nicer...

Related

How to add quotes to the start and end of a link in Python?

I have a dataframe with links and I want to add quotes to the end and start of the quotes. This is how my df looks like:
links
https://us.search.yahoo.com
https://us.search.google.com
https://us.search.wikipedia.com
I want my output to be:
links
'https://us.search.yahoo.com'
'https://us.search.google.com'
'https://us.search.wikipedia.com'
Thank you in advance!

I was curious about the impact of using the inefficient apply method here~
Given:
link text title
0 https://us.search.yahoo.com Yahoo Yahoo Search - Web Search
I explode its size to 1 million rows:
df = pd.concat([df]*int(1e6), ignore_index=True)
And then run some timing tests:
%timeit df.link.apply(repr)
93.5 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit "'" + df.link + "'"
68.3 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
It's clear that the vectorized version of this, "'" + df.link + "'" is significantly faster, but it's unlikely to make a practical difference unless your dataframe is insanely large.

This is relatively crude but given you've not supplied an example it's the best I can come up with;
import pandas as pd
df = pd.DataFrame({'link': 'https://us.search.yahoo.com', 'text': 'Yahoo', 'title': 'Yahoo Search - Web Search'}, index=[0])
print(df["link"].apply(repr))
Output:
'https://us.search.yahoo.com'

Performance of calculations with mixed built-in Python numeric types and Numpy scalar numeric data types

I was writing a parser for some data stored in large text files (1Gb+)
Output was a numpy array of complex numbers, so naturally I used statements like
vals[...] = real_part + 1j*imag_part
where real_part and imag_part were obtained from numpy.fromstring(...)
I noticed that if I simply replace vals[...] = real_part + 1j*imag_part with
vals[...] = 1j*imag_part + real_part
I get almost x2 performance boost, which can be significant for large datasets.
I did some testing and obtained confusing results:
Code:
import timeit
import numpy as np
a = np.float64(1.0)
print('type of 1j*a+a is',type(1j*a+a))
print('type of a+1j*a is',type(a+1j*a))
print('type of a+a*1j is',type(a+a*1j))
setup_line = 'import numpy as np; b = np.zeros(1,dtype=complex)'
N = 1000000
t1 = timeit.timeit("a=np.fromstring('1.1 2.2',sep=' ',dtype=float); b[0]=1j*a[1]+a[0]", setup=setup_line, number=N)
t2 = timeit.timeit("a=np.fromstring('1.1 2.2',sep=' ',dtype=float); b[0]=a[0]+1j*a[1]", setup=setup_line, number=N)
t3 = timeit.timeit("a=np.fromstring('1.1 2.2',sep=' ',dtype=float); b[0]=a[0]+a[1]*1j", setup=setup_line, number=N)
print(f't2/t1 = {t2/t1}')
print(f't3/t1 = {t3/t1}')
print(f'type of 1.0*a is {type(1.0*a)}')
print(f'type of 1.0.__mul__(a) is {type((1.0).__mul__(a))}')
print(f'type of a.__rmul__(1.0) is {type(a.__rmul__(1.0))}')
print(f'type of a*1.0 is {type(a*1.0)}')
print(f'type of 1j*a is {type(1j*a)}')
print(f'type of a*1j is {type(a*1j)}')
Output:
type of 1j*a+a is <class 'complex'>
type of a+1j*a is <class 'numpy.complex128'>
type of a+a*1j is <class 'numpy.complex128'>
t2/t1 = 2.720535618997823
t3/t1 = 3.9634173211365487
type of 1.0*a is <class 'numpy.float64'>
type of 1.0.__mul__(a) is <class 'float'>
type of a.__rmul__(1.0) is <class 'numpy.float64'>
type of a*1.0 is <class 'numpy.float64'>
type of 1j*a is <class 'complex'>
type of a*1j is <class 'numpy.complex128'>
So performance is better in the first case because all calculations are executed in Python built-in complex class. Performance boost is also close to practical case with line-by-line parsing.
What's more confusing is why type of 1.0*a is not equal to (1.0).__mul__(a) but equal to a.__rmul__(1.0)? Is it how it's supposed to be?
What's the difference between 1.0*a and 1j*a?

I agree with #hpaulj. There you can see that the difference goes away for larger amounts of data.
For
a = np.random.random(size=10**7)
%timeit 1j*a+a
%timeit a+1j*a
%timeit a+a*1j
I get
152 ms ± 7.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
158 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
151 ms ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
i.e. no significant difference. Did you have a difference on your actual data or was it just in a 1 value try case?
To answer your comment: Does it matter performance wise if I am using numpy functions line by line or on on a large amount at once?
Unfortunately this is one of the most important things using numpy. You really want to avoid using loops and applying numpy functions one by one but rather want to work with arrays and let numpy handle everything at once.
My suggestion is to use pandas for this task. My test shows that it is ~3 times faster than plane python. My setup
import numpy as np
import pandas as pd
from io import StringIO
n = 10**7
test_string = "\n".join(map(lambda x: f"{x[0]}, {x[1]}", zip(np.random.random((n)),np.random.random((n)))))
Pure python taking ~11.6 seconds ± 64.9 ms:
%%timeit
arr = np.empty(n,dtype='complex')
for i,line in enumerate(test_string.split('\n')):
x,y = line.split(', ')
arr[i] = float(x)+1j*float(y)
Pandas taking ~3.6 seconds ± 43.6 ms:
%%timeit
df = pd.read_csv(StringIO(test_string),names=['1', 'j'])
out = df['1']+1j*df['j'].to_numpy()

I suspect the 'raw' times will tell us more than the ratios:
In [182]: timeit 1j*a+a
209 ns ± 12.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [183]: timeit a+1j*a
5.26 µs ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [184]: timeit a+a*1j
10 µs ± 9.28 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
With a single value calculation like this, function calls and conversion to/from numpy can dominate the times. We aren't seeing the 'pure' calculation times. For that we need to work with larger arrays.
It's well known, for example, that math.sin is better than np.sin on a single value. The np.sin is faster with 100's of values. I suspect something like this is going on here, but don't have time now to explore it.

How to build pandas dataframe from multi-loop process that appends rows and columns?

I am trying to build a dataframe from data gathered from a crontab file. I am unsure how exactly to take the pieces and compile them into a dataframe.
Here is what I have so far:
from crontab import CronTab
import re
system_cron=CronTab()
user_cron=CronTab(user=True)
user_cron
#create and clean list of bash files
line=0
listJobs=[]
for job in user_cron:
match = re.search('.sh', str(user_cron[line]))
if match:
pos=str(user_cron[line]).find('.sh')+3
start=(str(user_cron[line])[::-1]).find(' ', 0)
print(str(user_cron[line])[-start:pos])
listJobs.append(str(user_cron[line])[-start:pos])
line = line+1
listJobs = list(set(listJobs))
listJobs.remove("keybash.sh")
# listJobs is now a list of .sh files including their paths
# loop through the .sh files to pull the python notebooks
for job in listJobs:
with(open(job, 'r')) as file:
text=file.read()
text = text.splitlines()
print(job)
print(text)
type(text)
listFiles=[]
line=0
for file in text:
match = re.search('ipynb', str(text[line]))
if match:
pos=str(text[line]).find('ipynb')+5
start=(str(text[line])[::-1]).find(' ', 0)
print(str(text[line])[-start:pos])
listFiles.append(str(text[line])[-start:pos])
line=line+1
listFiles
So now I have two lists with a different amount of rows, and not sure how to join them to get something like this:
I'm wondering if I should have used a dictionary or converted to a dataframe and then looped through that? What would be the most efficient way to alter what I have code wise to achieve what I need?

%%timeit
df = pd.DataFrame(columns=['key'])
for i in range(10):
df.loc[i] = i
15.8 ms ± 204 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df = pd.DataFrame()
for i in range(10):
df = df.append({'key': i}, ignore_index=True)
14.7 ms ± 849 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
data = []
for i in range(10):
data.append({'key': i})
df = pd.DataFrame(data)
668 µs ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

What's the fastest way to assign values to rows in python?

I have a code that runs a simulation for 21 days in which n number of orders arrive a day and some values have to be assigned to the orders which are generated from random sampling another data frame. this simulation has to run 100 times to calculate some cumulative statistics. I have tried two approaches but they take so long so I am looking for a way to speed this up.
Approach #1:
def get_values():
index=random.randint(0,(len(shipments)-1))#generate number
country,segment,time,weight,costs,carrier=shipments[['SHIP TO COUNTRY','SEGMENT','TRANSIT TIME','OTM_WEIGHT','TOTAL_TRANSPORT_COST','CARRIER']].iloc[index] #data sampling using the random number created
costs=round(costs,2)
return[country,segment,time,weight,costs,carrier]
def get_arrival():
index=random.randint(0,len(daily_arrivals)-1)
arrival=(daily_arrivals['DELIVERY DATE'].iloc[index])
return(arrival)
def LC_run(env,df,j,balance,dispatch,orders):
while True:
arrival=get_arrival()
for m in range(arrival):
order_nmbr='order #{}'.format(m)
pickup=env.now
country,segment,time,weight,cost,carrier=get_values()
delivery=(pickup+time)
df.loc[len(df)]=[order_nmbr,country,segment,pickup,delivery,time,weight,cost,carrier,j]
balance+=cost
dispatch+=weight
orders+=1
yield env.timeout(1)
df=pd.DataFrame(columns=['Order #','SHIP TO COUNTRY','SEGMENT','PICK UP DATE','DELIVERY DATE','TRANSIT TIME','OTM_WEIGHT', 'TRANSPORTATION COST','CARRIER','SIM_RUN'])
pdf=pd.DataFrame(columns=['SIM_RUN','TOTAL_TRANSPORTATION_COST','TOTAL_WEIGHT','TOTAL_NUMBER_OF_ORDERS'])
for j in range(10):
global balance, dispatch, orders
balance=0
dispatch=0
orders=0
#how many times is the simulation running
env = simpy.Environment()
env.process(LC_run(env,df,j,balance,dispatch,orders))
env.run(until=21) #how long is the simulation running until
pdf.loc[len(pdf)]=[j,balance,dispatch,orders]
print('run #{}'.format(j))
Approach #2 *Note pick up time is not important but was an extra to add also in this one I run the simulation for a day instead of the 21 days and instead of having n of arrivals I have an arrival rate
def get_values():
index=random.randint(0,(len(shipments)-1))#generate number
country,segment,time,weight,costs,carrier=shipments[['SHIP TO COUNTRY','SEGMENT','TRANSIT TIME','OTM_WEIGHT','TOTAL_TRANSPORT_COST','CARRIER']].iloc[index] #data sampling using the random number created
costs=round(costs,2)
return[country,segment,time,weight,costs,carrier]
def get_arrival():
index=random.randint(0,len(daily_arrivals)-1)
arrival=(24/daily_arrivals['DELIVERY DATE'].iloc[index])
#print(arrival)
return(arrival)
def LC_run(env,df,i,arival,j,balance,dispatch,orders):
while True:
order_nmbr='order #{}'.format(orders)
pickup=i
pick_up_time=('{} day, {} hours').format(i, i+(env.now*24))
country,segment,time,weight,cost,carrier=get_values()
delivery=(pickup+time)
df.loc[len(df)]=[order_nmbr,country,segment,pickup,pick_up_time,delivery,time,weight,cost,carrier,j]
yield env.timeout(arrival)
balance+=cost
dispatch+=weight
orders+=1
df=pd.DataFrame(columns=['Order #','SHIP TO COUNTRY','SEGMENT','PICK UP DATE','PICKUP TIME','DELIVERY DATE','TRANSIT TIME','OTM_WEIGHT', 'TRANSPORTATION COST','CARRIER','SIM_RUN'])
pdf=pd.DataFrame(columns=['SIM_RUN','TOTAL_TRANSPORTATION_COST','TOTAL_WEIGHT','TOTAL_NUMBER_OF_ORDERS'])
for j in range(10):
global balance, dispatch, orders
balance=0
dispatch=0
orders=0
for i in range(21):#how many times is the simulation running
env = simpy.Environment()
arrival=get_arrival()
env.process(LC_run(env,df,i,arrival,j,balance,dispatch,orders))
env.run(until=24) #how long is the simulation running until
pdf.loc[len(pdf)]=[j,balance,dispatch,orders]
print('run #{}'.format(j))

loc is not so fast, df.iloc[-1] can speed up, here 3 differents approach:
%timeit df.loc[len(df)] = [99, 'abc', .23, now, 99]
2.61 ms ± 94.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.values[-1,:] = [99, 'abc', .23, now, 99]
5.59 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.iloc[-1] = [99, 'abc', .23, now, 99]
1.03 ms ± 88.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Most efficient use of groupby-apply with user-defined functions in Pandas/Numpy

I'm missing information on what would be the most efficient (read: fastest) way of using user-defined functions in a groupby-apply setting in either Pandas or Numpy. I have done some of my own tests but am wondering if there are other methods out there that I have not come across yet.
Take the following example DataFrame:
import numpy as np
import pandas as pd
idx = pd.MultiIndex.from_product([range(0, 100000), ["a", "b", "c"]], names = ["time", "group"])
df = pd.DataFrame(columns=["value"], index = idx)
np.random.seed(12)
df["value"] = np.random.random(size=(len(idx),))
print(df.head())
value
time group
0 a 0.154163
b 0.740050
c 0.263315
1 a 0.533739
b 0.014575
I would like to calculate (for example, the below could be any arbitrary user-defined function) the percentage change over time per group. I could do this in a pure Pandas implementation as follows:
def pct_change_pd(series, num):
return series / series.shift(num) - 1
out_pd = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_pd, num=1)
But I could also modify the function and apply it over a numpy array:
def shift_array(arr, num, fill_value=np.nan):
if num >= 0:
return np.concatenate((np.full(num, fill_value), arr[:-num]))
else:
return np.concatenate((arr[-num:], np.full(-num, fill_value)))
def pct_change_np(series, num):
idx = series.index
arr = series.values.flatten()
arr_out = arr / shift_array(arr, num=num) - 1
return pd.Series(arr_out, index=idx)
out_np = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_np, num=1)
out_np = out_np.reset_index(level=2, drop=True)
From my testing, it seems that the numpy method, even with its additional overhead of converting between np.array and pd.Series, is faster.
Pandas:
%%timeit
out_pd = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_pd, num=1)
113 ms ± 548 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy:
%%timeit
out_np = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_np, num=1)
out_np = out_np.reset_index(level=2, drop=True)
94.7 ms ± 642 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
As the index grows and the user-defined function becomes more complex, the Numpy implementation will continue to outperform the Pandas implementation more and more. However, I wonder if there are alternative methods to achieving similar results that are even faster. I'm specifically after another (more efficient) groupby-apply methodology that would allow me to work with any arbitrary user-defined function, not just with the shown example of calculating the percentage change. Would be happy to hear if they exist!

Often the name of the game is to try to use whatever functions are in the toolbox (often optimized and C compiled) rather than applying your own pure Python function. For example, one alternative would be:
def f1(df, num=1):
grb_kwargs = dict(sort=False, group_keys=False) # avoid redundant ops
z = df.sort_values(['group', 'time'])
return z / z.groupby('group', **grb_kwargs).transform(pd.Series.shift, num) - 1
That is about 32% faster than the .groupby('group').apply(pct_change_pd, num=1). On your system, it would yield around 85ms.
And then, there is the trick of doing your "expensive" calculation on the whole df, but masking out the parts that are spillovers from other groups:
def f2(df, num=1):
grb_kwargs = dict(sort=False, group_keys=False) # avoid redundant ops
z = df.sort_values(['group', 'time'])
z2 = z.shift(num)
gid = z.groupby('group', **grb_kwargs).ngroup()
z2.loc[gid != gid.shift(num)] = np.nan
return z / z2 - 1
That one is fully 2.1x faster (on your system would be around 52.8ms).
Finally, when there is no way to find some vectorized function to use directly, then you can use numba to speed up your code (that can then be written with loops to your heart's content)... A classic example is cumulative sum with caps, as in this SO post and this one.

Your first function and using .apply() gives me this result:
In [42]: %%timeit
...: out_pd = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_pd, num=1)
155 ms ± 887 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using groups, the time goes to 56ms.
%%timeit
num=1
outpd_list = []
for g in dfg.groups.keys():
gc = dfg.get_group(g)
outpd_list.append(gc['value'] / gc['value'].shift(num) - 1)
out_pd = pd.concat(outpd_list, axis=0)
56 ms ± 821 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
And if you change this one line in the above code to use built in function you get a bit more time savings
outpd_list.append(gc['value'].pct_change(num))
41.2 ms ± 283 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Vectorize Instead of Loop - python

You can use a groupby transform (so that you are doing the expensive call only once per group): g = df1.groupby("Path")["Path"] s = pd.to_datetime(g.transform(lambda x: os.path.getmtime(x.name))) df1["Last Modification Time"] = s # putting this on two lines so it looks nicer...

Related

How to add quotes to the start and end of a link in Python?

Performance of calculations with mixed built-in Python numeric types and Numpy scalar numeric data types

How to build pandas dataframe from multi-loop process that appends rows and columns?

What's the fastest way to assign values to rows in python?

Most efficient use of groupby-apply with user-defined functions in Pandas/Numpy

Categories

Resources