Other questions attempting to provide the python equivalent to R's sweepfunction (like here) do not really address the case of multiple arguments where it is most useful.
Say I wish to apply a 2 argument function to each row of a Dataframe with the matching element from a column of another DataFrame:
df = data.frame("A" = 1:3,"B" = 11:13)
df2= data.frame("X" = 10:12,"Y" = 10000:10002)
sweep(df,1, FUN="*",df2$X)
In python I got the equivalent using apply on what is basically a loop through the row counts.
df = pd.DataFrame( { "A" : range(1,4),"B" : range(11,14) } )
df2 = pd.DataFrame( { "X" : range(10,13),"Y" : range(10000,10003) } )
pd.Series(range(df.shape[0])).apply(lambda row_count: np.multiply(df.iloc[row_count,:],df2.iloc[row_count,df2.columns.get_loc('X')]))
I highly doubt this is efficient in pandas, what is a better way of doing this?
Both bits of code should result in a Dataframe/matrix of 6 numbers when applying *:
A B
1 10 110
2 22 132
3 36 156
I should state clearly that the aim is to insert one's own function into this sweep like behavior say:
df = data.frame("A" = 1:3,"B" = 11:13)
df2= data.frame("X" = 10:12,"Y" = 10000:10002)
myFunc = function(a,b) { floor((a + b)^min(a/2,b/3)) }
sweep(df,1, FUN=myFunc,df2$X)
resulting in:
A B
[1,] 3 4
[2,] 3 4
[3,] 3 5
What is a good way of doing that in python pandas?
If I understand this correctly, you are looking to apply a binary function f(x,y) to a dataframe (for the x) row-wise with arguments from a series for y. One way to do this is to borrow the implementation from pandas internals itself. If you want to extend this function (e.g. apply along columns, it can be done in a similar manner, as long as f is binary. If you need more arguments, you can simply do a partial on f to make it binary
import pandas as pd
from pandas.core.dtypes.generic import ABCSeries
def sweep(df, series, FUN):
assert isinstance(series, ABCSeries)
# row-wise application
assert len(df) == len(series)
return df._combine_match_index(series, FUN)
# define your binary operator
def f(x, y):
return x*y
# the input data frames
df = pd.DataFrame( { "A" : range(1,4),"B" : range(11,14) } )
df2 = pd.DataFrame( { "X" : range(10,13),"Y" : range(10000,10003) } )
# apply
test1 = sweep(df, df2.X, f)
# performance
# %timeit sweep(df, df2.X, f)
# 155 µs ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)#
# another method
import numpy as np
test2 = pd.Series(range(df.shape[0])).apply(lambda row_count: np.multiply(df.iloc[row_count,:],df2.iloc[row_count,df2.columns.get_loc('X')]))
# %timeit performance
# 1.54 ms ± 56.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
assert all(test1 == test2)
Hope this helps.
In pandas
df.mul(df2.X,axis=0)
A B
0 10 110
1 22 132
2 36 156
Related
I am working on a route building code and have half a million record which taking around 3-4 hrs to get executed.
For creating dataframe:
# initialize list of lists
data = [[['1027', '(K)', 'TRIM']], [[SJCL, (K), EJ00, (K), ZQFC, (K), 'DYWH']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['route'])
Will look like something this:
route
[1027, (K), TRIM]
[SJCL, (K), EJ00, (K), ZQFC, (K), DYWH]
O/P
Code I have used:
def func_list(hd1):
required_list=[]
for j,i in enumerate(hd1):
#print(i,j)
if j==0:
req=i
else:
if (i[0].isupper() or i[0].isdigit()):
required_list.append(req)
req=i
else:
req=req+i
required_list.append(req)
return required_list
df['route2']=df.route1.apply(lambda x : func_list (x))
#op
route2
[1027(K), TRIM]
[SJCL(K), EJ00(K), ZQFC(K), DYWH]
For half million rows, it taking 3-4 hrs, I dont know how to reduce it pls help.
Use explode to flatten your dataframe:
sr1 = df['route'].explode()
sr2 = pd.Series(np.where(sr1.str[0] == '(', sr1.shift() + sr1, sr1), index=sr1.index)
df['route'] = sr2[sr1.eq(sr2).shift(-1, fill_value=True)].groupby(level=0).apply(list)
print(df)
# Output:
0 [1027(K), TRIM]
1 [SJCL(K), EJ00(K), ZQFC(K), DYWH]
dtype: object
For 500K records:
7.46 s ± 97.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm attempting to generate a code, some of which has to follow certain predefined rules(see commentary). I only need as many as there are df rows passed in - the code is assigned back to that same df on a simple per-row basis. Returning a list 'seems' less ideal than assigning directly to the df within the function, but i've not been able to achieve this. Unfortunately i need to pass in 3 df's separately due to other constraints in processing elsewhere, but each time they will have a different single character suffix(e.g. X|Y|Z). The codes do not 'need' to be sequential between the different df's, although having some sequencing in for each could be useful...and is the way i've attempted thus far.
However, my current 'working' attempt here, though functional....takes far too long. I am hopeful that someone can point out some possible wins for optimising any part of this. Typically each df is <500k, more usually 100-200k.
Generate an offer code
Desired outcome:
Sequence that takes the format: YrCodeMthCode+AAAA+99+[P|H|D]
Where:
YrCode and Mth code are supplied*
AAAA a generated psuedo unique char sequence*
99 should not contain zeros, and is always 2 digits* (Any, Incl non-sequential)
P|H|D is a defined identifier argument, must be passed in
Typically the df.shape[0] dimensions are never more than 65. But happy to create blank/new and merge with existing if faster.
*The uniqueness of YrCodeMthCode+AAA+99 only needs to cover 500k records each month(as MthCode will change/refresh x12)
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(200, 3), columns=list('ABC'))
offerCodeLength = 6
allowedOfferCodeChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
campaignMonth = 'January'
campaignYear = 2021
yearCodesDict = {2021:'G',2022:'H',2023:'I', 2024:'J', 2025:'K', 2026:'L', 2027:'M'}
monthCodesDict = {'January':'A','Febuary':'B','March':'C',
'April':'D','May':'E','June':'F',
'July':'G', 'August':'H','September':'I',
'October':'J','November':'K','December':'L'}
OfferCodeDateStr = str(yearCodesDict[campaignYear])+str(monthCodesDict[campaignMonth])
iterator = 0
breakPoint = df.shape[0]
def generateOfferCode(OfferCodeDateStr, offerCodeLength, breakPoint, OfferCodeSuffix):
allowedOfferCodeChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
iterator = 0 # limit amount generated
offerCodesList = []
for item in itertools.product(allowedOfferCodeChars, repeat=offerCodeLength):
# generate a 2 digit number, with NO zeros (to avoid 0 vs o call centre issues)
psuedoRandNumc = str(int(''.join(random.choices('123456789',k=randint(10,99))))%10**2)
if iterator < breakPoint: # breakpoint as length of associated dataframe/number of codes required
OfferCodeString = "".join(item)
OfferCodeString = OfferCodeDateStr+OfferCodeString+psuedoRandNum+OfferCodeSuffix # join Yr,Mth chars to generated rest
offerCodesList.append(OfferCodeString)
iterator +=1
return offerCodesList
generateOfferCode(OfferCodeDateStr, offerCodeLength, breakPoint, 'P')
Pretty sure this is less than ideal as'k=randint(10,99))))%10**2' but unsure as to how to better optimise....sliced string?
I'm only defining the breakpoint outside as when i used .shape[0] directly it appeared even slower.
I'm aware that my loop use is probably poor, and there has to be a more vectorised solution in only creating what i need and applying it directly back to the passed df.
Example timings on mine:
(OffercodeLength set to just 4)
x100 : 5.99 s ± 227 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Wall time: 47.5 s
x1000 : 5.87 s ± 243 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Wall time: 46.4 s
IIUC, you could try:
def generateOfferCode(OfferCodeDateStr, offerCodeLength, breakPoint, offerCodeSuffix):
seen = set()
offerCodesList = list()
for i in range(breakPoint):
psuedoRandNum = ''.join(random.choices('123456789', k=2))
OfferCodeString = "".join(random.choices("ABCDEFGHIJKLMNOPQRSTUVWXYZ", k=6))
while OfferCodeString in seen:
OfferCodeString = "".join(random.choices("ABCDEFGHIJKLMNOPQRSTUVWXYZ", k=6))
seen.add(OfferCodeString)
offerCodesList.append(f"{OfferCodeDateStr}{OfferCodeString}{psuedoRandNum}{offerCodeSuffix}")
return offerCodesList
df["offerCode"] = generateOfferCode(YrCodeMthCode, 6, df.shape[0], 'P')
>>> df
A B C offerCode
0 1.764052 0.400157 0.978738 GAZGCPGE28P
1 2.240893 1.867558 -0.977278 GADYNNWU69P
2 0.950088 -0.151357 -0.103219 GAEQUFPI48P
3 0.410599 0.144044 1.454274 GAUCSCHW76P
4 0.761038 0.121675 0.443863 GAFMVTBP28P
.. ... ... ... ...
195 -0.470638 -0.216950 0.445393 GAOXGTOU88P
196 -0.392389 -3.046143 0.543312 GAXPQOFI25P
197 0.439043 -0.219541 -1.084037 GACBKIJV93P
198 0.351780 0.379236 -0.470033 GAVYQEQL46P
199 -0.216731 -0.930157 -0.178589 GALNKYVE23P
Performance
>>> %timeit generateOfferCode(YrCodeMthCode, 6, 500000, 'P')
%timeit generateOfferCode(YrCodeMthCode, 6, df.shape[0]*1000, 'P')
829 ms ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Suppose I have dataframe A which consists of two columns : geometry ( point) and also hour.
dataframe B also consists of geometry(shape) and hour.
I am familiar with standard sjoin . What I want to do is to make sjoin link rows from the two tables only when the hours are the same. In traditional join terminology the keys are geometry and hour. How can I attain it?
Have reviewed two logical approached
spatial join followed by filter
shard (filter) data frames first on hour, spatial join shards and concatenate results from the sharded data frames
test results for equality
run some timings
Conclusions
little difference between timings on this test data set. simple is quicker if number of points is small
import pandas as pd
import numpy as np
import geopandas as gpd
import shapely.geometry
import requests
# source some points and polygons
# fmt: off
dfp = pd.read_html("https://www.latlong.net/category/cities-235-15.html")[0]
dfp = gpd.GeoDataFrame(dfp, geometry=dfp.loc[:,["Longitude", "Latitude",]].apply(shapely.geometry.Point, axis=1))
res = requests.get("https://opendata.arcgis.com/datasets/69dc11c7386943b4ad8893c45648b1e1_0.geojson")
df_poly = gpd.GeoDataFrame.from_features(res.json())
# fmt: on
# bulk up number of points
dfp = pd.concat([dfp for _ in range(1000)]).reset_index()
HOURS = 24
dfp["hour"] = np.random.randint(0, HOURS, len(dfp))
df_poly["hour"] = np.random.randint(0, HOURS, len(df_poly))
def simple():
return gpd.sjoin(dfp, df_poly).loc[lambda d: d["hour_left"] == d["hour_right"]]
def shard():
return pd.concat(
[
gpd.sjoin(*[d.loc[d["hour"].eq(h)] for d in [dfp, df_poly]])
for h in range(HOURS)
]
)
print(f"""length test: {len(simple()) == len(shard())} {len(simple())}
dataframe test: {simple().sort_index().equals(shard().sort_index())}
points: {len(dfp)}
polygons: {len(df_poly)}""")
%timeit simple()
%timeit shard()
output
length test: True 3480
dataframe test: True
points: 84000
polygons: 379
6.48 s ± 311 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.05 s ± 34.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm missing information on what would be the most efficient (read: fastest) way of using user-defined functions in a groupby-apply setting in either Pandas or Numpy. I have done some of my own tests but am wondering if there are other methods out there that I have not come across yet.
Take the following example DataFrame:
import numpy as np
import pandas as pd
idx = pd.MultiIndex.from_product([range(0, 100000), ["a", "b", "c"]], names = ["time", "group"])
df = pd.DataFrame(columns=["value"], index = idx)
np.random.seed(12)
df["value"] = np.random.random(size=(len(idx),))
print(df.head())
value
time group
0 a 0.154163
b 0.740050
c 0.263315
1 a 0.533739
b 0.014575
I would like to calculate (for example, the below could be any arbitrary user-defined function) the percentage change over time per group. I could do this in a pure Pandas implementation as follows:
def pct_change_pd(series, num):
return series / series.shift(num) - 1
out_pd = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_pd, num=1)
But I could also modify the function and apply it over a numpy array:
def shift_array(arr, num, fill_value=np.nan):
if num >= 0:
return np.concatenate((np.full(num, fill_value), arr[:-num]))
else:
return np.concatenate((arr[-num:], np.full(-num, fill_value)))
def pct_change_np(series, num):
idx = series.index
arr = series.values.flatten()
arr_out = arr / shift_array(arr, num=num) - 1
return pd.Series(arr_out, index=idx)
out_np = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_np, num=1)
out_np = out_np.reset_index(level=2, drop=True)
From my testing, it seems that the numpy method, even with its additional overhead of converting between np.array and pd.Series, is faster.
Pandas:
%%timeit
out_pd = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_pd, num=1)
113 ms ± 548 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy:
%%timeit
out_np = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_np, num=1)
out_np = out_np.reset_index(level=2, drop=True)
94.7 ms ± 642 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
As the index grows and the user-defined function becomes more complex, the Numpy implementation will continue to outperform the Pandas implementation more and more. However, I wonder if there are alternative methods to achieving similar results that are even faster. I'm specifically after another (more efficient) groupby-apply methodology that would allow me to work with any arbitrary user-defined function, not just with the shown example of calculating the percentage change. Would be happy to hear if they exist!
Often the name of the game is to try to use whatever functions are in the toolbox (often optimized and C compiled) rather than applying your own pure Python function. For example, one alternative would be:
def f1(df, num=1):
grb_kwargs = dict(sort=False, group_keys=False) # avoid redundant ops
z = df.sort_values(['group', 'time'])
return z / z.groupby('group', **grb_kwargs).transform(pd.Series.shift, num) - 1
That is about 32% faster than the .groupby('group').apply(pct_change_pd, num=1). On your system, it would yield around 85ms.
And then, there is the trick of doing your "expensive" calculation on the whole df, but masking out the parts that are spillovers from other groups:
def f2(df, num=1):
grb_kwargs = dict(sort=False, group_keys=False) # avoid redundant ops
z = df.sort_values(['group', 'time'])
z2 = z.shift(num)
gid = z.groupby('group', **grb_kwargs).ngroup()
z2.loc[gid != gid.shift(num)] = np.nan
return z / z2 - 1
That one is fully 2.1x faster (on your system would be around 52.8ms).
Finally, when there is no way to find some vectorized function to use directly, then you can use numba to speed up your code (that can then be written with loops to your heart's content)... A classic example is cumulative sum with caps, as in this SO post and this one.
Your first function and using .apply() gives me this result:
In [42]: %%timeit
...: out_pd = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_pd, num=1)
155 ms ± 887 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using groups, the time goes to 56ms.
%%timeit
num=1
outpd_list = []
for g in dfg.groups.keys():
gc = dfg.get_group(g)
outpd_list.append(gc['value'] / gc['value'].shift(num) - 1)
out_pd = pd.concat(outpd_list, axis=0)
56 ms ± 821 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
And if you change this one line in the above code to use built in function you get a bit more time savings
outpd_list.append(gc['value'].pct_change(num))
41.2 ms ± 283 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Python: What would be the most efficient way to read a file without a default delimiter with millions of records and place it in a "data frame (pandas)"?
The File is : "file_sd.txt"
A123456MESTUDIANTE 000-12
A123457MPROFESOR 003103
I128734MPROGRAMADOR00-111
A129863FARQUITECTO 00-456
# Fields and position:
# - Activity Indicator : indAct -> 01 Character
# - Person Code : codPer -> 06 Characters
# - Gender (M / F) : sex -> 01 Character
# - Occupation : occupation -> 11 Characters
# - Amount(User format): amount -> 06 Characters (Convert to Number)
I'm not sure. Is this the best option?:
import pandas as pd
import numpy as np
def stoI(cad):
pos = cad.find("-")
if pos < 0: return int(cad)
return int(cad[pos+1:])*-1
#Read Txt
data = pd.read_csv(r'D:\file_sd.txt',header = None)
data_sep = pd.DataFrame(
{
'indAct' :data[0].str.slice(0,1),
'codPer' :data[0].str.slice(1,7),
'sexo' :data[0].str.slice(7,8),
'ocupac' :data[0].str.slice(8,19),
'monto' :np.vectorize(stoI)(data[0].str.slice(19,25))
})
print(data_sep)
indAct codPer sexo ocupac monto
0 A 123456 M ESTUDIANTE -12
1 A 123457 M PROFESOR 3103
2 I 128734 M PROGRAMADOR -111
3 A 129863 F ARQUITECTO -456
**This solution for 7 million rows.the result is: **
%timeit df_slice()
11.1 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You have a fixed-width file, so you should use the appropriate pd.read_fwf reader. In this case we will specify the number of characters that belong to each field and the column names.
df = pd.read_fwf('test.txt', header=None, widths=[1, 6, 1, 11, 6])
df.columns = ['indAct' ,'codPer', 'sexo', 'ocupac', 'monto']
# indAct codPer sexo ocupac monto
#0 A 123456 M ESTUDIANTE 000-12
#1 A 123457 M PROFESOR 003103
#2 I 128734 M PROGRAMADOR 00-111
#3 A 129863 F ARQUITECTO 00-456
Now you can fix the dtypes of fields. 'monto' can be made into a number by stripping the 0s and calling pd.to_numeric.
df['monto'] = pd.to_numeric(df['monto'].str.strip('0'), errors='coerce')
# indAct codPer sexo ocupac monto
#0 A 123456 M ESTUDIANTE -12
#1 A 123457 M PROFESOR 3103
#2 I 128734 M PROGRAMADOR -111
#3 A 129863 F ARQUITECTO -456
As your comment notes this might on the surface seem slower. The advantage is that pd.read_fwf, as an I/O operation has a lot of automatic data cleaning.
It will properly downcast columns from object if all data are int/float/numeric. For string slicing, you need to manually typecast the columns.
It will properly trim the whitespace from strings in fields that don't fully consume the character limit. This is an additional step you need to perform after slicing.
It properly fills missing data (all blank fields) with NaN. string slicing will keep the blank strings and must be dealt with separately. pandas does not recognize '' as null, so this is how missing data should be properly handled.
In the case of many object columns that all fully encompass the entire character limit, with no missing data, string slicing has the advantage. But for a general unknown dataset, that you need to ingest and ETL, once start tacking on string stripping and type conversions to every column, you will likely find that the designated pandas I/O operations are the best option.
You can extract the columns using regex pattern matching. For the data in question, we can define the regex as:
data[0].str.extract('(?P<indAct>[AI]{1})(?P<codPer>[0-9]{6})(?P<sexo>[MF]{1})(?P<ocupac>[A-Z\s]{11})[0]*[^-|1-9](?P<monto>[0-9-\s]*$)')
The assumption is that the data is clean, which may or may not be a valid assumption.
Here is a comparison of the approach in the question and the one here:
#Data size is 300 rows. ( 4 rows in the question replicated 75 times)
import pandas as pd
import numpy as np
#Returns a number from a string with zeros to the left, For example: stoI('0000-810') return -810
def stoI(cad):
pos = cad.find("-")
if pos < 0: return int(cad)
return int(cad[pos+1:])*-1
data = pd.read_csv('file_sd.txt',header = None)
#Read Txt
def df_slice():
return pd.DataFrame(
{
'indAct' :data[0].str.slice(0,1),
'codPer' :data[0].str.slice(1,7),
'sexo' :data[0].str.slice(7,8),
'ocupac' :data[0].str.slice(8,19),
'monto' :np.vectorize(stoI)(data[0].str.slice(19,25))
})
def df_extract():
return data[0].str.extract('(?P<indAct>[AI]{1})(?P<codPer>[0-9]{6})(?P<sexo>[MF]{1})(?P<ocupac>[A-Z\s]{11})[0]*[^-|1-9](?P<monto>[0-9-\s]*$)')
%timeit df_slice()
1.84 ms ± 30.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df_extract()
975 µs ± 15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Hope this helps!