I am working on a route building code and have half a million record which taking around 3-4 hrs to get executed.
For creating dataframe:
# initialize list of lists
data = [[['1027', '(K)', 'TRIM']], [[SJCL, (K), EJ00, (K), ZQFC, (K), 'DYWH']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['route'])
Will look like something this:
route
[1027, (K), TRIM]
[SJCL, (K), EJ00, (K), ZQFC, (K), DYWH]
O/P
Code I have used:
def func_list(hd1):
required_list=[]
for j,i in enumerate(hd1):
#print(i,j)
if j==0:
req=i
else:
if (i[0].isupper() or i[0].isdigit()):
required_list.append(req)
req=i
else:
req=req+i
required_list.append(req)
return required_list
df['route2']=df.route1.apply(lambda x : func_list (x))
#op
route2
[1027(K), TRIM]
[SJCL(K), EJ00(K), ZQFC(K), DYWH]
For half million rows, it taking 3-4 hrs, I dont know how to reduce it pls help.
Use explode to flatten your dataframe:
sr1 = df['route'].explode()
sr2 = pd.Series(np.where(sr1.str[0] == '(', sr1.shift() + sr1, sr1), index=sr1.index)
df['route'] = sr2[sr1.eq(sr2).shift(-1, fill_value=True)].groupby(level=0).apply(list)
print(df)
# Output:
0 [1027(K), TRIM]
1 [SJCL(K), EJ00(K), ZQFC(K), DYWH]
dtype: object
For 500K records:
7.46 s ± 97.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Related
Suppose I have dataframe A which consists of two columns : geometry ( point) and also hour.
dataframe B also consists of geometry(shape) and hour.
I am familiar with standard sjoin . What I want to do is to make sjoin link rows from the two tables only when the hours are the same. In traditional join terminology the keys are geometry and hour. How can I attain it?
Have reviewed two logical approached
spatial join followed by filter
shard (filter) data frames first on hour, spatial join shards and concatenate results from the sharded data frames
test results for equality
run some timings
Conclusions
little difference between timings on this test data set. simple is quicker if number of points is small
import pandas as pd
import numpy as np
import geopandas as gpd
import shapely.geometry
import requests
# source some points and polygons
# fmt: off
dfp = pd.read_html("https://www.latlong.net/category/cities-235-15.html")[0]
dfp = gpd.GeoDataFrame(dfp, geometry=dfp.loc[:,["Longitude", "Latitude",]].apply(shapely.geometry.Point, axis=1))
res = requests.get("https://opendata.arcgis.com/datasets/69dc11c7386943b4ad8893c45648b1e1_0.geojson")
df_poly = gpd.GeoDataFrame.from_features(res.json())
# fmt: on
# bulk up number of points
dfp = pd.concat([dfp for _ in range(1000)]).reset_index()
HOURS = 24
dfp["hour"] = np.random.randint(0, HOURS, len(dfp))
df_poly["hour"] = np.random.randint(0, HOURS, len(df_poly))
def simple():
return gpd.sjoin(dfp, df_poly).loc[lambda d: d["hour_left"] == d["hour_right"]]
def shard():
return pd.concat(
[
gpd.sjoin(*[d.loc[d["hour"].eq(h)] for d in [dfp, df_poly]])
for h in range(HOURS)
]
)
print(f"""length test: {len(simple()) == len(shard())} {len(simple())}
dataframe test: {simple().sort_index().equals(shard().sort_index())}
points: {len(dfp)}
polygons: {len(df_poly)}""")
%timeit simple()
%timeit shard()
output
length test: True 3480
dataframe test: True
points: 84000
polygons: 379
6.48 s ± 311 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.05 s ± 34.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I am calculating the RSI value for a stock price where a previous row is needed for the result of the current row. I am doing it currently via for looping the full dataframe for as many times there are entries which takes a lot of time (executing time on my pc around 15 seconds).
Is there any way to improve that code?
import pandas as pd
from pathlib import Path
filename = Path("Tesla.csv")
test = pd.read_csv(filename)
data = pd.DataFrame(test[["Date","Close"]])
data["Change"] = (data["Close"].shift(-1)-data["Close"]).shift(1)
data["Gain"] = 0.0
data["Loss"] = 0.0
data.loc[data["Change"] >= 0, "Gain"] = data["Change"]
data.loc[data["Change"] <= 0, "Loss"] = data["Change"]*-1
data.loc[:, "avgGain"] = 0.0
data.loc[:, "avgLoss"] = 0.0
data["avgGain"].iat[14] = data["Gain"][1:15].mean()
data["avgLoss"].iat[14] = data["Loss"][1:15].mean()
for index in data.iterrows():
data.loc[15:, "avgGain"] = (data.loc[14:, "avgGain"].shift(1)*13 + data.loc[15:, "Gain"])/14
data.loc[15:, "avgLoss"] = (data.loc[14:, "avgLoss"].shift(1)*13 + data.loc[15:, "Loss"])/14
The used dataset can be downloaded here:
TSLA historic dataset from yahoo finance
The goal is to calculate the RSI value based on the to be calculated avgGain and avgLoss value.
The avgGain value on rows 0:14 are not existent.
The avgGain value on row 15 is the mean value of row[1:14] of the Gain column.
The avgGain value from row 16 onwards is calculated as:
(13*avgGain(row before)+Gain(current row))/14
'itertuples' is faster than 'iterrows' and vectorized operations generally performs best in terms of time.
Here you can calculate average gains and losses (rolling averages) over 14 days with the rolling method with a window size of 14.
%%timeit
data["avgGain"].iat[14] = data["Gain"][1:15].mean()
data["avgLoss"].iat[14] = data["Loss"][1:15].mean()
for index in data.iterrows():
data.loc[15:, "avgGain"] = (data.loc[14:, "avgGain"].shift(1)*13 + data.loc[15:, "Gain"])/14
data.loc[15:, "avgLoss"] = (data.loc[14:, "avgLoss"].shift(1)*13 + data.loc[15:, "Loss"])/14
1.12 s ± 3.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
data['avgGain_alt'] = data['Gain'].rolling(window=14).mean().fillna(0)
data['avgLos_alt'] = data['Gain'].rolling(window=14).mean().fillna(0)
1.38 ms ± 2.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
data.head(15)
Using vectorized operations to calculate moving averages is approximately 10 times faster than calculating with loops.
However note that, there is also some calculation error in your code for the averages after the first one.
Essentially I have data which provides a start time, the number of time slots and the duration of each slot.
I want to convert that into a dataframe of start and end times - which I've achieved but I can't help but think is not efficient or particularly pythonic.
The real data has multiple ID's hence the grouping.
import pandas as pd
slots = pd.DataFrame({"ID": 1, "StartDate": pd.to_datetime("2019-01-01 10:30:00"), "Quantity": 3, "Duration": pd.to_timedelta(30, unit="minutes")}, index=[0])
grp_data = slots.groupby("ID")
bob = []
for rota_id, row in grp_data:
start = row.iloc[0, 1]
delta = row.iloc[0, 3]
for quantity in range(1, int(row.iloc[0, 2] + 1)):
data = {"RotaID": rota_id,
"DateStart": start,
"Duration": delta,
"DateEnd": start+delta}
bob.append(data)
start = start + delta
fred = pd.DataFrame(bob)
This might be answered elsewhere but I've no idea how to properly search this since I'm not sure what my problem is.
EDIT: I've updated my code to be more efficient with it's function calls and it is faster, but I'm still interested in knowing if there is a vectorised approach to this.
How about this way:
indices_dup = [np.repeat(i, quantity) for i, quantity in enumerate(slots.Quantity.values)]
slots_ext = slots.loc[np.concatenate(indices_dup).ravel(), :]
# Add a counter per ID; used to 'shift' the duration along StartDate
slots_ext['counter'] = slots_ext.groupby('ID').cumcount()
# Calculate DateStart and DateEnd based on counter and Duration
slots_ext['DateStart'] = (slots_ext.counter) * slots_ext.Duration.values + slots_ext.StartDate
slots_ext['DateEnd'] = (slots_ext.counter + 1) * slots_ext.Duration.values + slots_ext.StartDate
slots_ext.loc[:, ['ID', 'DateStart', 'Duration', 'DateEnd']].reset_index(drop=True)
Performance
Looking at performance on a larger dataframe (duplicated 1000 times) using
slots_large = pd.concat([slots] * 1000, ignore_index=True).drop('ID', axis=1).reset_index().rename(columns={'index': 'ID'})
Yields:
Old method: 289 ms ± 4.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
New method: 8.13 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In case this ever helps anyone:
I found that my data set had varying delta's per ID and #RubenB 's initial answer doesn't handle those. Here was my final solution based on his/her code:
# RubenB's code
indices_dup = [np.repeat(i, quantity) for i, quantity in enumerate(slots.Quantity.values)]
slots_ext = slots.loc[np.concatenate(indices_dup).ravel(), :]
# Calculate the cumulative sum of the delta per rota ID
slots_ext["delta_sum"] = slots_ext.groupby("ID")["Duration"].cumsum()
slots_ext["delta_sum"] = pd.to_timedelta(slots_ext["delta_sum"], unit="minutes")
# Use the cumulative sum to calculate the running end dates and then the start dates
first_value = slots_ext.StartDate[0]
slots_ext["EndDate"] = slots_ext.delta_sum.values + slots_ext.StartDate
slots_ext["StartDate"] = slots_ext.EndDate.shift(1)
slots_ext.loc[0, "StartDate"] = first_value
slots_ext.reset_index(drop=True, inplace=True)
Python: What would be the most efficient way to read a file without a default delimiter with millions of records and place it in a "data frame (pandas)"?
The File is : "file_sd.txt"
A123456MESTUDIANTE 000-12
A123457MPROFESOR 003103
I128734MPROGRAMADOR00-111
A129863FARQUITECTO 00-456
# Fields and position:
# - Activity Indicator : indAct -> 01 Character
# - Person Code : codPer -> 06 Characters
# - Gender (M / F) : sex -> 01 Character
# - Occupation : occupation -> 11 Characters
# - Amount(User format): amount -> 06 Characters (Convert to Number)
I'm not sure. Is this the best option?:
import pandas as pd
import numpy as np
def stoI(cad):
pos = cad.find("-")
if pos < 0: return int(cad)
return int(cad[pos+1:])*-1
#Read Txt
data = pd.read_csv(r'D:\file_sd.txt',header = None)
data_sep = pd.DataFrame(
{
'indAct' :data[0].str.slice(0,1),
'codPer' :data[0].str.slice(1,7),
'sexo' :data[0].str.slice(7,8),
'ocupac' :data[0].str.slice(8,19),
'monto' :np.vectorize(stoI)(data[0].str.slice(19,25))
})
print(data_sep)
indAct codPer sexo ocupac monto
0 A 123456 M ESTUDIANTE -12
1 A 123457 M PROFESOR 3103
2 I 128734 M PROGRAMADOR -111
3 A 129863 F ARQUITECTO -456
**This solution for 7 million rows.the result is: **
%timeit df_slice()
11.1 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You have a fixed-width file, so you should use the appropriate pd.read_fwf reader. In this case we will specify the number of characters that belong to each field and the column names.
df = pd.read_fwf('test.txt', header=None, widths=[1, 6, 1, 11, 6])
df.columns = ['indAct' ,'codPer', 'sexo', 'ocupac', 'monto']
# indAct codPer sexo ocupac monto
#0 A 123456 M ESTUDIANTE 000-12
#1 A 123457 M PROFESOR 003103
#2 I 128734 M PROGRAMADOR 00-111
#3 A 129863 F ARQUITECTO 00-456
Now you can fix the dtypes of fields. 'monto' can be made into a number by stripping the 0s and calling pd.to_numeric.
df['monto'] = pd.to_numeric(df['monto'].str.strip('0'), errors='coerce')
# indAct codPer sexo ocupac monto
#0 A 123456 M ESTUDIANTE -12
#1 A 123457 M PROFESOR 3103
#2 I 128734 M PROGRAMADOR -111
#3 A 129863 F ARQUITECTO -456
As your comment notes this might on the surface seem slower. The advantage is that pd.read_fwf, as an I/O operation has a lot of automatic data cleaning.
It will properly downcast columns from object if all data are int/float/numeric. For string slicing, you need to manually typecast the columns.
It will properly trim the whitespace from strings in fields that don't fully consume the character limit. This is an additional step you need to perform after slicing.
It properly fills missing data (all blank fields) with NaN. string slicing will keep the blank strings and must be dealt with separately. pandas does not recognize '' as null, so this is how missing data should be properly handled.
In the case of many object columns that all fully encompass the entire character limit, with no missing data, string slicing has the advantage. But for a general unknown dataset, that you need to ingest and ETL, once start tacking on string stripping and type conversions to every column, you will likely find that the designated pandas I/O operations are the best option.
You can extract the columns using regex pattern matching. For the data in question, we can define the regex as:
data[0].str.extract('(?P<indAct>[AI]{1})(?P<codPer>[0-9]{6})(?P<sexo>[MF]{1})(?P<ocupac>[A-Z\s]{11})[0]*[^-|1-9](?P<monto>[0-9-\s]*$)')
The assumption is that the data is clean, which may or may not be a valid assumption.
Here is a comparison of the approach in the question and the one here:
#Data size is 300 rows. ( 4 rows in the question replicated 75 times)
import pandas as pd
import numpy as np
#Returns a number from a string with zeros to the left, For example: stoI('0000-810') return -810
def stoI(cad):
pos = cad.find("-")
if pos < 0: return int(cad)
return int(cad[pos+1:])*-1
data = pd.read_csv('file_sd.txt',header = None)
#Read Txt
def df_slice():
return pd.DataFrame(
{
'indAct' :data[0].str.slice(0,1),
'codPer' :data[0].str.slice(1,7),
'sexo' :data[0].str.slice(7,8),
'ocupac' :data[0].str.slice(8,19),
'monto' :np.vectorize(stoI)(data[0].str.slice(19,25))
})
def df_extract():
return data[0].str.extract('(?P<indAct>[AI]{1})(?P<codPer>[0-9]{6})(?P<sexo>[MF]{1})(?P<ocupac>[A-Z\s]{11})[0]*[^-|1-9](?P<monto>[0-9-\s]*$)')
%timeit df_slice()
1.84 ms ± 30.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df_extract()
975 µs ± 15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Hope this helps!
Other questions attempting to provide the python equivalent to R's sweepfunction (like here) do not really address the case of multiple arguments where it is most useful.
Say I wish to apply a 2 argument function to each row of a Dataframe with the matching element from a column of another DataFrame:
df = data.frame("A" = 1:3,"B" = 11:13)
df2= data.frame("X" = 10:12,"Y" = 10000:10002)
sweep(df,1, FUN="*",df2$X)
In python I got the equivalent using apply on what is basically a loop through the row counts.
df = pd.DataFrame( { "A" : range(1,4),"B" : range(11,14) } )
df2 = pd.DataFrame( { "X" : range(10,13),"Y" : range(10000,10003) } )
pd.Series(range(df.shape[0])).apply(lambda row_count: np.multiply(df.iloc[row_count,:],df2.iloc[row_count,df2.columns.get_loc('X')]))
I highly doubt this is efficient in pandas, what is a better way of doing this?
Both bits of code should result in a Dataframe/matrix of 6 numbers when applying *:
A B
1 10 110
2 22 132
3 36 156
I should state clearly that the aim is to insert one's own function into this sweep like behavior say:
df = data.frame("A" = 1:3,"B" = 11:13)
df2= data.frame("X" = 10:12,"Y" = 10000:10002)
myFunc = function(a,b) { floor((a + b)^min(a/2,b/3)) }
sweep(df,1, FUN=myFunc,df2$X)
resulting in:
A B
[1,] 3 4
[2,] 3 4
[3,] 3 5
What is a good way of doing that in python pandas?
If I understand this correctly, you are looking to apply a binary function f(x,y) to a dataframe (for the x) row-wise with arguments from a series for y. One way to do this is to borrow the implementation from pandas internals itself. If you want to extend this function (e.g. apply along columns, it can be done in a similar manner, as long as f is binary. If you need more arguments, you can simply do a partial on f to make it binary
import pandas as pd
from pandas.core.dtypes.generic import ABCSeries
def sweep(df, series, FUN):
assert isinstance(series, ABCSeries)
# row-wise application
assert len(df) == len(series)
return df._combine_match_index(series, FUN)
# define your binary operator
def f(x, y):
return x*y
# the input data frames
df = pd.DataFrame( { "A" : range(1,4),"B" : range(11,14) } )
df2 = pd.DataFrame( { "X" : range(10,13),"Y" : range(10000,10003) } )
# apply
test1 = sweep(df, df2.X, f)
# performance
# %timeit sweep(df, df2.X, f)
# 155 µs ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)#
# another method
import numpy as np
test2 = pd.Series(range(df.shape[0])).apply(lambda row_count: np.multiply(df.iloc[row_count,:],df2.iloc[row_count,df2.columns.get_loc('X')]))
# %timeit performance
# 1.54 ms ± 56.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
assert all(test1 == test2)
Hope this helps.
In pandas
df.mul(df2.X,axis=0)
A B
0 10 110
1 22 132
2 36 156