Pandas slowness with dataframe size increased size - python

I want to remove all url from a column. The column has string format.
My Dataframe has two columns: str_val[str], str_length[int].
I am using following code:
t1 = time.time()
reg_exp_val = r"((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+)"
df_mdr_pd['str_val1'] = df_mdr_pd.str_val.str.replace(reg_exp_val, r'')
print(time.time()-t1)
When I run the code for 10000 instance, it is finished in 0.6 seconds. For 100000 instances the execution just gets stuck. I tried using .loc[i, i+10000] and run it in for cycle but it did not help either.

The problem was due to the reg exp I was using. The one, which worked for me was
r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]|\(([^\s()<>]|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))",
Which was taken from this link.

Related

how to get result of expanding/resample function in original dataframe using python

have a dataframe with 1 minute timestamp of open, high, low, close, volume for a token.
using expanding or resample function, one can get a new dataframe based on the timeinterval. in my case its 1 day time interval.
i am looking to get the above output in the original dataframe. please assist in the same.
original dataframe:
desired dataframe:
Here "date_1d" is the time interval for my use case. i used expanding function but as the value changes in "date_1d" column, expanding function works on the whole dataframe
df["high_1d"] = df["high"].expanding().max()
df["low_1d"] = df["low"].expanding().min()
df["volume_1d"] = df["volume"].expanding().min()
then the next challenge was how to find Open and Close based on "date_1d" column
Please assist or ask more questions, if not clear on my desired output.
Fyi - data is huge for 5 years 1 minute data for 100 tokens
thanks in advance
Sukhwant
I'm not sure if I understand it right but for me it looks like you want to groupbyeach day and calculate first last min max for them.
Is the column date_1d already there ?
If not:
df["date_1d"] = df["date"].dt.strftime('%Y%m%d')
For the calculations:
df["open_1d"] = df.groupby("date_1d")['open'].transform('first')
df["high_1d"] = df.groupby("date_1d")['high'].transform('max')
df["low_1d"] = df.groupby("date_1d")['low'].transform('min')
df["close_1d"] = df.groupby("date_1d")['close'].transform('last')
EDIT:
Have a look in your code if this works like you expect it (till we have some of your data I can only guess, sorry :D )
df['high_1d'] = df.groupby('date_1d')['high'].expanding().max().values
It groups the data per "date_1d" but in the group only consider row by row (and the above rows)
EDIT: Found a neat solution using transform method. Erased the need for a "Day" Column as df.groupby is made using index.date attribute.
import pandas as pd
import yfinance as yf
df = yf.download("AAPL", interval="1m",
start=datetime.date.today()-datetime.timedelta(days=6))
df['Open_1d'] = df["Open"].groupby(
df.index.day).transform('first')
df['Close_1d'] = df["Close"].groupby(
df.index.day).transform('last')
df['High_1d'] = df['High'].groupby(
df.index.day).expanding().max().droplevel(level=0)
df['Low_1d'] = df['Low'].groupby(
df.index.day).expanding().min().droplevel(level=0)
df['Volume_1d'] = df['Volume'].groupby(
df['Day']).expanding().sum().droplevel(level=0)
df
Happy Coding!

Comparing values in Dataframes

I am doing a Python project and trying to cut down on some computational time at the start using Pandas.
The code currently is:
for c1 in centres1:
for c2 in centres2:
if ((c1[0]-c2[0])**2 + (c1[1]-c2[1])**2) < search_rad*search_rad:
possible_commet.append([c1,c2])
I am trying to put centres1 and centres2 into data frames then compare each value to each other value. Would pandas help me cut some time off it? (currently 2 mins). If not how could I work around it?
Thanks
Unfortunately this is never going to be fast as you are going to be performing n squared operations. For example if you are comparing n objects where n = 1000 then you only have 1 million comparisons. If however you have n = 10_000 then you 100 million comparisons. A problem 10x bigger becomes 100 times slower.
nevertheless, for loops in python are relatively expensive. Using a library like pandas may mean that you can only make one function call and will shave some time off. Without any input data it's hard to assist further but the below should provide some building blocks
import pandas
df1 = pandas.Dataframe(centres1)
df2 = pandas.Dataframe(centres2)
df3 = df1.merge(df2, how = 'cross')
df3['combined_centre'] = ((df3['0_x']-df2['0_y']**2 + (df1['1_x']-df['1_y'])**2)
df3[df3['prod'] > search_rad**2
Yes, for sure pandas will help in cutting-off atleast some time which will be less that what you are getting right now, but you can try this out:
for i,j in zip(center1, center2):
if ((c1[0]-c2[0])**2 + (c1[1]-c2[1])**2) < search_rad*search_rad:
possible_commet.append([c1,c2])

Python Dask - Group-by performance on all columns

I want to count the number of unique rows in my data. Below a quick input/output example.
#input
A,B
0,0
0,1
1,0
1,0
1,1
1,1
#output
A,B,count
0,0,1
0,1,1
1,0,2
1,1,2
The data in my pipeline have more than 5000 columns and more than 1M rows, each cell is a 0 or a 1. Below there are my two attempts at scaling with Dask (with 26 columns):
import numpy as np
import string
import time
client = Client(n_workers=6, threads_per_worker=2, processes=True)
columns = list(string.ascii_uppercase)
data = np.random.randint(2, size = (1000000, len(columns)))
ddf_parent = dd.from_pandas(pd.DataFrame(data, columns = columns), npartitions=20)
#1st solution
ddf = ddf_parent.astype(str)
ddf_concat = ddf.apply(''.join, axis =1).to_frame()
ddf_concat.columns = ['pattern']
ddf_concat = ddf_concat.groupby('pattern').size()
start = time.time()
ddf_concat = ddf_concat.compute()
print(time.time()-start)
#2nd solution
ddf_concat_other = ddf_parent.groupby(list(ddf.columns)).size()
start = time.time()
ddf_concat_other = ddf_concat_other.compute()
print(time.time() - start)
results:
9.491615056991577
12.688117980957031
The first solution first concatenates every column into a string and then runs the group-by on it. The second one just group-by all the columns. I am leaning toward using the first one as it is faster in my tests, but I am open to suggestions. Feel free to completely change my solution if there is anything better in term of performance (also, interesting, sort=False does not speed up the group-by, which may actually be related to this: https://github.com/dask/dask/issues/5441 and this https://github.com/rapidsai/cudf/issues/2717)
NOTE:
After some testing the first solution scales relatively well with the number of columns. I guess one improvement could be to hash the strings to always have a fix length. Any suggestion on the partition number in this case? From the remote dashboard I can see that after couple of operations the nodes in the computational graph reduces to only 3, not taking advantage of the other workers available.
The second solutions fails when columns increase.
NOTE2:
Also, with the first solution, something really strange is happening with what I guess is how Dask schedules and maps operations. What is happening is that after some time a single worker gets many more tasks than the others, then the worker exceed 95% of the memory, crash, then tasks are split correctly, but after some time another worker gets more tasks (and the cycle restart). The pipeline runs fine, but I was wondering if this is the expected behavior. Attached a screenshot:

Looking for ways to improve the speed of my python script that uses the Pandas library

I am fairly new to Pandas, and have started using the library to work with data sets in Power BI. I recently had to write a snippet of code to run some calculations on a column of integers, but had a hard time translating my code from standard python to Pandas. The code is essentially casting the column to a list, and then running a loop on the items in the list, appending the resulting number to a new list that I then make into it's own column.
I have read that running loops in Pandas can be slow, and the execution of the code below does indeed seem slow. Any help pointing me in the right direction would be much appreciated!
Here is the code that I am trying to optimize:
import pandas as pd
df = dataset #Required step in Power BI
gb_list = df['Estimated_Size'].T.tolist()
hours_list = []
for size in gb_list:
hours = -0.50
try:
for count in range(0,round(size)):
if count % 100 == 0:
hours += .50
else:
continue
except:
hours = 0
hours_list.append(hours)
df['Total Hours'] = hours_list
IIUC, your code is equivalent to:
df['Total Hours'] = (df['Estimated_Size'] // 100) * 0.5
Except that I'm not clear what value you want when Estimated_Size is exactly 100.

append() function is taking a long time to run

Firstly, I just got started in Python and therefore I do not have much knowledge. I tried to search this problem over and I could not find a proper solution.
Just to make a brief: I am studying a traffic accidents database that has almost 165000 rows and 39 columns. One of the steps that I am taking is running apriori algorithm (from apyori lab) in this base.
If you want, you can donwnload the base (.csv) here.
However, in order to do that, I have to transform my pandas database into a list and this is the part that I am having problems.
I am using the following code:
def list_apriori(df):
apr = []
for i in range (0, 164699):
apr.append([str(df.values[i,j]) for j in range (0,38)])
return apr
I left this code running for almost 40 minutes and it didn't finish, so I thought that maybe there is an improved way to do so.
I've made a test with:
def list_apriori(df):
apr = []
for i in range (0, 50):
apr.append([str(df.values[i,j]) for j in range (0,10)])
return apr
It finished in less than 5 minutes (which I believe that is a long time given that it is only running in 51 rows and 11 columns).
I also tried to change the computer, but I did not feel any difference.
Do you have any suggestions of if and how I can improve the code in order to run it faster?
Thanks in advance.
EDIT
I believe that the problem was the conversion to string. Thanks #ninesalt for the help!
The code that worked is the following:
def list_apriori(df):
result = df.astype(str)
apr = []
for i in range (0, 164699):
apr.append([df.values[i,j] for j in range (0,38)])
return apr
Here is exactly what you wanted but without the loops. This takes 3 seconds on my PC and the dataframe is the same size as the one in your example (165000, 39)
import numpy as np
import pandas as pd
arr = np.random.randint(0, 100, (165000, 39))
df = pd.DataFrame(arr)
result = df.astype(str)
firstrow = result.iloc[[0]]
print(firstrow) # prints first row as a string
Whenever you have something that you think is an expensive operation, there's almost always an easier and more efficient way to do it with the library you're using, you just have to check the docs.

Categories

Resources