Firstly, I just got started in Python and therefore I do not have much knowledge. I tried to search this problem over and I could not find a proper solution.
Just to make a brief: I am studying a traffic accidents database that has almost 165000 rows and 39 columns. One of the steps that I am taking is running apriori algorithm (from apyori lab) in this base.
If you want, you can donwnload the base (.csv) here.
However, in order to do that, I have to transform my pandas database into a list and this is the part that I am having problems.
I am using the following code:
def list_apriori(df):
apr = []
for i in range (0, 164699):
apr.append([str(df.values[i,j]) for j in range (0,38)])
return apr
I left this code running for almost 40 minutes and it didn't finish, so I thought that maybe there is an improved way to do so.
I've made a test with:
def list_apriori(df):
apr = []
for i in range (0, 50):
apr.append([str(df.values[i,j]) for j in range (0,10)])
return apr
It finished in less than 5 minutes (which I believe that is a long time given that it is only running in 51 rows and 11 columns).
I also tried to change the computer, but I did not feel any difference.
Do you have any suggestions of if and how I can improve the code in order to run it faster?
Thanks in advance.
EDIT
I believe that the problem was the conversion to string. Thanks #ninesalt for the help!
The code that worked is the following:
def list_apriori(df):
result = df.astype(str)
apr = []
for i in range (0, 164699):
apr.append([df.values[i,j] for j in range (0,38)])
return apr
Here is exactly what you wanted but without the loops. This takes 3 seconds on my PC and the dataframe is the same size as the one in your example (165000, 39)
import numpy as np
import pandas as pd
arr = np.random.randint(0, 100, (165000, 39))
df = pd.DataFrame(arr)
result = df.astype(str)
firstrow = result.iloc[[0]]
print(firstrow) # prints first row as a string
Whenever you have something that you think is an expensive operation, there's almost always an easier and more efficient way to do it with the library you're using, you just have to check the docs.
Related
I want to remove all url from a column. The column has string format.
My Dataframe has two columns: str_val[str], str_length[int].
I am using following code:
t1 = time.time()
reg_exp_val = r"((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+)"
df_mdr_pd['str_val1'] = df_mdr_pd.str_val.str.replace(reg_exp_val, r'')
print(time.time()-t1)
When I run the code for 10000 instance, it is finished in 0.6 seconds. For 100000 instances the execution just gets stuck. I tried using .loc[i, i+10000] and run it in for cycle but it did not help either.
The problem was due to the reg exp I was using. The one, which worked for me was
r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]|\(([^\s()<>]|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))",
Which was taken from this link.
I am doing a Python project and trying to cut down on some computational time at the start using Pandas.
The code currently is:
for c1 in centres1:
for c2 in centres2:
if ((c1[0]-c2[0])**2 + (c1[1]-c2[1])**2) < search_rad*search_rad:
possible_commet.append([c1,c2])
I am trying to put centres1 and centres2 into data frames then compare each value to each other value. Would pandas help me cut some time off it? (currently 2 mins). If not how could I work around it?
Thanks
Unfortunately this is never going to be fast as you are going to be performing n squared operations. For example if you are comparing n objects where n = 1000 then you only have 1 million comparisons. If however you have n = 10_000 then you 100 million comparisons. A problem 10x bigger becomes 100 times slower.
nevertheless, for loops in python are relatively expensive. Using a library like pandas may mean that you can only make one function call and will shave some time off. Without any input data it's hard to assist further but the below should provide some building blocks
import pandas
df1 = pandas.Dataframe(centres1)
df2 = pandas.Dataframe(centres2)
df3 = df1.merge(df2, how = 'cross')
df3['combined_centre'] = ((df3['0_x']-df2['0_y']**2 + (df1['1_x']-df['1_y'])**2)
df3[df3['prod'] > search_rad**2
Yes, for sure pandas will help in cutting-off atleast some time which will be less that what you are getting right now, but you can try this out:
for i,j in zip(center1, center2):
if ((c1[0]-c2[0])**2 + (c1[1]-c2[1])**2) < search_rad*search_rad:
possible_commet.append([c1,c2])
I want to count the number of unique rows in my data. Below a quick input/output example.
#input
A,B
0,0
0,1
1,0
1,0
1,1
1,1
#output
A,B,count
0,0,1
0,1,1
1,0,2
1,1,2
The data in my pipeline have more than 5000 columns and more than 1M rows, each cell is a 0 or a 1. Below there are my two attempts at scaling with Dask (with 26 columns):
import numpy as np
import string
import time
client = Client(n_workers=6, threads_per_worker=2, processes=True)
columns = list(string.ascii_uppercase)
data = np.random.randint(2, size = (1000000, len(columns)))
ddf_parent = dd.from_pandas(pd.DataFrame(data, columns = columns), npartitions=20)
#1st solution
ddf = ddf_parent.astype(str)
ddf_concat = ddf.apply(''.join, axis =1).to_frame()
ddf_concat.columns = ['pattern']
ddf_concat = ddf_concat.groupby('pattern').size()
start = time.time()
ddf_concat = ddf_concat.compute()
print(time.time()-start)
#2nd solution
ddf_concat_other = ddf_parent.groupby(list(ddf.columns)).size()
start = time.time()
ddf_concat_other = ddf_concat_other.compute()
print(time.time() - start)
results:
9.491615056991577
12.688117980957031
The first solution first concatenates every column into a string and then runs the group-by on it. The second one just group-by all the columns. I am leaning toward using the first one as it is faster in my tests, but I am open to suggestions. Feel free to completely change my solution if there is anything better in term of performance (also, interesting, sort=False does not speed up the group-by, which may actually be related to this: https://github.com/dask/dask/issues/5441 and this https://github.com/rapidsai/cudf/issues/2717)
NOTE:
After some testing the first solution scales relatively well with the number of columns. I guess one improvement could be to hash the strings to always have a fix length. Any suggestion on the partition number in this case? From the remote dashboard I can see that after couple of operations the nodes in the computational graph reduces to only 3, not taking advantage of the other workers available.
The second solutions fails when columns increase.
NOTE2:
Also, with the first solution, something really strange is happening with what I guess is how Dask schedules and maps operations. What is happening is that after some time a single worker gets many more tasks than the others, then the worker exceed 95% of the memory, crash, then tasks are split correctly, but after some time another worker gets more tasks (and the cycle restart). The pipeline runs fine, but I was wondering if this is the expected behavior. Attached a screenshot:
I am fairly new to Pandas, and have started using the library to work with data sets in Power BI. I recently had to write a snippet of code to run some calculations on a column of integers, but had a hard time translating my code from standard python to Pandas. The code is essentially casting the column to a list, and then running a loop on the items in the list, appending the resulting number to a new list that I then make into it's own column.
I have read that running loops in Pandas can be slow, and the execution of the code below does indeed seem slow. Any help pointing me in the right direction would be much appreciated!
Here is the code that I am trying to optimize:
import pandas as pd
df = dataset #Required step in Power BI
gb_list = df['Estimated_Size'].T.tolist()
hours_list = []
for size in gb_list:
hours = -0.50
try:
for count in range(0,round(size)):
if count % 100 == 0:
hours += .50
else:
continue
except:
hours = 0
hours_list.append(hours)
df['Total Hours'] = hours_list
IIUC, your code is equivalent to:
df['Total Hours'] = (df['Estimated_Size'] // 100) * 0.5
Except that I'm not clear what value you want when Estimated_Size is exactly 100.
I have some very simple code written to simulate a stock price assuming random movement between -2% and +2% a day (it's overly simplistic but for demonstration purposes I figured it was easier than using a GMB formula).
The issue I have is that it's very slow, I understand that it's because I'm using double loops. From what I understand I might be able to use vectorization, but I can't figure out how.
Basically what I did was create 100 simulations assuming 256 trading days in a year, each day the previous stock price is multiplied by a random number between .98 and 1.02.
I currently do this using a nested for loop. As I gather this is not good but as a novice I'm having a hard time vectorizing. I've read about it online and from what I understand basically instead of using loops you would try to convert both of them into matrices and use matrix multiplication but I'm unsure how to apply that here. Can anyone point me in the right direction?
from numpy import exp, sqrt, log, linspace
from random import gauss
from random import uniform
import pandas as pd
nsims = 100
stpx = 100
days = 256
mainframe = pd.DataFrame(0, index = list(range(1,days)), columns = list(range(1,nsims)))
mainframe.iloc[0] = stpx
for i in range(0, nsims-1):
for x in range(1, days-1):
mainframe.iloc[x, i] = mainframe.iloc[x-1, i]* uniform(.98, 1.02)
Vectorization can be tricky when one calculation relies on the result of a previous calculation, like in this case where day x needs to know the results from day x-1. I won't say it can't be done as quite possibly somebody can find a way, but here's my solution that at least gets rid of one of the loops. We still loop through days, but we do all 100 simulations at once by generating an array of random numbers and making use of numpy's element-wise multiplication (which is much faster than using a loop):
You will need to add the following import:
import numpy as np
And then replace your nested loop with this single loop:
for x in range(1, days-1):
mainframe.iloc[x] = mainframe.iloc[x-1] * np.random.uniform(0.98, 1.02, nsims-1)
Edit to add: because you are using a very simple formula which only involves basic multiplication, you actually can get rid of both loops by generating a random matrix of numbers, using numpy's cumulative product function column-wise, and multiplying it by a DataFrame where each value begins at 100. I'm not sure such an approach would be viable if you started using a more complicated formula though. Here it is anyway:
import pandas as pd
import numpy as np
nsims = 100
stpx = 100
days = 256
mainframe = pd.DataFrame(stpx, index=list(range(1, days)), columns=list(range(1, nsims)))
rand_matrix = np.random.uniform(0.98, 1.02, (days-2, nsims-1)).cumprod(axis=0)
mainframe.iloc[1:] *= rand_matrix