How to speed up this task in Python - python

I have a large Pandas dataframe, 24'000'000 rows × 6 columns plus index.
I need to read an integer in column 1 (which is = 1 or 2), then force the value in column 3 to be negative if column 1 = 1, or positive if = 2. I use the following code in Jupyter notebook:
for i in range(1000):
if df.iloc[i,1] == 1:
df.iloc[i,3] = abs(df.iloc[i,3])*(-1)
if df.iloc[i,1] == 2:
df.iloc[i,3] = abs(df.iloc[i,3])
The code above takes 2min 30sec to run for 1'000 rows only. For the 24M rows, it would take 41 days to complete !
Something is not right. The code runs in Jupyter Notebook/Chrome/Windows on a pretty high end PC.
The Pandas dataframe is created with pd.read_csv and then sorted and indexed this way:
df.sort_values(by = "My_time_stamp", ascending=True,inplace = True)
df = df.reset_index(drop=True)
The creation and sorting of the dataframe just takes a few seconds. I have other calculations to perform on this dataframe, so I clearly need to understand what I'm doing wrong.

np.where
a = np.where(df.iloc[:, 1].to_numpy() == 1, -1, 1)
b = np.abs(df.iloc[:, 3].to_numpy())
df.iloc[:, 3] = a * b

Vectorize it:
df.iloc[:, 3] = df.iloc[:, 3].abs() * (2 * (df.iloc[:, 1] != 1) - 1)
Explanation:
Treated as int, boolean series df.iloc[:, 1] != 1 gets converted to ones and zeroes. Multiplied by 2, it gets twos and zeroes. After subtracting one, it gets -1 where the first column is 1, and 1 otherwise. Finally, it is multiplied by the absolute value of the third column, which enforces the sign.
Vectorization typically provides an order of magnitude or two speedup comparing to for loops.

Use
df.iloc[:,3] = df.iloc[:,3].abs().mul( df.iloc[:,-1].map({2:1,1:-1}) )

Another way to do this:
import pandas as pd
Take an example data set:
df = pd.DataFrame({'x1':[1,2,1,2], 'x2':[4,8,1,2]})
Make new column, code values as -1 and +1:
df['nx1'] = df['x1'].replace({1:-1, 2:1})
Multiply columnwise:
df['nx1'] * df['x2']

Related

how to sample from a dataset and get the indices of samples in initial dataset

I have a dataset A with a shape of (1000, 10).
I want to do sampling such that:
B = pd.DataFrame(A).sample(frac = 0.2)
how I can get the indices of A that contain B? or how I can sort A based on B to have those 200 rows of B at the beginning of A?
I have tried this code but I don't understand why it gives me an error
I = np.argwhere((A == B[:, None]).all(axis=2))[:, 1]
or this one
np.arange(A.shape[0])[np.isin(A,B).all(axis=1)]
thanks
Make a boolean column in A that tells if row is in B
We can get rows in B with index by A.index.isin(B.index)
Sort with the new column and delete the column
# after defining A and B
# step 1, 2
A["isinB"] = A.index.isin(B.index)
# step 3 Trues go to front, Falses go to end
A.sort_values("isinB").drop("isinB", 1)

Odd dropping of pandas rows based on conditions

I use the function:
def df_proc(df, n):
print (list(df.lab).count(0)) # control label to see if it changes after conditional dropping
print ('C:', list(df.lab).count(1))
df = df.drop(df[df.lab.eq(0)].sample(n).index)
print (list(df.lab).count(0))
print ('C:', list(df.lab).count(1))
return df
To drop pandas rows based on certain conditions (where df.lab == 0). This works fine on a small df (e.g. n = 100) however when I increase the number of rows in the df something odd happens ... the counts of other labels (!= 0) also begin to decrease and are affected by the condition..
For example:
# dummy example:
import random
list2 = [random.randrange(0, 6, 1) for i in range(1500000)]
list1 = [random.randrange(0, 100, 1) for i in range(1500000)]
dft = pd.DataFrame(list(zip(list1, list2)), columns = ['A', 'lab'])
dftest = df_proc(dft,100000)
gives...
249797
C: 249585
149797
C: 249585
But when I run this on my actual df:
dftest = df_proc(S1,100000)
I get a change in my control labels which is weird.
467110
C: 70434
260616
C: 49395
I'm not sure where the error could have come from. I have tried using frac and df.query('lab == 0') but still run into the same error. The other thing I noticed is that with small n the control labels are unchanged, its only when I increase n.
dftest = df_proc(S1,1)
gives:
467110
C: 70434
467107
C: 70434
Which doesnt add up as 3 samples have been removed not 1.
If it's only about filtering, why not use:
dft = dft[dft['lab'] != 0]
This will filter out all rows with lab=0.
The error was that when drop is used it eliminates based on index however my df was a concatenation of serveral dataframes hence I had to use reset_index to overcome the problem.

How to change cycles in pandas

I have a dataframe and I need to change the 3d column by the rule
1) if differ between i+1 row and i row of 2nd column > 1 then 3d column +1
I wrote a code using a cycle, but this code is working for eternity.
I wrote a code in pure python, but there must be a better way to do this in pandas.
So, How to rewrite my code in pandas to reduce time?
old_store_id = -1
for i in range(0,df_sort.shape[0]):
if (old_store_id != df_sort.iloc[i, 0]):
old_store_id = df_sort.iloc[i, 0]
continue
if (df_sort.iloc[i,1]-df_sort.iloc[i-1,1])>1:
df_sort.iloc[i,2] = df_sort.iloc[i-1,2]+1
else:
df_sort.iloc[i,2] = df_sort.iloc[i-1,2]
Before the code:
After the code:
df['value'] = df.groupby('store_id')['period_id'].transform(lambda x: (x.diff()>1).cumsum()+1)
So we group by store_id, check when the diff between periods is greater than 1, then take the cumsum of the bool. We added 1 to make the counter start at 1 instead of 0.
Make sure that period_id is sorted correctly before using the above code, otherwise it will not work.

Using pandas, how to filter rows with similar values in two columns

I have a big dataframe (~10 millon rows). Each row has:
category
start position
end position
If two rows are in the same category and the start and end position overlap with a +-5 tolerance, I want to keep just one of the rows.
For example
1, cat1, 10, 20
2, cat1, 12, 21
3, cat2, 10, 25
I want to filter out 1 or 2.
What I'm doing right now isn't very efficient,
import pandas as pd
df = pd.read_csv('data.csv', sep='\t', header=None)
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
for index, row in df.iterrows():
if index in discard:
continue
df_2 = dfs[row.category]
res = df_2[(abs(df_2.start - row.start) <= params['min_distance']) & (abs(df_2.end - row.end) <= params['min_distance'])]
if len(res.index) > 1:
discard.extend(res.index.values)
rows.append(row)
df = pd.DataFrame(rows)
I've also tried a different approach making use of a sorted version of the dataframe.
my_index = 0
indexes = []
discard = []
count = 0
curr = 0
total_len = len(df.index)
while my_index < total_len - 1:
row = df.iloc[[my_index]]
cond = True
next_index = 1
while cond:
second_row = df.iloc[[my_index + next_index]]
c1 = (row.iloc[0].category == second_row.iloc[0].category)
c2 = (abs(second_row.iloc[0].sstart - row.iloc[0].sstart) <= params['min_distance'])
c3 = (abs(second_row.iloc[0].send - row.iloc[0].send) <= params['min_distance'])
cond = c1 and c2 and c3
if cond and (c2 amd c3):
indexes.append(my_index)
cond = True
next_index += 1
indexes.append(my_index)
my_index += next_index
indexes.append(total_len - 1)
The problem is that this solution is not perfect, sometimes it misses a row because the overlapping could be several rows ahead, and not in the next one
I'm looking for any ideas on how approach this problem in a more pandas friendly way, if exists.
The approach here should be this:
pandas.groupby by categories
agg(Func) on groupby result
the Func should implement the logic of finding the best range inside categories (sorted search, balanced trees or anything else)
Do you want to merge all similar or only 2 consecutive?
If all similar, I suggest you first order the rows, by category, then on the 2 other columns and squash similar in a single row.
If only consecutive 2 then, check if the next value is in the range you set and if yes, merge it. Here you can see how:
merge rows pandas dataframe based on condition
I don't believe the numeric comparisons can be made without a loop, but you can make at least part of this cleaner and more efficient:
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
Instead of this, use df.groupby('category').apply(drop_duplicates).droplevel(0), where drop_duplicates is a function containing your second loop. The function will then be called separately for each category, with a dataframe that contains only the filtered rows. The outputs will be combined back into a single dataframe. The dataframe will be a MultiIndex with the value of "category" as an outer level; this can be removed with droplevel(0).
Secondly, within the category you could sort by the first of the two numeric columns for another small speed-up:
def drop_duplicates(df):
df = df.sort_values("sstart")
...
This will allow you to stop the inner loop as soon as the sstart column value is out of range, instead of comparing every row to every other row.

How can I speed up the numpy operations?

I am currently having a dataframe as shown below:
This dataframe has 1 million rows. I would like to perform the following operation:
Say for row 0, b is 6.
I would like to create another column, c.
The c for row 0 is computed as the mean of a row (i.e. 8 rows in above image), where b is in range from 6-3 to 6+3 (here 3 is the fixed number for all rows).
Currently I have performed this operation by converting column a and column b to numpy arrays and then looping. Below I have attached the code:
index = range(0,1000000)
columns = ['A','B']
data = np.array([np.random.randint(10, size=1000000),np.random.randint(10, size=1000000)]).T
df = pd.DataFrame(data, index=index, columns=columns)
values_b = df.B.values
values_a = df.A.values
sum_array = []
program_starts = time.time()
for i in range(df.shape[0]):
value_index = [values_b[i] - 3,values_b[i] + 3]
sum_array.append(np.sum(values_a[value_index]))
time_now = time.time()
print('time taken ',time_now- program_starts)
This code is taking around 8 seconds to run.
How can I make this run faster? I tried to parallelize the task by splitting the array in 0.1 million rows and then calling this for loop in parallel for each 0.1 million array. But it takes even more time. Please, any help would be appreciated.

Categories

Resources