Multiprocessing functions for dataframes

Multiprocessing functions for dataframes - python

I have an excel sheet which consists of 2 columns. The first keywords and the second is Url.
I am making a script to extract groups which shares the same 3 URLs or more.
I wrote the below code but it takes around an hour to process the main function on a huge excel sheet.
import pandas as pd
import numpy as np
import time
loop = 1
numerator = 0
continuee= []
df_list = []
for index in list(df.sort_values('Url').set_index('Url').index.unique()):
if len(df.sort_values('Url').set_index('Url').loc[index].values) == 1:
list1 = list(df.sort_values('Url').set_index('Url').loc[index].values)
elif len(df.sort_values('Url').set_index('Url').loc[index].keywords.values) > 1:
list1 = list(df.sort_values('Url').set_index('Url').loc[index].keywords.values)
df1 = df[df.keywords.isin(list1)]
df1 = df1[df1.Url.duplicated(keep=False)]
df1 = df1.groupby('Url').filter(lambda x: x.Url.value_counts() == df1.keywords.nunique())
df1 = df1.groupby('keywords').filter(lambda x: x.keywords.value_counts() >= 3)
df1 = df1.groupby('Url').filter(lambda x: x.Url.value_counts() == df1.keywords.nunique())
if df1.keywords.nunique() > 1:
silos = list(df1.keywords.unique())
df_list.append({numerator:silos})
word = word[~(word.isin(silos))]
numerator += 1
else:
singles = list(word[word.keywords.isin(list1)].keywords.unique())
df_list.append({"single" : singles})
word = word[~(word.isin(singles))]
print(loop)
loop += 1
trial = pd.DataFrame(df_list)
if 'single' in list(trial.columns):
for i in list(word.keywords.unique()):
if i not in list(trial.single):
df_list.append({"single" : i})
else:
for i in list(word.keywords.unique()):
df_list.append({"single" : i})
trial = pd.DataFrame(df_list)
I tried many times to use multiprocessing but I failed as I am not really getting how it works with Pandas. Is there a way to help me, please? Also, if I wanted to pass another couple of functions how would I do it? Many thanks in advance.

From what I can gather, this should be your solution;
by_size = df.groupby(df.columns.tolist()).size().reset_index()
three_or_more=by_size[by_size[0]>=3].iloc[:,:-1]
Example:
>>> df
keyword url
0 2 2
1 4 3
2 2 1
3 4 3
4 1 1
5 2 1
6 4 1
7 2 1
8 1 1
9 3 3
>>> by_size = df.groupby(df.columns.tolist()).size().reset_index()
>>> by_size
keyword url 0
0 1 1 2
1 2 1 3
2 2 2 1
3 3 3 1
4 4 1 1
5 4 3 2
>>> three_or_more=by_size[by_size[0]>=3].iloc[:,:-1]
>>> three_or_more
keyword url
1 2 1

Related

Combining looping and conditional to make new columns on dataframe

I want to make a function with loop and conditional, that count only when Actual Result = 1.
So the numbers always increase by 1 if the Actual Result = 1.
This is my dataframe:
This is my code but it doesnt produce the result that i want :
def func_count(x):
for i in range(1,880):
if x['Actual Result']==1:
result = i
else:
result = '-'
return result
X_machine_learning['Count'] = X_machine_learning.apply(lambda x:func_count(x),axis=1)
When i check & filter with count != '-' The result will be like this :
The number always equal to 1 and not increase by 1 everytime the actual result = 1. Any solution?

Try something like this:
import pandas as pd
df = pd.DataFrame({
'age': [30,25,40,12,16,17,14,50,22,10],
'actual_result': [0,1,1,1,0,0,1,1,1,0]
})
count = 0
lst_count = []
for i in range(len(df)):
if df['actual_result'][i] == 1:
count+=1
lst_count.append(count)
else:
lst_count.append('-')
df['count'] = lst_count
print(df)
Result
age actual_result count
0 30 0 -
1 25 1 1
2 40 1 2
3 12 1 3
4 16 0 -
5 17 0 -
6 14 1 4
7 50 1 5
8 22 1 6
9 10 0 -

Actually, you don't need to loop over the dataframe, which is mostly a Pandas-antipattern that should be avoided. With df your dataframe you could try the following instead:
m = df["Actual Result"] == 1
df["Count"] = m.cumsum().where(m, "-")
Result for the following dataframe
df = pd.DataFrame({"Actual Result": [1, 1, 0, 1, 1, 1, 0, 0, 1, 0]})
is
Actual Result Count
0 1 1
1 1 2
2 0 -
3 1 3
4 1 4
5 1 5
6 0 -
7 0 -
8 1 6
9 0 -

Calculate a np.arange within a Panda dataframe from other columns

I want to create a new column with all the coordinates the car needs to pass to a certain goal. This should be as a list in a panda.
To start with I have this:
import pandas as pd
cars = pd.DataFrame({'x_now': np.repeat(1,5),
'y_now': np.arange(5,0,-1),
'x_1_goal': np.repeat(1,5),
'y_1_goal': np.repeat(10,5)})
output would be:
x_now y_now x_1_goal y_1_goal
0 1 5 1 10
1 1 4 1 10
2 1 3 1 10
3 1 2 1 10
4 1 1 1 10
I have tried to add new columns like this, and it does not work
for xy_index in range(len(cars)):
if cars.at[xy_index, 'x_now'] == cars.at[xy_index,'x_1_goal']:
cars.at[xy_index, 'x_car_move_route'] = np.repeat(cars.at[xy_index, 'x_now'].astype(int),(
abs(cars.at[xy_index, 'y_now'].astype(int)-cars.at[xy_index, 'y_1_goal'].astype(int))))
else:
cars.at[xy_index, 'x_car_move_route'] = \
np.arange(cars.at[xy_index,'x_now'], cars.at[xy_index,'x_1_goal'],
(cars.at[xy_index,'x_1_goal'] - cars.at[xy_index,'x_now']) / (
abs(cars.at[xy_index,'x_1_goal'] - cars.at[xy_index,'x_now'])))
at the end I want the columns x_car_move_route and y_car_move_route so I can loop over the coordinates that they need to pass. I will show it with tkinter. I will also add more goals, since this is actually only the first turn that they need to make.
x_now y_now x_1_goal y_1_goal x_car_move_route y_car_move_route
0 1 5 1 10 [1,1,1,1,1] [6,7,8,9,10]
1 1 4 1 10 [1,1,1,1,1,1] [5,6,7,8,9,10]
2 1 3 1 10 [1,1,1,1,1,1,1] [4,5,6,7,8,9,10]
3 1 2 1 10 [1,1,1,1,1,1,1,1] [3,4,5,6,7,8,9,10]
4 1 1 1 10 [1,1,1,1,1,1,1,1,1] [2,3,4,5,6,7,8,9,10]

You can apply() something like this route() function along axis=1, which means route() will receive rows from cars. It generates either x or y coordinates depending on what's passed into var (from args).
You can tweak/fix as needed, but it should get you started:
def route(row, var):
var2 = 'y' if var == 'x' else 'x'
now, now2 = row[f'{var}_now'], row[f'{var2}_now']
goal, goal2 = row[f'{var}_1_goal'], row[f'{var2}_1_goal']
diff, diff2 = goal - now, goal2 - now2
if diff == 0:
result = np.array([now] * abs(diff2)).astype(int)
else:
result = 1 + np.arange(now, goal, diff / abs(diff)).astype(int)
return result
cars['x_car_move_route'] = cars.apply(route, args=('x',), axis=1)
cars['y_car_move_route'] = cars.apply(route, args=('y',), axis=1)
x_now y_now x_1_goal y_1_goal x_car_move_route y_car_move_route
0 1 5 1 10 [1,1,1,1,1] [6,7,8,9,10]
1 1 4 1 10 [1,1,1,1,1,1] [5,6,7,8,9,10]
2 1 3 1 10 [1,1,1,1,1,1,1] [4,5,6,7,8,9,10]
3 1 2 1 10 [1,1,1,1,1,1,1,1] [3,4,5,6,7,8,9,10]
4 1 1 1 10 [1,1,1,1,1,1,1,1,1] [2,3,4,5,6,7,8,9,10]

How to set ranges of rows in pandas?

I have the following working code that sets 1 to "new_col" at the locations pointed by intervals dictated by starts and ends.
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": np.arange(10)})
starts = [1, 5, 8]
ends = [1, 6, 10]
value = 1
df["new_col"] = 0
for s, e in zip(starts, ends):
df.loc[s:e, "new_col"] = value
print(df)
a new_col
0 0 0
1 1 1
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 0
8 8 1
9 9 1
I want these intervals to come from another dataframe pointer_df.
How to vectorize this?
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
Attempt:
df.loc[pointer_df["starts"]:pointer_df["ends"], "new_col"] = 2
print(df)
obviously doesn't work and gives
raise AssertionError("Start slice bound is non-scalar")
AssertionError: Start slice bound is non-scalar
EDIT:
it seems all answers use some kind of pythonic for loop.
the question was how to vectorize the operation above?
Is this not doable without for loops/list comprehentions?

You could do:
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
rang = np.arange(len(df))
indices = [i for s, e in pointer_df.to_numpy() for i in rang[slice(s, e + 1, None)]]
df.loc[indices, 'new_col'] = value
print(df)
Output
a new_col
0 0 0
1 1 1
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 0
8 8 1
9 9 1
If you want a method that do not uses uses any for loop or list comprehension, only relies on numpy, you could do:
def indices(start, end, ma=10):
limits = end + 1
lens = np.where(limits < ma, limits, end) - start
np.cumsum(lens, out=lens)
i = np.ones(lens[-1], dtype=int)
i[0] = start[0]
i[lens[:-1]] += start[1:]
i[lens[:-1]] -= limits[:-1]
np.cumsum(i, out=i)
return i
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
df.loc[indices(pointer_df.starts.values, pointer_df.ends.values, ma=len(df)), "new_col"] = value
print(df)
I adapted the method to your use case from the one in this answer.

for i,j in zip(pointer_df["starts"],pointer_df["ends"]):
print (i,j)
Apply same method but on your dictionary

How can I reference particular cells in a dataframe?

I am a beginner and this is my first project.. I searched for the answer but it still isn't clear.
I have imported a worksheet from excel using Pandas..
**Rabbit Class:
Num Behavior Speaking Listening
0 1 3 1 1
1 2 1 1 1
2 3 3 1 1
3 4 1 1 1
4 5 3 2 2
5 6 3 2 3
6 7 3 3 1
7 8 3 3 3
8 9 2 3 2
What I want to do is create if functions.. ex. if a student's behavior is a "1" I want it to print one string, else print a different string. How can I reference a particular cell of the worksheet to set up such a function? I tried: val = df.at(1, "Behavior") but that clearly isn't working..
Here is the code I have so far..
import os
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
path = r"C:\Users\USER\Desktop\Python\rabbit_class.xls"
print("Rabbit Class:")
print(df)

Also you can do
dff = df.loc[df['Behavior']==1]
if(not(dff.empty)):
# do Something

What you want is to find rows where df.Behavior is equal to 1. Use any of the following three methods.
# Method-1
df[df["Behavior"]==1]
# Method-2
df.loc[df["Behavior"]==1]
# Method-3
df.query("Behavior==1")
Output:
Num Behavior Speaking Listening LastColumn
0 0 1 3 1 1
Note: Dummy Data
Your sample data does not have a column header (the last one). So I named it LastColumn and read-in the data as a dataframe.
# Dummy Data
s = """
Num Behavior Speaking Listening LastColumn
0 1 3 1 1
1 2 1 1 1
2 3 3 1 1
3 4 1 1 1
4 5 3 2 2
5 6 3 2 3
6 7 3 3 1
7 8 3 3 3
8 9 2 3 2
"""
# Make Dataframe
ss = re.sub('\s+',',',s)
ss = ss[1:-1]
sa = np.array(ss.split(',')).reshape(-1,5)
df = pd.DataFrame(dict((k,v) for k,v in zip(sa[0,:], sa[1:,].T)))
df = df.astype(int)
df

Hope below example will help you
import pandas as pd
df = pd.read_excel(r"D:\test_stackoverflow.xlsx")
print(df.columns)
def _filter(col, filter_):
return df[df[col]==filter_]
print(_filter('Behavior', 1))

Thank you all for your answers. I finally figured out what I was trying to do using the following code:
i = 0
for i in df.index:
student_number = df["Student Number"][i]
print(student_number)
student_name = student_list[int(student_number) - 1]
behavior = df["Behavior"][i]
if behavior == 1:
print("%s's behavior is good" % student_name)
elif behavior == 2:
print ("%s's behavior is average." % student_name)
else:
print ("%s's behavior is poor" % student_name)
speaking = df["Speaking"][i]

Return rows based off the most recent increase in value from other columns python

The title of this question is a little confusing to write out succinctly.
I have pandas df that contains integers and a relevant key Column. When a value is in the key Column is present I want to return the most recent increase in integers from the other Columns.
For the df below, the key Column is [Area]. When X is in [Area], I want to find the most recent increase is integers from Columns ['ST_A','PG_A','ST_B','PG_B'].
import pandas as pd
d = ({
'ST_A' : [0,0,0,0,0,1,1,1,1],
'PG_A' : [0,0,0,1,1,1,2,2,2],
'ST_B' : [0,1,1,1,1,1,1,1,1],
'PG_B' : [0,0,0,0,0,0,0,1,1],
'Area' : ['','','X','','X','','','','X'],
})
df = pd.DataFrame(data = d)
Output:
ST_A PG_A ST_B PG_B Area
0 0 0 0 0
1 0 0 1 0
2 0 0 1 0 X
3 0 1 1 0
4 0 1 1 0 X
5 1 1 1 0
6 1 2 1 0
7 1 2 1 1
8 1 2 1 1 X
I tried to use df = df.loc[(df['Area'] == 'X')] but this returns the rows where X is situated. I need something that uses X to return the most recent row where there was an increase in Columns ['ST_A','PG_A','ST_B','PG_B'].
I have also tried:
cols = ['ST_A','PG_A','ST_B','PG_B']
df[cols] = df[cols].diff()
df = df.fillna(0.)
df = df.loc[(df[cols] == 1).any(axis=1)]
This returns all rows where there was an increase in Columns ['ST_A','PG_A','ST_B','PG_B']. Not the most recent increase before X in ['Area'].
Intended Output:
ST_A PG_A ST_B PG_B Area
1 0 0 1 0
3 0 1 1 0
7 1 2 1 1
Does this question make sense or do I need to simplify it?

I believe you can use NumPy here via np.searchsorted:
import numpy as np
increases = np.where(df.iloc[:, :-1].diff().gt(0).max(1))[0]
marks = np.where(df['Area'].eq('X'))[0]
idx = increases[np.searchsorted(increases, marks) - 1]
res = df.iloc[idx]
print(res)
ST_A PG_A ST_B PG_B Area
1 0 0 1 0
3 0 1 1 0
7 1 2 1 1

Not efficient tho, but works, so big chunk of code which is kinda slow:
indexes=np.where(df['Area']=='X')[0].tolist()
indexes2=list(map((1).__add__,np.where(df[df.columns[:-1]].sum(axis=1) < df[df.columns[:-1]].shift(-1).sum(axis=1).sort_index())[0].tolist()))
l=[]
for i in indexes:
if min(indexes2,key=lambda x: abs(x-i)) in l:
l.append(min(indexes2,key=lambda x: abs(x-i))-2)
else:
l.append(min(indexes2,key=lambda x: abs(x-i)))
print(df.iloc[l].sort_index())
Output:
Area PG_A PG_B ST_A ST_B
1 0 0 0 1
3 1 0 0 1
7 2 1 1 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiprocessing functions for dataframes - python

Related

Combining looping and conditional to make new columns on dataframe

Calculate a np.arange within a Panda dataframe from other columns

How to set ranges of rows in pandas?

How can I reference particular cells in a dataframe?

Return rows based off the most recent increase in value from other columns python

Categories

Resources