I have a function that receives a DataFrame, and a dictionary of column name, operator, and threshold.
Function looks like:
df = pd.DataFrame(...)
df["passed_thresholds"] = False
threshold_dict = {"height": (operator.lt, 0.7), "width": (operator.gt, 0.1)}
def my_func(df, threshold_dict):
# return df with "passed_thresholds" equal true for rows that meet the thresholds.
What I want to do is to find all the rows in df that meet the thresholds in threshold_dict and set the "passed_thresholds" column to be True for those rows only. Usually I can do this pretty easily with:
df.loc[(df["height"] < 0.7) & (df["width"] > 0.1), "passed_thresholds"] = True
But the issue here is that I won't know how many elements will be inside of threshold_dict and what their values will be. By the way, threshold_dict is flexible and I can change how it looks/works if you have a better idea for it too. For example, maybe passing in a operator function isn't the best idea.
Let us try concat with for loop then apply all
out = pd.concat([y[0](df[x],y[1]) for x, y in threshold_dict.items()],axis=1).all(1)
df['passed_thresholds'] = out
Related
I have a groupby question that I can't solve. It is probably simple, but I can't get it to work nicely. I am trying to compute some statistics on a variable with pandas groupby chained with the very handy agg function. I would like add to the list below a calculation of the number of values above a given threshold.
df = df.groupby(['scenario','Name','year','month'])["Value"].agg([np.min,np.max,np.mean,np.std])
Usually, I compute the number of values above a given threshold as shown below, but I can't find a way to add this to the aggregation function. Do you know how I could do that?
df =df[df>0].groupby(['scenario','Name','year','month']).count()
Your answer works. Else you could add it to the one line, not needing to create a separate function by using lambda x: instead.
df = df.groupby(["scenario", "Name", "year", "month"])["Value"].agg([np.min, np.max, np.mean, np.std, lambda x: ((x > 0)*1).sum()])
The logic here: (x > 0) returns True/False bool; *1 turns the bool to an integer (1 = True, 0 = False); .sum() will sum all the 1s and 0s within the group - and as those that are True = 1, the sum will count all values greater than 0.
Running a quick test on the time taken, your solution is faster, but I thought I would give an alternative solution anyway.
I found a solution by creating a function and passing it in the agg function.
def counta(x):
m = np.count_nonzero(x > 10)
return m
df = df.groupby(['scenario','Name','year','month'])["Value"].agg([np.min,np.max,np.mean,np.std,counta])
I have what I thought would be a straightforward thing to do in python using dask. I have a dataframe with some records in it, and I want to add a new column based on calling a function with values from two other columns as parameters.
Here is what I mean (pretend ge exists and takes two parameters):
def gc(x, y):
return ge(x, y)
def gdf(df):
func1 = np.vectorize(gc)
gh = da.from_array(func1(df.x, df.y))
df['gh'] = gh
However, I seem to get one issue or another no matter what I try to do. Currently, in the above state, I get
Number of partitions do not match (2 != 33)
It feels like I'm either going about this all wrong (like maybe I need map_blocks or map_partitions or even gufunc), or I'm missing something easy where I can set the number of partitions on my array to match that of my dataframe.
Any help would be appreciated.
It should be possible to do this with assign or map_partitions:
func1 = np.vectorize(gc)
df = df.assign(gh=lambda df: func1(df.x, df.y))
# or try this
def myfunc(df):
df['gh'] = func1(df.x, df.y)
return df
df = df.map_partitions(myfunc)
I have a dataframe that looks like the following, but with many rows:
import pandas as pd
data = {'intent': ['order_food', 'order_food','order_taxi','order_call','order_call','order_taxi'],
'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','i would like a new taxi' ],
'key_words': [['need','hamburger'], ['want','sushi'],['need','cab'],['call','6'],['call'],['new','taxi']]}
df = pd.DataFrame (data, columns = ['intent','Sent','key_words'])
I have calculated the jaccard similarity using the code below (not my solution):
def lexical_overlap(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
return intersection
and modified the code given by #Amit Amola to compare overlapping words between every possible two rows and created a dataframe out of it:
overlapping_word_list=[]
for val in list(combinations(range(len(data_new)), 2)):
overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])
since my dataset is huge, when i run this code to compare all rows, it takes forever. so i would like to instead only compare the sentences which have the same intents and do not compare sentences that have different intents. I am not sure on how to proceed to do only that
IIUC you just need to iterate over the unique values in the intent column and then use loc to grab just the rows that correspond to that. If you have more than two rows you will still need to use combinations to get the unique combinations between similar intents.
from itertools import combinations
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent, ["Sent"]].Sent.to_list()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x, y)
print(f"Overlap for ({x}) and ({y}) is {overlap}")
# Overlap for (i need hamburger) and (she wants sushi) is 46.666666666666664
# Overlap for (i need a cab) and (i would like a new taxi) is 40.0
# Overlap for (call me at 6) and (she called me) is 54.54545454545454
ok, so I figured out what to do to get my desired output mentioned in the comments based on #gold_cy 's answer:
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent,['intent','key_words','Sent']].values.tolist()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x[1], y[1])
print(f"Overlap of intent ({x[0]}) for ({x[2]}) and ({y[2]}) is {overlap}")
I have a pandas DataFrame, which holds the performance results for many athletes. Now I want to group the data by 'BIB# and 'COURSE', so I write:
grupper = df.groupby(['BIB#', 'COURSE'])
Next, I want to find the two best runs (column = 'FINISH) for each 'BIB' and 'COURSE', so I write:
x = grupper.apply(lambda x: x.nsmallest(2, 'FINISH'))
This gives me the following:
Then, I want to calculate the mean of the two best runs for each athlete for each of the BIB and COURSE but can't find an appropriate solution. I have tried to apply mean() like in the code below but that calculates the mean for each column in the dataframe and that's not what I want.
x = grupper.apply(lambda x: x.nsmallest(2, 'FINISH')).mean()
What can I do?
I think you need pass mean into apply method after nsmallest:
x = grupper['FINISH'].apply(lambda x: x.nsmallest(2).mean())
In your solution should working also:
x = grupper.apply(lambda x: x.nsmallest(2, 'FINISH').mean())
TL;DR - I want to mimic the behaviour of functions such as DataFrameGroupBy.std()
I have a DataFrame which I group.
I want to take one row to represent each group, and then add extra statistics regarding these groups to the resulting DataFrame (such as the mean and std of these groups)
Here's an example of what I mean:
df = pandas.DataFrame({"Amount": [numpy.nan,0,numpy.nan,0,0,100,200,50,0,numpy.nan,numpy.nan,100,200,100,0],
"Id": [0,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
"Date": pandas.to_datetime(["2011-11-02","NA","2011-11-03","2011-11-04",
"2011-11-05","NA","2011-11-04","2011-11-04",
"2011-11-06","2011-11-06","2011-11-06","2011-11-06",
"2011-11-08","2011-11-08","2011-11-08"],errors='coerce')})
g = df.groupby("Id")
f = g.first()
f["std"] = g.Amount.std()
Now, this works - but let's say I want a special std, which ignores 0, and regards each unique value only once:
def get_unique_std(group):
vals = group.unique()
vals = vals[vals>0]
return vals.std() if vals.shape[0] > 1 else 0
If I use
f["std"] = g.Amount.transform(get_unique_std)
I only get zeros... (Also for any other function such as max etc.)
But if I do it like this:
std = g.Amount.transform(get_unique_std)
I get the correct result, only not grouped anymore... I guess I can calculate all of these into columns of the original DataFrame (in this case df) before I take the representing row of the group:
df["std"] = g.Amount.transform(get_unique_std)
# regroup again the modified df
g = df.groupby("Id")
f = g.first()
But that would just be a waste of memory space since many rows corresponding to the same group would get the same value, and I'd also have to group df twice - once for calculating these statistics, and a second time to get the representing row...
So, as stated in the beginning, I wonder how I can mimic the behaviour of DataFrameGroupBy.std().
I think you may be looking for DataFrameGroupBy.agg()
You can pass your custom function like this and get a grouped result:
g.Amount.agg(get_unique_std)
You can also pass a dictionary and get each key as a column:
g.Amount.agg({'my_std': get_unique_std, 'numpy_std': pandas.np.std})