Pass Multiple Columns to groupby.transform - python

I understand that when you call a groupby.transform with a DataFrame column, the column is passed to the function that transforms the data. But what I cannot understand is how to pass multiple columns to the function.
people = DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'], index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
key = ['one', 'two', 'one', 'two', 'one']
Now I can easily demean that data etc. but what I can't seem to do properly is to transform data inside groups using multiple column values as parameters of the function. For example if I wanted to add a column 'f' that took the value a.mean() - b.mean() * c for each observation how can that be achived using the transform method.
I have tried variants of the following
people['f'] = float(NA)
Grouped = people.groupby(key)
def TransFunc(col1, col2, col3):
return col1.mean() - col2.mean() * col3
Grouped.f.transform(TransFunc(Grouped['a'], Grouped['b'], Grouped['c']))
But this is clearly wrong. I have also trued to wrap the function in a lamba but can't quite make that work either.
I am able to achieve the result by iterating through the groups in the following manner:
for group in Grouped:
Amean = np.mean(list(group[1].a))
Bmean = np.mean(list(group[1].b))
CList = list(group[1].c)
IList = list(group[1].index)
for y in xrange(len(CList)):
people['f'][IList[y]] = (Amean - Bmean) * CList[y]
But that does not seem a satisfactory solution, particulalry if the index is non-unique. Also I know this must be possible using groupby.transform.
To generalise the question: how does one write functions for transforming data that have parameters that involve using values from multiple columns?
Help appreciated.

You can use apply() method:
import numpy as np
import pandas as pl
np.random.seed(0)
people2 = pd.DataFrame(np.random.randn(5, 5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
key = ['one', 'two', 'one', 'two', 'one']
Grouped = people2.groupby(key)
def f(df):
df["f"] = (df.a.mean() - df.b.mean())*df.c
return df
people2 = Grouped.apply(f)
print people2
If you want some generalize method:
Grouped = people2.groupby(key)
def f(a, b, c, **kw):
return (a.mean() - b.mean())*c
people2["f"] = Grouped.apply(lambda df:f(**df))
print people2

This is based upon the answer provided by HYRY (thanks) who made me see how this could be achieved. My version does nothing more than generalise the function and enter the arguments of the function when it is called. I think though the function has to be called with a lambda:
import pandas as pd
import numpy as np
people = DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'], index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
key = ['one', 'two', 'one', 'two', 'one']
people['f'] = ""
Grouped = people.groupby(key)
def FUNC(df, col1, col2, col3, col4):
df[col1] = (df[col2].mean() - df[col3].mean())*df[col4]
return df
people2 = Grouped.transform(lambda x: FUNC(x, 'f', 'a', 'b', 'c'))
This appears to me to be the best way I have seen of doing this... Basically the entire grouped data frame is passed to the function as x, and then columns can be called as arguments.

Related

Highlighting specific columns in bar chart in python using altair

I want to highlight specific data points in a bar chart in python based on my requirement. using altair, I am able to achieve results for one data point (for e.g 'A' in the code). Here's the sample data frame and code
import pandas as pd
import altair as alt
df = pd.DataFrame({
'Rank': [25, 20, 40, 10, 50, 35],
'Name': ['A', 'B', 'C', 'D', 'E', 'F'],
})
bar1 =alt.Chart(df).mark_bar().encode( y = alt.Y('Rank'),
x = alt.X('Name:O', sort ='x'),
color=alt.condition(
alt.datum.Name == 'A',
alt.value('red'),
alt.value('blue')
))
bar1
How can I highlight two or more datapoints (eg. A & B) with the same color and others with a different one? I tried passing the names as a list Select = ['A', 'B'] and then passing the alt.datum.Name == Select but that does not work?
How can i get this done?
Also, trying to understand why passing as a list did not work?
thank you
You could use the FieldOneOfPredicate to check if the Name column is one of the items in the list:
import pandas as pd
import altair as alt
df = pd.DataFrame({
'Rank': [25, 20, 40, 10, 50, 35],
'Name': ['A', 'B', 'C', 'D', 'E', 'F'],
})
bar1 =alt.Chart(df).mark_bar().encode( y = alt.Y('Rank'),
x = alt.X('Name:O', sort ='x'),
color=alt.condition(
alt.FieldOneOfPredicate('Name', ['A', 'B']),
alt.value('red'),
alt.value('blue')
))
bar1
You can read more about it in the VegaLite docs. You could also use two expression string with an "or" operator:
color=alt.condition(
"datum.Name == 'A'"
"|| datum.Name == 'B'", # splitting on two rows for readability
alt.value('red'),
alt.value('blue')
)
I don't think there is a single Vega expression operator that you can use for checking membership like Python's in. This answer doesn't mention it either.

Merge Dictionaries Based on Matching Values from 2 Specified Keys [Matching Values are in Numpy Arrays]

Been looking around, but haven't yet found a solution to this. Sorry if I missed it.
I'm trying to create the equivalent of a pandas merge or SQL JOIN with dictionaries where the values are numpy arrays.
Below is an example input / what the desired goal is.
Ex Inputs:
import numpy as np
dict_1 = {
'col1': np.array(['one', 'two', 'three', 'four']),
'col2': np.array(['item1', 'item2', 'item3','item4'])}
dict_2 = {
'col3': np.array(['two', 'two', 'six', 'seven', 'eight']),
'col4': np.array(['item2', 'item3','item4','item5','item 5'])}
~Ex Desired Output (UPDATED): ~
new_dict = {
'col1': array(['one', 'two', 'two', 'two', 'three', 'four']),
'col2': array(['item1', 'item2', 'item2','item3','item3','item4']),
'col3': array([np.nan, 'two', 'two', np.nan, np.nan]),
'col4': array([np.nan, 'item2', 'item3', np.nan, np.nan, np.nan]),
}
So the goal here is that the function would identify matches between col1 in dict_1 and col3 in dict_2 then return all matches with the priority being on the left side.
i.e., priority being on left side == four is returned as it's in dict_1 even though there is no match - similar like you'd see on:
a pandas merge how='left' on 2 DataFrames
a LEFT JOIN in sql on two database tables
Of course I could turn the dictionaries into DataFrames and use pandas merge, but ideally looking to solve for this without using pandas.
Any help would be appreciated!!
Thank you
Here's how to accomplish desired result with Pandas Merge
import pandas as pd
import numpy as np
dict_1 = {
'col1': np.array(['one', 'two', 'three', 'four']),
'col2': np.array(['item1', 'item2', 'item3','item4'])}
dict_2 = {
'col3': np.array(['two', 'two', 'six', 'seven', 'eight']),
'col4': np.array(['item2', 'item3','item4','item5','item 5'])}
df1 = pd.DataFrame(dict_1)
df2 = pd.DataFrame(dict_2)
new_df = df1.merge(df2, how='left', left_on='col1', right_on='col3')
new_df
Ended up figuring out how to do this (not 1:1 matching the above req - but similar / gets to same desired output).
Definitely not the cleanest solution in the world (not even close), but it gets the job done...
Here's the final script:
import numpy as np
import sm_pandas as spd
struct_arr1 = np.array([('jason', '28', 'j#j.com', 'j#j.com'), ('jared', '31', 'jm#j.com', 'j#j.com'),('george', '28', 'gmm#j.com', 'j#j.com')],
dtype=[('name', 'object'), ('ag', 'object'), ('emai', 'object'), ('email', 'object')])
struct_arr2 = np.array([('jason', '22', 'jm#j.com'), ('jason', '27', 'jmm#j.com'), ('jared', 22, 'm#j.com')],
dtype=[('name', 'object'), ('age', 'object'), ('email', 'object')])
def removeDuplicates(lst):
return [t for t in (set(tuple(i) for i in lst))]
def join_by_left(key, r1, r2):
# figure out the dtype of the result array
key1 = r1.dtype.descr
list_keys1 = [d[0] for d in key1]
key2 = r2.dtype.descr
list_keys2 = [d[0] for d in key2]
len_keys = len(list_keys1)
# get a dict of the rows of r2 grouped by key
rows2 = {}
for row2 in r2:
rows2.setdefault(row2[key], []).append(row2)
# merge the data into the return array
new_arr = []
for row1 in r1:
if row1[key] in rows2: # return matching on key
for row2 in rows2[row1[key]]:
ret = tuple(row1[list_keys1]) + tuple(row2[list_keys2])
new_arr.append(ret)
else: # for left join to return non-matches on left side
for j in range(len_keys):
null = ('na','na','na')
ret = tuple(row1[list_keys1]) + tuple(null)
new_arr.append(ret)
new_arr = removeDuplicates(lst=new_arr) # remove any duplicates
return new_arr
new_arr = join_by_left(key='name', r1=struct_arr1, r2=struct_arr2)
# convert to list of lists
final = []
r = len(new_arr[0])
for num in range(r):
fields = [i[num] for i in new_arr]
final.append(fields)
# convert list of lists to dict
new_dict = {}
i=1
for item in final:
ndict = {f'field{i}': np.array(item)}
new_dict.update(ndict)
i = i + 1
print(new_dict)
Returns
{
'field1': array(['jared', 'jason', 'george', 'jason'], dtype='<U6'),
'field2': array(['31', '28', '28', '28'], dtype='<U2'),
'field3': array(['jm#j.com', 'j#j.com', 'gmm#j.com', 'j#j.com'], dtype='<U9'),
'field4': array(['j#j.com', 'j#j.com', 'j#j.com', 'j#j.com'], dtype='<U7'),
'field5': array(['jared', 'jason', 'na', 'jason'], dtype='<U5'),
'field6': array(['22', '27', 'na', '22'], dtype='<U11'),
'field7': array(['m#j.com', 'jmm#j.com', 'na', 'jm#j.com'], dtype='<U9')
}

How to speed up nested for loop with dataframe?

I have a dataframe like this:
test = pd.DataFrame({'id':['a','C','D','b','b','D','c','c','c'], 'text':['a','x','a','b','b','b','c','c','c']})
Using the following for-loop I can add x to a new_col. This for-loop works fine for the small dataframe. However, for dataframes that have thousands of rows, it will take many hours to process. Any suggestions to speed it up?
for index, row in test.iterrows():
if row['id'] == 'C':
if test['id'][index+1] =='D':
test['new_col'][index+1] = test['text'][index]
Try using shift() and conditions.
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': ['a', 'C', 'D', 'b', 'b', 'D', 'c', 'c', 'c'],
'text': ['a', 'x', 'a', 'b', 'b', 'b', 'c', 'c', 'c']})
df['temp_col'] = df['id'].shift()
df['new_col'] = np.where((df['id'] == 'D') & (df['temp_col'] == 'C'), df['text'].shift(), "")
del df['temp_col']
print(df)
We can also do it without a temporary column. (Thanks& credits to Prayson 🙂)
df['new_col'] = np.where((df['id'].eq('D')) & (df['id'].shift().eq('C')), df['text'].shift(), "")

pandas map function to data frame across all columns and one fixed column

I have a pandas dataframe df with 5 columns, 'A', 'B', 'C', 'D', 'E'
I would like to apply a function to the first 4 columns ('A', 'B', 'C', 'D') that takes two inputs X[i] and E[i] for row i where X is one of the first four columns.
Ignoring E[i], this is fairly straightforward:
def do_something(value):
return some_transformation(value)
df[['A', 'B', 'C', 'D']].applymap(do_something)
Similarly, if I have a constant value I can do it with map:
def do_something(value, i):
return some_transformation(value, i)
df[['A', 'B', 'C', 'D']].map(lambda f: do_something(f, 6))
But how do I do this if instead of 6 I want to pass in the value of E in the same row?
Using np.vectorize, you can pass columns to the function while actual computation happens over each set of elements.
def do_something(x, y):
return some_transformation(x, y)
v = np.vectorize(do_something)
df[['A', 'B', 'C', 'D']].apply(v, args=(df['E'], ))

Pyspark join and operation on values within a list in column

I have two dataframes, namely
df1 = sc.parallelize([
['u1', 'type1', ['a', 'b']],
['u2', 'type1', ['a', 'c', 'd']],
['u1', 'type2', ['d']]
]).toDF(('person', 'type', 'keywords'))
df2 = sc.parallelize([
['a', 2],
['b', 1],
['c', 0],
['d', 1],
['e', 3],
]).toDF(('keyword', 'score'))
I need to calculate, for each person and per type, the average score of its keywords. So, this average would be 1.5 for person 'u1' on type 'type1' as it has keywords 'a' and 'b' which contribute with 2+1/2=1.5
I have tried an approach encompassing a join:
df = df1.join(df2) \
.select('person', 'type', 'keywords', 'keyword', 'score') \
.groupBy('person', 'type') \
.agg(avg('score'))
but the problem is, it is computing the average on each possible keyword, not solely on those which said user and type have, so that I obtain 1.4 everywhere, which is the sum of all scores for all keywords divided by their number.
I need to sum up only scores for those keywords in the list keywords per user and type.
You'll have to explode the keywords first:
from pyspark.sql.functions import explode, avg, col
(df1.select("person", "type", explode("keywords").alias("keyword"))
.join(df2, "keyword")
.groupBy("person", "type")
.agg(avg("score")))
While it could be possible to do something like this
from pyspark.sql.functions import expr
(df1.join(df2, expr("array_contains(keywords, keyword)"))
.groupBy("person", "type")
.agg(avg("score")))
to achieve the same result it is something you want to avoid in practice to avoid expansion into a Cartesian product.

Categories

Resources