Here is my attempt.
The Dataframe I have now has a column which will decide how to deal with the features.
For example, df has two columns, DATA and TYPE. The TYPE has three classes: S1,S2 and S3. And I will define three different function based on the different type of samples.
#### S1
def f_s1(data):
result = data+1
return result
#### S2
def f_s2(data):
result = data+2
return result
#### S3
def f_s3(data):
result = data+3
return result
I also created a mapping set:
f_map= {'S1':f_s1,'S2':f_s2, 'S3': f_s3}
Then, I use pandas.Map utility to map these function with the type of each sample.
df['result'] = df['TYPE'].map(f_map)(df['DATA'])
But It didn't work with TypeError: 'Series' object is not callable.
Any advice would be appreciate!
df['TYPE'].map(f_map) creates a series of functions, and if you want to apply them to your data column correspondingly, one option would be to use zip() function as follows:
df['result'] = [func(data) for func, data in zip(df['TYPE'].map(f_map), df['DATA'])]
df
Alternatively, you can group by TYPE and then apply the specific function for each type(or group) to the DATA column in that group, assuming your predefined functions contain vectorized operation and thus accepting series as parameters:
df = pd.DataFrame({'TYPE':['S1', 'S2', 'S3', 'S1'], 'DATA':[1, 1, 1, 1]})
df['result'] = (df.groupby('TYPE').apply(lambda g: f_map.get(g['TYPE'].iloc[0])(g['DATA']))
.reset_index(level = 0, drop = True))
Related
This seems like it should be pretty simple, but I'm stumped for some reason. I have a list of PySpark columns that I would like to sort by name (including aliasing, as that will be how they are displayed/written to disk). Here's some example tests and things I've tried:
def test_col_sorting():
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
# Active spark context needed
spark = SparkSession.builder.getOrCreate()
# Data to sort
cols = [f.col('c'), f.col('a'), f.col('b').alias('z')]
# Attempt 1
result = sorted(cols)
# This fails with ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
# Attempt 2
result = sorted(cols, key=lambda x: x.name())
# Fails for the same reason, `name()` returns a Column object, not a string
# Assertion I want to hold true:
assert result = [f.col('a'), f.col('c'), f.col('b').alias('z')]
Is there any reasonable way to actually get the string back out of the Column object that was used to initialize it (but also respecting aliasing)? If I could get this from the object I could use it as a key.
Note that I am NOT looking to sort the columns on a DataFrame, as answered in this question: Python/pyspark data frame rearrange columns. These Column objects are not bound to any DataFrame. I also do not want to sort the column based on the values of the column.
Answering my own question: it seems that you can't do this without some amount of parsing from the column string representation. You also don't need regex to handle this. These two methods should take care of it:
def get_column_name(col: Column) -> str:
"""
PySpark doesn't allow you to directly access the column name with respect to aliases
from an unbound column. We have to parse this out from the string representation.
This works on columns with one or more aliases as well as unaliased columns.
Returns:
Col name as str, with respect to aliasing
"""
c = str(col).lstrip("Column<'").rstrip("'>")
return c.split(' AS ')[-1]
def sorted_columns(cols: List[Column]) -> List[Column]:
"""
Returns sorted list of columns, with respect to aliases
Args:
cols: List of PySpark Columns (e.g. [f.col('a'), f.col('b').alias('c'), ...])
Returns:
Sorted list of PySpark Columns by name, with respect to aliasing
"""
return sorted(cols, key=lambda x: get_column_name(x))
Some tests to validate behavior:
import pytest
from pyspark.sql import SparkSession
#pytest.fixture(scope="session")
def spark() -> SparkSession:
# Provide a session spark fixture for all tests
yield SparkSession.builder.getOrCreate()
def test_get_col_name(spark: SparkSession):
col = f.col('a')
actual = get_column_name(col)
assert actual == 'a'
def test_get_col_name_alias(spark: SparkSession):
col = f.col('a').alias('b')
actual = get_column_name(col)
assert actual == 'b'
def test_get_col_name_multiple_alias(spark: SparkSession):
col = f.col('a').alias('b').alias('c')
actual = get_column_name(col)
assert actual == 'c'
def test_sorted_columns(spark: SparkSession):
cols = [f.col('z').alias('c'), f.col('a'), f.col('d').alias('e').alias('f'), f.col('b')]
actual = sorted_columns(cols)
expected = [f.col('a'), f.col('b'), f.col('z').alias('c'), f.col('d').alias('e').alias('f')]
# We can't directly compare lists of cols, so we zip and check the repr of each element
for a, b in zip(actual, expected):
assert str(a) == str(b)
I think it's fair to say being unable to access this information in a truthy way is a failure of the PySpark API. There are a multitude of valid reasons to want to ascertain what name an unbound Column type would be resolved to, and it should not have to be parsed in such a hacky way.
If you're only interested in grabbing the column names, and sorting those (without any relation to any data) you can use the column object's __repr__ method and use regex to extract the actual name of your column.
So for these columns
import pyspark.sql.functions as f
cols = [f.col('c'), f.col('a'), f.col('b').alias('z')]
You could do this:
import re
# Making a list of string representation of our columns
col_repr = [x.__repr__() for x in cols]
["Column<'c'>", "Column<'a'>", "Column<'b AS z'>"]
# Using regex to extract the interesting part of the column name
# while making sure we're properly grabbing the alias name. Notice
# that we're grabbing the right part of the column name in `b AS z`
col_names = [re.search('([a-zA-Z])\'>', x).group(1) for x in col_repr]
['c', 'a', 'z']
# Sorting this array
sorted_col_names = sorted(col_names)
['a', 'c', 'z']
NOTE: This example is simple (only accepting lowercase and uppercase letters as column names) but as your column names get more complex, it's just a question of adapting your regex pattern.
I have this example dataframe:
vocab_list = ['running','sitting','stand','walk']
col_list = ['browse','wander','saunter','jogging','prancing']
df = pd.DataFrame(vocab_list,columns=['vocab'])
df.set_index('vocab',inplace=True)
df = df.reindex(col_list,axis=1)
I need to apply a user-defined function to all columns using values from the index of the dataframe.
Taking my user-defined function to be the cosine similarity between pairs of strings in indices and columns
import spacy
nlp = spacy.load('en_core_web_lg')
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)
def func(col):
print(col.name) # Will print the strings in vocab_list in each call
print(col.index) # Will print an Index object containing the names of columns
doc = nlp(col.name)
for i,ind in tqdm(enumerate(col.index),leave=False):
user = nlp(ind)
check_lemma = doc[0].lemma_ != user[0].lemma_
pos_equality = doc[0].pos_ == user[0].pos_
if check_lemma==True and pos_equality==True:
col.iloc[i] = doc.similarity(user)
else:
col.iloc[i] = 0
return col
df = df.parallel_apply(lambda col: func(col), axis=1)
Is there a way to do this without having a for loop in the user_defined function?
The col in the function is a Series object made from the column, I can access the index string of the row by col.name .
Also, col.index gives me an Index object for this Series, containing the names of the columns, but how do I go from there to get the similarities without having a for loop?
NOTE: My actual dataframe has ~3000 columns and ~120000 indices so I would prefer to not have a for loop within the user-defined function.
EDIT: I have edited the question with the user-defined function currently being used.
I have a list of time-series (=pandas dataframe) and want to calculate for each time-series (of a device) the matrixprofile.
One option is to iterate all the devices - which seems to be slow.
A second option would be to group by the devices - and apply a UDF. The problem is now, that the UDF will return 1:1 rows i.e. not a single scalar value per group but the same number of rows will be outputted as the input.
Is it still possible to somehow vectorize this calculation for reach group when 1:1 (or at least non scalar values) are returned?
import pandas as pd
df = pd.DataFrame({
'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)
print('***************************')
# slow version retaining all the rows
for g in df.bar.unique():
print(g)
this_group = df[df.bar == g]
# perform a UDF which needs to have all the values per group
# i.e. for real I want to calculate the matrixprofile for each time-series of a device
this_group['result'] = this_group.baz.apply(lambda x: 1)
display(this_group)
print('***************************')
def my_non_scalar1_1_agg_function(x):
display(pd.DataFrame(x))
return x
# neatly vectorized application of a non_scalar function
# but this fails as: Must produce aggregated value
df = df.groupby(['bar']).baz.agg(my_non_scalar1_1_agg_function)
display(df)
For non-aggregated functions applied to each distinct group that does not return a non-scalar value, you need to iterate method across groups and then compile together.
Therefore, consider a list or dict comprehension using groupby(), followed by concat. Be sure method inputs and returns a full data frame, series, or ndarray.
# LIST COMPREHENSION
df_list = [ myfunction(sub) for index, sub in df.groupby(['group_column']) ]
final_df = pd.concat(df_list)
# DICT COMPREHENSION
df_dict = { index: myfunction(sub) for index, sub in df.groupby(['group_column']) }
final_df = pd.concat(df_dict, ignore_index=True)
Indeed this (see also the link above in the comment) is a way to get it to work in a faster/more desired way. Perhaps there is even a better alternative
import pandas as pd
df = pd.DataFrame({
'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)
grouped_df = df.groupby(['bar'])
altered = []
for index, subframe in grouped_df:
display(subframe)
subframe = subframe# obviously we need to apply the UDF here - not the idempotent operation (=doing nothing)
altered.append(subframe)
print (index)
#print (subframe)
pd.concat(altered, ignore_index=True)
#pd.DataFrame(altered)
Regards,
Apologies if this question appears be to a duplicate of other questions. But I could find an answer that addresses my problem in its exactitude.
I split a dataframe, called "data", into multiple subsets that are stored in a dictionary of dataframes named "dfs" as follows:
# Partition DF
dfs = {}
chunk = 5
for n in range((data.shape[0] // chunk + 1)):
df_temp = data.iloc[n*chunk:(n+1)*chunk]
df_temp = df_temp.reset_index(drop=True)
dfs[n] = df_temp
Now, I would like to apply a pre-defined helper function called "fun_c" to EACH of the dataframes (that are stored in the dictionary object called "dfs").
Is it correct for me to apply the function to the dfs in one go, as follows(?):
result = fun_c(dfs)
If not, what would be the correct way of doing this?
it depends on the output you're looking for:
If you want a dict in the output, then you should apply the function to each dict item
result = dict({key: fun_c(val) for key, val in dfs.items()})
If you want a list of dataframes/values in the output, then apply the function to each dict value
result = [fun_c(val) for val in dfs.items()]
But this style isnt wrong either, you can iterate however you like inside the helper function as well:
def fun_c(dfs):
result = None
# either
for key, val in dfs.items():
pass
# or
for val in dfs.values():
pass
return result
Let me know if this helps!
Since you want this:
Now, I would like to apply a pre-defined helper function called
"fun_c" to EACH of the dataframes (that are stored in the dictionary
object called "dfs").
Let's say your dataframe dict looks like this and your helper function takes in a single dataframe.
dfs = {0 : df0, 1: df1, 2: df2, 3:df3}
Let's iterate through the dictionary, apply the fun_c function on each of the dataframes, and save the results in another dictionary having the same keys:
dfs_result = {k:fun_c[v] for k, v in dfs.items()}
I am currently trying to implement a statistical test for a specific row based on the content of different rows. Given the dataframe in the following image:
DataFrame
I would like to create a new column based on a function that takes into account all the columns of the dataframe that has the same string in column "Template".
For example, in this case there are 2 rows with Template "[Are|Off]", and for each one of those rows I would need to create an element in a new column based on "Clicks", "Impressions" and "Conversions" of both rows.
How would you best approach this problem?
PS: I apologise in advance for the way I am describing the problem, as you might have notices I am not a professional codes :D But I would really appreciate your help!
Here the formula with which I solved this in excel:
Excel Chi Squared test
This might be overly general but I would use some sort of function map if different things should be done depending on the template name:
import pandas as pd
import numpy as np
import collections
n = 5
template_column = list(['are|off', 'are|off', 'comp', 'comp', 'comp|city'])
n = len(template_column)
df = pd.DataFrame(np.random.random((n, 3)), index=range(n), columns=['Clicks', 'Impressions', 'Conversions'])
df['template'] = template_column
# Use a defaultdict so that you can define a default value if a template is
# note defined
function_map = collections.defaultdict(lambda: lambda df: np.nan)
# Now define functions to compute what the new columns should do depending on
# the template.
function_map.update({
'are|off': lambda df: df.sum().sum(),
'comp': lambda df: df.mean().mean(),
'something else': lambda df: df.mean().max()
})
# The lambda functions are just placeholders. You could do whatever you want in these functions... for example:
def do_special_stuff(df):
"""Do something that uses rows and columns...
you could also do looping or whatever you want as long
as the result is a scalar, or a sequence with the same
number of columns as the original template DataFrame
"""
crazy_stuff = np.prod(np.sum(df.values,axis=1)[:,None] + 2*df.values, axis=1)
return crazy_stuff
function_map['comp'] = do_special_stuff
def wrap(f):
"""Wrap a function so that it returns an updated dataframe"""
def wrapped(df):
df = df.copy()
new_column_data = f(df.drop('template', axis=1))
df['new_column'] = new_column_data
return df
return wrapped
# wrap all the functions so that each template has a function defined that does
# the correct thing
series_function_map = {k: wrap(function_map[k]) for k in df['template'].unique()}
# throw everything back together
new_df = pd.concat([series_function_map[label](group)
for label, group in df.groupby('template')],
ignore_index=True)
# print your shiny new dataframe
print(new_df)
The result is then something like:
Clicks Impressions Conversions template new_column
0 0.959765 0.111648 0.769329 are|off 4.030594
1 0.809917 0.696348 0.683587 are|off 4.030594
2 0.265642 0.656780 0.182373 comp 0.502015
3 0.753788 0.175305 0.978205 comp 0.502015
4 0.269434 0.966951 0.478056 comp|city NaN
Hope it helps!
Ok so after groupby u need to apply this formula ..so you can do this in pandas also ...
import numpy as np
t = df.groupby("Template") # this is for groupby
def calculater(b5,b6,c5,c6):
return b5/(b5+b6)*((c5+c6))
t['result'] = np.vectorize(calculater)(df["b5"],df["b6"],df["c5"],df["c6"])
here b5,b6 .. are column names of the cells shown in image
This should work for you or may need to do some minor changes in maths there