I am using a custom function in pandas that iterates over cells in a dataframe, finds the same row in a different dataframe, extracts it as a tuple, extracts a random value from that tuple, and then adds a user specified amount of noise to the value and returns it to the original dataframe. I was hoping to find a way to do this that uses applymap, is it possible? I couldn't find a way using applymap, so I used itertuples, but an applymap solution should be more efficient.
import pandas as pd
# Mock data creation
key = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4,5,6], 'col3':[7,8,9]})
results = pd.DataFrame(np.zeros((3,3)))
def apply_value(value):
key_index = # <-- THIS IS WHERE I NEED A WAY TO ACCESS INDEX
key_tup = key.iloc[key_index]
length = (len(key_tup) - 1)
random_int = random.randint(1, length)
random_value = key_tup[random_int]
return random_value
results = results.applymap(apply_value)
If I understood your problem correctly, this piece of code should work. The problem is that applymap does not hold the index of the dataframe, so what you have to do is to apply nested apply functions: the first iterates over rows, and we get the key from there, and the second iterates over columns in each row. Hope it helps. Let me know if it does :D
# Mock data creation
key = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4,5,6], 'col3':[7,8,9]})
results = pd.DataFrame(np.zeros((3,3)))
def apply_value(value, key_index):
key_tup= key.loc[key_index]
length = (len(key_tup) - 1)
random_int = random.randint(1, length)
random_value = key_tup[random_int]
return random_value
results = results.apply(lambda x: x.apply(lambda d: apply_value(d, x.name)), axis=1)
Strictly you don't need to access row-index inside your function, there are other simpler ways to implement this.
You can probably do without it entirely, you don't even need do a pandas JOIN/merge of rows of key.
But first, you need to fix your example data, if key is really supposed to be a dataframe of tuples.
So you want to:
sweep over each column with apply(... , axis=1)
lookup the value of each cell key.loc[key_index]...
...which is supposed to give you a tuple key_tup, but in your example key was a simple dataframe, not a dataframe of tuples
key_tup = key.iloc[key_index]
the business with:
length = (len(key_tup) - 1)
random_int = random.randint(1, length)
random_value = key_tup[random_int]
can be simplified to just:
np.random.choice(key_tup)
in which case you likely don't need to declare apply_value()
Related
I have a pandas dataframe and want to add a value to a new column ('new') to all instances of .groupby() based on another column ('A').
At the moment I am doing it in several steps by:
1- looping through all unique column A values
2- calculate the value to add (run function on a different column, e.g. 'B')
3- store the value I would like to add to 'new' in a separate list (just one instance in that group!)
4- zip the list of unique groups (.groupby('A').unique())
5- looping again through the zipped values to store them in the dataframe.
This is a very inefficient way, and takes a long time to run.
is there a native pandas way to do it in less steps and that will run faster?
Example code:
mylist = []
df_groups = df.groupby('A')
groups = df['A'].unique()
for group in groups:
g = df_groups.get_group(group)
idxmin = g.index.min()
example = g.loc[idxmin]
mylist.append(myfunction(example['B'])
zipped = zip(groups, mylist)
df['new'] = np.nan
for group, val in zipped:
df.loc[df['A']==group, 'new'] = val
A better way to do that would be highly appreciated.
EDIT 1:
I could just run myfunction on all rows of the dataframe, but since its a heavy function, it would also take very long - so would prefer to run it as little as possible (that is, once per group).
Please try this, if this is the ask, using min function here, you can replace it.
import pandas as pd
data = {
"calories": [400, 300, 300, 400],
"duration": [50, 40, 45, 35]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
df['min_value_duration'] = df.groupby('calories')['duration'].transform(min)
print(df)
Reference: https://www.analyticsvidhya.com/blog/2020/03/understanding-transform-function-python/
Suppose we have a master dictionary master_dict = {"a": df1, "b": df2, "c": df3}. Now suppose we have a list called condition_list. Suppose func is a function that returns a new dictionary that has the original keys of master_dict along with potentially new keys.
What is the best way to get the below code to work when the length of condition_list is greater than 2:
if(len(condition_list) == 1):
df = master_dict[condition_list[0]]
else:
df = func(master_dict(condition_list[0]))
df = df[condition_list[1]]
You need to ask clearly. Declare input and output. And try to make a demo code. Anyway, use a loop.
for i in range(len(condition_list)):
if i==0: df = master_dict[condition_list[i]]
else: df = func(df)[condition_list[i]];
If the "df" is a dataframe of pandas, the conditions can be applied at once. Search "select dataframe with multiple conditions"
I need to count a value in several columns and I want all those individual count for each column in a list.
Is there a faster/better way of doing this? Because my solution takes quite some time.
dataframe.cache()
list = [dataframe.filter(col(str(i)) == "value").count() for i in range(150)]
You can do a conditional count aggregation:
import pyspark.sql.functions as F
df2 = df.agg(*[
F.count(F.when(F.col(str(i)) == "value", 1)).alias(i)
for i in range(150)
])
result = df2.toPandas().transpose()[0].tolist()
You can try the following approach/design
write a map function for each row of the data frame like this:
VALUE = 'value'
def row_mapper(df_row):
return [each == VALUE for each in df_row]
write a reduce function for data frame that takes 2 two rows as input:
def reduce_rows(df_row1, df_row2):
return [x + y for x, y in zip(df_row1, df_row2)]
Note: these are simple python function to help you understand not some udf functions you can directly apply on PySpark.
I have a list of time-series (=pandas dataframe) and want to calculate for each time-series (of a device) the matrixprofile.
One option is to iterate all the devices - which seems to be slow.
A second option would be to group by the devices - and apply a UDF. The problem is now, that the UDF will return 1:1 rows i.e. not a single scalar value per group but the same number of rows will be outputted as the input.
Is it still possible to somehow vectorize this calculation for reach group when 1:1 (or at least non scalar values) are returned?
import pandas as pd
df = pd.DataFrame({
'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)
print('***************************')
# slow version retaining all the rows
for g in df.bar.unique():
print(g)
this_group = df[df.bar == g]
# perform a UDF which needs to have all the values per group
# i.e. for real I want to calculate the matrixprofile for each time-series of a device
this_group['result'] = this_group.baz.apply(lambda x: 1)
display(this_group)
print('***************************')
def my_non_scalar1_1_agg_function(x):
display(pd.DataFrame(x))
return x
# neatly vectorized application of a non_scalar function
# but this fails as: Must produce aggregated value
df = df.groupby(['bar']).baz.agg(my_non_scalar1_1_agg_function)
display(df)
For non-aggregated functions applied to each distinct group that does not return a non-scalar value, you need to iterate method across groups and then compile together.
Therefore, consider a list or dict comprehension using groupby(), followed by concat. Be sure method inputs and returns a full data frame, series, or ndarray.
# LIST COMPREHENSION
df_list = [ myfunction(sub) for index, sub in df.groupby(['group_column']) ]
final_df = pd.concat(df_list)
# DICT COMPREHENSION
df_dict = { index: myfunction(sub) for index, sub in df.groupby(['group_column']) }
final_df = pd.concat(df_dict, ignore_index=True)
Indeed this (see also the link above in the comment) is a way to get it to work in a faster/more desired way. Perhaps there is even a better alternative
import pandas as pd
df = pd.DataFrame({
'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)
grouped_df = df.groupby(['bar'])
altered = []
for index, subframe in grouped_df:
display(subframe)
subframe = subframe# obviously we need to apply the UDF here - not the idempotent operation (=doing nothing)
altered.append(subframe)
print (index)
#print (subframe)
pd.concat(altered, ignore_index=True)
#pd.DataFrame(altered)
I'm about to write a backtesting tool and so for every row I'd like to have access to all the dataframe till the given row. In the following example I'm doing it from a fixed index using a loop. I'm wondering if there is any better solution.
import numpy as np
import pandas as pd
N
df = pd.DataFrame({"a":np.arange(N)})
for i in range(3,N):
print(df["a"][:i].values)
UPDATE (toy example)
I need to apply a custom function to all the previous values. Here as a toy example I will use the sum of the square of all previous values.
def toyFun(v):
return np.sum(v**2)
res = np.empty(N)
res[:] = np.nan
for i in range(3, N):
res[i] = toyFun(df["a"][:i].values)
df["res"] = res
If you are indexing rows for a particular column say 'a', you can use .iloc indexer (i stands for index, loc means location) to index on the columns.
df = pd.DataFrame({'a': [1,2,3,4]})
print(df.a.iloc[:2]) # get first two values
So, you can do:
for i in range(3, 10):
print(df.a.iloc[:i])
The best way is to use a temporary column with the direct results, that way you are not re-calculating everything.
df["a"].apply(lambda x: x**2).cumsum()
Then re-index as you which:
res[3:] = df["a"].apply(lambda x: x**2).cumsum()[2:N-1].values
or directly to the dataframe.