Most efficient way of applying a function based on condition

Most efficient way of applying a function based on condition - python

Suppose we have a master dictionary master_dict = {"a": df1, "b": df2, "c": df3}. Now suppose we have a list called condition_list. Suppose func is a function that returns a new dictionary that has the original keys of master_dict along with potentially new keys.
What is the best way to get the below code to work when the length of condition_list is greater than 2:
if(len(condition_list) == 1):
df = master_dict[condition_list[0]]
else:
df = func(master_dict(condition_list[0]))
df = df[condition_list[1]]

You need to ask clearly. Declare input and output. And try to make a demo code. Anyway, use a loop.
for i in range(len(condition_list)):
if i==0: df = master_dict[condition_list[i]]
else: df = func(df)[condition_list[i]];
If the "df" is a dataframe of pandas, the conditions can be applied at once. Search "select dataframe with multiple conditions"

Related

Iterate through different dataframes and apply a function to each one

I have 4 different dataframes containing time series data that all have the same structure.
My goal is to take each individual dataframe and pass it through a function I have defined that will group them by datestamp, sum the columns and return a new dataframe with the columns I want. So in total I want 4 new dataframes that have only the data I want.
I just looked through this post:
Loop through different dataframes and perform actions using a function
but applying this did not change my results.
Here is my code:
I am putting the dataframes in a list so I can iterate through them
dfs = [vds, vds2, vds3, vds4]
This is my function I want to pass each dataframe through:
def VDS_pre(df):
df = df.groupby(['datestamp','timestamp']).sum().reset_index()
df = df.rename(columns={'datestamp': 'Date','timestamp':'Time','det_vol': 'VolumeVDS'})
df = df[['Date','Time','VolumeVDS']]
return df
This is the loop I made to iterate through my dataframe list and pass each one through my function:
for df in dfs:
df = VDS_pre(df)
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did. Thanks for the help!

However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did.
Yes, this is actually the case. The reason why they have not been modified is:
Assignment to an item in a for item in lst: loop does not have any effect on both the lst and the identifier/variables from which the lst items got their values as it is demonstrated with following code:
v1=1; v2=2; v3=3
lst = [v1,v2,v3]
for item in lst:
item = 0
print(lst, v1, v2, v3) # gives: [1, 2, 3] 1 2 3
To achieve the result you expect to obtain you can use a list comprehension and the list unpacking feature of Python:
vds,vds2,vds3,vds4=[VDS_pre(df) for df in [vds,vds2,vds3,vds4]]
or following code which is using a list of strings with the identifier/variable names of the dataframes:
sdfs = ['vds', 'vds2', 'vds3', 'vds4']
for sdf in sdfs:
exec(str(f'{sdf} = VDS_pre(eval(sdf))'))
Now printing vds, vds2, vds3 and vds4 will output the modified dataframes.

Pandas frame operations return new copy of data. Your snippet store the result in df variable which is not stored or updated to your initial list. This is why you don't have any stored result after execution.
If you don't need to keep original frames, you may simply overwrite them:
for i, df in enumerate(dfs):
dfs[i] = VDS_pre(df)
If not just use a second list and append result to it.
l = []
for df in dfs:
df2 = VDS_pre(df)
l.append(df2)
Or even better use list comprehension to rewrite this snippet into a single line of code.
Now you are able to store the result of your processing.
Additionally if your frames have the same structure and can be merged as a single frame, you may consider to first concat them and then apply your function on it. That would be totally pandas.

Create a dictionary from pandas empty dataframe with only column names

I have a pandas data frame with only two column names( single row, which can be also considered as headers).I want to make a dictionary out of this with the first column being the value and the second column being the key.I already tried the
to.dict() method, but it's not working as it's an empty dataframe.
Example
df=|Land |Norway| to {'Land': Norway}
I can change the pandas data frame to some other type and find my way around it, but this question is mostly to learn the best/different/efficient approach for this problem.
For now I have this as the solution :
dict(zip(a.iloc[0:0,0:1],a.iloc[0:0,1:2]))
Is there any other way to do this?

Here's a simple way convert the columns to a list and a list to a dictionary
def list_to_dict(a):
it = iter(a)
ret_dict = dict(zip(it, it))
return ret_dict
df = pd.DataFrame([], columns=['Land', 'Normway'])
dict_val = list_to_dict(df.columns.to_list())
dict_val # {'Land': 'Normway'}

Very manual solution
df = pd.DataFrame(columns=['Land', 'Norway'])
df = pd.DataFrame({df.columns[0]: df.columns[1]}, index=[0])
If you have any number of columns and you want each sequential pair to have this transformation, try:
df = pd.DataFrame(dict(zip(df.columns[::2], df.columns[1::2])), index=[0])
Note: You will get an error if your DataFrame does not have at least two columns.

Python: Apply a function to multiple subsets of a dataframe (stored in a dictionary)

Regards,
Apologies if this question appears be to a duplicate of other questions. But I could find an answer that addresses my problem in its exactitude.
I split a dataframe, called "data", into multiple subsets that are stored in a dictionary of dataframes named "dfs" as follows:
# Partition DF
dfs = {}
chunk = 5
for n in range((data.shape[0] // chunk + 1)):
df_temp = data.iloc[n*chunk:(n+1)*chunk]
df_temp = df_temp.reset_index(drop=True)
dfs[n] = df_temp
Now, I would like to apply a pre-defined helper function called "fun_c" to EACH of the dataframes (that are stored in the dictionary object called "dfs").
Is it correct for me to apply the function to the dfs in one go, as follows(?):
result = fun_c(dfs)
If not, what would be the correct way of doing this?

it depends on the output you're looking for:
If you want a dict in the output, then you should apply the function to each dict item
result = dict({key: fun_c(val) for key, val in dfs.items()})
If you want a list of dataframes/values in the output, then apply the function to each dict value
result = [fun_c(val) for val in dfs.items()]
But this style isnt wrong either, you can iterate however you like inside the helper function as well:
def fun_c(dfs):
result = None
# either
for key, val in dfs.items():
pass
# or
for val in dfs.values():
pass
return result
Let me know if this helps!

Since you want this:
Now, I would like to apply a pre-defined helper function called
"fun_c" to EACH of the dataframes (that are stored in the dictionary
object called "dfs").
Let's say your dataframe dict looks like this and your helper function takes in a single dataframe.
dfs = {0 : df0, 1: df1, 2: df2, 3:df3}
Let's iterate through the dictionary, apply the fun_c function on each of the dataframes, and save the results in another dictionary having the same keys:
dfs_result = {k:fun_c[v] for k, v in dfs.items()}

How to access index inside function for applymap in pandas?

I am using a custom function in pandas that iterates over cells in a dataframe, finds the same row in a different dataframe, extracts it as a tuple, extracts a random value from that tuple, and then adds a user specified amount of noise to the value and returns it to the original dataframe. I was hoping to find a way to do this that uses applymap, is it possible? I couldn't find a way using applymap, so I used itertuples, but an applymap solution should be more efficient.
import pandas as pd
# Mock data creation
key = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4,5,6], 'col3':[7,8,9]})
results = pd.DataFrame(np.zeros((3,3)))
def apply_value(value):
key_index = # <-- THIS IS WHERE I NEED A WAY TO ACCESS INDEX
key_tup = key.iloc[key_index]
length = (len(key_tup) - 1)
random_int = random.randint(1, length)
random_value = key_tup[random_int]
return random_value
results = results.applymap(apply_value)

If I understood your problem correctly, this piece of code should work. The problem is that applymap does not hold the index of the dataframe, so what you have to do is to apply nested apply functions: the first iterates over rows, and we get the key from there, and the second iterates over columns in each row. Hope it helps. Let me know if it does :D
# Mock data creation
key = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4,5,6], 'col3':[7,8,9]})
results = pd.DataFrame(np.zeros((3,3)))
def apply_value(value, key_index):
key_tup= key.loc[key_index]
length = (len(key_tup) - 1)
random_int = random.randint(1, length)
random_value = key_tup[random_int]
return random_value
results = results.apply(lambda x: x.apply(lambda d: apply_value(d, x.name)), axis=1)

Strictly you don't need to access row-index inside your function, there are other simpler ways to implement this.
You can probably do without it entirely, you don't even need do a pandas JOIN/merge of rows of key.
But first, you need to fix your example data, if key is really supposed to be a dataframe of tuples.
So you want to:
sweep over each column with apply(... , axis=1)
lookup the value of each cell key.loc[key_index]...
...which is supposed to give you a tuple key_tup, but in your example key was a simple dataframe, not a dataframe of tuples
key_tup = key.iloc[key_index]
the business with:
length = (len(key_tup) - 1)
random_int = random.randint(1, length)
random_value = key_tup[random_int]
can be simplified to just:
np.random.choice(key_tup)
in which case you likely don't need to declare apply_value()

Pandas: Use iterrows on Dataframe subset

What is the best way to do iterrows with a subset of a DataFrame?
Let's take the following simple example:
import pandas as pd
df = pd.DataFrame({
'Product': list('AAAABBAA'),
'Quantity': [5,2,5,10,1,5,2,3],
'Start' : [
DT.datetime(2013,1,1,9,0),
DT.datetime(2013,1,1,8,5),
DT.datetime(2013,2,5,14,0),
DT.datetime(2013,2,5,16,0),
DT.datetime(2013,2,8,20,0),
DT.datetime(2013,2,8,16,50),
DT.datetime(2013,2,8,7,0),
DT.datetime(2013,7,4,8,0)]})
df = df.set_index(['Start'])
Now I would like to modify a subset of this DataFrame using the itterrows function, e.g.:
for i, row_i in df[df.Product == 'A'].iterrows():
row_i['Product'] = 'A1' # actually a more complex calculation
However, the changes do not persist.
Is there any possibility (except a manual lookup using the index 'i') to make persistent changes on the original Dataframe ?

Why do you need iterrows() for this? I think it's always preferrable to use vectorized operations in pandas (or numpy):
df.ix[df['Product'] == 'A', "Product"] = 'A1'

I guess the best way that comes to my mind is to generate a new vector with the desired result, where you can loop all you want and then reassign it back to the column
#make a copy of the column
P = df.Product.copy()
#do the operation or loop if you really must
P[ P=="A" ] = "A1"
#reassign to original df
df["Product"] = P

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Most efficient way of applying a function based on condition - python

Related

Iterate through different dataframes and apply a function to each one

Create a dictionary from pandas empty dataframe with only column names

Python: Apply a function to multiple subsets of a dataframe (stored in a dictionary)

How to access index inside function for applymap in pandas?

Pandas: Use iterrows on Dataframe subset

Categories

Resources