Creating N dataframes from a list - python

I want to create n DataFrames using the value s as the name of each DataFrame, but I only could create a list full of DataFrames. It's possible to change this list in each of the DataFrames inside it?
#estacao has something like [ABc,dfg,hil,...,xyz], and this should be the name of each DataFrame
estacao = dados.Station.unique()
for s,i in zip(estacao,range(126)):
estacao[i] = dados.groupby('Station').get_group(s)

I'd use a dictionary here. Then you can name the keys with s and the values can each be the dataframe corresponding to that group:
groups = dados.Station.unique()
groupby_ = datos.groupby('Station')
dataframes = {s: groupby_.get_group(s) for s in groups}
Then calling each one by name is as simple as:
group_df = dataframes['group_name']

If you REALLY NEED to create DataFrames named after s (which I named group in the following example), using exec is the solution.
groups = dados.Station.unique()
groupby_ = dados.groupby('Station')
for group in groups:
exec(f"{group} = groupby_.get_group('{group:s}')")
CAVEAT
See this answer to understand why using exec and eval commands is not always desirable.
Why should exec() and eval() be avoided?

Related

Iterate through different dataframes and apply a function to each one

I have 4 different dataframes containing time series data that all have the same structure.
My goal is to take each individual dataframe and pass it through a function I have defined that will group them by datestamp, sum the columns and return a new dataframe with the columns I want. So in total I want 4 new dataframes that have only the data I want.
I just looked through this post:
Loop through different dataframes and perform actions using a function
but applying this did not change my results.
Here is my code:
I am putting the dataframes in a list so I can iterate through them
dfs = [vds, vds2, vds3, vds4]
This is my function I want to pass each dataframe through:
def VDS_pre(df):
df = df.groupby(['datestamp','timestamp']).sum().reset_index()
df = df.rename(columns={'datestamp': 'Date','timestamp':'Time','det_vol': 'VolumeVDS'})
df = df[['Date','Time','VolumeVDS']]
return df
This is the loop I made to iterate through my dataframe list and pass each one through my function:
for df in dfs:
df = VDS_pre(df)
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did. Thanks for the help!
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did.
Yes, this is actually the case. The reason why they have not been modified is:
Assignment to an item in a for item in lst: loop does not have any effect on both the lst and the identifier/variables from which the lst items got their values as it is demonstrated with following code:
v1=1; v2=2; v3=3
lst = [v1,v2,v3]
for item in lst:
item = 0
print(lst, v1, v2, v3) # gives: [1, 2, 3] 1 2 3
To achieve the result you expect to obtain you can use a list comprehension and the list unpacking feature of Python:
vds,vds2,vds3,vds4=[VDS_pre(df) for df in [vds,vds2,vds3,vds4]]
or following code which is using a list of strings with the identifier/variable names of the dataframes:
sdfs = ['vds', 'vds2', 'vds3', 'vds4']
for sdf in sdfs:
exec(str(f'{sdf} = VDS_pre(eval(sdf))'))
Now printing vds, vds2, vds3 and vds4 will output the modified dataframes.
Pandas frame operations return new copy of data. Your snippet store the result in df variable which is not stored or updated to your initial list. This is why you don't have any stored result after execution.
If you don't need to keep original frames, you may simply overwrite them:
for i, df in enumerate(dfs):
dfs[i] = VDS_pre(df)
If not just use a second list and append result to it.
l = []
for df in dfs:
df2 = VDS_pre(df)
l.append(df2)
Or even better use list comprehension to rewrite this snippet into a single line of code.
Now you are able to store the result of your processing.
Additionally if your frames have the same structure and can be merged as a single frame, you may consider to first concat them and then apply your function on it. That would be totally pandas.

use dictionary value as a variable for df

I'm importing multiple dataframes and wrote the following process: 1. list of files to be coverted to dataframes + 2. list of names I want for the corresponding dataframes. 3. I combined the list into a dictionary:
tbls = ['tbl1', 'tbl2', 'tbl3']
dbname = ['dfABC', 'dfrand', 'dfXYZ']
dictdf = dict(zip(tbls, dbname))
Then I cycle through tbls to import the dataframes. (getdf below is a short function I wrote that reads the path, sheetname etc. for the excel/csv file in which the table(data) sits and imports the data.
for tbl in tbls:
dictdf[tbl] = getdf(tbl, dfRT, sfsession)
The process works except that the dataframes are written into the dictionary, i.e dfABC in the dictionary is replaced with a dataframe of 65K rows and 27 cols and so on.
What I want is dfABC = dataframe of 65krows and 27 cols. i.e in the above code. I tried:
str(dictdf[tbl]) = getdf(tbl, dfRT, sfsession)
but that gave an error. Is there a way to do this? thanks.
solved using exec and flipping the dictionary (the flip isn't needed to solve):
tbls = ['tbl1', 'tbl2', 'tbl3']
dfa = ['dfABC', 'dfrand', 'dfXYZ']
dictdf = dict(zip(dbname, tbls))
for df in dfs:
tbl = dictdf[df]
exec(f'{df} = getdf(\'{tbl}\', dfRT, sfsession)')
please note #Xukrao and #Yo_Chris's comments on keeping the dfs within the dictionary as a superior solution.
I found this question useful to understand how exec worked: What's the difference between eval, exec, and compile?

Python: Apply a function to multiple subsets of a dataframe (stored in a dictionary)

Regards,
Apologies if this question appears be to a duplicate of other questions. But I could find an answer that addresses my problem in its exactitude.
I split a dataframe, called "data", into multiple subsets that are stored in a dictionary of dataframes named "dfs" as follows:
# Partition DF
dfs = {}
chunk = 5
for n in range((data.shape[0] // chunk + 1)):
df_temp = data.iloc[n*chunk:(n+1)*chunk]
df_temp = df_temp.reset_index(drop=True)
dfs[n] = df_temp
Now, I would like to apply a pre-defined helper function called "fun_c" to EACH of the dataframes (that are stored in the dictionary object called "dfs").
Is it correct for me to apply the function to the dfs in one go, as follows(?):
result = fun_c(dfs)
If not, what would be the correct way of doing this?
it depends on the output you're looking for:
If you want a dict in the output, then you should apply the function to each dict item
result = dict({key: fun_c(val) for key, val in dfs.items()})
If you want a list of dataframes/values in the output, then apply the function to each dict value
result = [fun_c(val) for val in dfs.items()]
But this style isnt wrong either, you can iterate however you like inside the helper function as well:
def fun_c(dfs):
result = None
# either
for key, val in dfs.items():
pass
# or
for val in dfs.values():
pass
return result
Let me know if this helps!
Since you want this:
Now, I would like to apply a pre-defined helper function called
"fun_c" to EACH of the dataframes (that are stored in the dictionary
object called "dfs").
Let's say your dataframe dict looks like this and your helper function takes in a single dataframe.
dfs = {0 : df0, 1: df1, 2: df2, 3:df3}
Let's iterate through the dictionary, apply the fun_c function on each of the dataframes, and save the results in another dictionary having the same keys:
dfs_result = {k:fun_c[v] for k, v in dfs.items()}

Dynamically add columns to dataframe via apply

The following code applies a function f to a dataframe column data_df["c"] and concats the results to the original dataframe, i.e. concatenating 1024 columns to the dataframe data_df.
data_df = apply_and_concat(data_df, "c", lambda x: f(x, y), [y + "-dim" + str(i) for i in range(0,1024)])
def apply_and_concat(df, field, func, column_names):
return pd.concat((
df,
df[field].apply(
lambda cell: pd.Series(func(cell), index=column_names))), axis=1)
The problem is that I want to execute this dynamically, meaning that I don't know how many columns it returns. freturns a list. Is there any better or easier way to add these columns without the need to specify the number of columns before?
Your use of pd.concat(df, df.apply(...), axis=1) already solves the main task well. It seems like your main question really boils down to "how do I name an unknown number of columns", where you're happy to use a name based on sequential integers. For that, use itertools.count():
import itertools
f_modified = lambda x: dict(zip(
('{}-dim{}'.format(y, i) for i in itertools.count()),
f(x, y)
))
Then use f_modified instead of f. That way, you get a dictionary instead of a list, with an arbitrary number of dynamically generated names as keys. When converting this dictionary to a Series, you'll end up with the keys being used as the index, so you don't need to provide an explicit list as the index, and hence don't need to know the number of columns in advance.

compare list of dictionaries to dataframe, show missing values

I have a list of dictionaries
example_list = [{'email':'myemail#email.com'},{'email':'another#email.com'}]
and a dataframe with an 'Email' column
I need to compare the list against the dataframe and return the values that are not in the dataframe.
I can certainly iterate over the list, check in the dataframe, but I was looking for a more pythonic way, perhaps using list comprehension or perhaps a map function in dataframes?
To return those values that are not in DataFrame.email, here's a couple of options involving set difference operations—
np.setdiff1d
emails = [d['email'] for d in example_list)]
diff = np.setdiff1d(emails, df['Email']) # returns a list
set.difference
# returns a set
diff = set(d['email'] for d in example_list)).difference(df['Email'])
One way is to take one set from another. For a functional solution you can use operator.itemgetter:
from operator import itemgetter
res = set(map(itemgetter('email'), example_list)) - set(df['email'])
Note - is syntactic sugar for set.difference.
I ended up converting the list into a dataframe, comparing the two dataframes by merging them on a column, and then creating a dataframe out of the missing values
so, for example
example_list = [{'email':'myemail#email.com'},{'email':'another#email.com'}]
df_two = pd.DataFrame(item for item in example_list)
common = df_one.merge(df_two, on=['Email'])
df_diff = df_one[(~df_one.Email.isin(common.Email))]

Categories

Resources