Create new dataframe for each factor level in column

Create new dataframe for each factor level in column - python

There are 50+ different levels in a column, and each level needs to be broken into it's own dataframe and written to a file (excel or csv).
I've seen this as a possible solution:
df1, df2, df3, df4 = [x for _, x in df.groupby(df['column_of_interest'])]
but is there a way not to hard code the number of data frames?

Is there a way not to hard code the number of data frames?
Yes, there is. Use a dictionary or list. Using dict:
dfs = {i: x for i, (_, x) in enumerate(df.groupby('column_of_interest'), 1)}
Then access your dataframes via dfs[1], dfs[2], etc.
Alternatively, using list:
dfs = [x for _, x in df.groupby('column_of_interest')]
Then use dfs[0], dfs[1], etc.
If you don't need to store your dataframe slices, just iterate a groupby object and use to_csv. This is convenient with f-strings (PEP 498, Python 3.6+):
for idx, (value, x) in enumerate(df.groupby('column_of_interest'), 1):
x.to_csv(f'slice_{value}.csv') # include value in filename
x.to_csv(f'slice_{idx}.csv') # include numeric index in filename

You could save the dataframes directly
[df1.to_csv("coi_%s.csv"%val) for val, df1 in df.groupby(df['column_of_interest'])]
Or with a explicit for loop
for val, df1 in df.groupby(df['column_of_interest']):
#Write the df1 to csv or excel
df1.to_csv("coi_%s.csv"%val)

One way can do that using locals but not recommend, personally think jpp's answer is the right way for this type of request .
variables = locals()
for key,value in df.groupby(df['column_of_interest']):
variables["df{0}".format(key)]= value

Related

Iterate through different dataframes and apply a function to each one

I have 4 different dataframes containing time series data that all have the same structure.
My goal is to take each individual dataframe and pass it through a function I have defined that will group them by datestamp, sum the columns and return a new dataframe with the columns I want. So in total I want 4 new dataframes that have only the data I want.
I just looked through this post:
Loop through different dataframes and perform actions using a function
but applying this did not change my results.
Here is my code:
I am putting the dataframes in a list so I can iterate through them
dfs = [vds, vds2, vds3, vds4]
This is my function I want to pass each dataframe through:
def VDS_pre(df):
df = df.groupby(['datestamp','timestamp']).sum().reset_index()
df = df.rename(columns={'datestamp': 'Date','timestamp':'Time','det_vol': 'VolumeVDS'})
df = df[['Date','Time','VolumeVDS']]
return df
This is the loop I made to iterate through my dataframe list and pass each one through my function:
for df in dfs:
df = VDS_pre(df)
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did. Thanks for the help!

However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did.
Yes, this is actually the case. The reason why they have not been modified is:
Assignment to an item in a for item in lst: loop does not have any effect on both the lst and the identifier/variables from which the lst items got their values as it is demonstrated with following code:
v1=1; v2=2; v3=3
lst = [v1,v2,v3]
for item in lst:
item = 0
print(lst, v1, v2, v3) # gives: [1, 2, 3] 1 2 3
To achieve the result you expect to obtain you can use a list comprehension and the list unpacking feature of Python:
vds,vds2,vds3,vds4=[VDS_pre(df) for df in [vds,vds2,vds3,vds4]]
or following code which is using a list of strings with the identifier/variable names of the dataframes:
sdfs = ['vds', 'vds2', 'vds3', 'vds4']
for sdf in sdfs:
exec(str(f'{sdf} = VDS_pre(eval(sdf))'))
Now printing vds, vds2, vds3 and vds4 will output the modified dataframes.

Pandas frame operations return new copy of data. Your snippet store the result in df variable which is not stored or updated to your initial list. This is why you don't have any stored result after execution.
If you don't need to keep original frames, you may simply overwrite them:
for i, df in enumerate(dfs):
dfs[i] = VDS_pre(df)
If not just use a second list and append result to it.
l = []
for df in dfs:
df2 = VDS_pre(df)
l.append(df2)
Or even better use list comprehension to rewrite this snippet into a single line of code.
Now you are able to store the result of your processing.
Additionally if your frames have the same structure and can be merged as a single frame, you may consider to first concat them and then apply your function on it. That would be totally pandas.

Python: Apply a function to multiple subsets of a dataframe (stored in a dictionary)

Regards,
Apologies if this question appears be to a duplicate of other questions. But I could find an answer that addresses my problem in its exactitude.
I split a dataframe, called "data", into multiple subsets that are stored in a dictionary of dataframes named "dfs" as follows:
# Partition DF
dfs = {}
chunk = 5
for n in range((data.shape[0] // chunk + 1)):
df_temp = data.iloc[n*chunk:(n+1)*chunk]
df_temp = df_temp.reset_index(drop=True)
dfs[n] = df_temp
Now, I would like to apply a pre-defined helper function called "fun_c" to EACH of the dataframes (that are stored in the dictionary object called "dfs").
Is it correct for me to apply the function to the dfs in one go, as follows(?):
result = fun_c(dfs)
If not, what would be the correct way of doing this?

it depends on the output you're looking for:
If you want a dict in the output, then you should apply the function to each dict item
result = dict({key: fun_c(val) for key, val in dfs.items()})
If you want a list of dataframes/values in the output, then apply the function to each dict value
result = [fun_c(val) for val in dfs.items()]
But this style isnt wrong either, you can iterate however you like inside the helper function as well:
def fun_c(dfs):
result = None
# either
for key, val in dfs.items():
pass
# or
for val in dfs.values():
pass
return result
Let me know if this helps!

Since you want this:
Now, I would like to apply a pre-defined helper function called
"fun_c" to EACH of the dataframes (that are stored in the dictionary
object called "dfs").
Let's say your dataframe dict looks like this and your helper function takes in a single dataframe.
dfs = {0 : df0, 1: df1, 2: df2, 3:df3}
Let's iterate through the dictionary, apply the fun_c function on each of the dataframes, and save the results in another dictionary having the same keys:
dfs_result = {k:fun_c[v] for k, v in dfs.items()}

Dynamically add columns to dataframe via apply

The following code applies a function f to a dataframe column data_df["c"] and concats the results to the original dataframe, i.e. concatenating 1024 columns to the dataframe data_df.
data_df = apply_and_concat(data_df, "c", lambda x: f(x, y), [y + "-dim" + str(i) for i in range(0,1024)])
def apply_and_concat(df, field, func, column_names):
return pd.concat((
df,
df[field].apply(
lambda cell: pd.Series(func(cell), index=column_names))), axis=1)
The problem is that I want to execute this dynamically, meaning that I don't know how many columns it returns. freturns a list. Is there any better or easier way to add these columns without the need to specify the number of columns before?

Your use of pd.concat(df, df.apply(...), axis=1) already solves the main task well. It seems like your main question really boils down to "how do I name an unknown number of columns", where you're happy to use a name based on sequential integers. For that, use itertools.count():
import itertools
f_modified = lambda x: dict(zip(
('{}-dim{}'.format(y, i) for i in itertools.count()),
f(x, y)
))
Then use f_modified instead of f. That way, you get a dictionary instead of a list, with an arbitrary number of dynamically generated names as keys. When converting this dictionary to a Series, you'll end up with the keys being used as the index, so you don't need to provide an explicit list as the index, and hence don't need to know the number of columns in advance.

Slicing datasets and store in new dataframes quickly?

I'm new to python and would appreciate your help here.
I imported 4 dataset with the same headers into python. Now I want to create 4 dataframes that contain only selected columns from the 4 datasets. I know how to do it the ugly way but what's the most efficient way to perform this task?
I tried a for loop but couldn't make it work :D
Datasets imported as df1,df2,df3,df4
dataset_list = (df1,df2,df3,df4)
new_dataframes= (df_1,df_2,df_3,df_4)
for i in dataset_list:
for e in new_dataframes:
e = i.loc[0:,['column1','column2','column3','column4']]

You could use a dictionary comprehension:
cols = ['column1','column2','column3','column4']
dfs = {k: df[cols] for k, df in enumerate([df1, df2, df3, df4], 1)}
The benefit of this method is it caters for an arbitrary number of items without having to manually increment variable names.

How about this approach:
dataset_list = (df1,df2,df3,df4)
def slice(df):
return df.loc[:, ['column1','column2','column3','column4']]
df_1,df_2,df_3,df_4 = map(slice, dataset_list)

How to iterate over columns of pandas dataframe to run regression

I have this code using Pandas in Python:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but how can I do this for each column in the dataframe? Specifically, how can I iterate over columns, in order to run the regression on each?
Specifically, I want to regress each other ticker symbol (FIUIX, FSAIX and FSAVX) on FSTMX, and store the residuals for each regression.
I've tried various versions of the following, but nothing I've tried gives the desired result:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
Is there something wrong with the returns[k] part of the code? How can I use the k value to access a column? Or else is there a simpler approach?

for column in df:
print(df[column])

You can use iteritems():
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))

This answer is to iterate over selected columns as well as all columns in a DF.
df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)

You can index dataframe columns by the position using ix.
df1.ix[:,1]
This returns the first column for example. (0 would be the index)
df1.ix[0,]
This returns the first row.
df1.ix[:,1]
This would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.

A workaround is to transpose the DataFrame and iterate over the rows.
for column_name, column in df.transpose().iterrows():
print column_name

Using list comprehension, you can get all the columns names (header):
[column for column in df]

Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column] type is Series, which can simply be converted into numpy ndarrays:
for i, column in enumerate(df):
print i, np.asarray(df[column])

I'm a bit late but here's how I did this. The steps:
Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.
This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)

I landed on this question as I was looking for a clean iterator of columns only (Series, no names).
Unless I am mistaken, there is no such thing, which, if true, is a bit annoying. In particular, one would sometimes like to assign a few individual columns (Series) to variables, e.g.:
x, y = df[['x', 'y']] # does not work
There is df.items() that gets close, but it gives an iterator of tuples (column_name, column_series). Interestingly, there is a corresponding df.keys() which returns df.columns, i.e. the column names as an Index, so a, b = df[['x', 'y']].keys() assigns properly a='x' and b='y'. But there is no corresponding df.values(), and for good reason, as df.values is a property and returns the underlying numpy array.
One (inelegant) way is to do:
x, y = (v for _, v in df[['x', 'y']].items())
but it's less pythonic than I'd like.

Most of these answers are going via the column name, rather than iterating the columns directly. They will also have issues if there are multiple columns with the same name. If you want to iterate the columns, I'd suggest:
for series in (df.iloc[:,i] for i in range(df.shape[1])):
...

assuming X-factor, y-label (multicolumn):
columns = [c for c in _df.columns if c in ['col1', 'col2','col3']] #or '..c not in..'
_df.set_index(columns, inplace=True)
print( _df.index)
X, y = _df.iloc[:,:4].values, _df.index.values

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create new dataframe for each factor level in column - python

You could save the dataframes directly [df1.to_csv("coi_%s.csv"%val) for val, df1 in df.groupby(df['column_of_interest'])] Or with a explicit for loop for val, df1 in df.groupby(df['column_of_interest']): #Write the df1 to csv or excel df1.to_csv("coi_%s.csv"%val)

One way can do that using locals but not recommend, personally think jpp's answer is the right way for this type of request . variables = locals() for key,value in df.groupby(df['column_of_interest']): variables["df{0}".format(key)]= value

Related

Iterate through different dataframes and apply a function to each one

Python: Apply a function to multiple subsets of a dataframe (stored in a dictionary)

Dynamically add columns to dataframe via apply

Slicing datasets and store in new dataframes quickly?

How to iterate over columns of pandas dataframe to run regression

Categories

Resources