Slicing datasets and store in new dataframes quickly? - python

I'm new to python and would appreciate your help here.
I imported 4 dataset with the same headers into python. Now I want to create 4 dataframes that contain only selected columns from the 4 datasets. I know how to do it the ugly way but what's the most efficient way to perform this task?
I tried a for loop but couldn't make it work :D
Datasets imported as df1,df2,df3,df4
dataset_list = (df1,df2,df3,df4)
new_dataframes= (df_1,df_2,df_3,df_4)
for i in dataset_list:
for e in new_dataframes:
e = i.loc[0:,['column1','column2','column3','column4']]

You could use a dictionary comprehension:
cols = ['column1','column2','column3','column4']
dfs = {k: df[cols] for k, df in enumerate([df1, df2, df3, df4], 1)}
The benefit of this method is it caters for an arbitrary number of items without having to manually increment variable names.

How about this approach:
dataset_list = (df1,df2,df3,df4)
def slice(df):
return df.loc[:, ['column1','column2','column3','column4']]
df_1,df_2,df_3,df_4 = map(slice, dataset_list)

Related

Create a new column in multiple dataframes using for loop

I have multiple dataframes with the same structure but different values
for instance,
df0, df1, df2...., df9
To each dataframe I want to add a column named eventdate that consists of one date, for instance, 2019-09-15 using for loop
for i in range(0, 9);
df+str(i)['eventdate'] = "2021-09-15"
but I get an error message
SyntaxError: cannot assign to operator
I think it's because df isn't defined. This should be very simple.. Any idea how to do this? thanks.
dfs = [df0, df1, df2...., df9]
dfs_new = []
for i, df in enumerate(dfs):
df['eventdate'] = "2021-09-15"
dfs_new.append(df)
if you can't generate a list then you could use eval(f"df{str(num)}") but this method isn't recommended from what I've seen

How do to speed up ordinary dataframe loop in python? vectorisation? multiprocess?

I have a simple piece of code.
Essentially, I want to speed up my loop that creates a dataframe using dataframes.
I haven't found an example and would appreciate anyones help.
df_new = []
for df_i in df:
df_selected = df[df['good_value'] == df_i_list]
df_new = pd.concat([df_new,df_selected])
Given your code does not work, this is the best I can come up with.
Start with a list of dataframes, then select the rows in your dataframes to another list and then concat in one step.
Since concat is the heavy operation, this makes sure you call it only once, which is how it's meant to be used.
import pandas as pd
dfs = [df1, df2, df3, df4, ...]
sel = [df[df['column_to_filter'] == 'good_value'] for df in dfs]
df_new = pd.concat(sel) # might be useful to add `ignore_index=True`
df_new = df[df['good_value'].isin(df_i_list)]
The pd.concat is 4x slower than .isin()

Better pattern for storing results in loop?

When I work with data, very often I will have a bunch of similar objects I want to iterate over to do some processing and store the results.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(0, 1000, 20))
df2 = pd.DataFrame(np.random.randint(0, 1000, 20))
results = []
for df in [df1, df2]:
tmp_result = df.median() # do some rpocessing
results.append(tmp_result) # append results
The problem I have with this is that it's not clear which dataframe the results correspond to. I thought of using the objects as keys for a dict, but this won't always work as dataframes are not hashable objects and can't be used as keys to dicts:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(0, 1000, 20))
df2 = pd.DataFrame(np.random.randint(0, 1000, 20))
results = {}
for df in [df1, df2]:
tmp_result = df.median() # do some rpocessing
results[df] = tmp_result # doesn't work
I can think of a few hacks to get around this, like defining unique keys for the input objects before the loop, or storing the input and the result as a tuple in the results list. But in my experience those approaches are rather unwieldy, error prone, and I suspect they're not terrilbly great for memory usage either. Mostly, I just end up using the first example, and make sure I'm careful to manually keep track of the position of the results.
Are there any obvious solutions or best practices to this problem here?
You can keep the original dataframe and the result together in a class:
class Whatever:
def __init__(self, df):
self.df = df
self.result = None
whatever1 = Whatever(pd.DataFrame(...))
whatever2 = Whatever(pd.DataFrame(...))
for whatever in [whatever1, whatever2]:
whatever.result = whatever.df.median()
There are many ways to improve this depending on your situation: generate the result right in the constructor, add a method to generate and store it, compute it on the fly from a property, and so on.
I would concatenate your data frames, adding an index for each data frame, then use a group-by operation.
pd.concat([df1, df2], keys=['df1', 'df2']).groupby(level=0).median()
If your actual processing is more complex, you could use .apply() instead of .median().
You can try something like this:
dd = {'df1': df1,
'df2': df2}
results_dict = {}
for k, v in dd.items():
results_dict[k] = v.mean()
results_df = pd.concat(results_dict, keys=results_dict.keys(), axis=1)
print(results_df)
Output:
df1 df2
0 561.65 549.85
if you want corresponding output dfs , search SO for using globals() in a loop see if you can rename them to something similar to the input.
For df1, you could name it
df1.name = 'df1_output'
then use globals() to set the name of the output df to df1.name. Then you'd have df1 and df1_ouptut

Create new dataframe for each factor level in column

There are 50+ different levels in a column, and each level needs to be broken into it's own dataframe and written to a file (excel or csv).
I've seen this as a possible solution:
df1, df2, df3, df4 = [x for _, x in df.groupby(df['column_of_interest'])]
but is there a way not to hard code the number of data frames?
Is there a way not to hard code the number of data frames?
Yes, there is. Use a dictionary or list. Using dict:
dfs = {i: x for i, (_, x) in enumerate(df.groupby('column_of_interest'), 1)}
Then access your dataframes via dfs[1], dfs[2], etc.
Alternatively, using list:
dfs = [x for _, x in df.groupby('column_of_interest')]
Then use dfs[0], dfs[1], etc.
If you don't need to store your dataframe slices, just iterate a groupby object and use to_csv. This is convenient with f-strings (PEP 498, Python 3.6+):
for idx, (value, x) in enumerate(df.groupby('column_of_interest'), 1):
x.to_csv(f'slice_{value}.csv') # include value in filename
x.to_csv(f'slice_{idx}.csv') # include numeric index in filename
You could save the dataframes directly
[df1.to_csv("coi_%s.csv"%val) for val, df1 in df.groupby(df['column_of_interest'])]
Or with a explicit for loop
for val, df1 in df.groupby(df['column_of_interest']):
#Write the df1 to csv or excel
df1.to_csv("coi_%s.csv"%val)
One way can do that using locals but not recommend, personally think jpp's answer is the right way for this type of request .
variables = locals()
for key,value in df.groupby(df['column_of_interest']):
variables["df{0}".format(key)]= value

List comprehension pandas assignment

How do I use list comprehension, or any other technique to refactor the code I have? I'm working on a DataFrame, modifying values in the first example, and adding new columns in the second.
Example 1
df['start_d'] = pd.to_datetime(df['start_d'],errors='coerce').dt.strftime('%Y-%b-%d')
df['end_d'] = pd.to_datetime(df['end_d'],errors='coerce').dt.strftime('%Y-%b-%d')
Example 2
df['col1'] = 'NA'
df['col2'] = 'NA'
I'd prefer to avoid using apply, just because it'll increase the number of lines
I think need simply loop, especially if want avoid apply and many columns:
cols = ['start_d','end_d']
for c in cols:
df[c] = pd.to_datetime(df[c],errors='coerce').dt.strftime('%Y-%b-%d')
If need list comprehension is necessary concat because result is list of Series:
comp = [pd.to_datetime(df[c],errors='coerce').dt.strftime('%Y-%b-%d') for c in cols]
df = pd.concat(comp, axis=1)
But still here is possible solution with apply:
df[cols]=df[cols].apply(lambda x: pd.to_datetime(x ,errors='coerce')).dt.strftime('%Y-%b-%d')

Categories

Resources