I have a dictionary of dataframes
list_of_dfs={'df1:Dataframe','df2:Dataframe','df3:Dataframe','df4:Dataframe'}
Each data frame contains the same variables (price, volume, price,"Sell/Purchase") that I want to manipulate to end up with a new subset of DataFrames. My new dataframes have to filter the variable called "Sell/Purchase" by the observations that have "Sell" in the variable.
sell=df[df["Sale/Purchase"]=="Sell"]
My question is how do I loop over the dictionary in order to get a new dictionary with this new subset?
I dont know how to write this command to do the loop. I know it has to start like this:
# Create an empty dictionary called new_dfs to hold the results
new_dfs = {}
# Loop over key-value pair
for key, df in list_of_dfs.items():
But then due to my small knowledge of looping over a dictionary of dataframes I dont know how to write the filter command. I would be really thankful if someone can help me.
Thanks in advance.
Try this,
dict_of_dfs={'df1':'Dataframe','df2':'Dataframe','df3':'Dataframe','df4':'Dataframe'}
# Create an empty dictionary called new_dfs to hold the results
new_dfs = {}
# Loop over key-value pair
for key, df in dict_of_dfs.items():
new_dfs[key] = df[df["Sale/Purchase"]=="Sell"]
Explanation:
new_dfs = {} # Here we have created a empty dictionary.
# dictionary contains keys and values.
# to add keys and values to our dictionary,
# we need to do it as shown below,
new_dfs[our_key_1] = our_value_2
new_dfs[our_key_2] = our_value_2
.
.
.
You can map a function:
lambda df: df[df["Sale/Purchase"] == "Sell"]
HOW:
Syntax = map(fun, iter)
map(lambda df: df[df["Sale/Purchase"] == "Sell"], list_of_dfs)
You can map it on the a list, or set
For dict:
df_dict = {k: df[df["Sale/Purchase"]=="Sell"] for k, df in list_of_dfs.items()}
Something like:
sells = {k: v for (k, v) in list_of_df.items() if v["Sale/Purchase"] == "Sell"}
This pattern is called dictionary comprehension. According to this question this is the fastest and most Pythonic approach.
You should provide an example of the data you are dealing with for more precise answer.
Related
I have 4 different dataframes containing time series data that all have the same structure.
My goal is to take each individual dataframe and pass it through a function I have defined that will group them by datestamp, sum the columns and return a new dataframe with the columns I want. So in total I want 4 new dataframes that have only the data I want.
I just looked through this post:
Loop through different dataframes and perform actions using a function
but applying this did not change my results.
Here is my code:
I am putting the dataframes in a list so I can iterate through them
dfs = [vds, vds2, vds3, vds4]
This is my function I want to pass each dataframe through:
def VDS_pre(df):
df = df.groupby(['datestamp','timestamp']).sum().reset_index()
df = df.rename(columns={'datestamp': 'Date','timestamp':'Time','det_vol': 'VolumeVDS'})
df = df[['Date','Time','VolumeVDS']]
return df
This is the loop I made to iterate through my dataframe list and pass each one through my function:
for df in dfs:
df = VDS_pre(df)
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did. Thanks for the help!
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did.
Yes, this is actually the case. The reason why they have not been modified is:
Assignment to an item in a for item in lst: loop does not have any effect on both the lst and the identifier/variables from which the lst items got their values as it is demonstrated with following code:
v1=1; v2=2; v3=3
lst = [v1,v2,v3]
for item in lst:
item = 0
print(lst, v1, v2, v3) # gives: [1, 2, 3] 1 2 3
To achieve the result you expect to obtain you can use a list comprehension and the list unpacking feature of Python:
vds,vds2,vds3,vds4=[VDS_pre(df) for df in [vds,vds2,vds3,vds4]]
or following code which is using a list of strings with the identifier/variable names of the dataframes:
sdfs = ['vds', 'vds2', 'vds3', 'vds4']
for sdf in sdfs:
exec(str(f'{sdf} = VDS_pre(eval(sdf))'))
Now printing vds, vds2, vds3 and vds4 will output the modified dataframes.
Pandas frame operations return new copy of data. Your snippet store the result in df variable which is not stored or updated to your initial list. This is why you don't have any stored result after execution.
If you don't need to keep original frames, you may simply overwrite them:
for i, df in enumerate(dfs):
dfs[i] = VDS_pre(df)
If not just use a second list and append result to it.
l = []
for df in dfs:
df2 = VDS_pre(df)
l.append(df2)
Or even better use list comprehension to rewrite this snippet into a single line of code.
Now you are able to store the result of your processing.
Additionally if your frames have the same structure and can be merged as a single frame, you may consider to first concat them and then apply your function on it. That would be totally pandas.
I have the below dataframe:
And I have the below dictionary:
resource_ids_dict = {'Austria':1586023272, 'Bulgaria':1550004006, 'Croatia':1131119835, 'Denmark':1703440195,
'Finland':2005848983, 'France':1264698819, 'Germany':1907737079, 'Greece':2113941104,
'Italy':27898245, 'Netherlands':1832579427, 'Norway':1054291604, 'Poland':1188865122,
'Romania':270819662, 'Russia':2132391298, 'Serbia':1155274960, 'South Africa':635838568,
'Spain':52600180, 'Switzerland':842323896, 'Turkey':1716131192, 'UK':199152257}
I am using the above dictionary values to make calls to a vendor API. I then append all the return data into a dataframe df.
What I would like to do now is add a column after ID that is the dictionary keys of the dictionay values that lie in ResourceSetID.
I have had a look on the web, but haven't managed to find anything (probably due to my lack of accurate key word searches). Surely this should be a one-liner? I want avoid looping through the dataframe and the dictionary and mapping that way..
Use Series.map but first is necessary swap values with keys in dictionary:
d = {v:k for k, v in resource_ids_dict.items()}
#alternative
#d = dict(zip(resource_ids_dict.values(), resource_ids_dict.keys()))
df['new'] = df['ResourceSetID'].map(d)
My searching was unable to find a solution for this one. I hope it is simple and just missed it.
I am trying to assign a dataframe variable based on a dictionary key. I want to loop through a dictionary of keys 0, 1, 2 3... and save the dataframe as df_0, df_1, df_2 ... I am able to get the key and values working and can assign one dataframe, but cannot find a way to assign new dataframes based on the keys.
I tried How to create a new dataframe with every iteration of for loop in Python but it didn't seem to work.
Here is what I tried:
docs_dict = {0: '2635_base', 1: '2635_tri'}
for keys, docs in docs_dict.items():
print(keys, docs)
df = pd.read_excel(Path(folder_loc[docs]) / file_name[docs], sheet_name=sheet_name[docs], skiprows=3)}
Output: 0 2635_base 1 2635_tri from the print statement, and %whos DataFrame > df as excepted.
What I would like to get is: df_0 and df_1 based on the excel files in other dictionaries which work fine.
df[keys] = pd.read_excel(Path(folder_loc[docs]) / file_name[docs], sheet_name=sheet_name[docs], skiprows=3)
produces a ValueError: Wrong number of items passed 26, placement implies 1
SOLVED thanks to RubenB for pointing me to How do I create a variable number of variables? and answer by #rocky-li using globals()
for keys, docs in docs_dict.items():
print(keys, docs)
globals()['df_{}'.format(keys)] = pd.read_excel(...)}
>> Output: dataframes df_0, df_1, ...
You might want to try a dict comprehension as such (substitute pd.read_excel(...docs...) with whatever you need to read the dataframe from disc):
docs_dict = {0: '2635_base', 1: '2635_tri'}
dfs_dict = {k: pd.read_excel(...docs...) for k, docs in docs_dict.items()}
Regards,
Apologies if this question appears be to a duplicate of other questions. But I could find an answer that addresses my problem in its exactitude.
I split a dataframe, called "data", into multiple subsets that are stored in a dictionary of dataframes named "dfs" as follows:
# Partition DF
dfs = {}
chunk = 5
for n in range((data.shape[0] // chunk + 1)):
df_temp = data.iloc[n*chunk:(n+1)*chunk]
df_temp = df_temp.reset_index(drop=True)
dfs[n] = df_temp
Now, I would like to apply a pre-defined helper function called "fun_c" to EACH of the dataframes (that are stored in the dictionary object called "dfs").
Is it correct for me to apply the function to the dfs in one go, as follows(?):
result = fun_c(dfs)
If not, what would be the correct way of doing this?
it depends on the output you're looking for:
If you want a dict in the output, then you should apply the function to each dict item
result = dict({key: fun_c(val) for key, val in dfs.items()})
If you want a list of dataframes/values in the output, then apply the function to each dict value
result = [fun_c(val) for val in dfs.items()]
But this style isnt wrong either, you can iterate however you like inside the helper function as well:
def fun_c(dfs):
result = None
# either
for key, val in dfs.items():
pass
# or
for val in dfs.values():
pass
return result
Let me know if this helps!
Since you want this:
Now, I would like to apply a pre-defined helper function called
"fun_c" to EACH of the dataframes (that are stored in the dictionary
object called "dfs").
Let's say your dataframe dict looks like this and your helper function takes in a single dataframe.
dfs = {0 : df0, 1: df1, 2: df2, 3:df3}
Let's iterate through the dictionary, apply the fun_c function on each of the dataframes, and save the results in another dictionary having the same keys:
dfs_result = {k:fun_c[v] for k, v in dfs.items()}
I have large set of data that I have process and generated a dictionary. Now I want to create a dataframe from this dictionary. Vales of the dictionary are list of tuples. From those values I need to find out the unique values to build the columns of the dataframe:
d = {'0001': [('skiing',0.789),('snow',0.65),('winter',0.56)],'0002': [('drama', 0.89),('comedy', 0.678),('action',-0.42) ('winter',-0.12),('kids',0.12)],'0003': [('action', 0.89),('funny', 0.58),('sports',0.12)],'0004': [('dark', 0.89),('Mystery', 0.678),('crime',0.12), ('adult',-0.423)],'0005': [('cartoon', -0.89),('comedy', 0.678),('action',0.12)],'0006': [('drama', -0.49),('funny', 0.378),('Suspense',0.12), ('Thriller',0.78)],'0007': [('dark', 0.79),('Mystery', 0.88),('crime',0.32), ('adult',-0.423)]}
(size of the dictionary close to 800,000 records)
I iterate over the dictionary to find out the unique headers:
col_headers = []
entities = []
for key, scores in d.iteritems():
entities.append(key)
d[key] = dict(scores)
col_headers.extend(d[key].keys())
col_headers = list(set(col_headers))
I believe this take long time to process. Using dict might also be an issue since its much slower. Further more when I construct the data frame raw by raw it further slows down the process:
df = pd.DataFrame(columns=col_headers, index=entities)
for k in d:
df.loc[k] = pd.Series(d[k])
df.fillna(0.0, axis=1)
How can I speed up this process to reduce to the process time?
#ajcr almost gets it.
But you probably also need to unwrap the internal key-value pairs into a dictionary along the way.
df = pd.DataFrame.from_dict({ k: dict(v) for k,v in d.items() },
orient="index").fillna(0)
Then optionally, if you want to homogenize the style of column titles:
df.columns = [c.lower() for c in df.columns]
If you wanted to go entirely crazy, you could then sort the columns:
df = df.sort(axis=1)