I'm importing multiple dataframes and wrote the following process: 1. list of files to be coverted to dataframes + 2. list of names I want for the corresponding dataframes. 3. I combined the list into a dictionary:
tbls = ['tbl1', 'tbl2', 'tbl3']
dbname = ['dfABC', 'dfrand', 'dfXYZ']
dictdf = dict(zip(tbls, dbname))
Then I cycle through tbls to import the dataframes. (getdf below is a short function I wrote that reads the path, sheetname etc. for the excel/csv file in which the table(data) sits and imports the data.
for tbl in tbls:
dictdf[tbl] = getdf(tbl, dfRT, sfsession)
The process works except that the dataframes are written into the dictionary, i.e dfABC in the dictionary is replaced with a dataframe of 65K rows and 27 cols and so on.
What I want is dfABC = dataframe of 65krows and 27 cols. i.e in the above code. I tried:
str(dictdf[tbl]) = getdf(tbl, dfRT, sfsession)
but that gave an error. Is there a way to do this? thanks.
solved using exec and flipping the dictionary (the flip isn't needed to solve):
tbls = ['tbl1', 'tbl2', 'tbl3']
dfa = ['dfABC', 'dfrand', 'dfXYZ']
dictdf = dict(zip(dbname, tbls))
for df in dfs:
tbl = dictdf[df]
exec(f'{df} = getdf(\'{tbl}\', dfRT, sfsession)')
please note #Xukrao and #Yo_Chris's comments on keeping the dfs within the dictionary as a superior solution.
I found this question useful to understand how exec worked: What's the difference between eval, exec, and compile?
Related
I have 4 different dataframes containing time series data that all have the same structure.
My goal is to take each individual dataframe and pass it through a function I have defined that will group them by datestamp, sum the columns and return a new dataframe with the columns I want. So in total I want 4 new dataframes that have only the data I want.
I just looked through this post:
Loop through different dataframes and perform actions using a function
but applying this did not change my results.
Here is my code:
I am putting the dataframes in a list so I can iterate through them
dfs = [vds, vds2, vds3, vds4]
This is my function I want to pass each dataframe through:
def VDS_pre(df):
df = df.groupby(['datestamp','timestamp']).sum().reset_index()
df = df.rename(columns={'datestamp': 'Date','timestamp':'Time','det_vol': 'VolumeVDS'})
df = df[['Date','Time','VolumeVDS']]
return df
This is the loop I made to iterate through my dataframe list and pass each one through my function:
for df in dfs:
df = VDS_pre(df)
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did. Thanks for the help!
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did.
Yes, this is actually the case. The reason why they have not been modified is:
Assignment to an item in a for item in lst: loop does not have any effect on both the lst and the identifier/variables from which the lst items got their values as it is demonstrated with following code:
v1=1; v2=2; v3=3
lst = [v1,v2,v3]
for item in lst:
item = 0
print(lst, v1, v2, v3) # gives: [1, 2, 3] 1 2 3
To achieve the result you expect to obtain you can use a list comprehension and the list unpacking feature of Python:
vds,vds2,vds3,vds4=[VDS_pre(df) for df in [vds,vds2,vds3,vds4]]
or following code which is using a list of strings with the identifier/variable names of the dataframes:
sdfs = ['vds', 'vds2', 'vds3', 'vds4']
for sdf in sdfs:
exec(str(f'{sdf} = VDS_pre(eval(sdf))'))
Now printing vds, vds2, vds3 and vds4 will output the modified dataframes.
Pandas frame operations return new copy of data. Your snippet store the result in df variable which is not stored or updated to your initial list. This is why you don't have any stored result after execution.
If you don't need to keep original frames, you may simply overwrite them:
for i, df in enumerate(dfs):
dfs[i] = VDS_pre(df)
If not just use a second list and append result to it.
l = []
for df in dfs:
df2 = VDS_pre(df)
l.append(df2)
Or even better use list comprehension to rewrite this snippet into a single line of code.
Now you are able to store the result of your processing.
Additionally if your frames have the same structure and can be merged as a single frame, you may consider to first concat them and then apply your function on it. That would be totally pandas.
I want to create n DataFrames using the value s as the name of each DataFrame, but I only could create a list full of DataFrames. It's possible to change this list in each of the DataFrames inside it?
#estacao has something like [ABc,dfg,hil,...,xyz], and this should be the name of each DataFrame
estacao = dados.Station.unique()
for s,i in zip(estacao,range(126)):
estacao[i] = dados.groupby('Station').get_group(s)
I'd use a dictionary here. Then you can name the keys with s and the values can each be the dataframe corresponding to that group:
groups = dados.Station.unique()
groupby_ = datos.groupby('Station')
dataframes = {s: groupby_.get_group(s) for s in groups}
Then calling each one by name is as simple as:
group_df = dataframes['group_name']
If you REALLY NEED to create DataFrames named after s (which I named group in the following example), using exec is the solution.
groups = dados.Station.unique()
groupby_ = dados.groupby('Station')
for group in groups:
exec(f"{group} = groupby_.get_group('{group:s}')")
CAVEAT
See this answer to understand why using exec and eval commands is not always desirable.
Why should exec() and eval() be avoided?
I'm wanting to aggregate some API responses into a DataFrame.
The request consistently returns a number of json key value pairs, lets say A,B,C. occasionally however it will return A,B,C,D.
I would like something comparible to SQL's OUTER JOIN, that will simply add the new row, whilst filling the corresponding previous columns as NULL or some other placeholder.
The pandas join options insist upon imposing a unique suffix for the side, I really don't want this.
Am I looking at this the wrong way?
If there is no easy solution, I could just select a subset of the consistently available columns but I really wanted to download the lot and do the processing as a separate stage.
You can use pandas.concat as it provides with all the functionality required for your problem. Let this toy problem illustrate the possible solution.
# This generates random data with some key and value pair.
def gen_data(_size):
import string
keys = list(string.ascii_uppercase)
return dict((k,[v]) for k,v in zip(np.random.choice(keys, _size),np.random.randint(1000, size=_size)))
counter = 0
df = pd.DataFrame()
while True:
if counter > 5:
break;
# Recieve the data
new_data = gen_data(5)
# Converting this to dataframe obj
new_data = pd.DataFrame(new_data)
# Appending this data to my stack
df = pd.concat((df, new_data), axis=0, sort=True)
counter += 1
df.reset_index(drop=True, inplace=True)
print(df.to_string())
I have an EXCEL file with multiple sheets (far more than the three used in three used in this example). I would like to dynamically import them sheet-by-sheet and assign suffixes to all of them to distinguish between them since they are the same variables acquired at different times. I am able to do it using the following code:
import pandas as pd
filename = 'test.xlsx'
xls = pd.ExcelFile(filename)
df_1 = pd.read_excel(xls, '#1')
df_1 = df_1.add_suffix('_1')
df_2 = pd.read_excel(xls, '#2')
df_2 = df_2.add_suffix('_2')
df_3 = pd.read_excel(xls, '#3')
df_3 = df_3.add_suffix('_3')
However, this becomes a bit tedious when I have a large number of variables assigned to different sheets. Thus, I would like to see if there is a way to dynamically do this with a for loop, whereby I would also update the DataFrame name for each iteration.
Is there a way to do this?
Is it recommended to assign variables dynamically?
I tried some more pythonic approaches to this scenario you described using list comprehension and dict comprehension (you can choose which one you want to use).
df_dict = { 'df_' + str(c) : pd.read_excel(xls, i) for c, i in enumerate(xls.sheet_names, 1)}
df_list = [pd.read_excel(xls, i) for i in xls.sheet_names]
print(df_dict['df_1'])
print(df_list[0])
As you can see through tests, both will produce the same DataFrame.
In the first, you will access your data through a numeric index (df_list[0], df_list[1] and so on).
In the second, you will access through keys using the names you suggested, with the first key being df_dict['df_1'], for example.
Another approach would be to dynamically create variables, assigning them to your global dict. For example, the code below will produce the same result as the ones showed above:
for c, i in enumerate(xls.sheet_names, 1):
globals()['df_' + str(c)] = pd.read_excel(xls, i)
print(df_1)
However, I don't recommend using this unless it's REALLY mandatory, since you can easily loose track of the variables created in your program.
import pandas as pd
filename = 'test.xlsx'
xls = pd.ExcelFile(filename)
c = 0
dfs = []
for i in xls.sheet_names: #xls.sheet_names contains list of all sheet names in excel.
df = pd.read_excel(xls, i)
df = df.add_suffix('_' + str(c))
dfs.append(df)
c += 1
#dfs[0], dfs[1], ... contains all the dataframes of respective sheets
I am trying to match the stop_id in stop_times.csv to the stop_id in stops.csv in order to copy over the stop_lat and stop_lon to their respective columns in stop_times.csv.
Gist files:
stops.csv LINK
stop_times.csv LINK
Here's my code:
import pandas as pd
st = pd.read_csv('csv/stop_times.csv', sep=',')
st.set_index(['trip_id','stop_sequence'])
stops = pd.read_csv('csv/stops.csv')
for i in range(len(st)):
for x in range(len(stops)):
if st['stop_id'][i] == stops['stop_id'][x]:
st['stop_lat'][i] = stops['stop_lat'][x]
st['stop_lon'][i] = stops['stop_lon'][x]
st.to_csv('csv/stop_times.csv', index=False)
I'm aware that the script is applying a copy, but I'm not sure what other method to go about doing this, as I'm fairly new to pandas.
You can merge the two DataFrames:
pd.merge(stops, st, on='stop_id')
Since there are stop_lat columns in each, it will give you stop_lat_x (the good one) and stop_lat_y (the always-zero one). You can then remove or ignore the bad column and output the resulting DataFrame however you want.