In Python, I've created a bunch of dataframes like so:
df1 = pd.read_csv("1.csv")
...
df50 = pd.read_csv("50.csv") # import modes may vary based on the csv, no real way to shorten this
For every dataframe, I'd like to perform an operation which requires assigning a string as a name. For instance, given an existing database db,
df1.to_sql("df1", db) # and so on.
The dataframes may have a non-sequential name, so I can't do for i in range(1,51): "df"+str(i).
I'm looking for the right way to do this, instead of repeating the line 50 times. My idea was something like
for df in [df1, df2... df50]:
df.to_sql(df.__name__, db) # but dataframes don't have a __name__
How do I get the string "df1" from the dataframe I've called df1?
Is there an even nicer way to do all this?
Since the name appears to have been created following a pattern in the first place, just use code to replicate that pattern:
for i, df in enumerate([df1, df2... df50]):
df.to_sql(f'df{i}', db)
(Better yet, don't have those variables in the first place; create the list directly.)
The dataframes may have a non-sequential name, so I can't do for i in range(1,51): "df"+str(i).
Oh. Well in that case, if you want to associate textual names with the objects, that don't follow a pattern, that is what a dict is for:
dfs = {
"df1": pd.read_csv("1.csv"),
# whichever other names and values make sense
}
which you can iterate over easily:
for name, df in dfs.items():
df.to_sql(name, db)
If there is a logical rule that relates the input filename to the one that should be used for the to_sql call, you can use a dict comprehension to build the dict:
dfs = {to_sql_name(csv_name): pd.read_csv(csv_name) for csv_name in ...}
Or do the loading and processing in the same loop:
for csv_name in ...:
pd.read_csv(csv_name).to_sql(to_sql_name(csv_name), db)
Related
In Python, I'm reading an excel file with multiple sheets, with the intention of each sheet being its own dataframe:
df = pd.read_excel('Book1.xlsx', sheet_name=None)
So to get the dictionary keys to each dataframe (or sheet) I can use: df.keys() which gives me each sheet name from the original Excel file: dict_keys(['GF-1', 'H_2 S-Z', 'GB-SF+NZ'])
I can then assign each dictionary into its own dataframe using:
for key in df.keys():
globals()[key] = df[key]
But, because the sheet names from the original Excel file contain special characters ( -, spaces, + etc), I can't call up any of the dataframes individually:
H_2 S-Z.head()
^
SyntaxError: invalid syntax
I know that dataframe 'names' cannot contain special characters or start with numbers etc, so how do I remove those special characters?
I don't think the dict_keys can be edited (e.g. using regex). Also thought about creating a list of the dataframes, then perhaps doing a regex for loop to iterate over each dataframe name, but not sure that it would assign the 'new' dataframe name back to each dataframe.
Can anyone help me?
You can use re.sub with a dictcomp to get rid of the characters (-, +, whitespace, ..) :
import re
dict_dfs = pd.read_excel("Book1.xlsx", sheet_name=None)
dict_dfs = {re.sub(r"[-+\s]", "_", k): v for k,v in dict_dfs.items()}
for key in dict_dfs.keys():
globals()[key] = dict_dfs[key]
As suggested by #cottontail, you can also use re.sub(r"\W", "_", k).
NB: As a result (in the global scope), you'll have as much variables (pandas.core.frame.DataFrame objects) as there is worksheets in your Excel file.
print([(var, type(val)) for var, val in globals().items()
if type(val) == pd.core.frame.DataFrame])
#[('GF-1', pandas.core.frame.DataFrame),
# ('H_2_S_Z', pandas.core.frame.DataFrame),
# ('GB_SF_NZ', pandas.core.frame.DataFrame)]
globals() is already a dictionary (you can confirm by isinstance(globals(), dict)), so the individual sheets can be accessed as any dict value:
globals()['H_2 S-Z'].head()
etc.
That being said, instead of creating individually named dataframes, I would think that storing the sheets as dataframes in a single dictionary may be more readable and accessible for you down the road. It's already creating problems given you cannot name your dataframes with the same name as the sheet names. If you change the dataframe names, then you'll need another mapping that tells you which sheet name corresponds to which dataframe name, so it's a lot of work tbh. As you already have a dictionary of dataframes in df, why not access the individual sheets by df['H_2 S-Z'] etc.?
I am trying to add a new column at the end of my pandas dataframe that will contain the values of previous cells in key:value pair. I have tried the following:
import json
df["json_formatted"] = df.apply
(
lambda row: json.dumps(row.to_dict(), ensure_ascii=False), axis=1
)
It creates the the column json_formatted successfully with all required data, but the problem is it also adds the json_formatted as another extra key. I don't want that. I want the json data to contain only the information from the original df columns. How can I do that?
Note: I made ensure_ascii=False because the column names are in Japanese characters.
Create a new variable holding the created column and add it afterwards:
json_formatted = df.apply(lambda row: json.dumps(row.to_dict(), ensure_ascii=False), axis=1)
df['json_formatted'] = json_formatted
This behaviour shouldn't happen, but might be caused by your having run this function more than once. (You added the column, and then ran df.apply on the same dataframe).
You can avoid this by making your columns explicit: df[['col1', 'col2']].apply()
Apply is an expensive operation is Pandas, and if performance matters it is better to avoid it. An alternative way to do this is
df["json_formatted"] = [json.dumps(s, ensure_ascii=False) for s in df.T.to_dict().values()]
I have a problem with a list containing many dataframes. I create them in that way:
listWithDf = []
listWithDf.append(file)
And I got:
And now I wanna work with data inside this list but I want to have one dataframe with all the data. I know this is a very ugly way and this must be changed every time the quantity of the dataframe is changed.
df = pd.concat([listWithDf[0], listWithDf[1], ...)
So, I was wondering is any better way to unpack a list like that. Or maybe is a different way to make some dataframe in a loop, which contains the data that I need.
Here's a way you can do it as suggested in comments by #sjw:
df = pd.concat(listWithDf)
Here's a method with a loop(but it's unnecessary!):
df = pd.concat([i for i in listWithDf])
This is a question about how to make building a Pandas.DataFrame more elegant / succinct.
I want to create a dataframe from a list of tuples.
I can go as usual and create it from the list after collecting all of them, for example,
import pandas as pd
L = []
for d in mydata:
a,b,c = food(d)
L.append(a,b,c)
df = pd.DataFrame(data=L,columns=['A','B','C'])
However, I would like instead to immediately add the rows to the dataframe, instead of keeping the intermediate list, hence using dataframes as the sole datastructure in my code.
This seems much more elegant to me; one possible way to do this is to indeed use DataFrame's append function, as suggested by #PejoPhylo:
df = pd.DataFrame(columns=['A','B','C'])
for d in mydata:
a,b,c = food(d)
df.append([(a,b,c)])
However, If I do this, it creates additional columns, named 1,2,3, etc.
I could also add a dictionary in each row:
df = pd.DataFrame(columns=['A','B','C'])
for d in mydata:
a,b,c = food(d)
df.append([{'A':a,'B':b,'C':c)])
But I would still like some way to add the data without specifying the names of the columns at each iteration.
Is there a way to do this which will be as efficient as the uppermost version of the code yes not seem cumbersome?
Let's say I have a list of objects (in this instance, dataframes)
myList = [dataframe1, dataframe2, dataframe3 ...]
I want to loop over my list and create new objects based on the names of the list items. What I want is a pivoted version of each dataframe, called "dataframe[X]_pivot" where [X] is the identifier for that dataframe.
My pseudocode looks something like:
for d in myList:
d+'_pivot' = d.pivot_table(index='columnA', values=['columnB'], aggfunc=np.sum)
And my desired output looks like this:
myList = [dataframe1, dataframe2 ...]
dataframe1_pivoted # contains a pivoted version of dataframe1
dataframe2_pivoted # contains a pivoted version of dataframe2
dataframe3_pivoted # contains a pivoted version of dataframe3
Help would be much appreciated.
Thanks
John
You do not want to do that. Creating a variables dynamically is almost always a very bad idea. The correct thing to do would be to simply use an appropriate data structure to hold your data, e.g. either a list (as your elements are all just numbered, you can just as well access them via an index) or a dictionary (if you really really want to give a name to each individual thing):
pivoted_list = []
for df in mylist:
pivoted_df = #whatever you need to to to turn a dataframe into a pivoted one
pivoted_list.append(pivoted_df)
#now access your results by index
do_something(pivoted_list[0])
do_something(pivoted_list[1])
The same thing can be expressed as a list comprehension. Assume pivot is a function that takes a dataframe and turns it into a pivoted frame, then this is equivalent to the loop above:
pivoted_list = [pivot(df) for df in mylist]
If you are certain that you want to have names for the elements, you can create a dictionary, by using enumerate like this:
pivoted_dict = {}
for index, df in enumerate(mylist):
pivoted_df = #whatever you need to to to turn a dataframe into a pivoted one
dfname = "dataframe{}_pivoted".format(index + 1)
pivoted_dict[dfname] = pivoted_df
#access results by name
do_something(pivoted_dict["dataframe1_pivoted"])
do_something(pivoted_dict["dataframe2_pivoted"])
The way to achieve that is:
globals()[d+'_pivot'] = d.pivot_table(...)
[edit] after looking at your edit, I see that you may want to do something like this:
for i, d in enumerate(myList):
globals()['dataframe%d_pivoted' % i] = d.pivot_table(...)
However, as others have suggested, it is unadvisable to do so if that is going to create lots of global variables.
There are better ways (read: data structures) to do so.