Building a dataframe on the fly - python

This is a question about how to make building a Pandas.DataFrame more elegant / succinct.
I want to create a dataframe from a list of tuples.
I can go as usual and create it from the list after collecting all of them, for example,
import pandas as pd
L = []
for d in mydata:
a,b,c = food(d)
L.append(a,b,c)
df = pd.DataFrame(data=L,columns=['A','B','C'])
However, I would like instead to immediately add the rows to the dataframe, instead of keeping the intermediate list, hence using dataframes as the sole datastructure in my code.
This seems much more elegant to me; one possible way to do this is to indeed use DataFrame's append function, as suggested by #PejoPhylo:
df = pd.DataFrame(columns=['A','B','C'])
for d in mydata:
a,b,c = food(d)
df.append([(a,b,c)])
However, If I do this, it creates additional columns, named 1,2,3, etc.
I could also add a dictionary in each row:
df = pd.DataFrame(columns=['A','B','C'])
for d in mydata:
a,b,c = food(d)
df.append([{'A':a,'B':b,'C':c)])
But I would still like some way to add the data without specifying the names of the columns at each iteration.
Is there a way to do this which will be as efficient as the uppermost version of the code yes not seem cumbersome?

Related

Unpack a list with many dataframes

I have a problem with a list containing many dataframes. I create them in that way:
listWithDf = []
listWithDf.append(file)
And I got:
And now I wanna work with data inside this list but I want to have one dataframe with all the data. I know this is a very ugly way and this must be changed every time the quantity of the dataframe is changed.
df = pd.concat([listWithDf[0], listWithDf[1], ...)
So, I was wondering is any better way to unpack a list like that. Or maybe is a different way to make some dataframe in a loop, which contains the data that I need.
Here's a way you can do it as suggested in comments by #sjw:
df = pd.concat(listWithDf)
Here's a method with a loop(but it's unnecessary!):
df = pd.concat([i for i in listWithDf])

How to use a variable name as string in Python

In Python, I've created a bunch of dataframes like so:
df1 = pd.read_csv("1.csv")
...
df50 = pd.read_csv("50.csv") # import modes may vary based on the csv, no real way to shorten this
For every dataframe, I'd like to perform an operation which requires assigning a string as a name. For instance, given an existing database db,
df1.to_sql("df1", db) # and so on.
The dataframes may have a non-sequential name, so I can't do for i in range(1,51): "df"+str(i).
I'm looking for the right way to do this, instead of repeating the line 50 times. My idea was something like
for df in [df1, df2... df50]:
df.to_sql(df.__name__, db) # but dataframes don't have a __name__
How do I get the string "df1" from the dataframe I've called df1?
Is there an even nicer way to do all this?
Since the name appears to have been created following a pattern in the first place, just use code to replicate that pattern:
for i, df in enumerate([df1, df2... df50]):
df.to_sql(f'df{i}', db)
(Better yet, don't have those variables in the first place; create the list directly.)
The dataframes may have a non-sequential name, so I can't do for i in range(1,51): "df"+str(i).
Oh. Well in that case, if you want to associate textual names with the objects, that don't follow a pattern, that is what a dict is for:
dfs = {
"df1": pd.read_csv("1.csv"),
# whichever other names and values make sense
}
which you can iterate over easily:
for name, df in dfs.items():
df.to_sql(name, db)
If there is a logical rule that relates the input filename to the one that should be used for the to_sql call, you can use a dict comprehension to build the dict:
dfs = {to_sql_name(csv_name): pd.read_csv(csv_name) for csv_name in ...}
Or do the loading and processing in the same loop:
for csv_name in ...:
pd.read_csv(csv_name).to_sql(to_sql_name(csv_name), db)

Dynamically update pandas column names to avoid code change

Is there a way to dynamically update column names that are based on previous column names? Or what are best practices for column names while processing data? Below I explain the problem:
When processing data, I often need to create columns that are calculated from the previous columns, and I set up the names like below:
|STUDENT|GRADE|GRADE_AVG|GRADE_AVG_FORMATTED|GRADE_AVG_FORMATTED_FINAL
The problem is, if I need to make a change in the middle of this data flow [for example, hypothetically, say I needed to scale the grade before taking the average], I would have to rename all the column names that were produced after this point. See below:
|STUDENT|GRADE|**GRADE_SCALED**|GRADE_SCALED_AVG|GRADE_SCALED_AVG_FORMATTED|GRADE_SCALED_AVG_FORMATTED_FINAL
Since the code to calculate each column is based on the previous column names, this process of name changing in the code gets really cumbersome, specially for big datasets for which a lot of code has been produced. Any suggestion on how to dynamically update the column names? or best practices on it?
To clarify, an extension of the example:
my code would look like:
df[GRADE_AVG] = df[GRADE].apply(something)
df[GRADE_AVG_FORMATTED] = df[GRADE_AVG].apply(something)
df[GRADE_AVG_FORMATTED_FINAL] = df[GRADE_AVG_FORMATTED].apply(something)
...
... more column names based on the previous one..
...
df[FINAL_SCORE] = df[GRADE_AVG_FORMATTED_FINAL_REVISED...etc]
And then... I need to change GRADE_AVG to GRADE_SCALED_AVG in the code. So I will have change those columns names. This is a small example, but when there are a lot of column names based on the previous one, changing the code gets messy.
What I do is to change all the column names in the code, like below (but this gets really impractical), hence my question:
df[GRADE_SCALED_AVG] = df[GRADE].apply(something)
df[GRADE_SCALED_AVG_FORMATTED] = df[GRADE_SCALED_AVG].apply(something)
df[GRADE_SCALED_AVG_FORMATTED_FINAL] = df[GRADE_SCALED_AVG_FORMATTED].apply(something)
...
... more column names based on the previous one..
...
df[FINAL_SCORE] = df[GRADE_SCALED_AVG_FORMATTED_FINAL_REVISED...etc]
Lets say if your columns will start with GRADE. You can do this.
df.columns = ['GRADE_SCALED_'+ '_'.join(x.split('_')[1:]) if x.startswith('GRADE') else x for x in df.columns]
# sample test case
>>> l = ['abc','GRADE_AVG','GRADE_AVG_TOTAL']
>>> ['GRADE_SCALED_'+ '_'.join(x.split('_')[1:]) if x.startswith('GRADE') else x for x in l]
['abc', 'GRADE_SCALED_AVG', 'GRADE_SCALED_AVG_TOTAL']
A nice way to rename dynamically is with rename method:
import pandas as pd
import re
header = '|STUDENT|GRADE|GRADE_AVG|GRADE_AVG_FORMATTED|GRADE_AVG_FORMATTED_FINAL'
df = pd.DataFrame(columns=header.split('|')) # your dataframe
print(df)
# now rename: can take a function or a dictionary as a parameter
df1 = df.rename(lambda x: re.sub('^GRADE', 'GRADE_SCALE', x), axis=1)
print(df1)
#Empty DataFrame
#Columns: [, STUDENT, GRADE, GRADE_AVG, GRADE_AVG_FORMATTED, GRADE_AVG_FORMATTED_FINAL]
#Index: []
#Empty DataFrame
#Columns: [, STUDENT, GRADE_SCALE, GRADE_SCALE_AVG, GRADE_SCALE_AVG_FORMATTED, GRADE_SCALE_AVG_FORMATTED_FINAL]
#Index: []
However, in your case, I'm not sure this is what you are looking for. Are the AVG and FORMATTED columns generated from GRADE column? Also, is this RENAMING or REPLACING? doesn't the content of the columns change as well?
It seems a more complete description of the problem might help..

List of Series to Dataframe

I have a list having Pandas Series objects, which I've created by doing something like this:
li = []
li.append(input_df.iloc[0])
li.append(input_df.iloc[4])
where input_df is a Pandas Dataframe
I want to convert this list of Series objects back to Pandas Dataframe object, and was wondering if there is some easy way to do it
Based on the post you can do this by doing:
pd.DataFrame(li)
To everyone suggesting pd.concat, this is not a Series anymore. They are adding values to a list and the data type for li is a list. So to convert the list to dataframe then they should use pd.Dataframe(<list name>).
Since the right answer has got hidden in the comments, I thought it would be better to mention it as an answer:
pd.concat(li, axis=1).T
will convert the list li of Series to DataFrame
It seems that you wish to perform a customized melting of your dataframe.
Using the pandas library, you can do it with one line of code. I am creating below the example to replicate your problem:
import pandas as pd
input_df = pd.DataFrame(data={'1': [1,2,3,4,5]
,'2': [1,2,3,4,5]
,'3': [1,2,3,4,5]
,'4': [1,2,3,4,5]
,'5': [1,2,3,4,5]})
Using pd.DataFrame, you will be able to create your new dataframe that melts your two selected lists:
li = []
li.append(input_df.iloc[0])
li.append(input_df.iloc[4])
new_df = pd.DataFrame(li)
if what you want is that those two lists present themselves under one column, I would not pass them as list to pass those list back to dataframe.
Instead, you can just append those two columns disregarding the column names of each of those columns.
new_df = input_df.iloc[0].append(input_df.iloc[4])
Let me know if this answers your question.
The answer already mentioned, but i would like to share my version:
li_df = pd.DataFrame(li).T
If you want each Series to be a row of the dataframe, you should not use concat() followed by T(), unless all your values are of the same datatype.
If your data has both numerical and string values, then the transpose() function will mangle the dtypes, likely turning them all to objects.
The right way to do this in general is:
Convert each series to a dict()
Pass the list of dicts either into the pd.Dataframe() constructor directly or use pd.Dataframe.from_dicts and set the orient keyword to "index."
In your case the following should work:
my_list_of_dicts = [s.to_dict() for s in li]
my_df = pd.Dataframe(my_list_of_dicts)

Looping through a list in Python and creating new objects based on items

Let's say I have a list of objects (in this instance, dataframes)
myList = [dataframe1, dataframe2, dataframe3 ...]
I want to loop over my list and create new objects based on the names of the list items. What I want is a pivoted version of each dataframe, called "dataframe[X]_pivot" where [X] is the identifier for that dataframe.
My pseudocode looks something like:
for d in myList:
d+'_pivot' = d.pivot_table(index='columnA', values=['columnB'], aggfunc=np.sum)
And my desired output looks like this:
myList = [dataframe1, dataframe2 ...]
dataframe1_pivoted # contains a pivoted version of dataframe1
dataframe2_pivoted # contains a pivoted version of dataframe2
dataframe3_pivoted # contains a pivoted version of dataframe3
Help would be much appreciated.
Thanks
John
You do not want to do that. Creating a variables dynamically is almost always a very bad idea. The correct thing to do would be to simply use an appropriate data structure to hold your data, e.g. either a list (as your elements are all just numbered, you can just as well access them via an index) or a dictionary (if you really really want to give a name to each individual thing):
pivoted_list = []
for df in mylist:
pivoted_df = #whatever you need to to to turn a dataframe into a pivoted one
pivoted_list.append(pivoted_df)
#now access your results by index
do_something(pivoted_list[0])
do_something(pivoted_list[1])
The same thing can be expressed as a list comprehension. Assume pivot is a function that takes a dataframe and turns it into a pivoted frame, then this is equivalent to the loop above:
pivoted_list = [pivot(df) for df in mylist]
If you are certain that you want to have names for the elements, you can create a dictionary, by using enumerate like this:
pivoted_dict = {}
for index, df in enumerate(mylist):
pivoted_df = #whatever you need to to to turn a dataframe into a pivoted one
dfname = "dataframe{}_pivoted".format(index + 1)
pivoted_dict[dfname] = pivoted_df
#access results by name
do_something(pivoted_dict["dataframe1_pivoted"])
do_something(pivoted_dict["dataframe2_pivoted"])
The way to achieve that is:
globals()[d+'_pivot'] = d.pivot_table(...)
[edit] after looking at your edit, I see that you may want to do something like this:
for i, d in enumerate(myList):
globals()['dataframe%d_pivoted' % i] = d.pivot_table(...)
However, as others have suggested, it is unadvisable to do so if that is going to create lots of global variables.
There are better ways (read: data structures) to do so.

Categories

Resources