Pandas - Concatenating Dataframes - python

I have a script with if statements that has 14 possible dataframes
['result_14', 'result_13', 'result_12', 'result_11', 'result_10', 'result_9', 'result_8', 'result_7', 'result_6', 'result_5', 'result_4', 'result_3', 'result_2', 'result_1']
Not all dataframes are created every time I run the script. It is dependent on a secondary input variable. I am now attempting to concatenate dataframes but run into issue with those that do not exist.
pd.concat(([result_14, result_13, result_12, result_11, result_10, result_9, result_8, result_7, result_6, result_5, result_4, result_3, result_2, result_1]), ignore_index=True)
NameError: name 'result_13' is not defined
I have tried finding all dfs that exist in my python memory and parsing the results but this creates a list rather than a list of dataframes
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
SelectDFs = [s for s in alldfs if "result" in s]
SelectDFs
['result_14', 'result_15', 'result_12', 'result_11', 'result_10', 'result_9', 'result_8', 'result_7', 'result_6', 'result_5', 'result_4', 'result_3', 'result_2', 'result_1']
pd.concat(([SelectDFs]), ignore_index=True)
TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid

You can try
%who_ls DataFrame
# %whos DataFrame
In your case
l = %who_ls DataFrame
pd.concat([eval(dfn) for dfn in l if dfn.startswith('result')], ignore_index=True)

You are passing list of string and not Dataframe object.
And once you re able to get DF Object you can pass SelecteDFs without bracket.
pd.concat(SelectDFs, ignore_index=True)

Have you tried to convert them into DFs? I mean when you want to concat them, it raise an error which says your data need to be dfs rahter than lists, so have you tried to convert your lists into DFs?
this link may help you:
Convert List to Pandas Dataframe Column

Related

Add selected rows from an existing Pandas DataFrame to a new Pandas DataFrame in for loop in Python

I want to select some rows based on a condition from an existing Pandas DataFrame and then insert it into a new DataFrame.
At frist, I tried this way:
second_df = pd.DataFrame()
for specific_idx in specific_idx_set:
second_df = existing_df.iloc[specific_idx]
len(specific_idx_set), second_df.shape => (1000), (15,)
As you see, I'm iterating over a set which has 1000 indexes. However, after I add these 1000 rows to into a new Pandas DataFrame(second_df), I saw only one of these rows was stored into the new DataFrame while I expected to see 1000 rows with 15 columns in this DataFrame.
So, I tried new way:
specific_rows = list()
for specific_val in specific_idx_set:
specific_rows.append( existing_df[existing_df[col] == specific_val])
new_df = pd.DataFrame(specific_rows)
And I got this error:
ValueError: Must pass 2-d input. shape=(1000, 1, 15)
Then, I wrote this code:
specific_rows = list()
new_df = pd.DataFrame()
for specific_val in specific_idx_set:
specific_rows.append(existing_df[existing_df[col] == specific_val])
pd.concat([new_df, specific_rows])
But I got this error:
TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
You need modify your last solution - remove empty DataFrame and for concat use list of DataFrames only:
specific_rows = list()
for specific_val in specific_idx_set:
specific_rows.append(existing_df[existing_df[col] == specific_val])
out = pd.concat(specific_rows)
Problem of your solution - if join list with DataFrame error is raised:
pd.concat([new_df, specific_rows])
#specific_rows - is list
#new_df - is DataFrame
If need append DataFrame need join lists - append one element list [new_df] + another list specific_rows - ouput is list of DataFrames:
pd.concat([new_df] + specific_rows)

use a list plus some strings to select columns from a dataframe

I am trying to make a dynamic list and then combine it with a fixed string to select columns from a dataframe:
import pandas as pd
df = pd.DataFrame([], columns=['c1','c2','c3','c4'])
column_list= ['c2','c3']
df2 = df[['c1',column_list]]
but I get the following error:
TypeError: unhashable type: 'list'
I tried a dict as well but that is similar error.
In your code, pandas tries to find the column ['c1','c2','c3','c4'], which is not possible as only hashable objects can be column names. Even if this wasn't triggering an error (e.g. if you used tuples), this wouldn't give you what you want. You need a 1D list.
Use expansion:
df[['c1', *column_list]]
Or addition:
df[['c1']+column_list]
Output:
Empty DataFrame
Columns: [c1, c2, c3]
Index: []

Apply the same block of formatting code to multiple dataframes at once

My raw data is in multiple datafiles that have the same format. After importing the various (10) csv files using pd.read_csv(filename.csv) I have a series of dataframes df1, df2, df3 etc etc
I want to perform all of the below code to each of the dataframes.
I therefore created a function to do it:
def my_func(df):
df = df.rename(columns=lambda x: x.strip())
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
df.date = pd.to_datetime(df.date)
df = df.join(df['long_margin'].str.split(' ', 1, expand=True).rename(columns={0:'A', 1:'B'}))
df = df.drop(columns=['long_margin'])
df = df.drop(columns=['cash_interest'])
mapping = {df.columns[6]: 'daily_turnover', df.columns[7]: 'cash_interest', df.columns[8]: 'long_margin', df.columns[9]: 'short_margin'}
df = df.rename(columns=mapping)
return(df)
and then tried to call the function as follows:
list_of_datasets = [df1, df2, df3]
for dataframe in list_of_datasets:
dataframe = my_func(dataframe)
If I manually ran this code changing df to df1, df2 etc it works, but it doesn't seem to work in my function (or the way I am calling it).
What am I missing?
As I understand, in
for dataframe in list_of_datasets:
dataframe = my_func(dataframe)
dataframe is a pointer to an object in the list. It is not the DataFrame itself. When for x in something: is executed, Python creates a new variable x, which points to an element of the list, and (this new pointer) is usually discarded by you when the loop ends (the pointer (the new variable created by the loop) is not deleted though).
If inside the function you just modify this object "by reference", it's ok. The changes will propagate to the object in the list.
But as soon as the function starts to create a new object named "df" instead of the previous object (not modifying the previous, but creating a new one with a new ID) and then returning this new object to dataframe in the for loop, the assignment of this new object to dataframe will basically mean that dataframe will start to point to the new object instead of the element of the list. And the element in the list won't be affected or rather will be affected to the point when the function created a new DataFrame instead of the previous.
In order to see when exactly it happens, I would suggest that you add print(id(df)) after (and before) each line of code in the function and in the loop. When the id changes, you deal with the new object (not with the element of the list).
Alex is correct.
To make this work you could use list comprehension:
list_of_datasets = [my_func(df) for df in list_of_datasets]
or create a new list for the outputs
formatted_dfs = []
for dataframe in list_of_datasets:
formatted_dfs.append(my_func(dataframe))

Dataframe is not defined when trying to concatenate in loop (Python - Pandas)

Consider the following list (named columns_list):
['total_cases',
'new_cases',
'total_deaths',
'new_deaths',
'total_cases_per_million',
'new_cases_per_million',
'total_deaths_per_million',
'new_deaths_per_million',
'total_tests',
'new_tests',
'total_tests_per_thousand',
'new_tests_per_thousand',
'new_tests_smoothed',
'new_tests_smoothed_per_thousand',
'tests_units',
'stringency_index',
'population',
'population_density',
'median_age',
'aged_65_older',
'aged_70_older',
'gdp_per_capita',
'extreme_poverty',
'cvd_death_rate',
'diabetes_prevalence',
'female_smokers',
'male_smokers',
'handwashing_facilities',
'hospital_beds_per_thousand',
'life_expectancy']
Those are columns in two dataframes: US (df_us) and Canada (df_canada). I would like to create one dataframe for each item in the list, by concatenating its corresponding column from both df_us and df_canada.
for i in columns_list:
df_i = pd.concat([df_canada[i],df_us[i]],axis=1)
Yet, when I type
df_new_deaths
I get the following output: name 'df_new_deaths' is not defined
Why?
You're not actually saving the dataframes
df_new_deaths is never defined
Add the dataframe of each column to a list and access it by index
Also, since only one column is being concated, you will end up with a pandas Series, not a DataFrame, unless you use pd.DataFrame
df_list = list()
for i in columns_list:
df_list.append(pd.DataFrame(pd.concat([df_canada[i],df_us[i]],axis=1)))
add the dataframes to a dict, where the column name is also the key
df_dict = dict()
for i in columns_list:
df_dict[i] = pd.DataFrame(pd.concat([df_canada[i],df_us[i]],axis=1))

List Objects into Individual CSV

I have a list of dataframes which I wish to convert to multiple csv.
Example:
List_Df = [df1,df2,df3,df4]
for i in List_Df:
i.to_csv("C:\\Users\\Public\\Downloads\\"+i+".csv")
Expected output: Having 4 csv files with the names df1.csv,df2.csv ...
But I am facing two problems:
First problem:
AttributeError: 'list' object has no attribute 'to_csv'
Second problem:
("C:\\Users\\Public\\Downloads\\"+ **i** +".csv") <- **i** returns the object
as it's suppose to but I wish for python to automatically take the
object_name and use it with .csv
Any help will be greatly appreciated as I am new to Python and SOF.
Thank you :)
Try this:
import pandas as pd
List_Df = [df1,df2,df3,df4]
for i,e in enumerate(List_Df):
df = pd.DataFrame(e)
df.to_csv("C:\\Users\\Public\\Downloads\\"+"df"+str(i)+".csv")
For your second problem you would have to e.g. name the dataframes first:
for j,df in enumerate(List_Df):
df.name = 'df'+str(j)
df.to_csv("C:\\Users\\Public\\Downloads\\%s.csv" %(df.name))
or even just take a string and add the index without naming the dataframes first:
for j,df in enumerate(List_Df):
name = 'df'+str(j)
df.to_csv("C:\\Users\\Public\\Downloads\\%s.csv" %(name))

Categories

Resources