Dataframe is not defined when trying to concatenate in loop (Python - Pandas) - python

Consider the following list (named columns_list):
['total_cases',
'new_cases',
'total_deaths',
'new_deaths',
'total_cases_per_million',
'new_cases_per_million',
'total_deaths_per_million',
'new_deaths_per_million',
'total_tests',
'new_tests',
'total_tests_per_thousand',
'new_tests_per_thousand',
'new_tests_smoothed',
'new_tests_smoothed_per_thousand',
'tests_units',
'stringency_index',
'population',
'population_density',
'median_age',
'aged_65_older',
'aged_70_older',
'gdp_per_capita',
'extreme_poverty',
'cvd_death_rate',
'diabetes_prevalence',
'female_smokers',
'male_smokers',
'handwashing_facilities',
'hospital_beds_per_thousand',
'life_expectancy']
Those are columns in two dataframes: US (df_us) and Canada (df_canada). I would like to create one dataframe for each item in the list, by concatenating its corresponding column from both df_us and df_canada.
for i in columns_list:
df_i = pd.concat([df_canada[i],df_us[i]],axis=1)
Yet, when I type
df_new_deaths
I get the following output: name 'df_new_deaths' is not defined
Why?

You're not actually saving the dataframes
df_new_deaths is never defined
Add the dataframe of each column to a list and access it by index
Also, since only one column is being concated, you will end up with a pandas Series, not a DataFrame, unless you use pd.DataFrame
df_list = list()
for i in columns_list:
df_list.append(pd.DataFrame(pd.concat([df_canada[i],df_us[i]],axis=1)))
add the dataframes to a dict, where the column name is also the key
df_dict = dict()
for i in columns_list:
df_dict[i] = pd.DataFrame(pd.concat([df_canada[i],df_us[i]],axis=1))

Related

Add selected rows from an existing Pandas DataFrame to a new Pandas DataFrame in for loop in Python

I want to select some rows based on a condition from an existing Pandas DataFrame and then insert it into a new DataFrame.
At frist, I tried this way:
second_df = pd.DataFrame()
for specific_idx in specific_idx_set:
second_df = existing_df.iloc[specific_idx]
len(specific_idx_set), second_df.shape => (1000), (15,)
As you see, I'm iterating over a set which has 1000 indexes. However, after I add these 1000 rows to into a new Pandas DataFrame(second_df), I saw only one of these rows was stored into the new DataFrame while I expected to see 1000 rows with 15 columns in this DataFrame.
So, I tried new way:
specific_rows = list()
for specific_val in specific_idx_set:
specific_rows.append( existing_df[existing_df[col] == specific_val])
new_df = pd.DataFrame(specific_rows)
And I got this error:
ValueError: Must pass 2-d input. shape=(1000, 1, 15)
Then, I wrote this code:
specific_rows = list()
new_df = pd.DataFrame()
for specific_val in specific_idx_set:
specific_rows.append(existing_df[existing_df[col] == specific_val])
pd.concat([new_df, specific_rows])
But I got this error:
TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
You need modify your last solution - remove empty DataFrame and for concat use list of DataFrames only:
specific_rows = list()
for specific_val in specific_idx_set:
specific_rows.append(existing_df[existing_df[col] == specific_val])
out = pd.concat(specific_rows)
Problem of your solution - if join list with DataFrame error is raised:
pd.concat([new_df, specific_rows])
#specific_rows - is list
#new_df - is DataFrame
If need append DataFrame need join lists - append one element list [new_df] + another list specific_rows - ouput is list of DataFrames:
pd.concat([new_df] + specific_rows)

Create a new column in multiple dataframes using for loop

I have multiple dataframes with the same structure but different values
for instance,
df0, df1, df2...., df9
To each dataframe I want to add a column named eventdate that consists of one date, for instance, 2019-09-15 using for loop
for i in range(0, 9);
df+str(i)['eventdate'] = "2021-09-15"
but I get an error message
SyntaxError: cannot assign to operator
I think it's because df isn't defined. This should be very simple.. Any idea how to do this? thanks.
dfs = [df0, df1, df2...., df9]
dfs_new = []
for i, df in enumerate(dfs):
df['eventdate'] = "2021-09-15"
dfs_new.append(df)
if you can't generate a list then you could use eval(f"df{str(num)}") but this method isn't recommended from what I've seen

Create a dictionary from pandas empty dataframe with only column names

I have a pandas data frame with only two column names( single row, which can be also considered as headers).I want to make a dictionary out of this with the first column being the value and the second column being the key.I already tried the
to.dict() method, but it's not working as it's an empty dataframe.
Example
df=|Land |Norway| to {'Land': Norway}
I can change the pandas data frame to some other type and find my way around it, but this question is mostly to learn the best/different/efficient approach for this problem.
For now I have this as the solution :
dict(zip(a.iloc[0:0,0:1],a.iloc[0:0,1:2]))
Is there any other way to do this?
Here's a simple way convert the columns to a list and a list to a dictionary
def list_to_dict(a):
it = iter(a)
ret_dict = dict(zip(it, it))
return ret_dict
df = pd.DataFrame([], columns=['Land', 'Normway'])
dict_val = list_to_dict(df.columns.to_list())
dict_val # {'Land': 'Normway'}
Very manual solution
df = pd.DataFrame(columns=['Land', 'Norway'])
df = pd.DataFrame({df.columns[0]: df.columns[1]}, index=[0])
If you have any number of columns and you want each sequential pair to have this transformation, try:
df = pd.DataFrame(dict(zip(df.columns[::2], df.columns[1::2])), index=[0])
Note: You will get an error if your DataFrame does not have at least two columns.

Use list elements in both code and variable assignment with for loop

I have an index(list) called newSeries0 and I'd like to do the following.
for seriesName in newSeries0:
seriesName=fred.get_series_first_release(seriesName)
seriesName=pd.DataFrame(seriesName)
seriesName=seriesName.resample('D').fillna('ffill')
seriesName.rename(columns={'value': str(seriesName)}, inplace=True)
In other words, I'd like to create a dataframe from each name in newSeries (using this fred api) which has the (variable) name of that newSeries. Each dataframe is forward filled and the column name of the data is changed to the name of the data series.
Is zip or map involved?
In the end I'd like to have
a=dataframe of a
b=dataframe of b
c=dataframe of c
...
where a,b,c... are the names of the data series in my index(list) newSeries0, so when I call a I get the dataframe of a.
Just use dictionary like :-
dataframe_dict = {}
for seriesName in newSeries0:
seriesName=fred.get_series_first_release(seriesName)
seriesName=seriesName.resample('D').fillna('ffill')
dataframe_dict[seriesName]=seriesName
df = pd.DataFrame()
for name in dataframe_dict:
df[name] = dataframe_dict[name]

Python dataframe groupby by dictionary list then sum

I have two dataframes. The first named mergedcsv is of the format:
mergedcsv dataframe
The second dataframe named idgrp_df is of a dictionary format which for each region Id a list of corresponding string ids.
idgrp_df dataframe - keys with lists
For each row in mergedcsv (and the corresponding row in idgrp_df) I wish to select the columns within mergedcsv where the column labels are equal to the list with idgrp_df for that row. Then sum the values of those particular values and add the output to a column within mergedcsv. The function will iterate through all rows in mergedcsv (582 rows x 600 columns).
My line of code to try to attempt this is:
mergedcsv['TotRegFlows'] = mergedcsv.groupby([idgrp_df],as_index=False).numbers.apply(lambda x: x.iat[0].sum())
It returns a ValueError: Grouper for class pandas.core.frame.DataFrame not 1-dimensional.
This relates to the input dataframe for the groupby. How can I access the list for each row as the input for the groupby?
So for example, for the first row in mergedcsv I wish to select the columns with labels F95RR04, F95RR06 and F95RR15 (reading from the list in the first row of idgrp_df). Sum the values in these columns for that row and insert the sum value into TotRegFlows column.
Any ideas as to how I can utilize the list would be very much appreciated.
Edits:
Many thanks IanS. Your solution is useful. Following modification of the code line based on this advice I realised that (as suggested) my index in both dataframes are out of sync. I tested the indices (mergedcsv had 'None' and idgrp_df has 'REG_ID' column as index. I set the mergedcsv to 'REG_ID' also. Then realised that the mergedcsv has 582 rows (the REG_ID is not unique) and the idgrp_df has 220 rows (REG_ID is unique). I therefor think I am missing a groupby based on REG_ID index in mergedcsv.
I have modified the code as follows:
mergedcsv.set_index('REG_ID', inplace=True)
print mergedcsv.index.name
print idgrp_df.index.name
mergedcsvgroup = mergedcsv.groupby('REG_ID')[mergedcsv.columns].apply(lambda y: y.tolist())
mergedcsvgroup['TotRegFlows'] = mergedcsvgroup.apply(lambda row: row[idgrp_df.loc[row.name]].sum(), axis=1)
I have a keyError:'REG_ID'.
Any further recommendations are most welcome. Would it be more efficient to combine the groupby and apply into one line?
I am new to working with pandas and trying to build experience in python
Further amendments:
Without an index for mergedcsv:
mergedcsv['TotRegFlows'] = mergedcsv.apply(lambda row: row[idgrp_df.loc[row.name]].groupby('REG_ID').sum(), axis=1)
this throws a KeyError: (the label[0] is not in the [index], u 'occurred at index 0')
With an index for mergedcsv:
mergedcsv.set_index('REG_ID', inplace=True)
columnlist = list(mergedcsv.columns.values)
mergedcsv['TotRegFlows'] = mergedcsv.apply(lambda row: row[idgrp_df.loc[row.name]].groupby('REG_ID')[columnlist].transform().sum(), axis=1)
this throws a TypeError: ("unhashable type:'list'", u'occurred at index 7')
Or finally separating the groupby function:
columnlist = list(mergedcsv.columns.values)
mergedcsvgroup = mergedcsv.groupby('REG_ID')
mergedcsv['TotRegFlows'] = mergedcsvgroup.apply(lambda row: row[idgrp_df.loc[row.name]].sum())
this throws a TypeError: unhashable type list. The axis=1 argument is not available also with groupby apply.
Any ideas how I can use the lists with the apply function? I've explored tuples in the apply code but have not had any success.
Any suggestions much appreciated.
If I understand correctly, I have a simple solution with apply:
Setup
import pandas as pd
df = pd.DataFrame({'A': [1,2,3], 'B': [4,5,6], 'C': [7,8,9]})
lists = pd.Series([['A', 'B'], ['A', 'C'], ['C']])
Solution
I apply a lambda function that gets the list of columns to be summed from the lists series:
df.apply(lambda row: row[lists[row.name]].sum(), axis=1)
The trick is that, when iterating over rows (axis=1), row.name is the original index of the dataframe df. I use that to access the list from the lists series.
Notes
This solution assumes that both dataframes share the same index, which appears not to be the case in the screenshots you included. You have to address that.
Also, if idgrp_df is a dataframe and not a series, then you need to access its values with .loc.

Categories

Resources