Create a new column in multiple dataframes using for loop

Create a new column in multiple dataframes using for loop - python

I have multiple dataframes with the same structure but different values
for instance,
df0, df1, df2...., df9
To each dataframe I want to add a column named eventdate that consists of one date, for instance, 2019-09-15 using for loop
for i in range(0, 9);
df+str(i)['eventdate'] = "2021-09-15"
but I get an error message
SyntaxError: cannot assign to operator
I think it's because df isn't defined. This should be very simple.. Any idea how to do this? thanks.

dfs = [df0, df1, df2...., df9]
dfs_new = []
for i, df in enumerate(dfs):
df['eventdate'] = "2021-09-15"
dfs_new.append(df)
if you can't generate a list then you could use eval(f"df{str(num)}") but this method isn't recommended from what I've seen

Related

Optimal way to create a column by matching two other columns

The first df I have is one that has station codes and names, along with lat/long (not as relevant), like so:
code name latitude longitude
I have another df with start/end dates for travel times. This df has only the station code, not the station name, like so:
start_date start_station_code end_date end_station_code duration_sec
I am looking to add columns that have the name of the start/end stations to the second df by matching the first df "code" and second df "start_station_code" / "end_station_code".
I am relatively new to pandas, and was looking for a way to optimize doing this as my current method takes quite a while. I use the following code:
for j in range(0, len(df_stations)):
for i in range(0, len(df)):
if(df_stations['code'][j] == df['start_station_code'][i]):
df['start_station'][i] = df_stations['name'][j]
if(df_stations['code'][j] == df['end_station_code'][i]):
df['end_station'][i] = df_stations['name'][j]
I am looking for a faster method, any help is appreciated. Thank you in advance.

Use merge. If you are familiar with SQL, merge is equivalent to LEFT JOIN:
cols = ["code", "name"]
result = (
second_df
.merge(first_df[cols], left_on="start_station_code", right_on="code")
.merge(first_df[cols], left_on="end_station_code", right_on="code")
.rename(columns={"code_x": "start_station_code", "code_y": "end_station_code"})
)

The answer by #Code-Different is very nearly correct. However the columns to be renamed are the name columns not the code columns. For neatness you will likely want to drop the additional code columns that get created by the merges. Using your names for the dataframes df and df_station the code needed to produce df_required is:
cols = ["code", "name"]
required_df = (
df
.merge(df_stations[cols], left_on="start_station_code", right_on="code")
.merge(df_stations[cols], left_on="end_station_code", right_on="code")
.rename(columns={"name_x": "start_station", "name_y": "end_station"})
.drop(columns = ['code_x', 'code_y'])
)
As you may notice the merge means that the dataframe acquires duplicate 'code' columns which get suffixed automatically, this is a built in default of the merge command. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html for more detail.

Changes to pandas dataframe in for loop is only partially saved

I have two dfs, and want to manipulate them in some way with a for loop.
I have found that creating a new column within the loop updates the df. But not with other commands like set_index, or dropping columns.
import pandas as pd
import numpy as np
gen1 = pd.DataFrame(np.random.rand(12,3))
gen2 = pd.DataFrame(np.random.rand(12,3))
df1 = pd.DataFrame(gen1)
df2 = pd.DataFrame(gen2)
all_df = [df1, df2]
for x in all_df:
x['test'] = x[1]+1
x = x.set_index(0).drop(2, axis=1)
print(x)
Note that when each df is printed as per the loop, both dfs execute all the commands perfectly. But then when I call either df after, only the new column 'test' is there, and 'set_index' and 'drop' column is undone.
Am I missing something as to why only one of the commands have been made permanent? Thank you.

Here's what's going on:
x is a variable that at the start of each iteration of your for loop initially refers to an element of the list all_df. When you assign to x['test'], you are using x to update that element, so it does what you want.
However, when you assign something new to x, you are simply causing x to refer to that new thing without touching the contents of what x previously referred to (namely, the element of all_df that you are hoping to change).
You could try something like this instead:
for x in all_df:
x['test'] = x[1]+1
x.set_index(0, inplace=True)
x.drop(2, axis=1, inplace=True)
print(df1)
print(df2)
Please note that using inplace is often discouraged (see here for example), so you may want to consider whether there's a way to achieve your objective using new DataFrame objects created based on df1 and df2.

Dataframe is not defined when trying to concatenate in loop (Python - Pandas)

Consider the following list (named columns_list):
['total_cases',
'new_cases',
'total_deaths',
'new_deaths',
'total_cases_per_million',
'new_cases_per_million',
'total_deaths_per_million',
'new_deaths_per_million',
'total_tests',
'new_tests',
'total_tests_per_thousand',
'new_tests_per_thousand',
'new_tests_smoothed',
'new_tests_smoothed_per_thousand',
'tests_units',
'stringency_index',
'population',
'population_density',
'median_age',
'aged_65_older',
'aged_70_older',
'gdp_per_capita',
'extreme_poverty',
'cvd_death_rate',
'diabetes_prevalence',
'female_smokers',
'male_smokers',
'handwashing_facilities',
'hospital_beds_per_thousand',
'life_expectancy']
Those are columns in two dataframes: US (df_us) and Canada (df_canada). I would like to create one dataframe for each item in the list, by concatenating its corresponding column from both df_us and df_canada.
for i in columns_list:
df_i = pd.concat([df_canada[i],df_us[i]],axis=1)
Yet, when I type
df_new_deaths
I get the following output: name 'df_new_deaths' is not defined
Why?

You're not actually saving the dataframes
df_new_deaths is never defined
Add the dataframe of each column to a list and access it by index
Also, since only one column is being concated, you will end up with a pandas Series, not a DataFrame, unless you use pd.DataFrame
df_list = list()
for i in columns_list:
df_list.append(pd.DataFrame(pd.concat([df_canada[i],df_us[i]],axis=1)))
add the dataframes to a dict, where the column name is also the key
df_dict = dict()
for i in columns_list:
df_dict[i] = pd.DataFrame(pd.concat([df_canada[i],df_us[i]],axis=1))

Slicing datasets and store in new dataframes quickly?

I'm new to python and would appreciate your help here.
I imported 4 dataset with the same headers into python. Now I want to create 4 dataframes that contain only selected columns from the 4 datasets. I know how to do it the ugly way but what's the most efficient way to perform this task?
I tried a for loop but couldn't make it work :D
Datasets imported as df1,df2,df3,df4
dataset_list = (df1,df2,df3,df4)
new_dataframes= (df_1,df_2,df_3,df_4)
for i in dataset_list:
for e in new_dataframes:
e = i.loc[0:,['column1','column2','column3','column4']]

You could use a dictionary comprehension:
cols = ['column1','column2','column3','column4']
dfs = {k: df[cols] for k, df in enumerate([df1, df2, df3, df4], 1)}
The benefit of this method is it caters for an arbitrary number of items without having to manually increment variable names.

How about this approach:
dataset_list = (df1,df2,df3,df4)
def slice(df):
return df.loc[:, ['column1','column2','column3','column4']]
df_1,df_2,df_3,df_4 = map(slice, dataset_list)

pandas appending df1 to df2 get 0s/NaNs in result

I have 2 dataframes. df1 comprises a Series of values.
df1 = pd.DataFrame({'winnings': cumsums_winnings_s, 'returns':cumsums_returns_s, 'spent': cumsums_spent_s, 'runs': cumsums_runs_s, 'wins': cumsums_wins_s, 'expected': cumsums_expected_s}, columns=["winnings", "returns", "runs", "wins", "expected"])
df2 runs each row through a function which takes 3 columns and produces a result for each row - specialSauce
df2= pd.DataFrame(list(map(lambda w,r,e: doStuff(w,r,e), df1['wins'], df1['runs'], df1['expected'])), columns=["specialSauce"])
print(df2.append(df1))
produces all the df1 columns but NaN for the df1 (and vice versa if df1/df2 switched in append)
So the problem I has is how to append these 2 dataframes correctly.

As I understand things, your issue seems to be related to the fact that you get NaN's in the result DataFrame.
The reason for this is that you are trying to .append() one dataframe to the other while they don't have the same columns.
df2 has one extra column, the one created with apply() and doStuff, while df1 does not have that column. When trying to append one pd.DataFrame to the other the result will have all columns both pd.DataFrame objects. Naturally, you will have some NaN's for ['specialSauce'] since this column does not exist in df1.
This would be the same if you were to use pd.concat(), both methods do the same thing in this case. The one thing that you could do to bring the result closer to your desired result is use the ignore_index flag like this:
>> df2.append(df1, ignore_index=True)
This would at least give you a 'fresh' index for the result pd.DataFrame.
EDIT
If what you're looking for is to "append" the result of doStuff to the end of your existing df, in the form of a new column (['specialSauce']), then what you'll have to do is use pd.concat() like this:
>> pd.concat([df1, df2], axis=1)
This will return the result pd.DataFrame as you want it.
If you had a pd.Series to add to the columns of df1 then you'd need to add it like this:
>> df1['specialSauce'] = <'specialSauce values'>
I hope that helps, if not please rephrase the description of what you're after.

Ok, there are a couple of things going on here. You've left code out and I had to fill in the gaps. For example you did not define doStuff, so I had to.
doStuff = lambda w, r, e: w + r + e
With that defined, your code does not run. I had to guess what you were trying to do. I'm guessing that you want to have an additional column called 'specialSauce' adjacent to your other columns.
So, this is how I set it up and solved the problem.
Setup and Solution
import pandas as pd
import numpy as np
np.random.seed(314)
df = pd.DataFrame(np.random.randn(100, 6),
columns=["winnings", "returns",
"spent", "runs",
"wins", "expected"]).cumsum()
doStuff = lambda w, r, e: w + r + e
df['specialSauce'] = df[['wins', 'runs', 'expected']].apply(lambda x: doStuff(*x), axis=1)
print df.head()
winnings returns spent runs wins expected specialSauce
0 0.166085 0.781964 0.852285 -0.707071 -0.931657 0.886661 -0.752067
1 -0.055704 1.163688 0.079710 0.155916 -1.212917 -0.045265 -1.102266
2 -0.554241 1.928014 0.271214 -0.462848 0.452802 1.692924 1.682878
3 0.627985 3.047389 -1.594841 -1.099262 -0.308115 4.356977 2.949601
4 0.796156 3.228755 -0.273482 -0.661442 -0.111355 2.827409 2.054611
Also
You tried to use pd.DataFrame.append(). Per the linked documentation, it attaches the DataFrame specified as the argument to the end of the DataFrame that is being appended to. You would have wanted to use pd.DataFrame.concat().

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create a new column in multiple dataframes using for loop - python

dfs = [df0, df1, df2...., df9] dfs_new = [] for i, df in enumerate(dfs): df['eventdate'] = "2021-09-15" dfs_new.append(df) if you can't generate a list then you could use eval(f"df{str(num)}") but this method isn't recommended from what I've seen

Related

Optimal way to create a column by matching two other columns

Changes to pandas dataframe in for loop is only partially saved

Dataframe is not defined when trying to concatenate in loop (Python - Pandas)

Slicing datasets and store in new dataframes quickly?

pandas appending df1 to df2 get 0s/NaNs in result

Categories

Resources