Adding a calculated column to pandas dataframe - python

I am completely new to Python, pandas and programming in general, and I cannot figure out the following:
I have accessed a database with the help of pandas and I have put the data from the query into a dataframe, df. One of the column contains birthdays, which can have the following forms:
- 01/25/1980 (string)
- 01/25 (string)
- None (NoneType)
Now, I would like to add a new column to df, which stores the ages of the people in the database. So I have done the following:
def addAge(df):
today = date.today()
df["age"] = None
for index, row in df.iterrows():
if row["birthday"] != None:
if len(row["birthday"]) == 10:
birthday = df["birthday"]
birthdayDate = datetime.date(int(birthday[6:]), int(birthday[:2]), int(birthday[3:5]))
row["age"] = today.year - birthdayDate.year - ((today.month, today.day) < (birthdayDate.month, birthdayDate.day))
print row["birthday"], row["age"] #this is just for testing
addAge(df)
print df
The line print row["birthday"], row["age"] correctly prints the birthdays and the ages. But when I call print df, the column age always contains "None". Could you guys explain to me what I have been doing wrong? Thanks!

When you call iterrows() you are getting copies of each row and cannot assign back to the larger dataframe. In general, you should be trying to using vectorized methods, rather than iterating over the rows.
So for example in this case, to parse the 'birthday' column, you could do something like this: For the rows that have a length of 10, the string will parsed into a datetime, otherwise it will be filled with a missing value.
import numpy as np
import pandas as pd
df['birthday'] = np.where(df['birthday'].str.len() == 10, pd.to_datetime(df['birthday']), '')
To calculate the ages, you can use .apply, which applies a function over each row of a series.
So if you wrapped your age calculation in a function:
def calculate_age(birthdayDate, today):
if pd.isnull(birthdayDate):
return np.nan
else:
return today.year - birthdayDate.year -
((today.month, today.day) < (birthdayDate.month, birthdayDate.day))
Then, you could calculate the age column like this:
today = date.today()
df['age'] = df['birthday'].apply(lambda x: calculate_age(x, today))

Related

extracting values matching timestamps by a new set of timestamps

sample table here
i am trying to look up corresponding commodity prices from columns(CU00.SHF,AU00.SHF,SC00.SHF,I8888.DCE C00.DCE), with a new set of timestamps, the dates of which are 32 days later than the dates in column 'history_date'.
i tried .loc and .at in a loop to extract the matching values with below functions:
latest_day = data.iloc[data.shape[0] - 1, 0].date()
def next_trade_day(x):
x = pd.to_datetime(x).date() #imported is_workday funtion requires datetime type
while True:
if is_workday(x + timedelta(32)) != False:
break
return (pd.Timestamp((x + timedelta(32))))
if is_workday(x + timedelta(32)) == False:
x = x + timedelta(1)
return pd.Timestamp(x + timedelta(32))
def end_price(x):
x = pd.Timestamp(x)
if x <= latest_day:
return data.at[x,'CU00.SHF']
if x > latest_day:
return'None'
return data.at[x,'CU00.SHF']
but it always gives
KeyError: Timestamp('2023-02-03 00:00:00')
any idea how should i achieve the target?
thanks in advance!
if you want work datetime:
convert column datetime
check date converted, use filte
pd.to_datetime(df['your column'],errors='ignore')
df.loc[df.['your column'] > 'your-date' ]
if work both, then check your full code.

Adding a column to pandas dataframe conditionally

I am working on a personal project collecting the data on Covid-19 cases. The data set only shows the total number of Covid-19 cases per state cumulatively. I would like to add a column that contains the new cases added that day. This is what I have so far:
import pandas as pd
from datetime import date
from datetime import timedelta
import numpy as np
#read the CSV from github
hist_US_State = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
#some code to get yesterday's date and the day before which is needed later.
today = date.today()
yesterday = today - timedelta(days = 1)
yesterday = str(yesterday)
day_before_yesterday = today - timedelta(days = 2)
day_before_yesterday = str(day_before_yesterday)
#Extracting yesterday's and the day before cases and combine them in one dataframe
yesterday_cases = hist_US_State[hist_US_State["date"] == yesterday]
day_before_yesterday_cases = hist_US_State[hist_US_State["date"] == day_before_yesterday]
total_cases = pd.DataFrame()
total_cases = day_before_yesterday_cases.append(yesterday_cases)
#Adding a new column called "new_cases" and this is where I get into trouble.
total_cases["new_cases"] = yesterday_cases["cases"] - day_before_yesterday_cases["cases"]
Can you please point out what I am doing wrong?
Because you defined total_cases as a concatenation (via append) of yesterday_cases and day_before_yesterday_cases, its number of rows is equal to the sum of the other two dataframes. It looks like yesterday_cases and day_before_yesterday_cases both have 55 rows, and so total_cases has 110 rows. Thus your last line is trying to assign 55 values to a series of 110 values.
You may either want to reshape your data so that each date is its own column, or work in arrays of dataframes.

How do I make this function iterable (getting indexerror)

I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?
To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.

How to add a new column to a pandas dataframe while calculate the elements?

I want to add a new column to a pandas df which will be calculated based on another column.
Here's a shourtcut of the df:
If the date is between start_date1 and end_date1 it should output in the period column "0". If the date is between start_date2 and end_date2 output a "1" and so on.
Is there any way to do this without a loop?
Thanks for your help :)
Larry
First of all you need to check if your column date has a datetime format.
You can check this with df.dtypes. If it does not have the format date (datetime64), you have to convert it to datetime with:
df['date'] = pd.to_datetime(df.date, format='%Y%m%d', errors='ignore')
Please note that the argument errors='ignore' has its risks, so its optional.
Now to make your calculated field, you can do this:
# define function to calculate periods based on date
def calculate_period(row):
if row['date'] > start_date1 & row['date'] < end_date1:
return "0"
elif row['date'] > start_date2 & row['date'] < end_date2:
return "1"
elif row['date'] > start_date3 & row['date'] < end_date3:
return "2"
else:
return "unknown"
# apply function to create the new column
df['period'] = df.apply(calculate_period, axis=1)
If you need more period values, you can extend the elif statements as you like.
Since there is lack of information about your data. I assumed that start_date1 and end_date1 are variables you defined.
If these are columns as well. The function would look like this:
# define function to calculate periods based on date
def calculate_period(row):
if row['date'] > row['start_date1'] & row['date'] < row['end_date1']:
return "0"
elif row['date'] > row['start_date2'] & row['date'] < row['end_date2']:
return "1"
elif row['date'] > row['start_date3'] & row['date'] < row['end_date3']:
return "2"
else:
return "unknown"
# apply function to create the new column
df['period'] = df.apply(calculate_period, axis=1)
Good luck.

Apply a function in a dataframe's columns [Python]

I just wrote this function to calculated the age's person based in two columns in a Python DataFrame. Unfortunately, if a use the return the function return the same value for all rows, but if I use the print statement the function gives me the right values.
Here is the code:
def calc_age(dataset):
index = dataset.index
for element in index:
year_nasc = train['DT_NASCIMENTO_BENEFICIARIO'][element][6:]
year_insc = train['ANO_CONCESSAO_BOLSA'][element]
age = int(year_insc) - int(year_nasc)
print ('Age: ', age)
#return age
train['DT_NASCIMENTO_BENEFICIARIO'] = 03-02-1987
train['ANO_CONCESSAO_BOLSA'] = 2009
What am I doing wrong?!
If what you want is to subtract the year of DT_NASCIMENTO_BENEFICIARIO from ANO_CONCESSAO_BOLSA, and df is your DataFrame:
# cast to datetime
df["DT_NASCIMENTO_BENEFICIARIO"] = pd.to_datetime(df["DT_NASCIMENTO_BENEFICIARIO"])
df["age"] = df["ANO_CONCESSAO_BOLSA"] - df["DT_NASCIMENTO_BENEFICIARIO"].dt.year
# print the result, or do something else with it:
print(df["age"])

Categories

Resources