Any differences between iterating over values in columns of dataframe and assigning variable to data in column? - python

I ran the following codes but Spyder returned "float division by zero"
import pandas as pd
file = pd.read_csv(r"data_ET.csv")
def normalise(df, columnName):
for value in df[columnName]:
df[columnName] = (value - df[columnName].min())/(df[columnName].max()-df[columnName].min()) (This line showed up error)
return df[columnName]
#b)
normalised_RTfirstpass = normalise(file, 'RTfirstpass')
normalised_logfreq = normalise(file, 'log_freq')
file['normalised RTfirstpass'] = normalised_RTfirstpass
file['normalised logfreq'] = normalised_logfreq
print(file)
When I changed into this, it works (the change here is assigning column values to a variable)
import pandas as pd
file = pd.read_csv(r"data_ET.csv")
def normalise(df, columnName):
value=df[columnName]
df[columnName] = (value - df[columnName].min())/(df[columnName].max()-df[columnName].min())
return df[columnName]
#b)
normalised_RTfirstpass = normalise(file, 'RTfirstpass')
normalised_logfreq = normalise(file, 'log_freq')
file['normalised RTfirstpass'] = normalised_RTfirstpass
file['normalised logfreq'] = normalised_logfreq
print(file)
Can anybody explain why the later works but the former does not?

df[Columnname] returns a pd.Series object and you are trying cast a Series object into int.
while in latter case, value=df[ColumnName], df[ColumnName]-df[ColumnName].min()/(....),
the pandas will broadcast the df[ColumnName].min() (which is a int/float value) into a pd.Series object. pandas automatically performs matrix operation on dataframe, thats why you do not need to iterate for every value in column.

Related

Function to add a column based on the input from a specific column

I have the following dataframe:
import pandas as pd
import numpy as np
from pandas_datareader import data as pdr
from datetime import date, timedelta
yf.pdr_override()
end = date.today()
start = end - timedelta(days=7300)
# download dataframe
data = pdr.get_data_yahoo('^GSPC', start=start, end= end)
Now, that I have the dataframe, I want to create a function to add the logarithmic return based on a column to the dataframe called 'data', with the following code:
data['log_return'] = np.log(data['Adj Close'] / data['Adj Close'].shift(1))
How I think the function should look like is like this:
def add_log_return(df):
# add returns in a logarithmic fashion
added = df.copy()
added["log_return"] = np.log(df[column] / df[column].shift(1))
added["log_return"] = added["log_return"].apply(lambda x: x*100)
return added
How can I select a specific column as an input of the function add_log_return(df['Adj Close']), so the function adds the logarithmic return to my 'data' dataframe?
data = add_log_return(df['Adj Close'])
Just add an argument column to your function!
def add_log_return(df, column):
# add returns in a logarithmic fashion
added = df.copy()
added["log_return"] = np.log(df[column] / df[column].shift(1)) * 100
return added
new_df = add_log_return(old_df, 'Adj_Close')
Note I removed the line in your function to apply a lambda that just multiplied by 100. It's much faster to do this in a vectorized manner, by including it in the np.log(...) line
However, if I were you, I'd just return the Series object instead of copying the dataframe and modifying and returning the copy.
def log_return(col: pd.Series) -> np.ndarray:
return np.log(col / col.shift(1)) * 100
Now, the caller can do what they want with it:
df['log_ret'] = log_return(df['Adj_Close'])

pandas: while loop to simultaneously advance through multiple lists and call functions

I want my code to:
read data from a CSV and make a dataframe: "source_df"
see if the dataframe contains any columns specified in a list:
"possible_columns"
call a unique function to replace the values in each column whose header is found in the "possible_columns" the list, then insert the modified values in a new dataframe: "destination_df"
Here it is:
import pandas as pd
#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
#creates destination_df
blanklist = []
destination_df = pd.DataFrame(blanklist)
#create the column header lists for comparison in the while loop
columns = source_df.head(0)
possible_columns = ['yes/no','true/false']
#establish the functions list and define the functions to replace column values
fix_functions_list = ['yes_no_fix()','true_false_fix()']
def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
'''use the counter to call a unique function from the function list to replace the values in each column whose header is found in the "possible_columns" the list, insert the modified values in "destination_df, then advance the counter'''
counter = 0
while counter < len(possible_columns):
if possible_columns[counter] in columns:
destination_df.insert(counter, possible_columns[counter], source_df[possible_columns[counter]])
fix_functions_list[counter]
counter = counter + 1
#see if it works
print(destination_df.head(10))
When I print(destination_df), I see the unmodified column values from source_df. When I call the functions independently they work, which makes me think something is going wrong in my while loop.
Your issue is that you are trying to call a function that is stored in a list as a string.
fix_functions_list[cnt]
This will not actually run the function just access the string value.
I would try and find another way to run these functions.
def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
fix_functions_list = {0:yes_no_fix,1:true_false_fix}
and change the function calling to like below
fix_functions_list[counter]()
#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
possible_columns = ['yes/no','true/false']
mapping_dict={'yes/no':{"No":"0","Yes":"1"} ,'true/false': {'False':'1','True': '0'}
old_columns=[if column not in possible_columns for column in source_df.columns]
existed_columns=[if column in possible_columns for column in source_df.columns]
new_df=source_df[existed_columns]
for column in new_df.columns:
new_df[column].map(mapping_dict[column])
new_df[old_columns]=source_df[old_columns]

the cell value is not returned in pandas (where function)

I used the following code to:
To map the values from another data frame (Map function is used)
To finalize the values in the column; where() is used
Requirement: VT_Final column to either of these values (V_Team1 or V_Team2 or Non-PA)
Issue: VT_Final returns an empty cell(blanks);
Please advise with clarifications.
Code:
Bookings['V_Team1'] = Bookings.Marker1.map(Manpower_1.set_index('Marker1')['Vertical Team'].to_dict())
Bookings['V_Team2'] = Bookings.Marker1.map(Attrition_1.set_index('Marker1')['Vertical Team'].to_dict())
Bookings['VT_Final'] = Bookings['V_Team1']
Bookings['VT_Final'].where(Bookings['V_Team1'] !='')
Bookings['VT_Final'] = Bookings['V_Team2']
Bookings['VT_Final'].where(Bookings['V_Team1'] =='')
Bookings['VT_Final'] = 'Non PA'
Bookings['VT_Final'].where((Bookings['V_Team1'] =='')&(Bookings['V_Team2']==''))

Heatmap plot of a pandas dataframe - TypeError

I have two pandas dataframes that on inspection look identical. One was created using the Pandas builtin:
df.corr(method='pearson')
While the other was created with a custom function:
def cor_matrix(dataframe, method):
coeffmat = pd.DataFrame(index=dataframe.columns,
columns=dataframe.columns)
pvalmat = pd.DataFrame(index=dataframe.columns, columns=dataframe.columns)
for i in range(dataframe.shape[1]):
for j in range(dataframe.shape[1]):
x = np.array(dataframe[dataframe.columns[i]])
y = np.array(dataframe[dataframe.columns[j]])
bad = ~np.logical_or(np.isnan(x), np.isnan(y))
if method == 'spearman':
corrtest = spearmanr(np.compress(bad,x), np.compress(bad,y))
if method == 'pearson':
corrtest = pearsonr(np.compress(bad,x), np.compress(bad,y))
coeffmat.iloc[i,j] = corrtest[0]
pvalmat.iloc[i,j] = corrtest[1]
return (coeffmat, pvalmat)
Both look identical and have same type (pandas.core.frame.DataFrame) and their entries are also of same type (numpy.float64)
However when I try to plot these using:
import matplotlib.pyplot as plt
plt.imshow((df))
Only the dataframe created with the pandas builtin function works. For the other dataframe I receive the error: TypeError: Image data cannot be converted to float. Can anyone explain what is going on, how the two dataframes are different and what can be done to address the error?
Edit - It looks as though there is one difference, when I convert the dataframes to a numpy array, the one that doesn't work has dtype = object at the end. Is there a way to remove this?
Amending the function to specify the dataframe as float fixed the issue:
def cor_matrix(dataframe, method):
coeffmat = pd.DataFrame(index=dataframe.columns, columns=dataframe.columns)
pvalmat = pd.DataFrame(index=dataframe.columns, columns=dataframe.columns)
for i in range(dataframe.shape[1]):
for j in range(dataframe.shape[1]):
x = np.array(dataframe[dataframe.columns[i]])
y = np.array(dataframe[dataframe.columns[j]])
bad = ~np.logical_or(np.isnan(x), np.isnan(y))
if method == 'spearman':
corrtest = spearmanr(np.compress(bad,x), np.compress(bad,y))
if method == 'pearson':
corrtest = pearsonr(np.compress(bad,x), np.compress(bad,y))
coeffmat.iloc[i,j] = corrtest[0]
pvalmat.iloc[i,j] = corrtest[1]
#This is to convert to float type otherwise can cause problems when e.g. plotting
coeffmat=coeffmat.apply(pd.to_numeric, errors='ignore')
pvalmat=pvalmat.apply(pd.to_numeric, errors='ignore')
return (coeffmat, pvalmat)

what is the source of this error: python pandas

import pandas as pd
census_df = pd.read_csv('census.csv')
#census_df.head()
def answer_seven():
census_df_1 = census_df[(census_df['SUMLEV'] == 50)].set_index('CTYNAME')
census_df_1['highest'] = census_df_1[['POPESTIAMTE2010','POPESTIAMTE2011','POPESTIAMTE2012','POPESTIAMTE2013','POPESTIAMTE2014','POPESTIAMTE2015']].max()
census_df_1['lowest'] =census_df_1[['POPESTIAMTE2010','POPESTIAMTE2011','POPESTIAMTE2012','POPESTIAMTE2013','POPESTIAMTE2014','POPESTIAMTE2015']].min()
x = abs(census_df_1['highest'] - census_df_1['lowest']).tolist()
return x[0]
answer_seven()
This is trying to use the data from census.csv to find the counties that have the largest absolute change in population within 2010-2015(POPESTIMATES), I wanted to simply find the difference between abs.value of max and min value for each year/column. You must return a string. also [(census_df['SUMLEV'] ==50)] means only counties are taken as they are set to 50. But the code gives an error that ends with
KeyError: "['POPESTIAMTE2010' 'POPESTIAMTE2011' 'POPESTIAMTE2012'
'POPESTIAMTE2013'\n 'POPESTIAMTE2014' 'POPESTIAMTE2015'] not in index"
Am I indexing the wrong data structure? I'm really new to datascience and coding.
I think the column names in the code have typo. The pattern is 'POPESTIMATE201?' and not 'POPESTIAMTE201?'
Any help with shortening the code will be appreciated. Here is the code that works -
census_df = pd.read_csv('census.csv')
def answer_seven():
cdf = census_df[(census_df['SUMLEV'] == 50)].set_index('CTYNAME')
columns = ['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']
cdf['big'] = cdf[columns].max(axis =1)
cdf['sml'] = cdf[columns].min(axis =1)
cdf['change'] = cdf[['big']].sub(cdf['sml'], axis=0)
return cdf['change'].idxmax()

Categories

Resources