problem in calculating lambda with Pandas - python

I need to convert the value of the 'Amount' field to dollar, based on the value of another 'Currency' field, but I don't understand why the value of the first record is repeated to me throughout the dataframe.
Here is my code:
def calculo_dolar_2(data):
valor = (data*1000)/float(precio_dolar)
return valor
df_conversion_dolar_2['ED'] = df_conversion_dolar_2['Currency'].apply(lambda x: ( df_conversion_dolar_2['Amount'].apply(calculo_dolar_2)) if x=='$$' else df_conversion_dolar_2['Amount'])
df_conversion_dolar_2
capture
I am trying in this other way, but without success:
precio_dolar = 800
def calculo_dolar_3(data):
if data == '$$':
valor = (df_conversion_dolar_2['Amount']*1000)/float(precio_dolar)
else:
valor = df_conversion_dolar_2['Amount']
return valor
df_conversion_dolar_2['ED'] = df_conversion_dolar_2['Currency'].apply(lambda x: df_conversion_dolar_2['Amount'].apply(calculo_dolar_3))
df_conversion_dolar_2
What is it due to?

I haven't tested the code but this is how I would do it;
# make your code clear (what is 2?)
df = df_conversion_dolar_2
precio_dolar = 800
# first, let's make a boolean selector
dolar_select = df['Currency'] == '$$$'
# Selecting dollar rows at the column Amount is as follow:
# This line is only to show you what happens and is not
# needed in your final code
df.loc[dolar_select, 'Amount']
# Anyway, now we apply your function to the selected data:
df['ED'] = df.loc[dolar_select, 'Amount'].map(lambda x: (x*1000)/float(precio_dolar))
# Finally, fill the NaN values in your dataframe (the non selected rows)
df.loc[df['ED'].isna(), 'ED'] = df['Amount']

I think what you're trying to do can be accomplished like so
def calculo_dolar_2(data):
valor = (data*1000)/float(precio_dolar)
return valor
df_conversion_dolar_2['ED'] = df_conversion_dolar_2.apply(lambda x: calculo_dolar_2(x['Amount']) if x['Currency']=='$$' else x['Amount'])

Related

how to make a Correlation One Column to Many Columns and return a list?

i would like to create a correlation function between a column and the others, passing the dataframe with all columns, corelating wiht a specif colum and returning a list of metrics and correlation i`am doing this like this.
correlations = df.corr().unstack().sort_values(ascending=True)
correlations = pd.DataFrame(correlations).reset_index()
correlations.columns = ['corr_matrix', 'dfbase', 'correlation']
correlations.query("corr_matrix == 'venda por m2' & dfbase != 'venda por m2'")
but i would like to know a way to make this with a function.
Something like this should do
def get_nonself_correlation(df,self_name):
temp = df.corr()
temp = temp.loc[temp.index!=self_name,temp.columns==self_name]
temp = temp.unstack().reset_index()
temp.columns = ['corr_matrix', 'dfbase', 'correlation']
return temp

Script keep showing "SettingCopyWithWarning'

Hello my problem is that my script keep showing below message
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
downcast=downcast
I Searched the google for a while regarding this, and it seems like my code is somehow
assigning sliced dataframe to new variable, which is problematic.
The problem is ** I can't find where my code get problematic **
I tried copy function, or seperated the nested functions, but it is not working
I attached my code below.
def case_sorting(file_get, col_get, methods_get, operator_get, value_get):
ops = {">": gt, "<": lt}
col_get = str(col_get)
value_get = int(value_get)
if methods_get is "|x|":
new_file = file_get[ops[operator_get](file_get[col_get], value_get)]
else:
new_file = file_get[ops[operator_get](file_get[col_get], np.percentile(file_get[col_get], value_get))]
return new_file
Basically what i was about to do was to make flask api that gets excel file as an input, and returns the csv file with some filtering. So I defined some functions first.
def get_brandlist(df_input, brand_input):
if brand_input == "default":
final_list = (pd.unique(df_input["브랜드"])).tolist()
else:
final_list = brand_input.split("/")
if '브랜드' in final_list:
final_list.remove('브랜드')
final_list = [x for x in final_list if str(x) != 'nan']
return final_list
Then I defined the main function
def select_bestitem(df_data, brand_name, col_name, methods, operator, value):
# // 2-1 // to remove unnecessary rows and columns with na values
df_data = df_data.dropna(axis=0 & 1, how='all')
df_data.fillna(method='pad', inplace=True)
# // 2-2 // iterate over all rows to find which row contains brand value
default_number = 0
for row in df_data.itertuples():
if '브랜드' in row:
df_data.columns = df_data.iloc[default_number, :]
break
else:
default_number = default_number + 1
# // 2-3 // create the list contains all the target brand names
brand_list = get_brandlist(df_input=df_data, brand_input=brand_name)
# // 2-4 // subset the target brand into another dataframe
df_data_refined = df_data[df_data.iloc[:, 1].isin(brand_list)]
# // 2-5 // split the dataframe based on the "brand name", and apply the input condition
df_per_brand = {}
df_per_brand_modified = {}
for brand_each in brand_list:
df_per_brand[brand_each] = df_data_refined[df_data_refined['브랜드'] == brand_each]
file = df_per_brand[brand_each].copy()
df_per_brand_modified[brand_each] = case_sorting(file_get=file, col_get=col_name, methods_get=methods,
operator_get=operator, value_get=value)
# // 2-6 // merge all the remaining dataframe
df_merged = pd.DataFrame()
for brand_each in brand_list:
df_merged = df_merged.append(df_per_brand_modified[brand_each], ignore_index=True)
final_df = df_merged.to_csv(index=False, sep=',', encoding='utf-8')
return final_df
And I am gonna import this function in my app.py later
I am quite new to all the coding, therefore really really sorry if my code is quite hard to understand, but I just really wanted to get rid of this annoying warning message. Thanks for help in advance :)

Python - Alternates solutions to iterrows

I have written the following code to create a dataframe, and add new rows and columns based on a certain conditions. Unfortunately, it takes a lot of time to execute.
Are there any alternate ways to do this?
Any inputs are highly appreciated.
dfCircuito=None
for index, row in dadosCircuito.iterrows():
for mes in range(1,13):
for nue in range(1,5):
for origem in range (1,3):
for suprimento in range (1,3):
for tipo in range (1,3):
df=pd.DataFrame(dadosCircuito.iloc[[index]])
df['MES']=mes
if(nue==1):
df['NUE']='N'
elif(nue==2):
df['NUE']='C'
elif(nue==3):
df['NUE']='F'
else:
df['NUE']='D'
if(origem==1):
df['Origem']='DISTRIBUICAO'
else:
df['Origem']='SUBTRANSMISSAO'
if(suprimento==1):
df['Suprimento']='INTERNO'
else:
df['Suprimento']='EXTERNO'
if(tipo==1):
df['TipoOcorrencia']='EMERGENCIAL'
else:
df['TipoOcorrencia']='PROGRAMADA'
dfCircuito=pd.concat([dfCircuito, df], axis=0) ```
If I understand you correctly, you are trying to add a number of rows per row of dadosCircuito. The extra rows are permutations of mes=1...12; nue=N,C,F,D; ...
You can create a dataframe containing the permutations of attributes, then join it back to dadosCircuito:
mes = range(1,13)
nues = list('NCFD')
origems = ['DISTRIBUICAO', 'SUBTRANSMISSAO']
suprimentos = ['INTERNO', 'EXTERNO']
tipos = ['EMERGENCIAL', 'PROGRAMADA']
# Make sure dadosCircuito.index is unique. If not, call a reset_index
# dadosCircuito = dadosCircuito.reset_index()
df = pd.MultiIndex.from_product([dadosCircuito.index, mes, nues, origems, suprimentos, tipos], names=['index', 'MES', 'NUE', 'Origem', 'Suprimento', 'TipoOcorrencia']) \
.to_frame(index=False) \
.set_index('index')
dfCircuito = dadosCircuito.join(df)

Apply a function in a dataframe's columns [Python]

I just wrote this function to calculated the age's person based in two columns in a Python DataFrame. Unfortunately, if a use the return the function return the same value for all rows, but if I use the print statement the function gives me the right values.
Here is the code:
def calc_age(dataset):
index = dataset.index
for element in index:
year_nasc = train['DT_NASCIMENTO_BENEFICIARIO'][element][6:]
year_insc = train['ANO_CONCESSAO_BOLSA'][element]
age = int(year_insc) - int(year_nasc)
print ('Age: ', age)
#return age
train['DT_NASCIMENTO_BENEFICIARIO'] = 03-02-1987
train['ANO_CONCESSAO_BOLSA'] = 2009
What am I doing wrong?!
If what you want is to subtract the year of DT_NASCIMENTO_BENEFICIARIO from ANO_CONCESSAO_BOLSA, and df is your DataFrame:
# cast to datetime
df["DT_NASCIMENTO_BENEFICIARIO"] = pd.to_datetime(df["DT_NASCIMENTO_BENEFICIARIO"])
df["age"] = df["ANO_CONCESSAO_BOLSA"] - df["DT_NASCIMENTO_BENEFICIARIO"].dt.year
# print the result, or do something else with it:
print(df["age"])

pandas fillna is not working on subset of the dataset

I want to impute the missing values for df['box_office_revenue'] with the median specified by df['release_date'] == x and df['genre'] == y .
Here is my median finder function below.
def find_median(df, year, genre, col_year, col_rev):
median = df[(df[col_year] == year) & (df[col_rev].notnull()) & (df[genre] > 0)][col_rev].median()
return median
The median function works. I checked. I did the code below since I was getting some CopyValue error.
pd.options.mode.chained_assignment = None # default='warn'
I then go through the years and genres, col_name = ['is_drama', 'is_horror', etc] .
i = df['release_year'].min()
while (i < df['release_year'].max()):
for genre in col_name:
median = find_median(df, i, genre, 'release_year', 'box_office_revenue')
df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(median, inplace=True)
print(i)
i += 1
However, nothing changed!
len(df['box_office_revenue'].isnull())
The output was 35527. Meaning none of the null values in df['box_office_revenue'] had been filled.
Where did I go wrong?
Here is a quick look at the data: The other columns are just binary variables
You mentioned
I did the code below since I was getting some CopyValue error...
The warning is important. You did not give your data, so I cannot actually check, but the problem is likely due to:
df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(..)
Let's break this down:
First you select some rows with:
df[(df['release_year'] == i) & (df[genre] > 0)]
Then from that, you select a columns with:
...['box_office_revenue']
And now you have a problem...
Why?
The problem is that when you selected some rows (ie: not all), pandas was forced to create a copy of your dataframe. You then select a column of the copy!. Then you fillna() on the copy. Not super useful.
How do I fix it?
Select the column first:
df['box_office_revenue'][(df['release_year'] == i) & (df[genre] > 0)].fillna(..)
By selecting the entire column first, pandas is not forced to make a copy, and thus subsequent operations should work as desired.
This is not elegant, but I think it works. Basically, I calculate the means conditioned on genre and year, and then join the data to a dataframe containing the imputing values. Then, wherever the revenue data is null, replace the null with the imputed value
import pandas as pd
import numpy as np
#Fake Data
rev = np.random.normal(size = 10_000,loc = 20)
rev_ix = np.random.choice(range(rev.size), size = 100 )
rev[rev_ix] = np.NaN
year = np.random.choice(range(1950,2018), replace = True, size = 10_000)
genre = np.random.choice(list('abc'), size = 10_000, replace = True)
df = pd.DataFrame({'rev':rev,'year':year,'genre':genre})
imputing_vals = df.groupby(['year','genre']).mean()
s = df.set_index(['year','genre'])
s.rev.isnull().any() #True
#Creates dataframe with new column containing the means
s = s.join(imputing_vals, rsuffix = '_R')
s.loc[s.rev.isnull(),'rev'] = s.loc[s.rev.isnull(),'rev_R']
new_df = s['rev'].reset_index()
new_df.rev.isnull().any() #False
This URL describing chained assignments seems useful for such a case: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#evaluation-order-matters
As seen in above URL:
Hence, instead of doing (in your 'for' loop):
for genre in col_name:
median = find_median(df, i, genre, 'release_year', 'box_office_revenue')
df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(median, inplace=True)
You can try:
for genre in col_name:
median = find_median(df, i, genre, 'release_year', 'box_office_revenue')
df.loc[(df['release_year'] == i) & (df[genre] > 0) & (df['box_office_revenue'].isnull()), 'box_office_revenue'] = median

Categories

Resources