create variable based on variable and column condition - pyspark

create variable based on variable and column condition - pyspark - python

I'm trying to create a new variable based on a simple variable ModelType and a df variable model.
Currently I'm doing it in this way
if ModelType == 'FRSG':
df=df.withColumn(MODEL_NAME+'_veh', F.when(df["model"].isin(MDL_CD), df["ford_cd"]))
elif ModelType == 'TYSG':
df=df.withColumn(MODEL_NAME+'_veh', F.when(df["model"].isin(MDL_CD), df["toyota_cd"]))
else:
df=df.withColumn(MODEL_NAME+'_veh', F.when(df["model"].isin(MDL_CD), df["cm_cd"]))
I have tried this as well
df=df.withColumn(MODEL_NAME+'_veh', F.when((ModelType == 'FRSG') &(df["model"].isin(MDL_CD)), df["ford_cd"]))
but since the variable ModelType is not a column so it gives an error
TypeError: condition should be a Column
Is there any other efficient method also to perform the same?

You can also use a dict that holds the possible mappings for ModelType and use it like this:
model_mapping = {"FRSG": "ford_cd", "TYSG": "toyota_cd"}
df = df.withColumn(
MODEL_NAME + '_veh',
F.when(df["model"].isin(MDL_CD), df[model_mapping.get(ModelType, "cm_cd")])
)

I would probably use a variable for the column to be chosen in the then part:
if ModelType == 'FRSG':
x = "ford_cd"
elif ModelType == 'TYSG':
x = "toyota_cd"
else:
x = "cm_cd"
df=df.withColumn(MODEL_NAME+'_veh', F.when(df["model"].isin(MDL_CD), df[x]))

Related

Trying to filter a CSV file with multiple variables using pandas in python

import pandas as pd
import numpy as np
df = pd.read_csv("adult.data.csv")
print("data shape: "+str(data.shape))
print("number of rows: "+str(data.shape[0]))
print("number of cols: "+str(data.shape[1]))
print(data.columns.values)
datahist = {}
for index, row in data.iterrows():
k = str(row['age']) + str(row['sex']) +
str(row['workclass']) + str(row['education']) +
str(row['marital-status']) + str(row['race'])
if k in datahist:
datahist[k] += 1
else:
datahist[k] = 1
uniquerows = 0
for key, value in datahist.items():
if value == 1:
uniquerows += 1
print(uniquerows)
for key, value in datahist.items():
if value == 1:
print(key)
df.loc[data['age'] == 58] & df.loc[data['sex'] == Male]
I have been trying to get the above code to work.
I have limited experience in coding but it seems like the issue lies with some of the columns being objects. The int64 columns work just fine when it comes to filtering.
Any assistance will be much appreciated!

df.loc[data['age'] == 58] & df.loc[data['sex'] == Male]
Firstly you are attemping to use Male variable, you probably meant string, i.e. it should be 'Male', secondly observe [ and ] placement, you are extracting part of DataFrame with age equal 58 then extracting part of DataFrame with sex equal Male and then try to use bitwise and. You should probably use & with conditions rather than pieces of DataFrame that is
df.loc[(data['age'] == 58) & (data['sex'] == 'Male')]

The int64 columns work just fine because you've specified the condition correctly as:
data['age'] == 58
However, the object column condition data['sex'] == Male should be specified as a string:
data['sex'] == 'Male'
Also, I noticed that you have loaded the dataframe df = pd.read_csv("adult.data.csv"). Do you mean this instead?
data = pd.read_csv("adult.data.csv")
The query at the end includes 2 conditions, and should be enclosed in brackets within the square brackets [ ] filter. If the dataframe name is data (instead of df), it should be:
data.loc[ (data['age'] == 58]) & (data['sex'] == Male) ]

fillna() method does not replace values in data

I need to replace the values inside the dataset, I used the fillna() method, the function runs, but when I check the data is still null
import pandas as pd
import numpy as np
dataset = pd.read_csv('mamografia.csv')
dataset
mamografia = dataset
mamografia
malignos = mamografia[mamografia['Severidade'] == 0].isnull().sum()
print('Valores ausentes: ')
print()
print('Valores Malignos: ', malignos)
print()
belignos = mamografia[mamografia['Severidade'] == 1].isnull().sum()
print('Valores Belignos:', belignos)
def substitui_ausentes(lista_col):
for lista in lista_col:
if lista != 'Idade':
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 0)].mode())
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 1)].mode())
else:
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 0)].mean())
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 1)].mean())
mamografia.columns
substitui_ausentes(mamografia.columns)
mamografia
I'm trying to replace the null values, using fillna()

By default fillna does not work in place but returns the result of the operation.
You can either set the new value manually using
df = df.fillna(...)
Or overwrite the default behaviour by setting the parameter inplace=True
df.fillna(... , inplace=True)
However your code will still not work since you want to fill the different severities separately.
Since the function is being rewritten lets also make it more pandonic by not making it change the Dataframe by default
def substitui_ausentes(dfc, reglas, inplace = False):
if inplace:
df = dfc
else:
df = dfc.copy()
fill_values = df.groupby('Severidade').agg(reglas).to_dict(orient='index')
for k in fill_values:
df.loc[df['Severidade'] == k] = df.loc[df['Severidade'] == k].fillna(fill_values[k])
return df
Note that you now need to call the function using
reglas = {
'Idade':lambda x: pd.Series.mode(x)[0],
'Densidade':'mean'
}
substitui_ausentes(df,reglas, inplace=True)
and the reglas dictionary needs to include only the columns you want to fill and how you want to fill them.

Python: Referencing a column in DataFrame using a function

This is the code that I have:
.......
gender = data.loc[np.where(data['ID']==id)]["gender"].tolist()[0]
cell = data.loc[np.where(data['ID']==id)]["cell"].tolist()[0]
bloodtype = data.loc[np.where(data['ID']==id)]["bloodType"].tolist()[0]
expList = tempPos[(tempPos.gender == gender) & (tempPos.cell == cell) & (tempPos.bloodType == bloodtype)]
.......
Now there are several other columns that can be referenced in the tempPos dataframe (in the above example I am using gender, cell & bloodType)
Is there a way I can define a function and reference the columns dynamically?
def generateProbability(col1, col2, col3):
.......
col1val = data.loc[np.where(data['ID']==id)][col1].tolist()[0]
col2val = data.loc[np.where(data['ID']==id)][col2].tolist()[0]
col3val = data.loc[np.where(data['ID']==id)][col3].tolist()[0]
expList = tempPos[(tempPos.col1 == col1val) & (tempPos.col2 == col2val) & (tempPos.col3 == col3val)]
........
generateProbability("Age","Gender","bloodType")
Thanks.

You needn't create a function. You could just access them through a for loop.
column_outputs = dict # Dictionary to hold column outputs.
for column in data.columns:
column_outputs[column] = data.loc[data['ID'] == id, column].tolist()[0]
column_outputs should now contain the desired values for each column, which you can access using the column name as a key.

python conditional subtraction of a column from a constant

I have a Dataframe for instance:
df <- data.frame("condition" = [a,a,a,b,b,b,a,b], "dv1" = [7,8,6,3,2,1,5,4])`
and I want to subtract 10 from column dv1 only if values in column condition equals to "a". Is there a way in python to do so such as using def and if function? I have tried the following but doesn't work:
def recode():
for i in df["condition']:
if i == "a":
return abs(10-df["dv1"])

Indent the code like this:
def recode():
for i in df["condition"]:
if i == "a":
return abs(10-df["dv1"])

Trying different ways to do a conditional and the output is not what expected (PYTHON)

I´m trying to modify some cells of a column base on a double conditional. The code should write "none" in the column "FCH_REC" (this column is originally filled with different dates) if column "LL" is equal to "201" and column DEC is equal to "RC". Also I want to write "none" in the column "FCH_REC" if column LL is equal to "400" and column "DEC" is equal to "RCLA". Next what I tried.
First I transform to string that columns.
table["FCH_REC"] = table["FCH_REC"].astype(str)
table["LL"] = table["LL"].astype(str)
table["DEC"] = table["DEC"].astype(str)
Second I tried this:
table.loc[(tabla['LL'] == '201') & (tabla['DEC'] == "RC" ), "FCH_REC"] = None
table.loc[(tabla['LL'] == '400') & (tabla['DEC'] == "RCLA" ), "FCH_REC"] = None
Third I tried this:
table.columns = table.columns.str.replace(' ', '')
table.loc[(tabla['LL'] == '201') & (table['DEC'] == "RC" ), "FCH_REC"] = "None"
table.loc[(renove['LL'] == '400') & (table['DEC'] == "RCLA" ), "FCH_REC"] = "None"
Fourth I tried this (this one is having problems with the syntax):
table["FCH_REC"] = table[["FCH_REC","LL"]].apply(lambda row:
row["FCH_REC"] = "None" if row["LL"] in ("RC", "RCLA") else row["FCH_REC"] )
Fifth I tried this:
for i in list(range(0,table.shape[0])):
if table.loc[i,"DEC"] in ("RC", "RCLA"):
table.loc[i,"FCH_REC"] == "NONE"
I don´t know what`s going on.
Thanks for the help!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

create variable based on variable and column condition - pyspark - python

You can also use a dict that holds the possible mappings for ModelType and use it like this: model_mapping = {"FRSG": "ford_cd", "TYSG": "toyota_cd"} df = df.withColumn( MODEL_NAME + '_veh', F.when(df["model"].isin(MDL_CD), df[model_mapping.get(ModelType, "cm_cd")]) )

I would probably use a variable for the column to be chosen in the then part: if ModelType == 'FRSG': x = "ford_cd" elif ModelType == 'TYSG': x = "toyota_cd" else: x = "cm_cd" df=df.withColumn(MODEL_NAME+'_veh', F.when(df["model"].isin(MDL_CD), df[x]))

Related

Trying to filter a CSV file with multiple variables using pandas in python

fillna() method does not replace values in data

Python: Referencing a column in DataFrame using a function

python conditional subtraction of a column from a constant

Trying different ways to do a conditional and the output is not what expected (PYTHON)

Categories

Resources