Python: Referencing a column in DataFrame using a function - python

This is the code that I have:
.......
gender = data.loc[np.where(data['ID']==id)]["gender"].tolist()[0]
cell = data.loc[np.where(data['ID']==id)]["cell"].tolist()[0]
bloodtype = data.loc[np.where(data['ID']==id)]["bloodType"].tolist()[0]
expList = tempPos[(tempPos.gender == gender) & (tempPos.cell == cell) & (tempPos.bloodType == bloodtype)]
.......
Now there are several other columns that can be referenced in the tempPos dataframe (in the above example I am using gender, cell & bloodType)
Is there a way I can define a function and reference the columns dynamically?
def generateProbability(col1, col2, col3):
.......
col1val = data.loc[np.where(data['ID']==id)][col1].tolist()[0]
col2val = data.loc[np.where(data['ID']==id)][col2].tolist()[0]
col3val = data.loc[np.where(data['ID']==id)][col3].tolist()[0]
expList = tempPos[(tempPos.col1 == col1val) & (tempPos.col2 == col2val) & (tempPos.col3 == col3val)]
........
generateProbability("Age","Gender","bloodType")
Thanks.

You needn't create a function. You could just access them through a for loop.
column_outputs = dict # Dictionary to hold column outputs.
for column in data.columns:
column_outputs[column] = data.loc[data['ID'] == id, column].tolist()[0]
column_outputs should now contain the desired values for each column, which you can access using the column name as a key.

Related

Trying to filter a CSV file with multiple variables using pandas in python

import pandas as pd
import numpy as np
df = pd.read_csv("adult.data.csv")
print("data shape: "+str(data.shape))
print("number of rows: "+str(data.shape[0]))
print("number of cols: "+str(data.shape[1]))
print(data.columns.values)
datahist = {}
for index, row in data.iterrows():
k = str(row['age']) + str(row['sex']) +
str(row['workclass']) + str(row['education']) +
str(row['marital-status']) + str(row['race'])
if k in datahist:
datahist[k] += 1
else:
datahist[k] = 1
uniquerows = 0
for key, value in datahist.items():
if value == 1:
uniquerows += 1
print(uniquerows)
for key, value in datahist.items():
if value == 1:
print(key)
df.loc[data['age'] == 58] & df.loc[data['sex'] == Male]
I have been trying to get the above code to work.
I have limited experience in coding but it seems like the issue lies with some of the columns being objects. The int64 columns work just fine when it comes to filtering.
Any assistance will be much appreciated!
df.loc[data['age'] == 58] & df.loc[data['sex'] == Male]
Firstly you are attemping to use Male variable, you probably meant string, i.e. it should be 'Male', secondly observe [ and ] placement, you are extracting part of DataFrame with age equal 58 then extracting part of DataFrame with sex equal Male and then try to use bitwise and. You should probably use & with conditions rather than pieces of DataFrame that is
df.loc[(data['age'] == 58) & (data['sex'] == 'Male')]
The int64 columns work just fine because you've specified the condition correctly as:
data['age'] == 58
However, the object column condition data['sex'] == Male should be specified as a string:
data['sex'] == 'Male'
Also, I noticed that you have loaded the dataframe df = pd.read_csv("adult.data.csv"). Do you mean this instead?
data = pd.read_csv("adult.data.csv")
The query at the end includes 2 conditions, and should be enclosed in brackets within the square brackets [ ] filter. If the dataframe name is data (instead of df), it should be:
data.loc[ (data['age'] == 58]) & (data['sex'] == Male) ]

create variable based on variable and column condition - pyspark

I'm trying to create a new variable based on a simple variable ModelType and a df variable model.
Currently I'm doing it in this way
if ModelType == 'FRSG':
df=df.withColumn(MODEL_NAME+'_veh', F.when(df["model"].isin(MDL_CD), df["ford_cd"]))
elif ModelType == 'TYSG':
df=df.withColumn(MODEL_NAME+'_veh', F.when(df["model"].isin(MDL_CD), df["toyota_cd"]))
else:
df=df.withColumn(MODEL_NAME+'_veh', F.when(df["model"].isin(MDL_CD), df["cm_cd"]))
I have tried this as well
df=df.withColumn(MODEL_NAME+'_veh', F.when((ModelType == 'FRSG') &(df["model"].isin(MDL_CD)), df["ford_cd"]))
but since the variable ModelType is not a column so it gives an error
TypeError: condition should be a Column
Is there any other efficient method also to perform the same?
You can also use a dict that holds the possible mappings for ModelType and use it like this:
model_mapping = {"FRSG": "ford_cd", "TYSG": "toyota_cd"}
df = df.withColumn(
MODEL_NAME + '_veh',
F.when(df["model"].isin(MDL_CD), df[model_mapping.get(ModelType, "cm_cd")])
)
I would probably use a variable for the column to be chosen in the then part:
if ModelType == 'FRSG':
x = "ford_cd"
elif ModelType == 'TYSG':
x = "toyota_cd"
else:
x = "cm_cd"
df=df.withColumn(MODEL_NAME+'_veh', F.when(df["model"].isin(MDL_CD), df[x]))

Iterate over a pandas data frame or groupby object

df_headlines =
I want to group by the date column and then count how many times -1, 0, and 1 appear by date and then whichever has the highest count, use that as the daily_score.
I started with a groupby:
df_group = df_headlines.groupby('date')
This returns a groupby object and I'm not sure how to work with this given what I want to do above:
Can I iterate through this using the following?:
for index, row in df_group.iterrows():
daily_pos = []
daily_neg = []
daily_neu = []
As Ch3steR hinted as a comment, you can iterate through your groups in the following way:
for name, group in headlines.groupby('date'):
daily_pos = len(group[group['score'] == 1])
daily_neg = len(group[group['score'] == -1])
daily_neu = len(group[group['score'] == 0])
print(name, daily_pos, daily_neg, daily_neu)
For each iteration, the variable name will contain a value from the date column (e.g. 4/13/20, 4/14/20, 5/13/20), and the variable group will contain a dataframe of all rows for the date contained in the name variable.
Try:
df_headlines.groupby("date")["score"].nlargest(1).reset_index(level=1, drop=True)
No loop required - you will get most common score within each group

Replace value of column based on value in separate column

I have a pandas DataFrame that looks like:
ID | StateName | ZipCode
____________________________________
0 MD 20814
1 90210
2 DC 20006
3 05777
4 12345
I have a function that will fill in StateName based on ZipCode value:
def FindZip(x):
search = ZipcodeSearchEngine()
zipcode = search.by_zipcode(x)
return zipcode['State']
I want to fill in the blanks in the column StateName - based on the value of the corresponding ZipCode. I've unsuccessfully tried this:
test['StateName'] = test['StateName'].apply(lambda x: FindZip(test['Zip_To_Use']) if x == "" else x)
Basically, I want to apply a function to a column different from the column I am trying to change. I would appreciate any help! Thanks!
You can try following:
test['StateName'] = test.apply(lambda x: FindZip(test['Zip_To_Use'])
if x['StateName'] == ""
else x['StateName'], axis = 1)
The above code applies to dataframe instead of StateName and using axis = 1, applies to columns.
Updated:
Updated with multiple condition in if statement (looking at the solution below):
test['StateName'] = test.apply(lambda x: FindZip(test['Zip_To_Use'])
if ((x['StateName'] == "") and (x['Zip_To_Use'] != ""))
else x['StateName'], axis = 1)
I came up with a not very "pandorable" workaround. I would still love to see a more "pythonic" or "pandorable" solution if anyone has ideas! I essentially created a new list of the same length as the DataFrame and iterated through each row and then wrote over the column with the new list.
state = [FindState(test['Zip_To_Use'].iloc[i]) if (test['StateName'].iloc[i] == "" and test['Zip_To_Use'].iloc[i] != "")
else test['StateName'].iloc[i] for i in range(len(test))]
Restated in a regular for loop (for readability):
state = []
for i in range(len(test)):
if (test['StateName'].iloc[i] == "" and test['Zip_To_Use'].iloc[i] != ""):
state.append(FindState(test['Zip_To-Use'].iloc[i]))
else:
state.append(test['StateName'].iloc[i])
And then reassigned the column with this new list
test['StateName'] = state
Please let me know if you have a better solution!

How do I get all the rows before a specific index in Pandas?

I am reading an xlsx file and I want for every row to create columns based on the rows before.
import pandas as pd
import numpy as np
def get_total(x):
name = x["NAME"]
city = x["CITY"]
index = x.index
records = df[df.index < index) & (df["NAME"] == name) & (df["CITY"] == city)]
return records.size[0]
data_filename = "data.xslx"
df = pd.read_excel(data_filename, na_values=["", " ", "-"])
df["TOTAL"] = df.apply(lambda x: get_total(x), axis=1)
The get_total function is a simple example of what I want to achieve.
I could use df.reset_index(inplace=True) to get the dataframe's index as a column. I think there must be a better way to get the index of a row.
You can rewrite your function like this:
def get_total(x):
name = x["NAME"]
city = x["CITY"]
index = x.name
records df.loc[0:index]
return records.loc[(records['NAME'] == name) & (records['CITY']==city)].size
the name attribute is the current row index value

Categories

Resources