adding new permanent column to data frame python - python

I am currently building a fake dataset to play with. I have one dataset, called patient_data that has the patient's info:
patient_data = pd.DataFrame(np.random.randn(100,5),columns='id name dob sex state'.split())
This gives me a sample of 100 observations, with variables like name, birthday, etc.
Clearly, some of these (like name sex and state) are categorical variables, and makes no sense to have random numbers attached to it.
So for "sex" column, I created a function that will turn every random number <0 to read "male" and everything else to read "female." I would like to create a new variable called "gender" and store this inside this variable:
def malefemale(x):
if x < 0:
print('male')
else:
print('female')
And then I wrote a code to apply this function into the data frame to officially create a new variable "gender."
patient_data.assign(gender = patient_data['sex'].apply(malefemale))
But when I type "patient_data" in the jupiter notebook, I do not see the data frame updated to include this new variable. Seems like nothing was done.
Does anybody know what I can do to permanently add this new gender variable into my patient_data dataframe, with the function properly working?

I think you need assign back and for new values use numpy.where:
patient_data = patient_data.assign(gender=np.where(patient_data['sex']<0, 'male', 'female'))
print(patient_data.head(10))
id name dob sex state gender
0 0.588686 1.333191 2.559850 0.034903 0.232650 female
1 1.606597 0.168722 0.275342 -0.630618 -1.394375 male
2 0.912688 -1.273570 1.140656 -0.788166 0.265234 male
3 -0.372272 1.174600 0.300846 1.959095 -1.083678 female
4 0.413863 0.047342 0.279944 1.595921 0.585318 female
5 -1.147525 0.533511 -0.415619 -0.473355 1.045857 male
6 -0.602340 -0.379730 0.032407 0.946186 0.581590 female
7 -0.234415 -0.272176 -1.160130 -0.759835 -0.654381 male
8 -0.149291 1.986763 -0.675469 -0.295829 -2.052398 male
9 0.600571 -1.577449 -0.906590 1.042335 -2.104928 female

You need to change your custom function as
def malefemale(x):
if x < 0:
return "Male"
else:
return "female"
then simply apply the custom function
patient_data['gender'] = patient_data['sex'].apply(malefemale)

Related

Setting specific rows to the value found in a row if differing index

I work with a lot of CSV data for my job. I am trying to use Pandas to convert the member 'Email' to populate into the row of their spouses 'PrimaryMemberEmail' column. Here is a sample of what I mean:
import pandas as pd
user_data = {'FirstName':['John','Jane','Bob'],
'Lastname':['Snack','Snack','Tack'],
'EmployeeID':['12345','12345S','54321'],
'Email':['John#issues.com','NaN','Bob#issues.com'],
'DOB':['09/07/1988','12/25/1990','07/13/1964'],
'Role':['Employee On Plan','Spouse On Plan','Employee Off Plan'],
'PrimaryMemberEmail':['NaN','NaN','NaN'],
'PrimaryMemberEmployeeId':['NaN','12345','NaN']
}
df = pd.DataFrame(user_data)
I have thousands of rows like this. I need to only populate the 'PrimaryMemberEmail' when the user is a spouse with the 'Email' of their associated primary holders email. So in this case I would want to autopopulate the 'PrimaryMemberEmail' for Jane Snack to be that of her spouse, John Snack, which is 'John#issues.com' I cannot find a good way to do this. currently I am using:
for i in (df['EmployeeId']):
p = (p + len(df['EmployeeId']) - (len(df['EmployeeId'])-1))
EEID = df['EmployeeId'].iloc[p]
if 'S' in EEID:
df['PrimaryMemberEmail'].iloc[p] = df['Email'].iloc[p-1]
What bothers me is that this only works if my file comes in correctly, like how I showed in the example DataFrame. Also my NaN values do not work with dropna() or other methods, but that is a question for another time.
I am new to python and programming. I am trying to add value to myself in my current health career and I find this all very fascinating. Any help is appreciated.
IIUC, map the values and fillna:
df['PrimaryMemberEmail'] = (df['PrimaryMemberEmployeeId']
.map(df.set_index('EmployeeID')['PrimaryMemberEmail'])
.fillna(df['PrimaryMemberEmail'])
)
Alternatively, if you have real NaNs, (not strings), use boolean indexing:
df.loc[df['PrimaryMemberEmployeeId'].notna(),
'PrimaryMemberEmail'] = df['PrimaryMemberEmployeeId'].map(df.set_index('EmployeeID')['PrimaryMemberEmail'])
output:
FirstName Lastname EmployeeID DOB Role PrimaryMemberEmail PrimaryMemberEmployeeId
0 John Mack 12345 09/07/1988 Employee On Plan John#issues.com NaN
1 Jane Snack 12345S 12/25/1990 Spouse On Plan John#issues.com 12345
2 Bob Tack 54321 07/13/1964 Employee Off Plan Bob#issues.com NaN

How to filter and sort specific csv using python

Please help me with the python script to filter the below CSV.
Below is the example of the CSV dump for which I have done the initial filtration.
Last_name
Gender
Name
Phone
city
Ford
Male
Tom
123
NY
Rich
Male
Robert
21312
LA
Ford
Female
Jessica
123123
NY
Ford
Male
John
3412
NY
Rich
Other
Linda
12312
LA
Ford
Other
James
4321
NY
Smith
Male
David
123123
TX
Rich
Female
Mary
98689
LA
Rich
Female
Jennifer
86860
LA
Ford
Male
Richard
12123
NY
Smith
Other
Daniel
897097
TX
Ford
Other
Lisa
123123123
NY
import re
def gather_info (L_name):
dump_filename = "~/Documents/name_report.csv"
LN = []
with open(dump_filename, "r") as FH:
for var in FH.readlines():
if L_name in var
final = var.split(",")
print(final[1], final[2], final[3])
return LN
if __name__ == "__main__":
L_name = input("Enter the Last name: ")
la_name = gather_info(L_name)
By this, I am able to filter by the last name. for example, if I choose L_name as Ford, then I have my output as
Gender
Name
Phone
Male
Tom
123
Female
Jessica
123123
Male
John
3412
Other
James
4321
Male
Richard
12123
Other
Lisa
22412
I need help extending the script by selecting each gender and the values in the list to perform other functions, then calling the following gender and the values to achieve the same functions. for example, first, it selects the gender Male [Tom, John] and performs other functions. then selects the next gender Female [Jessica] and performs the same functions and then selects the gender Other [James, Lisa] and performs the same functions.
I would recomend using the pandas module which allows for easy filtering and grouping of data
import pandas as pd
if __name__ == '__main__':
data = pd.read_csv('name_reports.csv')
L_name = input("Enter the last name: ")
by_last_name = data[data['Last_name'] == L_name]
groups = by_last_name.groupby(['Gender'])
for group_name, group_data in groups:
print(group_name)
print(group_data)
Breaking this down into its pieces the first part is
data = pd.read_csv('name_reports.csv')
This reads the data from the csv and places it into a dataframe
Second we have
by_last_name = data[data['Last_name'] == L_name]
This filters the dataframe to only have results with Last_name equal to L_name
Next we group the data.
groups = by_last_name.groupby(['Gender'])
this groups the filtered data frames by gender
then we iterate over this. It returns a tuple with the group name and the dataframe associated with that group.
for group_name, group_data in groups:
print(group_name)
print(group_data)
This loop just prints out the data to access fields from it you can use the iterrows function
for index,row in group_data.iterrows():
print(row['city']
print(row['Phone']
print(row['Name']
And then you can use those for whatever function you want. I would recommend reading on the documentation for pandas since depending on the function you plan on using there may be a better way to do it using the library. Here is the link to the library https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
Since you cannot use the pandas module then a method using only the csv module would look like this
import csv
def has_last_name(row,last_name):
return row['Last_name'] == last_name
def has_gender(row,current_gender):
return row['Gender'] == current_gender
if __name__ == '__main__':
data = None
genders = ['Male','Female','Other']
with open('name_reports.csv') as csvfile:
data = list(csv.DictReader(csvfile,delimiter=','))
L_name = input('Enter the Last name: ')
get_by_last_name = lambda row: has_last_name(row,L_name)
filtered_by_last_name = list(filter(get_by_last_name,data))
for gender in genders:
get_by_gender = lambda row: has_gender(row,gender)
filtered_by_gender = list(filter(get_by_gender,filtered_by_last_name))
print(filtered_by_gender)
The important part is the filter built in function. This takes in a function that takes in an item from a list and returns a bool. filter takes this function and an iterable and returns a generator of items that return true for that function. The other important part is the csv.DictReader which returns your csv file as a dictionary which makes allows you to access attributes by key instead of by index.

Select array entries by criteria

I'm trying to write a conditional statement for fields from an imported CSV (data_dict), however, my current code using np.where does not seem to work. I am trying to find the age from (data_dict['Age']) of people depending on whether they are male or female, from (data_dict['Gender']). How would I approach solving this? Please see my code below. Many thanks.
Sample Data
Index,Age,Year,Crash_Month,Crash_Day,Crash_Time,Road_User,Gender,Crash_Type,Injury_Severity,Crash_LGA,Crash_Area_Type
1,37,2000,1,1,4:30:59,PEDESTRIAN,MALE,UNKNOWN,1,MARIBYRNONG,MELBOURNE
2,22,2000,1,1,0:07:35,DRIVER,MALE,ADJACENT DIRECTION,1,YARRA,MELBOURNE
3,47,2000,1,1,4:51:37,DRIVER,FEMALE,ADJACENT DIRECTION,0,YARRA,MELBOURNE
4,70,2000,1,1,4:27:56,DRIVER,MALE,ADJACENT DIRECTION,1,BANYULE,MELBOURNE
Expected Result
Age of Males: [37,22,70,...]
Age of Females: [47,...]
Current Result
Age of Males: []
Age of Females: []
gender1 = np.array(data_dict['Gender'])
age1 = np.array(data_dict['Age'])
age_females = age1[np.where(gender1 == 'Female')]
age_males = age1[np.where(gender1 == 'Male')]

Getting data from API and sum 2 values for each year

I want to get data from API where the structure looks like:
district - passed - gender - year - value
This api shows how many people from country passed exam, but it has 2 rows for the same year but gender are different (male and female).
I want to sum value for each year, for example:
CountryA passed male 2019 12
CountryA passed female 2019 30
So result should be 42.
Unfortunately my method return only female's value.
I was trying something like this:
def task2(self,data,district=None):
passed = {stat.year: stat.amount for stat in data
if self.getPercentageAmountOfPassed(stat.status,stat.district,stat.year)}
def getPercentageAmountOfPassed(self,status,district,year):
return status == 'passed' and district == 'CountryA' and year <= 2019
Im pretty sure that I got data from api, because I was solving other examples with other parameters
EXAMPLE OF DATA:
[year][disctrict][amount][passed][gender]
[Data(2010.0, Polska,160988.0,przystąpiło,mężczyźni),
Data(2010.0, Polska,205635.0,przystąpiło,kobiety),
Data(2011.0, Polska,150984.0,przystąpiło,mężczyźni)]
This happens because you overwrite previous entries when you have the same year. You need to sum them instead:
from collections import defaultdict
def sum_per_year(data):
passed = defaultdict(int)
for stat in data:
if getPercentageAmountOfPassed(stat.status, stat.district, stat.year):
passed[int(stat.year)] += stat.amount
return passed

How to organize data after using filter

Let's say I have a database table like this:
NumCOL NameCol AgeCOL ColourCOL
----------------------------------------
1 Joel 18 Blue
2 Joey 22 Red
3 Jacob 25 Green
4 Jack 27 Blue
5 Joey 21 Red
In this example, NumCOL is unique and icrements by 1, NameCOL, AgeCOL & ColourCOL are not unique. I need to filter them in a way that I can grab all entries based on ColourCOL, and then grab the entries in NameCOL & AgeCOLthat are assiocated to the colour. I use this code below (where a user selects the colour blue):
colour = request.POST.get('FormColour') #colour = blue
name = Database.objects.filter(ColourCOL=colour).values_list('NameCOL', flat=True)
Now there two names that have the colour blue, so I was thinking to use a list and append all the data into a list and just call from the list when needs be, like this:
ex_list = []
tracker = 0
for x in name:
age = Database.objects.filter(ColourCOL=colour, NameCOL=name).values_list('AgeCOL', flat=True)
ex_list.append([name[tracker], name, age])
tracker += 1
My concern is that I won't be able to retrieve data from the list effectively (if at all) if I need to write to a PDF or display it in a table. Another solution is sorting by the NumCOL but users only have access to the data in the ColourCOL. So what is the best way to sort all the information where I can display the colour selected, and then the name and age of everyone assicoated with that colour?
Thanks for the help.
colour = request.POST.get('FormColour') #colour = blue
data_dict = Database.objects.filter(ColourCOL=colour).values('NameCOL', 'AgeCOL', 'NumCOL')
for data in data_dict:
print(data['NameCOL'], data['AgeCOL'], data['NumCOL'])
You can create a dict out of the retrieved data using values, iterate over it and access the column name from dict.

Categories

Resources