Adding information in a particular text - python

I have a text file, and the content is as follows:
id income gender
1 6423435 female
2 1245638 male
3 6246554 female
4 9755105 female
5 5345215 female
6 5624209 female
7 8294732 male
I want to add two more information to it , gender code(0 or 1) and another income data, and then I want to save it as another text, but this time each line should be like the following:
id;income;gender;anotherincome;gender_coded
In this case, how can I add the two information in the text?

Related

How is DIFF calculated on customer demographics in featuretools?

I have a two tables of of customer information and transaction info.
Customer information includes each person's quality of health (from 0 to 100)
e.g. if I extract just the Name and HealthQuality columns:
John: 70
Mary: 20
Paul: 40
etc etc.
After applying featuretools I noticed a new DIFF(HealthQuality) variable.
According to the docs, this is what DIFF does:
"Compute the difference between the value in a list and the previous value in that list."
Is featuretools calculating the difference between Mary and John's health quality in this instance?
I don't think this kind of feature synthesis really works for customer records e.g. CUM_SUM(emails_sent) for John. John's record is one row, and he has one value for the amount of emails we sent him.
For now I am using the ignore_variables=[all_customer_info] option to remove all of the customer data except for transactions table of course.
This also leads me into another question.
Using data from the transactions table, John now has a DIFF(MEAN(transactions.amount)). What is the DIFF measured in this instance?
id MEAN(transactions.amount) DIFF(MEAN(transactions.amount))
0 1 21.950000 NaN
1 2 20.000000 -1.950000
2 3 35.604581 15.604581
3 4 NaN NaN
4 5 22.782682 NaN
5 6 35.616306 12.833624
6 7 24.560536 -11.055771
7 8 331.316552 306.756016
8 9 60.565852 -270.750700

Seaborn countplot show wrong results on Titanic dataset

I'm working on the Titanic dataset which I've got it from this website:
https://public.opendatasoft.com/explore/dataset/titanic-passengers/table/?flg=fr
I want to show the number of male and female persons for each survived class (yes or no).
First of all I got the whole number of male and female persons using:
bysex=data1['Sex'].value_counts()
print(bysex)
This gave me these results:
male 577
female 314
Name: Sex, dtype: int64
The results show that the number of male persons is greater than female persons.
But when I use seaborn to show the number of male and female persons for each survived class using this code:
plot1 = sns.FacetGrid(data1, col='Survived')
plot1.map(sns.countplot,'Sex')
Then I get this results:
enter image description here
Here it shows that the number of female is greater than the number of male and for no survived class the number of female (around 450) is even greater than the total number of female persons (314).
How is this possible?
I think there is something wrong with the mapping.
In the left plot Sex are interchanged.
data1.loc[data1["Survived"] == "No", 'Sex'].value_counts()
male 468
female 81
Name: Sex, dtype: int64
and the second plot is right.
data1.loc[data1["Survived"] == "Yes", 'Sex'].value_counts()
female 233
male 109
Name: Sex, dtype: int64
On the other hand when you use
ax = sns.countplot(x="Survived", hue="Sex", data=data1)
you get the right results.

cleaning a column of strings in a pandas dataframe with str comprehension

I have a dataframe (df1) constructed from a survey in which participants entered their gender as a string and so there is a gender column that looks like:
id gender age
1 Male 19
2 F 22
3 male 20
4 Woman 32
5 female 26
6 Male 22
7 make 24
etc.
I've been using
df1.replace('male', 'Male')
for example, but this is really clunky and involves knowing the exact format of each response to fix it.
I've been trying to use various string comprehensions and string operations in Pandas, such as .split(), .replace(), and .capitalize(), with np.where() to try to get:
id gender age
1 Male 19
2 Female 22
3 Male 20
4 Female 32
5 Female 26
6 Male 22
7 Male 24
I'm sure there must be a way to use regex to do this but I can't seem to get the code right.
I know that it is probably a multi-step process of removing " ", then capitalising the entry, then replacing the capitalised values.
Any guidance would be much appreciated pythonistas!
Kev
Adapt the code in my comment to replace every record that starts with an f with the word Female:
df1["gender"] = df1.gender.apply(lambda s: re.sub(
"(^F)([A-Za-z]+)*", # pattern
"Female", # replace
s.strip().title()) # string
)
Similarly for F with M in the pattern and replace with Male for Male.
Relevant regex docs
Regex help

Add values in column for based on name in different column python

I have a dataframe with a column that has participants' full names, and another column that has the attendance for a specific year. each participant's name appears multiple times with their attendance for that year. I want to add the attendance values for a specific person to see how many times they attended total. Right now I am using this command but it adds all the values in the attendance column.
StudentinfoAll['Attendance_x'].sum(axis=0)
How do I edit this so that it gives me the sum of the attendance values for a specific person? Thank you for your help.
Here is what my data frame looks like
Here is what it looks like
Full Name Attendance Question 1 Question 2
Dan Smith 4 3.0 2.0
Erika Jones 5 6.0 0.0
Dan Smith 3 5.0 7.0
Erika Jones 5 5.0 3.0
Assuming you want the total by student (not just for one student at a time), you need a group by operation. For example, with a test.csv input of:
Full Name,Attendance,Question 1,Question 2
Dan Smith,4,3.0,2.0
Erika Jones,5,6.0,0.0
Dan Smith,3,5.0,7.0
Erika Jones,5,5.0,3.0
And some aggregation code of:
import numpy as np
import pandas as pd
df = pd.read_csv('test.csv')
print df.groupby('Full Name').agg({'Attendance': np.sum})
you get the following output (attendance by full name):
Attendance
Full Name
Dan Smith 7
Erika Jones 10

How to filter and groupby in python

I have a data set (made up the below as an example) and I am trying to group and filter at the same time. I want to groupby the occupation and then filter the Sex for just males. I am also working in pandas.
Occupation Age Sex
Accountant 23 Female
Doctor 33 Male
Accountant 43 Male
Doctor 28 Female
I'd like the final result to look something like this:
Occupation Sex
Accountant 1
Doctor 1
So far I have come up with the below but it doesn't filter males for sex
data.groupby(['occupation'])[['sex']].count()
Thank you.
Use query prior to groupby
data.query('Sex == "Male"').groupby('Occupation').Sex.size().reset_index()

Categories

Resources