Difficulty in plotting Pandas Multi-indexed DataFrame or series - python

Please see this Image
s = pd.DataFrame(combined_df.groupby(['session','age_range', 'gender']).size())
s.head(25)
​ 0
session age_range gender
Evening 0 - 17 female 31022
male 21754
18 - 24 female 79086
male 71563
unknown 75
25 - 29 female 29321
male 46125
unknown 44
30 - 34 female 21480
male 25803
unknown 33
35 - 44 female 17369
male 20335
unknown 121
45 - 54 female 8420
male 12385
unknown 24
55+ female 3433
male 9880
unknown 212
Mid Night 0 - 17 female 18456
male 12185
18 - 24 female 50536
male 45829
unknown 62
This is how my Multi-indexed data Frame looks like. All I am trying to do is to plot the data in such a way that I can compare the male and female users of different age groups active during the different sessions(say Morning, Evening, Noon and Night).
For example I will plot the Male and Female users of age group 0-17, 18-24, 25-29... during different Sessions that I have.
Note: I have tried a few examples from stack overflow and other websites still unsuccessful in getting what I need. So, I request you guys to try solving my problem and help me in finding a solution for this. I have been struggling with this for many days and even the documentation is vague. So, please throw some light on this problem.
]2

I think you can use unstack with DataFrame.plot.bar:
import matplotlib.pyplot as plt
df = combined_df.groupby(['session','age_range', 'gender']).size()
df.unstack(fill_value=0).plot.bar()
plt.show()

Related

Trying to use first 23 rows of a Pandas data frame as headers and then pivot on the headers

I'm pulling in the data frame using tabula. Unfortunately, the data is arranged in rows as below. I need to take the first 23 rows and use them as column headers for the remainder of the data. I need each row to contain these 23 headers for each of about 60 clinics.
Col \
0 Date
1 Clinic
2 Location
3 Clinic Manager
4 Lease Cost
5 Square Footage
6 Lease Expiration
8 Care Provided
9 # of Providers (Full Time)
10 # FTE's Providing Care
11 # Providers (Part-Time)
12 Patients seen per week
13 Number of patients in rooms per provider
14 Number of patients in waiting room
15 # Exam Rooms
16 Procedure rooms
17 Other rooms
18 Specify other
20 Other data:
21 TI Needs:
23 Conclusion & Recommendation
24 Date
25 Clinic
26 Location
27 Clinic Manager
28 Lease Cost
29 Square Footage
30 Lease Expiration
32 Care Provided
33 # of Providers (Full Time)
34 # FTE's Providing Care
35 # Providers (Part-Time)
36 Patients seen per week
37 Number of patients in rooms per provider
38 Number of patients in waiting room
39 # Exam Rooms
40 Procedure rooms
41 Other rooms
42 Specify other
44 Other data:
45 TI Needs:
47 Conclusion & Recommendation
Val
0 9/13/2017
1 Gray Medical Center
2 1234 E. 164th Ave Thornton CA 12345
3 Jane Doe
4 $23,074.80 Rent, $5,392.88 CAM
5 9,840
6 7/31/2023
8 Family Medicine
9 12
10 14
11 1
12 750
13 4
14 2
15 31
16 1
17 X-Ray, Phlebotomist/blood draw
18 NaN
20 Facilities assistance needed. 50% of business...
21 Paint and Carpet (flooring is in good conditio...
23 Lay out and occupancy flow are good for this p...
24 9/13/2017
25 Main Cardiology
26 12000 Wall St Suite 13 Main CA 12345
27 John Doe
28 $9610.42 Rent, $2,937.33 CAM
29 4,406
30 5/31/2024
32 Cardiology
33 2
34 11, 2 - P.T.
35 2
36 188
37 0
38 2
39 6
40 0
41 1 - Pacemaker, 1 - Treadmill, 1- Echo, 1 - Ech...
42 Nurse Office, MA station, Reading Room, 2 Phys...
44 Occupied in Emerus building. Needs facilities ...
45 New build out, great condition.
47 Practice recently relocated from 84th and Alco...
I was able to get my data frame in a better place by fixing the headers. I'm re-posting the first 3 "groups" of data to better illustrate the structure of the data frame. Everything repeats (headers and values) for each clinic.
Try this:
df2 = pd.DataFrame(df[23:].values.reshape(-1, 23),
columns=df[:23][0])
print(df2)
Ideally the number 23 is the number of columns in each row for the result df . you can replace it with the desired number of columns you want.

find the maximum value in a column with respect to other column

i have below data frame:-
input-
first_name last_name age preTestScore postTestScore
0 Jason Miller 42 4 25
1 Molly Jacobson 52 24 94
2 Tina Ali 36 31 57
3 Jake Milner 24 2 62
4 Amy Cooze 73 3 70
i want the output as:-0
Amy 73
so basically i want to find the highest value in age column and i also want the name of person with highest age.
i tried with pandas using group by as below:-
df2=df.groupby(['first_name'])['age'].max()
But with this i am getting the below output as below :
first_name
Amy 73
Jake 24
Jason 42
Molly 52
Tina 36
Name: age, dtype: int64
where as i only want
Amy 73
How shall i go about it in pandas?
You can get your result with the code below
df.loc[df.age.idxmax(),['first_name','age']]
Here, with df.age.idxmax() we are getting the index of the row which has the maximum age value.
Then with df.loc[df.age.idxmax(),['first_name','age']] we are getting the columns 'first_name' & 'age' at that index.
This line of code should do the work
df[df['age']==df['age'].max()][['first_name','age']]
The [['first_name','age']] has the names of columns you want in the result output.
Change as you want.
As in this case the output will be
first_name Age
Amy 73

cleaning a column of strings in a pandas dataframe with str comprehension

I have a dataframe (df1) constructed from a survey in which participants entered their gender as a string and so there is a gender column that looks like:
id gender age
1 Male 19
2 F 22
3 male 20
4 Woman 32
5 female 26
6 Male 22
7 make 24
etc.
I've been using
df1.replace('male', 'Male')
for example, but this is really clunky and involves knowing the exact format of each response to fix it.
I've been trying to use various string comprehensions and string operations in Pandas, such as .split(), .replace(), and .capitalize(), with np.where() to try to get:
id gender age
1 Male 19
2 Female 22
3 Male 20
4 Female 32
5 Female 26
6 Male 22
7 Male 24
I'm sure there must be a way to use regex to do this but I can't seem to get the code right.
I know that it is probably a multi-step process of removing " ", then capitalising the entry, then replacing the capitalised values.
Any guidance would be much appreciated pythonistas!
Kev
Adapt the code in my comment to replace every record that starts with an f with the word Female:
df1["gender"] = df1.gender.apply(lambda s: re.sub(
"(^F)([A-Za-z]+)*", # pattern
"Female", # replace
s.strip().title()) # string
)
Similarly for F with M in the pattern and replace with Male for Male.
Relevant regex docs
Regex help

How to perform groupby and mean on categorical columns in Pandas

I'm working on a dataset called gradedata.csv in Python Pandas where I've created a new binned column called 'Status' as 'Pass' if grade > 70 and 'Fail' if grade <= 70. Here is the listing of first five rows of the dataset:
fname lname gender age exercise hours grade \
0 Marcia Pugh female 17 3 10 82.4
1 Kadeem Morrison male 18 4 4 78.2
2 Nash Powell male 18 5 9 79.3
3 Noelani Wagner female 14 2 7 83.2
4 Noelani Cherry female 18 4 15 87.4
address status
0 9253 Richardson Road, Matawan, NJ 07747 Pass
1 33 Spring Dr., Taunton, MA 02780 Pass
2 41 Hill Avenue, Mentor, OH 44060 Pass
3 8839 Marshall St., Miami, FL 33125 Pass
4 8304 Charles Rd., Lewis Center, OH 43035 Pass
Now, how do i compute the mean hours of exercise of female students with a 'status' of passing...?
I've used the below code, but it isn't working.
print(df.groupby('gender', 'status')['exercise'].mean())
I'm new to Pandas. Anyone please help me in solving this.
You are very close. Note that your groupby key must be one of mapping, function, label, or list of labels. In this case, you want a list of labels. For example:
res = df.groupby(['gender', 'status'])['exercise'].mean()
You can then extract your desired result via pd.Series.get:
query = res.get(('female', 'Pass'))

Grouping by in Pandas dataframe

I feel like I am missing something really simple here, can someone tell me what is wrong with this code?
I am trying to group by Sex where the Age > 30 and the Survived value = 1.
'Sex' is a boolean value (1 or 0), if that makes a difference
data_r.groupby('Sex')([data_r.Age >30],[data_r.Survived == 1]).count()
This is throwing:
"'DataFrameGroupBy' object is not callable"
any ideas? thanks
You need filter first and then groupby.
data_r[(data_r.Age>30) & (data_r.Survived==1)].groupby('Sex').count()
You can do you filtering before grouping.
data_r.query('Age > 30 and Survived == 1').groupby('Sex').count()
Output:
PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare \
Sex
female 83 83 83 83 83 83 83 83 83
male 41 41 41 41 41 41 41 41 41
Cabin Embarked
Sex
female 47 81
male 25 41
IMHO... I'd use size it is safer, count does not include null values(NaN values). Notice those different values in the columns this is due to NaN values.
data_r.query('Age > 30 and Survived == 1').groupby('Sex').size()
Output:
Sex
female 83
male 41
dtype: int64

Categories

Resources