Python: How to find properties of items in a series - python

I have a series of numbers that I have broken into buckets using pandas.cut.
agepreg_cuts = pd.cut(df['agepreg'],[0,20,25,30,pd.np.inf], right=False)
I then count it and display the count.
agepreg_count = (df.groupby(agepreg_cuts).count())
agepreg_count
Which gives me much more information than I want:
sest cmintvw totalwgt_lb
agepreg
[0, 20) 3182 0 1910
[20, 25) 4246 0 2962
[25, 30) 3178 0 2336
[30, inf) 2635 0 1830
Now I want to format it like this:
INAPPLICABLE 352
0 to 20 3182
20 to 25 4246
25 to 30 3178
30 to 50 2635
Total 13593
Which leads me to a couple of questions.
How do I extract the begin/end properties (e.g. 25/30) from the bin [25,30)?
How do I discover properties in a series so that I do not have to ask SO the previous question?
For reference, the data I am using comes from the nsfg. The free book thinkstats2 has companion code and data on github.
From the 'code' directory, you can run the following line to load the dataframe.
import nsfg
df = nsfg.ReadFemPreg()
df

You could use iterate over the frame using iterrows and then work on categorical value like
In [679]: for x, i in agepreg_count.iterrows():
.....: print ' to '.join(x[1:-1].split(', ')), i['agepreg']
.....:
0 to 20 0
20 to 25 43
25 to 30 27
30 to inf 30

If you are just looking for a well formated string (your example suggests it) you can use a label argument to the cut function.
## create labels from breakpoints
breaks=[0,20,25,30,pd.np.inf]
diff=np.diff(breaks).tolist()
## make tuples of *breaks* and length of intervals
joint = list(zip(breaks,diff))
## format label
s1 = "{left:,.0f} to {right:,.0f}"
labels = [s1.format(left=yr[0], right=yr[0]+yr[1]-1) for yr in joint]
labels
['0 to 19', '20 to 24', '25 to 29', '30 to inf']
Then, cut using the breaks and labels.
df['agebin'] = pd.cut(df['agepreg'],breaks, labels=labels, right=False)
And summarize:
df.groupby('agebin')['agebin'].size()

Related

making and updating multiple pandas dataframes using dicts (avoiding repetative code)

I have a dataframe of id numbers (n = 140, but it could be more or less) and I have 5 group leaders. Each group leader needs to be randomly assigned an amount of these ids (for ease lets make it even so n=28, but I need to be able to control the amounts) and those rows need to be split out into a new df and then droped from the original dataframe so that there is no crossover between leaders.
import pandas as pd
import numpy as np
#making the df
df = pd.DataFrame()
df['ids'] = np.random.randint(1, 140, size=140)
df['group_leader'] = ''
# list of leader names
leaders = ['John', 'Paul', 'George', 'Ringo', 'Apu']
I can do this for each leader with something like
df.loc[df.sample(n=28).index, 'group_leader'] = 'George'
g = df[df['group_leader']=='George'].copy()
df = df[df['group_leader] != 'George']
print(df.shape()[0]) #double checking that df has less ids in it
However, doing this individually for each group leaders seems really un-pythonic (not that I'm an expert on that) and is not easy to refactor into a function.
I thought that I might be able to do it with a dict and a for loop
frames = dict.fromkeys('group_leaders', pd.DataFrame())
for i in frames.keys(): #allows me to fill the cells with the string key?
df.loc[df.sample(n=28).index, 'group_leader'] = str(i)
frames[i].update(df[df['group_leader']== str(i)].copy())#also tried append()
print(frames[i].head())
df = df[df['group_leader'] != str(i)]
print(f'df now has {df.shape[0]} ids left') #just in case there's a remainder of ids
However, the new dataframes are still empty and I get the error:
Traceback (most recent call last):
File "C:\Users\path\to\the\file\file.py", line 38, in <module>
df.loc[df.sample(n=28).index, 'group_leader'] = str(i)
File "C:\Users\path\to\the\file\pandas\core\generic.py", line 5356, in sample
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
File "mtrand.pyx", line 909, in numpy.random.mtrand.RandomState.choice
ValueError: a must be greater than 0 unless no samples are taken
This leads me to believe that I'm doing two things wrong:
Either making the dict incorectly or updating it incorrectly.
Making the for loop run in such a way that it tries to run 1 too many times.
I have tried to be as clear as possible and present a minimally useful version of what I need, any help would be appreciated.
Note - I'm aware that 5 divides well into 140 and there may be cases where this isn't the case but I'm pretty sure I can handle that myself with if-else if it's needed.
You can use np.repeat and np.random.shuffle:
leaders = ['John', 'Paul', 'George', 'Ringo', 'Apu']
leaders = np.repeat(leaders, 28)
np.random.shuffle(leaders)
df['group_leader'] = leaders
Output:
>>> df
ids group_leader
0 138 John
1 36 Apu
2 99 John
3 91 George
4 58 Ringo
.. ... ...
135 43 Ringo
136 84 Apu
137 94 John
138 56 Ringo
139 58 Paul
[140 rows x 2 columns]
>>> df.value_counts('group_leader')
group_leader
Apu 28
George 28
John 28
Paul 28
Ringo 28
dtype: int64
Update
df = pd.DataFrame({'ids': np.random.randint(1, 113, size=113)})
leaders = ['John', 'Paul', 'George', 'Ringo', 'Apu']
leaders = np.repeat(leaders, np.ceil(len(df) / len(leaders)))
np.random.shuffle(leaders)
df['group_leader'] = leaders[:len(df)]
Output:
>>> df.value_counts('group_leader')
group_leader
Apu 23
John 23
Ringo 23
George 22
Paul 22
dtype: int64

How to create a pivot table with value ranges in the index and headers and values in the frame?

This is my input data
And the output I want:
PAX range DELHI PUNE MUMBAI
0-50 56 22 56
51-100 55 33 77
101-150 52 27 89
A couple of things are not clear from your question:
There is no PAX column in the dataframe, perhaps there are more columns not shown, in which case it is all right.
I'm assuming by your comment that the aggregation function you want to use is count of rows.
If this is all correct then you can binarize and pass to a groupby call with.
output = df.groupby([
pd.cut(df.PAX, bins=[0, 50, 100, 150]), 'City'
]).size().unstack()

Summing Pandas columns between two rows

I have a Pandas dataframe with columns labeled Ticks, Water, and Temp, with a few million rows (possibly billion on a complete dataset), but it looks something like this
...
'Ticks' 'Water' 'Temp'
215 4 26.2023
216 1 26.7324
217 17 26.8173
218 2 26.9912
219 48 27.0111
220 1 27.2604
221 19 27.7563
222 32 28.3002
...
(All temperatures are in ascending order, and all 'ticks' are also linearly spaced and in ascending order too)
What I'm trying to do is to reduce the data down to a single 'Water' value for each floored, integer 'Temp' value, and just the first 'Tick' value (or last, it doesn't really have that much of an effect on the analysis).
The current direction I'm working on doing this is to start at the first row and save the tick value, check if the temperature is an integer value greater than the previous, add the water value, move to the next row check the temperature value, add the water value if it's not a whole integer higher. If the temperature value is an integer value higher, append the saved 'tick' value and integer temperature value and the summed water count to a new dataframe.
I'm sure this will work but, I'm thinking there should be a way to do this a lot more efficiently using some type of application of df.loc or df.iloc since everything is nicely in ascending order.
My hopeful output for this would be a much shorter dataset with values that look something like this:
...
'Ticks' 'Water' 'Temp'
215 24 26
219 68 27
222 62 28
...
Use GroupBy.agg and Series.astype
new_df = (df.groupby(df['Temp'].astype(int))
.agg({'Ticks' : 'first', 'Water' : 'sum'})
#.agg(Ticks = ('Ticks', 'first'), Water = ('Water', 'sum'))
.reset_index()
.reindex(columns=df.columns)
)
print(new_df)
Output
Ticks Water Temp
0 215 24 26
1 219 68 27
2 222 32 28
I have some trouble understanding the rules for which ticks you want in the final dataframe, but here is a way to get the indices of all Temps with equal floored value:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
data = pd.DataFrame({
'Ticks': [215, 216, 217, 218, 219, 220, 221, 222],
'Water': [4, 1, 17, 2, 48, 1, 19, 32],
'Temp': [26.2023, 26.7324, 26.8173, 26.9912, 27.0111, 27.2604, 27.7563, 28.3002]})
# first floor all temps
data['Temp'] = data['Temp'].apply(np.floor)
# get the indices of all equal temps
groups = data.groupby('Temp').groups
print(groups)
# maybe apply mean?
data = data.groupby('Temp').mean()
print(data)
hope this helps

Stacked Plot To Represent Genders For An Age Group From CSV containing Identifier , Age and Gender On Python / Pandas/ Matplotlib

I have a csv data with age, gender(Men,Women) and identifier. I grouped age and gender of individuals by count of identifier on pandas with
counts = df.groupby(['Age','Gender']).count()
print counts
and the result looked something like this :
Age Gender Id_count
15 W 1
17 M 1
19 M 2
20 M 6
W 1
21 M 3
W 1
23 M 4
W 3
24 M 8
W 3
25 M 9
26 M 6
W 1
27 M 3
W 1
28 M 9
W 2
29 M 5
W 1
30 M 3
31 M 9
W 1 ..
Unique ages on my dataset are from age 15 to 90. I now want to do an age group analysis with a stacked plot at the end.For that , i want to lets say range the ages into certain age group (10-20,21-30,31-40 and so on) and plot sum of identifier on each age group , showing sum on the top of the bar and my aim is to get two different colors for stacked bar representing men and women according to their proportion of id_count. To implement this : i created a dictionary where i gave range as shown below..
df['ids_counted']= np.round(df['Age'])
categories_dict = { 15 : 'Between 10 and 20',
16 : 'Between 10 and 20',
17 : 'Between 10 and 20',
18 : 'Between 10 and 20',
19 : 'Between 10 and 20',
20 : 'Between 10 and 20',
21 : 'Between 21 and 30',
22 : 'Between 21 and 30',..
90 : 'Between 81 and 90',}
Then I created this dataframe.
df['category'] = df['id_counted'].map(categories_dict)
count2 = df.groupby(['category','Age','Gender','Id_Count']).count()
total= count2.sum(level= 0)
print total
now i have successfully counted the total of identifier on each age group. It looked something like this :
Between 10 and 20 11
Between 21 and 30 62
Between 31 and 40 82
Between 41 and 50 120
Between 51 and 60 125
Between 61 and 70 141
Between 71 and 80 192
Between 81 and 90 38
But i lost my way here because i wanted to plot gender too. lets take age between 10 and 20 . Total 11 should have been on the top of my bar and portion 9 men and 2 women should have been plotted on a stacked bar. I thought about another approach because i think this way to approach won't get me to my result. I generated a grouped dataframe with the counts of each M and F per age, then calculated the total number of individual per age group.
totals = counts.sum(level=0)
Now to plot :
plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['W'], bottom=counts['M'], color='red', label='W')
plt.legend()
plt.xlabel('Age Group')
plt.ylabel('Occurences Of Identifiers')
plt.title('ttl',fontsize=20)
for age,tot in zip(ages,totals.values.flatten()):
plt.annotate('{:d}'.format(tot), xy=(age+0.39, tot), xytext=(0,1), textcoords='offset points', ha='center', va='bottom')
plt.show()
plt.save()
plt.close()
and got this plot which turned out to be okay but it is for individual age and my target is to generate same plot for age group on my dictionary. I would be very grateful if anyone would suggest me or give me an idea to obtain my aimed result. Thank you so much for your time.
Assigning age groups is easier using np.digitize.
n = 100
age = np.random.randint(15, 91, size=n)
gender = np.random.randint(2, size=n)
df = pd.DataFrame.from_items([('Age', age), ('Gender', gender)])
bins = np.arange(1, 10) * 10
df['category'] = np.digitize(df.Age, bins, right=True)
print(df.head())
Age Gender category
0 22 1 2
1 54 0 5
2 85 1 8
3 77 0 7
4 86 1 8
Now count grouping by category and gender, then unstack the result to have gender as columns.
counts = df.groupby(['category', 'Gender']).Age.count().unstack()
print(counts)
Gender 0 1
category
1 2 7
2 7 5
3 6 4
4 11 9
5 5 8
6 2 4
7 10 7
8 6 7
Plotting is now a breeze.
counts.plot(kind='bar', stacked=True)
This turned out to be my code at last :
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
plt.style.use('fivethirtyeight')
df = pd.read_csv('/home/Desktop/cocktail_ids_age_gender.csv')
df.values
bins = np.arange(10, 100, 10)
df['category'] = np.digitize(df.Age, bins, right=True)
counts = df.groupby(['category', 'Gender']).Age.count().unstack()
print(counts)
ax = counts.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
ax.annotate(np.round(p.get_height(),decimals=0).astype(np.int64), (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.xlabel ('Age Group')
plt.ylabel ('Co-Occurences ')
plt.title('Comparison Of Occurences In An Age Group',fontsize=20)
plt.show()
And i decided to leave it stacked anyways because it made analysis easier. Everything turned out well , thanks to goyo. But the only thing that is again bothering me is my x-axis. Instead of showing 1 , 2 , 3 , 4.. i wanted to show 10-20,20-30 and so on. I am not grasping how i could do that. Can anyone help me. Thank you

Analyze data using python

I have a csv file in the following format:
30 1964 1 1
30 1962 3 1
30 1965 0 1
31 1959 2 1
31 1965 4 1
33 1958 10 1
33 1960 0 1
34 1959 0 2
34 1966 9 2
34 1958 30 1
34 1960 1 1
34 1961 10 1
34 1967 7 1
34 1960 0 1
35 1964 13 1
35 1963 0 1
The first column denotes the age and the last column denotes the survival rate(1 if patient survives 5 years or longer;2 if patient died within 5 years)
I have to calculate which age has the highest survival rate. I am new to python and I cannot figure out how to proceed. I was able to calculate the most repeated age using the mode function but I cannot figure out how to check one column and print the corresponding other column. Please help.
I was able to find an answer where I had to analyze just the first row.
import csv
import matplotlib.pyplot as plt
import numpy as np
df = open('Dataset.csv')
csv_df=csv.reader(df)
a=[]
b=[]
for row in csv_df:
a.append(row[0])
b.append(row[3])
print('The age that has maximum reported incidents of cancer is '+ mode(a))
I am not entirely sure whether I understood your logic clearly for determining the age with the maximum survival rate. Assuming that the age that has the heighest number of 1s have the heighest survival rate the following code is written
I have done the reading part a little differently as the data set acted wired when I used csv. If the csv module works fine in your environment, use it. The idea is, to retrieve each element of value in each row; we are interested in the 0th and 3rd columns.
In the following code, we maintain a dictionary, survival_map, and count the frequency of a particular age being associated with a 1.
import operator
survival_map = {}
with open('Dataset.csv', 'rb') as in_f:
for row in in_f:
row = row.rstrip() #to remove the end line character
items = row.split(',') #I converted the tab space to a comma, had a problem otherwise
age = int(items[0])
survival_rate = int(items[3])
if survival_rate == 1:
if age in survival_map:
survival_map[age] += 1
else:
survival_map[age] = 1
Once we build the dictionary, {33: 2, 34: 5, 35: 2, 30: 3, 31: 2}, it is sorted in reverse by the key:
sorted_survival_map = sorted(survival_map.items(), key=operator.itemgetter(1), reverse = True)
max_survival = sorted_survival_map[0]
UPDATE:
For a single max value, OP's suggestion (in a comment) is preferred. Posting it here:
maximum = max(dict, key=dict.get)
print(maximum, dict[maximum])
For multiple max values
max_keys = []
max_value = 0
for k,v in survival_map.items():
if v > max_value:
max_keys = [k]
max_value = v
elif v == max_value:
max_keys.append(k)
print [(x, max_value) for x in max_keys]
Of course, this could be achieved by a dictionary comprehension; however for readability, I am proposing this. Also, this is done through one pass through the objects in the dictionary without going through it multiple times. Therefore, the solution has O(n) time complexity and would be the fastest.

Categories

Resources