I have a dataframe having multiple columns in pairs: if one column is values then the adjacent column is the corresponding counts. I want to plot a histogram using values as x variable and counts as the frequency.
For example, I have the following columns:
Age Counts
60 1204
45 700
21 400
. .
. .
34 56
10 150
I want my code to bin the Age values in ten-year intervals between the maximum and minimum values and get the cumulative frequencies for each interval from the Counts column and then plot a histogram. Is there a way to do this using matplotlib ?
I have tried the following but in vain:
patient_dets.plot(x='PatientAge', y='PatientAgecounts', kind='hist')
(patient_dets is the dataframe with 'PatientAge' and 'PatientAgecounts' as columns)
I think you need Series.plot.bar:
patient_dets.set_index('PatientAge')['PatientAgecounts'].plot.bar()
If need bins, one possible solution is with pd.cut:
#helper df with min and max ages
df1 = pd.DataFrame({'G':['14 yo and younger','15-19','20-24','25-29','30-34',
'35-39','40-44','45-49','50-54','55-59','60-64','65+'],
'Min':[0, 15,20,25,30,35,40,45,50,55,60,65],
'Max':[14,19,24,29,34,39,44,49,54,59,64,120]})
print (df1)
G Max Min
0 14 yo and younger 14 0
1 15-19 19 15
2 20-24 24 20
3 25-29 29 25
4 30-34 34 30
5 35-39 39 35
6 40-44 44 40
7 45-49 49 45
8 50-54 54 50
9 55-59 59 55
10 60-64 64 60
11 65+ 120 65
cutoff = np.hstack([np.array(df1.Min[0]), df1.Max.values])
labels = df1.G.values
patient_dets['Groups'] = pd.cut(patient_dets.PatientAge, bins=cutoff, labels=labels, right=True, include_lowest=True)
print (patient_dets)
PatientAge PatientAgecounts Groups
0 60 1204 60-64
1 45 700 45-49
2 21 400 20-24
3 34 56 30-34
4 10 150 14 yo and younger
patient_dets.groupby(['PatientAge','Groups'])['PatientAgecounts'].sum().plot.bar()
You can use pd.cut() to bin your data, and then plot using the function plot('bar')
import numpy as np
nBins = 10
my_bins = np.linspace(patient_dets.Age.min(),patient_dets.Age.max(),nBins)
patient_dets.groupby(pd.cut(patient_dets.Age, bins =nBins)).sum()['Counts'].plot('bar')
Related
I have a dataset that consists of 5 rows that are formed like a curve. I want to separate the inner row from the other or if possible each row and store them in a separate array. Is there any way to do this, like somehow flatten the curved data and sorting it afterwards based on the x and y values?
I would like to assign each row from left to right numbers from 0 to the max of the row. Right now the labels for each dot are not useful for me and I can't change the labels.
Here are the first 50 data points of my data set:
x y
0 -6.4165 0.3716
1 -4.0227 2.63
2 -7.206 3.0652
3 -3.2584 -0.0392
4 -0.7565 2.1039
5 -0.0498 -0.5159
6 2.363 1.5329
7 -10.7253 3.4654
8 -8.0621 5.9083
9 -4.6328 5.3028
10 -1.4237 4.8455
11 1.8047 4.2297
12 4.8147 3.6074
13 -5.3504 8.1889
14 -1.7743 7.6165
15 1.1783 6.9698
16 4.3471 6.2411
17 7.4067 5.5988
18 -2.6037 10.4623
19 0.8613 9.7628
20 3.8054 9.0202
21 7.023 8.1962
22 9.9776 7.5563
23 0.1733 12.6547
24 3.7137 11.9097
25 6.4672 10.9363
26 9.6489 10.1246
27 12.5674 9.3369
28 3.2124 14.7492
29 6.4983 13.7562
30 9.2606 12.7241
31 12.4003 11.878
32 15.3578 11.0027
33 6.3128 16.7014
34 9.7676 15.6557
35 12.2103 14.4967
36 15.3182 13.5166
37 18.2495 12.5836
38 9.3947 18.5506
39 12.496 17.2993
40 15.3987 16.2716
41 18.2212 15.1871
42 21.1241 14.0893
43 12.3548 20.2538
44 15.3682 18.9439
45 18.357 17.8862
46 21.0834 16.6258
47 23.9992 15.4145
48 15.3776 21.9402
49 18.3568 20.5803
50 21.1733 19.3041
It seems that your curves have a pattern, so you could select the curve of interest using splicing. I had the offset the selection slightly to get the five curves because the first 8 points are not in the same order as the rest of the data. So the initial 8 data points are discarded. But these could be added back in afterwards if required.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({ 'x': [-6.4165, -4.0227, -7.206, -3.2584, -0.7565, -0.0498, 2.363, -10.7253, -8.0621, -4.6328, -1.4237, 1.8047, 4.8147, -5.3504, -1.7743, 1.1783, 4.3471, 7.4067, -2.6037, 0.8613, 3.8054, 7.023, 9.9776, 0.1733, 3.7137, 6.4672, 9.6489, 12.5674, 3.2124, 6.4983, 9.2606, 12.4003, 15.3578, 6.3128, 9.7676, 12.2103, 15.3182, 18.2495, 9.3947, 12.496, 15.3987, 18.2212, 21.1241, 12.3548, 15.3682, 18.357, 21.0834, 23.9992, 15.3776, 18.3568, 21.1733],
'y': [0.3716, 2.63, 3.0652, -0.0392, 2.1039, -0.5159, 1.5329, 3.4654, 5.9083, 5.3028, 4.8455, 4.2297, 3.6074, 8.1889, 7.6165, 6.9698, 6.2411, 5.5988, 10.4623, 9.7628, 9.0202, 8.1962, 7.5563, 12.6547, 11.9097, 10.9363, 10.1246, 9.3369, 14.7492, 13.7562, 12.7241, 11.878, 11.0027, 16.7014, 15.6557, 14.4967, 13.5166, 12.5836, 18.5506, 17.2993, 16.2716, 15.1871, 14.0893, 20.2538, 18.9439, 17.8862, 16.6258, 15.4145, 21.9402, 20.5803, 19.3041]})
# Generate the 5 dataframes
df_list = [df.iloc[i+8::5, :] for i in range(5)]
# Generate the plot
fig = plt.figure()
for frame in df_list:
plt.scatter(frame['x'], frame['y'])
plt.show()
# Print the data of the innermost curve
print(df_list[4])
OUTPUT:
The 5th dataframe df_list[4] contains the data of the innermost plot.
x y
12 4.8147 3.6074
17 7.4067 5.5988
22 9.9776 7.5563
27 12.5674 9.3369
32 15.3578 11.0027
37 18.2495 12.5836
42 21.1241 14.0893
47 23.9992 15.4145
You can then add the missing data like this:
# Retrieve the two missing points of the inner curve
inner_curve = pd.concat([df_list[4], df[5:7]]).sort_index(ascending=True)
print(inner_curve)
# Plot the inner curve only
fig2 = plt.figure()
plt.scatter(inner_curve['x'], inner_curve['y'], color = '#9467BD')
plt.show()
OUTPUT: inner curve
x y
5 -0.0498 -0.5159
6 2.3630 1.5329
12 4.8147 3.6074
17 7.4067 5.5988
22 9.9776 7.5563
27 12.5674 9.3369
32 15.3578 11.0027
37 18.2495 12.5836
42 21.1241 14.0893
47 23.9992 15.4145
Complete Inner Curve
I have df with different groups. I have two predictions (iqr, median).
cntx_iqr pred_iqr cntx_median pred_median
18-54 83 K18-54 72
R18-54 34 R18-54 48
25-54 33 18-34 47
K18-54 29 18-54 47
18-34 27 R25-54 29
K18-34 25 25-54 23
K25-54 24 K25-54 14
R18-34 22 R18-34 8
R25-54 17 K18-34 6
Now I want to plot them using seaborn and I have melted data for pilots. However, it does not look right to me.
pd.melt(df, id_vars=['cntx_iqr', 'cntx_median'], value_name='category', var_name="kind")
I am aiming to compare predictions (pred_iqr,pred_median) from those 2 groups (cntx_iqr, cntx_median) maybe stack barplot or some other useful plot to see how each group differs for those 2 predictions.
any help/suggestion would be appreciated
Thanks in advance
Not sure how you obtained the data frame, but you need to match the values first:
df = df[['cntx_iqr','pred_iqr']].merge(df[['cntx_median','pred_median']],
left_on="cntx_iqr",right_on="cntx_median")
df.head()
cntx_iqr pred_iqr cntx_median pred_median
0 18-54 83 18-54 47
1 R18-54 34 R18-54 48
2 25-54 33 25-54 23
3 K18-54 29 K18-54 72
4 18-34 27 18-34 47
Once you have this, you can just make a scatterplot:
sns.scatterplot(x = 'pred_iqr',y = 'pred_median',data=df)
The barplot requires a bit of pivoting, but should be:
sns.barplot(x = 'cntx_iqr', y = 'value', hue='variable',
data = df.melt(id_vars='cntx_iqr',value_vars=['pred_iqr','pred_median']))
I have a large pandas dataframe read as table. I would like to calculate the means and standard deviations of the two different groups, CRPS and Age, so I can plot them in a bar plot with std deviations as the error bars.
I can get the mean calculated by just the Age column. I figured it's a for loop that I have to construct, but I don't know how to construct further than table["Age"].mean(), which just gives me the average of all data points' age values. This is where I need some guidance. I want to look in the group column, tell it to calculate the average and standard deviation for the ages of that group. So, an average and standard deviation value for the ages of the CRPS group, for example.
I have the first 25 rows down below just to show what the dataframe looks like. I also have imported numpy as np as well.
Group Age
0 CRPS 50
1 CRPS 59
2 CRPS 22
3 CRPS 48
4 CRPS 53
5 CRPS 48
6 CRPS 29
7 CRPS 44
8 CRPS 28
9 CRPS 42
10 CRPS 35
11 CONTROLS 54
12 CONTROLS 43
13 CRPS 50
14 CRPS 62
15 CONTROLS 64
16 CONTROLS 39
17 CRPS 40
18 CRPS 59
19 CRPS 46
20 CONTROLS 56
21 CRPS 21
22 CRPS 45
23 CONTROLS 41
24 CRPS 46
25 CONTROLS 35
I don't think you need a for-loop.
Instead, you might try something like:
table.iloc[table['Group'] == 'CRPS']['Age'].mean()
I haven't tested with your table, but I think that will work.
The idea is to first create a boolean array, which is true for row indices where the group field contains 'CRPS', then to select all of those rows using iloc, and finally to take the mean. You could iterate over all of the groups in the following way:
mean_age = dict()
for group in set(table['Group']):
mean_age[group] = table.iloc[table['Group'] == group]['Age'].mean()
Maybe this is where you intended to use a for loop.
Priority Expected Actual
High 47 30
Medium 22 14
Required 16 5
I'm trying to create two other columns 'Expected_values' which will have the values like for the row High 47*5, for the row Medium 22*3,for the row Required 16*10 and 'Actual_values' for the row High 30*5, for the row Medium 14*3,for the row Required 5*10
like this
Priority Expected Actual Expected_values Actual_values
Required 16 5 160 50
High 47 30 235 150
Medium 22 14 66 42
Any simple way to do that in pandas or numpy?
try:
a = np.array([5, 3, 10])
df['Expected_values'] = df.Expected * a
df['Actual_values'] = df.Actual * a
print df
Priority Expected Actual Expected_values Actual_values
0 High 47 30 235 150
1 Medium 22 14 66 42
2 Required 16 5 160 50
I am working with a very large dataframe (3.5 million X 150 and takes 25 gigs of memory when unpickled) and I need to find maximum of one column over an id number and a date and keep only the row with the maximum value. Each row is a recorded observation for one id at a certain date and I also need the latest date.
This is animal test data where there are twenty additional columns seg1-seg20 for each id and date that are filled with test day information consecutively, for example, first test data fills seg1, second test data fills seg2 ect. The "value" field indicates how many segments have been filled, in other words how many tests have been done, so the row with the maximum "value" has the most test data. Ideally I only want these rows and not the previous rows. For example:
df= DataFrame({'id':[1000,1000,1001,2000,2000,2000],
"date":[20010101,20010201,20010115,20010203,20010223,20010220],
"value":[3,1,4,2,6,6],
"seg1":[22,76,23,45,12,53],
"seg2":[23,"",34,52,24,45],
"seg3":[90,"",32,"",34,54],
"seg4":["","",32,"",43,12],
"seg5":["","","","",43,21],
"seg6":["","","","",43,24]})
df
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
1 20010201 1000 76 1
2 20010115 1001 23 34 32 32 4
3 20010203 2000 45 52 2
4 20010223 2000 12 24 34 43 43 41 6
5 20010220 2000 12 24 34 43 44 35 6
And eventually it should be:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 41 6
I first tried to use .groupby('id').max but couldnt find a way to use it to drop rows. The resulting dataframe MUST contain the ORIGINAL ROWS and not just the maximum value of each column with each id. My current solution is:
for i in df.id.unique():
df =df.drop(df.loc[df.id==i].sort(['value','date']).index[:-1])
But this takes around 10 seconds to run each time through, I assume because its trying to call up the entire dataframe each time through. There are 760,000 unique ids, each are 17 digits long, so it will take way too long to be feasible at this rate.
Is there another method that would be more efficient? Currently it reads every column in as an "object" but converting relevant columns to the lowest possible bit of integer doesnt seem to help either.
I tried with groupby('id').max() and it works, and it also drop the rows. Did you remeber to reassign the df variable? Because this operation (and almost all Pandas' operations) are not in-place.
If you do:
df.groupby('id', sort = False).max()
You will get:
date value
id
1000 20010201 3
1001 20010115 4
2000 20010223 6
And if you don't want id as the index, you do:
df.groupby('id', sort = False, as_index = False).max()
And you will get:
id date value
0 1000 20010201 3
1 1001 20010115 4
2 2000 20010223 6
I don't know if that's going to be much faster, though.
Update
This way the index will not be reseted:
df.iloc[df.groupby('id').apply(lambda x: x['value'].idxmax())]
And you will get:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 43 6