What measure of central tendency is better measure ? Mean or Median?

What measure of central tendency is better measure ? Mean or Median? - python

I have a dataset which has temperatures of different cities (total cities = 20).
Dataset:
Columns-> city1 city2 city3 .... city20
23 34 45 56
34 56 26 54
12 23 33 64
34 67 31 42
Now for each row I want to find the mean and want to check if 50% of data points in a particular row are less than mean or not. If there are datapoints which are less than mean then I make a separate column where I replace the entire row by mean otherwise by median.
In below code I am calculating mean and then I just use for loop to check if the 50% datapoints are less than mean or not. Is there any other smart way to do this ? My ultimate goal is to create a column and each cell in that column will have mean of all temperatures from that particular row if 50% datapoints are less than mean otherwise use median in the column cell.
Code:
mean1 = data.mean(axis=1)

For each row we compare the sum of different from mean and median , pick the less one , inyour case , row 1 to 3 we chose median, row 4 we chose mean
df['New']=np.where(df.sub(df.mean(1).values).pow(2).sum(1)>df.sub(df.median(1).values).pow(2).sum(1),df.median(1),df.mean(1))
df
Out[1429]:
city1 city2 city3 city20 New
0 23 34 45 56 39.5
1 34 56 26 54 42.5
2 12 23 33 64 33.0
3 34 67 31 42 38.0

Related

Calculating average/standard deviations of rows containing certain string in pandas dataframe

I have a large pandas dataframe read as table. I would like to calculate the means and standard deviations of the two different groups, CRPS and Age, so I can plot them in a bar plot with std deviations as the error bars.
I can get the mean calculated by just the Age column. I figured it's a for loop that I have to construct, but I don't know how to construct further than table["Age"].mean(), which just gives me the average of all data points' age values. This is where I need some guidance. I want to look in the group column, tell it to calculate the average and standard deviation for the ages of that group. So, an average and standard deviation value for the ages of the CRPS group, for example.
I have the first 25 rows down below just to show what the dataframe looks like. I also have imported numpy as np as well.
Group Age
0 CRPS 50
1 CRPS 59
2 CRPS 22
3 CRPS 48
4 CRPS 53
5 CRPS 48
6 CRPS 29
7 CRPS 44
8 CRPS 28
9 CRPS 42
10 CRPS 35
11 CONTROLS 54
12 CONTROLS 43
13 CRPS 50
14 CRPS 62
15 CONTROLS 64
16 CONTROLS 39
17 CRPS 40
18 CRPS 59
19 CRPS 46
20 CONTROLS 56
21 CRPS 21
22 CRPS 45
23 CONTROLS 41
24 CRPS 46
25 CONTROLS 35

I don't think you need a for-loop.
Instead, you might try something like:
table.iloc[table['Group'] == 'CRPS']['Age'].mean()
I haven't tested with your table, but I think that will work.
The idea is to first create a boolean array, which is true for row indices where the group field contains 'CRPS', then to select all of those rows using iloc, and finally to take the mean. You could iterate over all of the groups in the following way:
mean_age = dict()
for group in set(table['Group']):
mean_age[group] = table.iloc[table['Group'] == group]['Age'].mean()
Maybe this is where you intended to use a for loop.

I need help building new dataframe from old one, by applying method to each row, keeping same index and columns

I have a dataframe (df_input), and im trying to convert it to another dataframe (df_output), through applying a formula to each element in each row. The formula requires information about the the whole row (min, max, median).
df_input:
A B C D E F G H I J
2011-01-01 60 48 26 29 41 91 93 87 39 65
2011-01-02 88 52 24 99 1 27 12 26 64 87
2011-01-03 13 1 38 60 8 50 59 1 3 76
df_output:
F(A)F(B)F(C)F(D)F(E)F(F)F(G)F(H)F(I)F(J)
2011-01-01 93 54 45 52 8 94 65 37 2 53
2011-01-02 60 44 94 62 78 77 37 97 98 76
2011-01-03 53 58 16 63 60 9 31 44 79 35
Im trying to go from df_input to df_output, as above, after applying f(x) to each cell per row. The function foo is trying to map element x to f(x) by doing an OLS regression of the min, median and max of the row to some co-ordinates. This is done each period.
I'm aware that I iterate over the rows and then for each row apply the function to each element. Where i am struggling is getting the output of foo, into df_output.
for index, row in df_input.iterrows():
min=row.min()
max=row.max()
mean=row.mean()
#apply function to row
new_row = row.apply(lambda x: foo(x,min,max,mean)
#add this to df_output
help!
My current thinking is to build up the new df row by row? I'm trying to do that but im getting a lot of multiindex columns etc. Any pointers would be great.
thanks so much... merry xmas to you all.

Consider calculating row aggregates with DataFrame.* methods and then pass series values in a DataFrame.apply() across columns:
# ROW-WISE AGGREGATES
df['row_min'] = df.min(axis=1)
df['row_max'] = df.max(axis=1)
df['row_mean'] = df.mean(axis=1)
# COLUMN-WISE CALCULATION (DEFAULT axis=0)
new_df = df[list('ABCDEFGHIJ')].apply(lambda col: foo(col,
df['row_min'],
df['row_max'],
df['row_mean']))

Plot histogram using two columns (values, counts) in python dataframe

I have a dataframe having multiple columns in pairs: if one column is values then the adjacent column is the corresponding counts. I want to plot a histogram using values as x variable and counts as the frequency.
For example, I have the following columns:
Age Counts
60 1204
45 700
21 400
. .
. .
34 56
10 150
I want my code to bin the Age values in ten-year intervals between the maximum and minimum values and get the cumulative frequencies for each interval from the Counts column and then plot a histogram. Is there a way to do this using matplotlib ?
I have tried the following but in vain:
patient_dets.plot(x='PatientAge', y='PatientAgecounts', kind='hist')
(patient_dets is the dataframe with 'PatientAge' and 'PatientAgecounts' as columns)

I think you need Series.plot.bar:
patient_dets.set_index('PatientAge')['PatientAgecounts'].plot.bar()
If need bins, one possible solution is with pd.cut:
#helper df with min and max ages
df1 = pd.DataFrame({'G':['14 yo and younger','15-19','20-24','25-29','30-34',
'35-39','40-44','45-49','50-54','55-59','60-64','65+'],
'Min':[0, 15,20,25,30,35,40,45,50,55,60,65],
'Max':[14,19,24,29,34,39,44,49,54,59,64,120]})
print (df1)
G Max Min
0 14 yo and younger 14 0
1 15-19 19 15
2 20-24 24 20
3 25-29 29 25
4 30-34 34 30
5 35-39 39 35
6 40-44 44 40
7 45-49 49 45
8 50-54 54 50
9 55-59 59 55
10 60-64 64 60
11 65+ 120 65
cutoff = np.hstack([np.array(df1.Min[0]), df1.Max.values])
labels = df1.G.values
patient_dets['Groups'] = pd.cut(patient_dets.PatientAge, bins=cutoff, labels=labels, right=True, include_lowest=True)
print (patient_dets)
PatientAge PatientAgecounts Groups
0 60 1204 60-64
1 45 700 45-49
2 21 400 20-24
3 34 56 30-34
4 10 150 14 yo and younger
patient_dets.groupby(['PatientAge','Groups'])['PatientAgecounts'].sum().plot.bar()

You can use pd.cut() to bin your data, and then plot using the function plot('bar')
import numpy as np
nBins = 10
my_bins = np.linspace(patient_dets.Age.min(),patient_dets.Age.max(),nBins)
patient_dets.groupby(pd.cut(patient_dets.Age, bins =nBins)).sum()['Counts'].plot('bar')

Pandas timeseries bins and indexing

I have some experimental data collected from a number of samples at set time intervals, in a dataframe organised like so:
Studynumber Time Concentration
1 20 80
1 40 60
1 60 40
2 15 95
2 44 70
2 65 30
Although the time intervals are supposed to be fixed, there is some variation in the data based on when they were actually collected. I want to create bins of the Time column, calculate an 'average' concentration, and then compare the difference between actual concentration and average concentration for each studynumber, at each time.
To do this, I created a column called 'roundtime', then used a groupby to calculate the mean:
data['roundtime']=data['Time'].round(decimals=-1)
meanconc = data.groupby('roundtime')['Concentration'].mean()
This gives a pandas series of the mean concentrations, with roundtime as the index. Then I want to get this back into the main frame to calculate the difference between each actual concentration and the mean concentration:
data['meanconcentration']=meanconc.loc[data['roundtime']].reset_index()['Concentration']
This works for the first 60 or so values, but then returns NaN for each entry, I think because the index of data is longer than the index of meanconcentration.
On the one hand, this looks like an indexing issue - equally, it could be that I'm just approaching this the wrong way. So my question is: a) can this method work? and b) is there another/better way of doing it? All advice welcome!

Use transform to add a column from a groupby aggregation, this will create a Series with it's index aligned to the original df so you can assign it back correctly:
In [4]:
df['meanconcentration'] = df.groupby('roundtime')['Concentration'].transform('mean')
df
Out[4]:
Studynumber Time Concentration roundtime meanconcentration
0 1 20 80 20 87.5
1 1 40 60 40 65.0
2 1 60 40 60 35.0
3 2 15 95 20 87.5
4 2 44 70 40 65.0
5 2 65 30 60 35.0

efficiently find maxes of one column over id in a Pandas dataframe

I am working with a very large dataframe (3.5 million X 150 and takes 25 gigs of memory when unpickled) and I need to find maximum of one column over an id number and a date and keep only the row with the maximum value. Each row is a recorded observation for one id at a certain date and I also need the latest date.
This is animal test data where there are twenty additional columns seg1-seg20 for each id and date that are filled with test day information consecutively, for example, first test data fills seg1, second test data fills seg2 ect. The "value" field indicates how many segments have been filled, in other words how many tests have been done, so the row with the maximum "value" has the most test data. Ideally I only want these rows and not the previous rows. For example:
df= DataFrame({'id':[1000,1000,1001,2000,2000,2000],
"date":[20010101,20010201,20010115,20010203,20010223,20010220],
"value":[3,1,4,2,6,6],
"seg1":[22,76,23,45,12,53],
"seg2":[23,"",34,52,24,45],
"seg3":[90,"",32,"",34,54],
"seg4":["","",32,"",43,12],
"seg5":["","","","",43,21],
"seg6":["","","","",43,24]})
df
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
1 20010201 1000 76 1
2 20010115 1001 23 34 32 32 4
3 20010203 2000 45 52 2
4 20010223 2000 12 24 34 43 43 41 6
5 20010220 2000 12 24 34 43 44 35 6
And eventually it should be:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 41 6
I first tried to use .groupby('id').max but couldnt find a way to use it to drop rows. The resulting dataframe MUST contain the ORIGINAL ROWS and not just the maximum value of each column with each id. My current solution is:
for i in df.id.unique():
df =df.drop(df.loc[df.id==i].sort(['value','date']).index[:-1])
But this takes around 10 seconds to run each time through, I assume because its trying to call up the entire dataframe each time through. There are 760,000 unique ids, each are 17 digits long, so it will take way too long to be feasible at this rate.
Is there another method that would be more efficient? Currently it reads every column in as an "object" but converting relevant columns to the lowest possible bit of integer doesnt seem to help either.

I tried with groupby('id').max() and it works, and it also drop the rows. Did you remeber to reassign the df variable? Because this operation (and almost all Pandas' operations) are not in-place.
If you do:
df.groupby('id', sort = False).max()
You will get:
date value
id
1000 20010201 3
1001 20010115 4
2000 20010223 6
And if you don't want id as the index, you do:
df.groupby('id', sort = False, as_index = False).max()
And you will get:
id date value
0 1000 20010201 3
1 1001 20010115 4
2 2000 20010223 6
I don't know if that's going to be much faster, though.
Update
This way the index will not be reseted:
df.iloc[df.groupby('id').apply(lambda x: x['value'].idxmax())]
And you will get:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 43 6

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What measure of central tendency is better measure ? Mean or Median? - python

Related

Calculating average/standard deviations of rows containing certain string in pandas dataframe

I need help building new dataframe from old one, by applying method to each row, keeping same index and columns

Plot histogram using two columns (values, counts) in python dataframe

Pandas timeseries bins and indexing

efficiently find maxes of one column over id in a Pandas dataframe

Categories

Resources