Adding confidence intervals for population rates in a dataframe - python

I have a dataframe where I have created a new column which sums the first three columns (dates) with values. Then I have created a rate for each row based on population column.
I would like to create lower and upper 95% confidence levels for the "sum_of_days_rate" for each row in this dataset.
I can create a mean of the first three columns but not sure how to create lower and upper values for the sum of these three columns rate.
Sample of the dataset below:
data= {'09/01/2021': [74,84,38],
'10/11/2021': [43,35,35],
"12/01/2021": [35,37,16],
"population": [23000,69000,48000]}
df = pd.DataFrame (data, columns = ['09/01/2021','10/11/2021', "12/01/2021", "population"])
df['sum_of_days'] = df.loc[:, df.columns[0:3]].sum(1)
df['sum_of_days_rate'] = df['sum_of_days']/df['population'] * 100000

To estimate the confidence interval you need to make certain assumptions about the data, how it is distributed or what would be the associated error. I am not sure what your data points mean, why you are summing them up etc.
A commonly used distribution for rates would a poisson distribution and you can construct the confidence interval, given a mean:
lb, ub = scipy.stats.poisson.interval(0.95,df.sum_of_days_rate)
df['lb'] = lb
df['ub'] = ub
The arrays ub and lb are the upper and lower bound of the 95% confidence interval. Final data frame looks like this:
09/01/2021 10/11/2021 12/01/2021 population sum_of_days sum_of_days_rate lb ub
0 74 43 35 23000 152 660.869565 611.0 712.0
1 84 35 37 69000 156 226.086957 197.0 256.0
2 38 35 16 48000 89 185.416667 159.0 213.0

Related

Errors attempting to use linearmodels.panel.PanelOLS entity effects (not time effects)

I have a Pandas DataFrame like (abridged):
age
gender
control
county
11877
67.0
F
0
AL-Calhoun
11552
60.0
F
0
AL-Coosa
11607
60.0
F
0
AL-Talladega
13821
NaN
NaN
1
AL-Mobile
11462
59.0
F
0
AL-Dale
I want to run a linear regression with fixed effects by county entity (not by time) to balance check my control and treatment groups for an experimental design, such that my dependent variable is membership in the treatment group (control = 1) or not (control = 0).
In order to do this, so far as I have seen I need to use linearmodels.panel.PanelOLS and set my entity field (county) as my index.
So far as I'm aware my model should look like this:
# set index on entity effects field:
to_model = to_model.set_index(["county"])
# implement fixed effects linear model
model = PanelOLS.from_formula("control ~ age + gender + EntityEffects", to_model)
When I try to do this, I get the below error:
ValueError: The index on the time dimension must be either numeric or date-like
I have seen a lot of implementations of such models online and they all seem to use a temporal effect, which is not relevant in my case. If I try to encode my county field using numerics, I get a different error.
# create a dict to map county values to numerics
county_map = dict(zip(to_model["county"].unique(), range(len(to_model.county.unique()))))
# create a numeric column as alternative to county
to_model["county_numeric"] = to_model["county"].map(county_map)
# set index on numeric entity effects field
to_model = to_model.set_index(["county_numeric"])
FactorEvaluationError: Unable to evaluate factor `control`. [KeyError: 'control']
How am I able to implement this model using the county as a unit fixed effect?
Assuming you have multiple entries for each county, then you could use the following. The key step is to use a groupby transform to create a distinct numeric index for each county which can be used as a fake time index.
import numpy as np
import pandas as pd
import string
import linearmodels as lm
# Generate randomd DF
rs = np.random.default_rng(1213892)
counties = rs.choice([c for c in string.ascii_lowercase], (1000, 3))
counties = np.array([["".join(c)] * 10 for c in counties]).ravel()
age = rs.integers(18, 65, (10 * 1000))
gender = rs.choice(["m", "f"], size=(10 * 1000))
control = rs.integers(0, 2, size=10 * 1000)
df = pd.DataFrame(
{"counties": counties, "age": age, "gender": gender, "control": control}
)
# Construct a dummy numeric index for each county
numeric_index = df.groupby("counties").age.transform(lambda c: np.arange(len(c)))
df["numeric_index"] = numeric_index
df = df.set_index(["counties","numeric_index"])
# Take a look
df.head(15)
age gender control
counties numeric_index
qbt 0 51 m 1
1 36 m 0
2 28 f 1
3 28 m 0
4 47 m 0
5 19 m 1
6 32 m 1
7 54 m 0
8 36 m 1
9 52 m 0
nub 0 19 m 0
1 57 m 0
2 49 f 0
3 53 m 1
4 30 f 0
This just shows that the model can be estimated.
# Fit the model
# Note: Results are meaningless, just shows that this works
lm.PanelOLS.from_formula("control ~ age + gender + EntityEffects", data=df)
mod = lm.PanelOLS.from_formula("control ~ age + gender + EntityEffects", data=df)
mod.fit()
PanelOLS Estimation Summary
================================================================================
Dep. Variable: control R-squared: 0.0003
Estimator: PanelOLS R-squared (Between): 0.0005
No. Observations: 10000 R-squared (Within): 0.0003
Date: Thu, May 12 2022 R-squared (Overall): 0.0003
Time: 11:08:00 Log-likelihood -6768.3
Cov. Estimator: Unadjusted
F-statistic: 1.4248
Entities: 962 P-value 0.2406
Avg Obs: 10.395 Distribution: F(2,9036)
Min Obs: 10.0000
Max Obs: 30.000 F-statistic (robust): 2287.4
P-value 0.0000
Time periods: 30 Distribution: F(2,9036)
Avg Obs: 333.33
Min Obs: 2.0000
Max Obs: 962.00
Parameter Estimates
===============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
-------------------------------------------------------------------------------
age -0.0002 0.0004 -0.5142 0.6072 -0.0010 0.0006
gender[T.f] 0.5191 0.0176 29.559 0.0000 0.4847 0.5535
gender[T.m] 0.5021 0.0175 28.652 0.0000 0.4678 0.5365
===============================================================================
F-test for Poolability: 0.9633
P-value: 0.7768
Distribution: F(961,9036)
Included effects: Entity
PanelEffectsResults, id: 0x2246f38a9d0

How to sample data from Pandas Dataframe based on value count from another column

I have a dataframe of about 400,000 observations. I want to sample 50,000 observations based on the amount of each state that's in a 'state' column. So if there is 5% of all observations from TX, then 2,500 of the samples should be from TX, and so on.
I tried the following:
import pandas as pd
df.sample(n=50000, weights = 'state', random_state = 101)
That gave me this error.
TypeError: '<' not supported between instances of 'str' and 'int`
Is there a different way to do this?
Weights modify the probability of any one row to be selected, but can’t provide strict guarantees on counts of given values, as you want. For that you would need .groupby('state'):
>>> rate = df['state'].value_counts(normalize=True)
>>> rate
TX 0.5
NY 0.3
CA 0.2
>>> df.groupby('state').apply(lambda s: s.sample(int(10 * rate[s.name]))).droplevel('state')
state val
69 CA 33
19 CA 99
37 NY 89
36 NY 63
75 NY 3
42 TX 42
53 TX 52
50 TX 68
72 TX 70
2 TX 18
Replace 10 with the number of samples you want, so 50_000. This gives slightly more flexibility than the more efficient answer by #Psidom.
You can use groupby.sample:
df.groupby('state').sample(frac=0.125, random_state=101)
weights parameter is different from groups, it expects list of numbers as sample probability which is used when you want non equal probability weighting for different rows.
For instance the following sample will always return a data frame from the first two rows since the last two rows have weights of 0 and will never get selected:
df = pd.DataFrame({'a': [1,2,3,4]})
df.sample(n=2, weights=[0.5,0.5,0,0])
a
0 1
1 2

Applying `pd.qcut` on multiple columns

I have a DataFrame containing 2 columns x and y that represent coordinates in a Cartesian system. I want to obtain groups with an even(or almost even) number of points. I was thinking about using pd.qcut() but as far as I can tell it can be applied only to 1 column.
For example, I would like to divide the whole set of points with 4 intervals in x and 4 intervals in y (numbers might not be equal) so that I would have roughly even number of points. I expect to see 16 intervals in total (4x4).
I tried a very direct approach which obviously didn't produce the right result (look at 51 and 99 for example). Here is the code:
df['x_bin']=pd.qcut(df.x,4)
df['y_bin']=pd.qcut(df.y,4)
grouped=df.groupby([df.x_bin,df.y_bin]).count()
print(grouped)
The output:
x_bin y_bin
(7.976999999999999, 7.984] (-219.17600000000002, -219.17] 51 51
(-219.17, -219.167] 60 60
(-219.167, -219.16] 64 64
(-219.16, -219.154] 99 99
(7.984, 7.986] (-219.17600000000002, -219.17] 76 76
(-219.17, -219.167] 81 81
(-219.167, -219.16] 63 63
(-219.16, -219.154] 53 53
(7.986, 7.989] (-219.17600000000002, -219.17] 78 78
(-219.17, -219.167] 77 77
(-219.167, -219.16] 68 68
(-219.16, -219.154] 51 51
(7.989, 7.993] (-219.17600000000002, -219.17] 70 70
(-219.17, -219.167] 55 55
(-219.167, -219.16] 77 77
(-219.16, -219.154] 71 71
Am I making a mistake in thinking it is possible to do with pandas only or am I missing something else?
The problem is that the distribution of the rows might not be the same according to x than according to y.
You are empirically mimicking a correlation analysis and finding out that there is slight negative relation... the y values are higher in the lower end of x scale and rather flat on the higher end of x.
So, if you want even number of datapoints on each bin I would suggest splitting the df into x bins and then applying qcut on y for each x bin ( so y bins have different cut points but even sample size)
Edit
Something like:
split_df = [(xbin, xdf) for xbin, xdf in df.groupby(pd.qcut(df.x, 4))] # no aggregation so far, just splitting the df evenly on x
split_df = [(xbin, xdf.groupby(pd.qcut(xdf.y)).x.size())
for xbin, xdf in split_df] # now each xdf is evenly cut on y
Now you need to work on each xdf separately. Attempting to concatenate all xdfs will result in an error. Index for xdfs is a CategoricalIndex, and the first xdf needs to have all categories for concat to work (i.e. split_df[0][1].index must include the bins of all other xdfs). Or you could change the Index to the center of the interval as a float64 on both xbins and ybins.

Pandas: how to apply weight column to create a new dataframe with weighted data

I have this dataset from US Census Bureau with weighted data:
Weight Income ......
2 136 72000
5 18 18000
10 21 65000
11 12 57000
23 43 25700
The first person represents 136 people, the second 18 and so on. There are a lot of other columns and I need to do several charts and calculations. I will be too much work to apply the weight every time I need to do a chart, pivot table, etc.
Ideally, I would like to use this:
df2 = df.iloc [np.repeat (df.index.values, df.PERWT )]
To create an unweighted or flat dataframe.
This produces a new large (1.4GB) dataframe:
Weight Wage
0 136 72000
0 136 72000
0 136 72000
0 136 72000
0 136 72000
.....
The thing is that using all the columns of the dataset, my computer runs out of memory.
Any idea on how to use the weights to create a new weighted dataframe?
I've tied this:
df2 = df.sample(frac=1, weights=df['Weight'])
But it seems to produce the same data. Changing frac to 0.5 could be a solution, but I'll lose 50% of the information.
Thanks!

Pandas timeseries bins and indexing

I have some experimental data collected from a number of samples at set time intervals, in a dataframe organised like so:
Studynumber Time Concentration
1 20 80
1 40 60
1 60 40
2 15 95
2 44 70
2 65 30
Although the time intervals are supposed to be fixed, there is some variation in the data based on when they were actually collected. I want to create bins of the Time column, calculate an 'average' concentration, and then compare the difference between actual concentration and average concentration for each studynumber, at each time.
To do this, I created a column called 'roundtime', then used a groupby to calculate the mean:
data['roundtime']=data['Time'].round(decimals=-1)
meanconc = data.groupby('roundtime')['Concentration'].mean()
This gives a pandas series of the mean concentrations, with roundtime as the index. Then I want to get this back into the main frame to calculate the difference between each actual concentration and the mean concentration:
data['meanconcentration']=meanconc.loc[data['roundtime']].reset_index()['Concentration']
This works for the first 60 or so values, but then returns NaN for each entry, I think because the index of data is longer than the index of meanconcentration.
On the one hand, this looks like an indexing issue - equally, it could be that I'm just approaching this the wrong way. So my question is: a) can this method work? and b) is there another/better way of doing it? All advice welcome!
Use transform to add a column from a groupby aggregation, this will create a Series with it's index aligned to the original df so you can assign it back correctly:
In [4]:
df['meanconcentration'] = df.groupby('roundtime')['Concentration'].transform('mean')
df
Out[4]:
Studynumber Time Concentration roundtime meanconcentration
0 1 20 80 20 87.5
1 1 40 60 40 65.0
2 1 60 40 60 35.0
3 2 15 95 20 87.5
4 2 44 70 40 65.0
5 2 65 30 60 35.0

Categories

Resources