I need some help with calculating the confidence interval for a range of sample sizes and according population sizes. So I have a data frame with 3 columns; 1 column has the country name in it, one column the sample size of a survey that was done in that country, and one column has the population size. I want to iterate through those sample sizes and population sizes and calculate the confidence interval for each sample. Only thing is, I have no idea where to start.
Basically, I want to build something like the 'find confidence interval' calculator (the 2nd one) on this page: http://www.surveysystem.com/sscalc.htm, only thing is I want to pass a list of sample sizes and population sizes. I hope you guys can help! Thank you in advance.
Related
I need 2 lists (one for highest teacher-student ratio and the other for the lowest ratio) which contain the codes of the 10 districts of the schools that have the highest and lowest teacher / student ratio, respectively.
This is the dataset, the columns i need to work on are underlined in red.
I need to calculate the 10 highest value from dataframe.students/dataset.teachers but i also have to associate to every ratio the dataset.district row.
I tried but i can't get around this.
Please help me.
Here is a dummy example of the DF I'm working with. It effectively comprises binned data, where the first column gives a category and the second column the number of individuals in that category.
df = pd.DataFrame(data={'Category':['A','B','C','D','E','F','G','H','I'],
'Count':[1000,200,850,350,4000,20,35,4585,2],})
I want to take a random sample, say of 100 individuals, from these data. So for example my random sample could be:
sample1 = pd.DataFrame(data={'Category':['A','B','C','D','E','F','G','H','I'],
'Count':[15,2,4,4,35,0,15,25,0],})
I.e. the sample cannot contain more individuals than are actually in any of the categories. Sampling 0 individuals from a category is possible (and more likely for categories with a lower Count).
How could I go about doing this? I feel like there must be a simple answer but I can't think of it!
Thank you in advance!
You can try sample with replacement:
df.sample(n=100, replace=True, weights=df.Count).groupby(by='Category').count()
I'm using the diamonds dataset, below are the columns
Question: to create bins having equal population. Also need to generate a report that contains cross tab between bins and cut. Represent the number under each cell as a percentage of total
I have the above query. Although being a beginner, I created the Volume column and tried to create bins with equal population using qcut, but I'm not able to proceed further. Could someone help me out with the approach to solve the question?
pd.qcut(diamond['Volume'], q=4)
You are on the right path: pd.qcut() attempts to break the data you provide into q equal-sized bins (though it may have to adjust a little, depending on the shape of your data).
pd.qcut() also lets you specify labels=False as an argument, which will give you back the number of the bin into which the observation falls. This is a little confusing, so here's a quick exaplanation: you could pass labels=['A','B','C','D'] (given your request for 4 bins), which would return the labels of the bin into which each row falls. By telling pd.qcut that you don't have labels to give the bins, the function returns a bin number, just without a specific label. Otherwise, what the function gives back is a tuple with the range into which the observation (row) fell, and the bin number.
The reason you want the bin number is because of your next request: a cross-tab for the bin-indicator column and cut. First, create a column with the bin numbering:
diamond['binned_volume] = pd.qcut(diamond['Volume'], q=4, labels=False)`
Next, use the pd.crosstab() method to get your table:
pd.crosstab(diamond['binned_volume'], diamond['cut'], normalize=True)
The normalize=True argument will have the table calculate the entries as the entry divided by their sum, which is the last part of your question, I believe.
I'd like to create code that can read create a histogram from a matrix of data that contains information about movies. The data set (matrix) contains several columns, and I'm interested in the column that contains movie release years and another column that says whether or not they pass the bechtel test (the data set defines "Pass" and "Fail" as indicators of whether a movie passed or failed the test). Knowing the nth column number of these two columns (release year and pass/fail), how can I create a histogram of the movies that fail the test, with the x axis containing bins of movie years? The bin sizes are not too important, whatever pyplot defaults to would be fine.
What I can do (which is not a lot) is this:
plt.hist(year_by_Test_binary[:,0])
which creates a pretty but meaningless histogram of how many movies were released in bins of years (the matrix has years in the 0th column).
If you couldn't already tell, I am python-illiterate and struggling. Any help would be appreciated.
Assuming n is the column of the Bechdel test, and that your data is numpy like:
plt.hist([matrix[matrix[:,n] == 'Pass', 0], matrix[matrix[:,n] == 'Fail', 0]])
We're giving numpy two vectors of years, one with movies passing and one with movies failing. It will then create two histograms for each category, so you can visually identify changes to the categories.
for to convert a data to an matrix use :
numpy.asarray(data)
and to present in a histogram you can use :
plt.plot(data)
or
plt.hist(data, bins)
bins is the niveau of your data
I need to confirm few thing related to pandas exponential weighted moving average function.
If I have a data set df for which I need to find a 12 day exponential moving average, would the method below be correct.
exp_12=df.ewm(span=20,min_period=12,adjust=False).mean()
Given the data set contains 20 readings the span (Total number of values) should equal to 20.
Since I need to find a 12 day moving average hence min_period=12.
I interpret span as total number of values in a data set or the total time covered.
Can someone confirm if my above interpretation is correct?
I can't get the significance of adjust.
I've attached the link to pandas.df.ewm documentation below.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html
Quoting from Pandas docs:
Span corresponds to what is commonly called an “N-day EW moving average”.
In your case, set span=12.
You do not need to specify that you have 20 datapoints, pandas takes care of that. min_period may not be required here.