Margin of Error for Complex Sample in Python - python

I have weighted STATA dataset of national survey (n=6342). The data has already been weighted, i.e. each respondent represents 4000 respondents on average.
I am reading dataset with pandas.read_stata function. Basically, what I need to achieve is to extract data by each question with respected frequencies(%) along with Margin of Error for each frequency.
I have written Python code to do it and it works perfectly fine with frequency itself, i.e. calculating sum of weighted value in each frequency and divide it by total weighted value sum.
Pseudo-code looks like this:
q_5 = dataset['q5'].unique()`
frequencies = {}
for value in q_5:
variable = dataset[dataset['q5'] == value]
freq = ((variable['indwt'].sum()/weights_sum)*100)
freq = round(freq,0)
frequencies.update({value : freq})
However, I cannot get the proper confidence intervals or margin of error since this is a complex sample.
I was advised to use R instead, but taking into consideration syntax learning curve, I would rather stick with Python.
Is there any statistical package for Python that could calculate ME for complex sample?

Related

Predict the future graph based on averages of given data

I am trying to make a future stock price forecaster, i am nearly done but the final step has stumped me.
How do i predict the future of the graph based on the different averages of given data?
#how it works up to now:
stockprice =[1, 2, 3, ... 9999]
#for every number in stock price, add that number till x amount(x would be input) numbers and divide them (calculate average)
StockDataSeperate = StockData_AverageFinder[-int_toSplitBy:-1]
for num in StockDataSeperate:
Average += num
Average = Average / len(StockDataSeperate)
Averaged_StockData = np.append(Averaged_StockData, Average)
#doing this x amount of times and exponentiating the number to average by, by x.
using this data (StockPrice averaged graphs), is it possible to predict the future of the raw data using the data averaged?
if anyone has any links or ideas i would be so greatful!
Obviously, using a moving average for future values does not work since you don't have values beyond the present. In theory you would assume that near-term stock prices follow a random walk, so you best guess for future value would be to simply predict the last known value.
However, a more "exciting" solution could be to train a LSTM by turning the stock price series into a supervised learning problem. It is important that you dont predict the price itself but the return between the stock prices in your time series. Of course you can also use the returns of moving averages as inputs or even multiple moving averages and conduct multivariate time series forecasting.
Hopefully, I don't have to mention that stock price prediction is not that "easy" - it's a good exercise though.

How to interpolate based on previous behaviour python?

I have a DataFrame tracking Temperatures based on time.
it looks like this :
For a few days there was a problem and it shows 0 so the plot looks like this:
I have replaced the 0 with nans and then used interpolate method but the result is not what I need even I used method = time I get this:
So how can I use a customised interpolation or something to correct this based on previous behaviour?
Thank you
I would not interpolate. I would just take N elements before the gap and after, compute the average temperature and fill the gap with random values using a normal distribution around the average value (and you can use the std too)
The variances between the frames indicates true variable values missing or not calculated within the set range.
<1,2/,3,...?>="unknown""unknowns"/^#'table'/open(set)

How to normalize unix timestamps to sum to discrete numeric values?

I have an "ideal" formula in which I have a sum of values like
score[i] = SUM(properties[i]) * frequency[i] + recency[i]
with properties being a vector of values, frequency and recency scalar values, taken from a given dataset of N items. While all variables here is numeric and with discrete integer values, the recency value is a UNIX timestamp in a given time range (like 1 month since now, or 1 week since now, etc. on daily basis).
In the dataset, each item i has a date value expressed as recency[i], and a frequency value frequency[i], and the list properties[i]. All properties of item[i] are therefore evaluated on each day expressed as recency[i] in the proposed time range.
According to this formula the recency contribution to the score value for the item[i] is a negative contribution: the older is the timestamp the better is the score (hence the + sign in that formula).
My idea was to use a re-scaler approach in the given range like
scaler = MinMaxScaler(feature_range=(min(recencyVec), max(recencyVec)))
scaler = scaler.fit(values)
normalized = scaler.transform(values)
where recencyVec collects all recency vectors for each data point, where min(recencyVec) is the first day and max(recencyVec) is the last day.
using the scikit-learn object MinMaxScaler, hence transforming the recency values by scaling each feature to the given range as suggested in How to Normalize and Standardize Time Series Data in Python
Is this the correct approach for this numerical formulation? Which alternative approach may be possible to normalize the timestamp values when summed to other discrete numeric values?
Is recency then an absolute UNIX timestamp? Or do you already subtract the current timestamp? If not, then depending on your goal, it might be sufficient to simply subtract the current unix timestamp from recency, so that it consistently describes "seconds from now", or the time delta instead of absolute unix time.
Of course, that would create quite a large score, but it will be consistent.
What scaling you use depends on your goal (what is an acceptable score?), but many are valid as long as they're monotonic. In addition to the min-max scaling (where I would suggest using 0 as minimum and set maximum to some known maximum time offset), you might also want to consider the log transformation.

How can i replace outliers with the mean of previous and next neighbour?

I have a really large dataset from beating two laser frequencies and reading out the beat frequency with a freq. counter.
The problem is that I have a lot of outliers in my dataset.
Filtering is not an option since the filtering/removing of outliers kills precious information for my allan deviation I use to analyze my beat frequency.
The problem with removing the outliers is that i want to compare allan deviations of three different beat frequencies. If i now remove some points i will have shorter x-axis than before and my allan deviation x-axis will scale differently. (The adev basically builds up a new x-axis starting with intervals of my sample rate up to my longest measurement time -> which is my highest beat frequency x-axis value.)
Sorry if this is confusing, I wanted to give as many information as possible.
So anyway, what i did until now is i got my whole allan deviation to work and removed outliers successfully, chopping my list into intervals and compare all y-values of each interval to the standard deviation of the interval.
What i want to change now is that instead of removing the outliers i want to replace them with the mean of their previous and next neighbours.
Below you can find my test code for a list with outliers, it seems have a problem using numpy where and i don't really understand why.
The error is given as "'numpy.int32' object has no attribute 'where'". Do I have to convert my dataset to a panda structure?
What the code does is searching for values above/below my threshold, replace them with NaN, and then replace NaN with my mean. I'm not really into using NaN replacement so i would be very grateful for any help.
l = np.array([[0,4],[1,3],[2,25],[3,4],[4,28],[5,4],[6,3],[7,4],[8,4]])
print(*l)
sd = np.std(l[:,1])
print(sd)
for i in l[:,1]:
if l[i,1] > sd:
print(l[i,1])
l[i,1].where(l[i,1].replace(to_replace = l[i,1], value = np.nan),
other = (l[i,1].fillna(method='ffill')+l[i,1].fillna(method='bfill'))/2)
so what i want is to have a list/array with the outliers replaced with the means of previous/following neighbours
error message: 'numpy.int32' object has no attribute 'where'
One option is indeed tranform all the work into pandas just with
import pandas as pd
dataset = pd.DataFrame({'Column1':data[:,0],'Column2':data[:,1]})
that will solve error as pandas dataframe object has where command.
Howewer, that is not obligatory and we can still operate with just numpy
For example, the easiest way to detect outliers is to look if they are not in range mean+-3std.
Code example below, using your setting
import numpy as np
l = np.array([[0,4],[1,3],[2,25],[3,4],[4,28],[5,4],[6,3],[7,4],[8,4]])
std = np.std(l[:,1])
mean=np.mean(l[:,1])
for i in range (len(l[:,1])):
if((l[i,1]<=mean+2*std)&(l[i,1]>=mean-2*std)):
pass
else:
if (i!=len(l[:,1])-1)&(i!=0):
l[i,1]=(l[i-1,1]+l[i+1,1])/2
else:
l[i,1]=mean
What we did here first check is value is outlier at line
if((l[i,1]<=mean+2*std)&(l[i,1]>=mean-2*std)):
pass
Then check if its not first or last element
if (i!=len(l[:,1])-1)&(i!=1):
If it is, just put mean to the field:
else:
l[i,1]=mean

Exponential Weighted Moving Average using Pandas

I need to confirm few thing related to pandas exponential weighted moving average function.
If I have a data set df for which I need to find a 12 day exponential moving average, would the method below be correct.
exp_12=df.ewm(span=20,min_period=12,adjust=False).mean()
Given the data set contains 20 readings the span (Total number of values) should equal to 20.
Since I need to find a 12 day moving average hence min_period=12.
I interpret span as total number of values in a data set or the total time covered.
Can someone confirm if my above interpretation is correct?
I can't get the significance of adjust.
I've attached the link to pandas.df.ewm documentation below.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html
Quoting from Pandas docs:
Span corresponds to what is commonly called an “N-day EW moving average”.
In your case, set span=12.
You do not need to specify that you have 20 datapoints, pandas takes care of that. min_period may not be required here.

Categories

Resources