I have a 'city' column which has more than 1000 unique entries. (The entries are integers for some reason and are currently assigned float type.)
I tried df['city'].value_counts()/len(df) to get their frequences. It returned a table. The fist few values were 0.12,.4,.4,.3.....
I'm a complete beginner so I'm not sure how to use this information to assign everything in, say, the last 10 percentile to 'other'.
I want to reduce the unique city values from 1000 to something like 10, so I can later use get_dummies on this.
Let's go through the logic of expected actions:
Count frequencies for every city
Calculate the bottom 10% percentage
Find the cities with frequencies less then 10%
Change them to other
You started in the right direction. To get frequencies for every city:
city_freq = (df['city'].value_counts())/df.shape[0]
We want to find the bottom 10%. We use pandas' quantile to do it:
bottom_decile = city_freq.quantile(q=0.1)
Now bottom_decile is a float which represents the number that differs bottom 10% from the rest. Cities with frequency less then 10%:
less_freq_cities = city_freq[city_freq<=botton_decile]
less_freq_cities will hold enteries of cities. If you want to change the value of them in 'df' to "other":
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
complete code:
city_freq = (df['city'].value_counts())/df.shape[0]
botton_decile = city_freq.quantile(q=0.1)
less_freq_cities = city_freq[city_freq<=botton_decile]
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
This is how you replace 10% (or whatever you want, just change q param in quantile) to a value of your choice.
EDIT:
As suggested in comment, to get normalized frequency it's better use
city_freq = df['city'].value_counts(normalize=True) instead of dividing it by shape. But actually, we don't need normalized frequencies. pandas' qunatile will work even if they are not normalize. We can use:
city_freq = df['city'].value_counts() and it will still work.
I used the values 360D and 360 (which I thought they were equivalent) for the parameter window of the method .rolling(). However, they produced different graph. Could you please explain what the difference between those two values was?
rolling_stats = data.Ozone.rolling(window='360D').agg(['mean', 'std'])
stats = data.join(rolling_stats)
stats.plot(subplots=True)
plt.show()
rolling_stats = data.Ozone.rolling(window=360).agg(['mean', 'std'])
stats = data.join(rolling_stats)
stats.plot(subplots=True)
plt.show()
The difference is that when you use the string '360D' it calculates rolling by considering "Calendar Days" while when you use 360 (int) it calculates rolling for business days.
The other diffence is that when using string ('360D') pandas calculates Rolling from the begining of the 'Calendar Year' but when you use integer (360), pandas calculates from the last 360 business days. This is why there is a gap in your second plot, because for the first 359 days, rolling.agg() returns NaN values. However, in the first plot there is no gap. If you look further in details, you will find out that for the first method, the data.Ozone.rolling.agg() is equal to data.Ozone.iloc[0]
I have a really large dataset from beating two laser frequencies and reading out the beat frequency with a freq. counter.
The problem is that I have a lot of outliers in my dataset.
Filtering is not an option since the filtering/removing of outliers kills precious information for my allan deviation I use to analyze my beat frequency.
The problem with removing the outliers is that i want to compare allan deviations of three different beat frequencies. If i now remove some points i will have shorter x-axis than before and my allan deviation x-axis will scale differently. (The adev basically builds up a new x-axis starting with intervals of my sample rate up to my longest measurement time -> which is my highest beat frequency x-axis value.)
Sorry if this is confusing, I wanted to give as many information as possible.
So anyway, what i did until now is i got my whole allan deviation to work and removed outliers successfully, chopping my list into intervals and compare all y-values of each interval to the standard deviation of the interval.
What i want to change now is that instead of removing the outliers i want to replace them with the mean of their previous and next neighbours.
Below you can find my test code for a list with outliers, it seems have a problem using numpy where and i don't really understand why.
The error is given as "'numpy.int32' object has no attribute 'where'". Do I have to convert my dataset to a panda structure?
What the code does is searching for values above/below my threshold, replace them with NaN, and then replace NaN with my mean. I'm not really into using NaN replacement so i would be very grateful for any help.
l = np.array([[0,4],[1,3],[2,25],[3,4],[4,28],[5,4],[6,3],[7,4],[8,4]])
print(*l)
sd = np.std(l[:,1])
print(sd)
for i in l[:,1]:
if l[i,1] > sd:
print(l[i,1])
l[i,1].where(l[i,1].replace(to_replace = l[i,1], value = np.nan),
other = (l[i,1].fillna(method='ffill')+l[i,1].fillna(method='bfill'))/2)
so what i want is to have a list/array with the outliers replaced with the means of previous/following neighbours
error message: 'numpy.int32' object has no attribute 'where'
One option is indeed tranform all the work into pandas just with
import pandas as pd
dataset = pd.DataFrame({'Column1':data[:,0],'Column2':data[:,1]})
that will solve error as pandas dataframe object has where command.
Howewer, that is not obligatory and we can still operate with just numpy
For example, the easiest way to detect outliers is to look if they are not in range mean+-3std.
Code example below, using your setting
import numpy as np
l = np.array([[0,4],[1,3],[2,25],[3,4],[4,28],[5,4],[6,3],[7,4],[8,4]])
std = np.std(l[:,1])
mean=np.mean(l[:,1])
for i in range (len(l[:,1])):
if((l[i,1]<=mean+2*std)&(l[i,1]>=mean-2*std)):
pass
else:
if (i!=len(l[:,1])-1)&(i!=0):
l[i,1]=(l[i-1,1]+l[i+1,1])/2
else:
l[i,1]=mean
What we did here first check is value is outlier at line
if((l[i,1]<=mean+2*std)&(l[i,1]>=mean-2*std)):
pass
Then check if its not first or last element
if (i!=len(l[:,1])-1)&(i!=1):
If it is, just put mean to the field:
else:
l[i,1]=mean
I need to confirm few thing related to pandas exponential weighted moving average function.
If I have a data set df for which I need to find a 12 day exponential moving average, would the method below be correct.
exp_12=df.ewm(span=20,min_period=12,adjust=False).mean()
Given the data set contains 20 readings the span (Total number of values) should equal to 20.
Since I need to find a 12 day moving average hence min_period=12.
I interpret span as total number of values in a data set or the total time covered.
Can someone confirm if my above interpretation is correct?
I can't get the significance of adjust.
I've attached the link to pandas.df.ewm documentation below.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html
Quoting from Pandas docs:
Span corresponds to what is commonly called an “N-day EW moving average”.
In your case, set span=12.
You do not need to specify that you have 20 datapoints, pandas takes care of that. min_period may not be required here.
Presented as an example.
Two data sets. One collected over a 1 hour period. One collected over a 20 min period within that hour.
Each data set contains instances of events that can transformed into single columns of true (-) or false (_), representing if the event is occurring or not.
DS1.event:
_-__-_--___----_-__--_-__---__
DS2.event:
__--_-__--
I'm looking for a way to automate the correlation (correct me if the terminology is incorrect) of the two data sets and find the offset(s) into DS1 at which DS2 is most (top x many) likely to have occurred. This will probably end up with some matching percentage that I can then threshold to determine the validity of the match.
Such that
_-__-_--___----_-__--_-__---__
__--_-__--
DS1.start + 34min ~= DS2.start
Additional information:
DS1 was recorded at roughly 1 Hz. DS2 at roughly 30 Hz. This makes it less likely that there will be a 100% clean match.
Alternate methods (to pandas) will be appreciated, but python/pandas are what I have at my disposal.
Sounds like you just want something like a cross correlation?
I would first convert the string to a numeric representation, so replace your - and _ with 1 and 0
You can do that using a strings replace method (e.g. signal.replace("-", "1"))
Convert them to a list or a numpy array:
event1 = [int(x) for x in signal1]
event2 = [int(x) for x in signal2]
Then calculate the cross correlation between them:
xcor = np.correlate(event1, event2, "full")
That will give you the cross correlation value at each time lag. You just want to find the largest value, and the time lag at which it happens:
nR = max(xcor)
maxLag = np.argmax(xcor) # I imported numpy as np here
Giving you something like:
Cross correlation value: 5
Lag: 20
It sounds like you're more interested in the lag value here. What the lag tells you is essentially how many time/positional shifts are required to get the maximum cross correlation value (degree of match) between your 2 signals
You might want to take a look at the docs for np.correlate and np.convolve to determine the method (full, same, or valid) you want to use as thats determined by the length of your data and what you want to happen if your signals are different lengths