I need some help with a vert quick calculation, in the denominator line below I need to get the sum of the string occurances, yet only need to sum over values which are above a value, so for example, I need to get the sum of all of them, but exclude the number that comes with a certain occurance at 2, so theoretically I need something along the lines of:
enominator = np.sum(occurances yet only sum above the value of occurances(2))
# the next bit uses the True/False columns to find the ranges in which a
# series of avalanches happen.
fst = bins.index[bins['avalanche'] & ~ bins['avalanche'].shift(1).fillna(False)]
lst = bins.index[bins['avalanche'] & ~ bins['avalanche'].shift(-1).fillna(False)]
for i, j in zip(fst, lst):
bins.loc[j, 'total count'] = sum(bins.loc[i:j+1, 'count'])
bins.loc[j, 'total duration'] = (j-i+1)*bin_width
writer = pd.ExcelWriter(bin_file)
bins.to_excel(writer)
writer.save()
# When a series of avalanches occur, we need to add them up.
occurances = bins.groupby(bins['total count']).size()
# Fill in the gaps with zero
occurances = occurances.reindex(np.arange(occurances.index.min(), occurances.index.max()), fill_value=0)
# Create a new series that shows the percentage of outcomes
denominator = np.sum(occurances)
print(denominator)
percentage = occurances/denominator
#print (denomimator)
So, this takes an excel file and runs it as a dataframe, nonetheless, I'm having trouble, like I mentioned earlier, calculating the variable denominator. Occurances simply adds up the number of times a given values is present, however, i need to calculate denominator such that:
denominator = np.sum(occurances) - occurances[2] + occurances[1]
Yet if it occurances[2] or occurances[1] isn't present it crashes, so how would I go about taking the sum of occurances[3] and above, I also tried:
denominator = np.sum(occurances) >=occurances[3]
but it only gave me a True and False statement and would crash shortly after. So I basically need the sum of the values present in occurances[3] and above. Thank you any help is appreciated
Using a conditional index:
denominator = occurances[occurances > occurances(2)].sum()
Related
Is it possible to use .nlargest to get the two highest numbers in a set of number, but ensure that they are x amount of rows apart?
For examples, in the following code I would want to find the largest values but ensure that they are more than 5 values apart from each other. Is there an easy way to do this?
data = {'Pressure' : [100,112,114,120,123,420,1222,132,123,333,123,1230,132,1,23,13,13,13,123,13,123,3,222,2303,1233,1233,1,1,30,20,40,401,10,40,12,122,1,12,333],
}
If I understand the question correctly, you need to output the largest value, and then the next largest value that's at least X rows apart from it (based on the index).
First value is just data.Pressure.max(). Its index is data.Pressure.idxmax()
Second value is either before or after the first value's index:
max_before = df.Pressure.loc[:df.Pressure.idxmax() - X].max()
max_after = df.Pressure.loc[df.Pressure.idxmax() + X:].max()
second_value = max(max_before, max_after)
I have a CSV file, structured in two columns: "Time.s", "Volt.mv".
Example: 0, 1.06 0.0039115, 1.018 0.0078229, 0.90804
So, I have to return time values that exceed the threshold of 0.95 indicating the deviation of each value with respect to the average of that time interval.
I calculated the average like this:
ecg_time_mean = ecg["Time.s"].mean()
print(ecg_time_mean)
Then, I tried to make a For Loop with condition:
ecg_dev = []
for elem in ecg["Time.s"]:
if elem > 0.95:
deviation = ecg["Time.s"].std()
sqrt = deviation**(1/2.0)
dev = sqrt/ecg_time_mean
ecg_dev.append(elem)
ecg["Deviation"] = dev
print(ecg)
And I would like to print the output in a new column called "Deviation".
This is the output: no If Condition taken into consideration and column Deviation with the same number in each row.
I can't understand the problem. Thank you in advance, guys!
You are adding "elem" and not your dev into the new column.
You didnt assign ecg_dev to your new column; you assigned a VALUE dev to that column.
If you notice, dev is actually the same value throughout your loop. The only variables factored into the calculation of dev is std and mean, which encompass the entire data, thus negating any looping.
And ecg_dev is different length as your ecg because it is shorter than your ecg (due to the if). So even if you assign it to the new column, it will fail.
The sqrt is fixed across ecg, you do not need to recalculate it inside the loop.
I do not understand what you want based on what you wrote here:
So, I have to return time values that exceed the threshold of 0.95
indicating the deviation of each value with respect to the average of
that time interval.
What is Time.s? Is it a monotonically-increasing time or is it an interval length? If it is monotonically-increasing time, "deviation of each value with respect to the average" do not make sense. If it is an interval, then "average of that time interval" do not make sense.
For Time.s > 0.95, you want the deviation of what value (or column) with respect to the average of what column during the time interval.
I will amend my answer when you clarify these to find in the placeholder function. (Question was clarified in the comments)
It looks like you want the deviation of time taken from the mean time taken for the cases where volt.mv exceeds 0.95. In this case you do not need std() at all.
ecg_time_mean = ecg["Time.s"].mean()
ecg['TimeDeviationFromMean'] = ecg['Time.s'] - ecg_time_mean
ecg_above95 = ecg[ecg['Volt.mv'] > 0.95]
The ecg_above95 dataframe should be what you need.
The issue is that you are setting each of the values to the last calculation of dev.
To do what you want with the for loop, you have to make two edits:
First, append the calculated dev to ecg_dev
ecg_dev.append(dev) #not elem
Side step, you need to append nans when elem <= 0.95
if elem > 0.95:
....
else: ecg_dev.append(np.nan)
Second, you need to set the column to ecg_dev
ecg['Deviation'] = ecg_dev
Assuming that you are using a Pandas DataFrame, however, you can speed up your code by skipping the for loops altogether and calculate directly.
ecg['Deviation'] = #deviation calculation
ecg['Deviation'][ecg['Time.s'] <= 0.95] = np.nan
I have a huge dataset, where I'm trying to reduce the dimensionality by removing the variables that fulfill these two conditions:
Count of unique values in a feature / sample size < 10%
Count of most common value / Count of second most common value > 20 times
The first condition has no problem, the second condition is where I'm stuck at as I'm trying to be as much efficient as possible because of the size of the dataset, I'm trying to use numpy as I have known that it's faster than pandas. So, a possible solution was numpy-most-efficient-frequency-counts-for-unique-values-in-an-array but I'm having too much trouble trying to get the count of the two most common values.
My attempt:
n = df.shape[0]/10
variable = []
condition_1 = []
condition_2 = []
for i in df:
variable.append(i)
condition_1.append(df[i].unique().shape[0] < n)
condition_2.append(most_common_value_count/second_most_common_value_count > 20)
result = pd.DataFrame({"Variables": variable,
"Condition_1": condition_1,
"Condition_2": condition_2})
The dataset df contains positive and negative values (so I can't use np.bincount), and also categorical variables, objects, datetimes, dates, and NaN variables/values.
Any suggestions? Remember that it's critical to minimize the number of steps in order to maximize efficiency.
As noted in the comments, you may want to use np.unique (or pd.unique). You can set return_counts=True to get the value counts. These will be the second item in the tuple returned by np.unique, hence the [1] index below. After sorting them, the most common count will be the last value, and the second most common count will be the next to last value, so you can get them both by indexing with [-2:].
You could then construct a Boolean list indicating which columns meet your condition #2 (or rather the opposite). This list can then be used as a mask to reduce the dataframe:
def counts_ratio(s):
"""Take a pandas series s and return the
count of its most common value /
count of its second most common value."""
counts = np.sort(np.unique(s, return_counts=True)[1])[-2:]
return counts[1] / counts[0]
condition2 = [counts_ratio(df[col]) <= 20
for col in df.columns]
df_reduced = df[df.columns[condition2]]
I'm trying to calculate Welles Wilder's type of moving average in a panda dataframe (also called cumulative moving average).
The method to calculate the Wilder's moving average for 'n' periods of series 'A' is:
Calculate the mean of the first 'n' values in 'A' and set as the mean for the 'n' position.
For the following values use the previous mean weighed by (n-1) and the current value of the series weighed by 1 and divide all by 'n'.
My question is: how to implement this in a vectorized way?
I tried to do it iterating over the dataframe (what a I read isn't recommend because is slow). It works, the values are correct, but I get an error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
and it's probably not the most efficient way of doing it.
My code so far:
import pandas as pd
import numpy as np
#Building Random sample:
datas = pd.date_range('2020-01-01','2020-01-31')
np.random.seed(693)
A = np.random.randint(40,60, size=(31,1))
df = pd.DataFrame(A,index = datas, columns = ['A'])
period = 12 # Main parameter
initial_mean = A[0:period].mean() # Equation for the first value.
size = len(df.index)
df['B'] = np.full(size, np.nan)
df.B[period-1] = initial_mean
for x in range(period, size):
df.B[x] = ((df.A[x] + (period-1)*df.B[x-1]) / period) # Equation for the following values.
print(df)
You can use the Pandas ewm() method, which behaves exactly as you described when adjust=False:
When adjust is False, weighted averages are calculated recursively as:
weighted_average[0] = arg[0];
weighted_average[i] = (1-alpha)*weighted_average[i-1] + alpha*arg[i]
If you want to do the simple average of the first period items, you can do that first and apply ewm() to the result.
You can calculate a series with the average of the first period items, followed by the other items repeated verbatim, with the formula:
pd.Series(
data=[df['A'].iloc[:period].mean()],
index=[df['A'].index[period-1]],
).append(
df['A'].iloc[period:]
)
So in order to calculate the Wilder moving average and store it in a new column 'C', you can use:
df['C'] = pd.Series(
data=[df['A'].iloc[:period].mean()],
index=[df['A'].index[period-1]],
).append(
df['A'].iloc[period:]
).ewm(
alpha=1.0 / period,
adjust=False,
).mean()
At this point, you can calculate df['B'] - df['C'] and you'll see that the difference is almost zero (there's some rounding error with float numbers.) So this is equivalent to your calculation using a loop.
You might want to consider skipping the direct average between the first period items and simply start applying ewm() from the start, which will assume the first row is the previous average in the first calculation. The results will be slightly different but once you've gone through a couple of periods then those initial values will hardly influence the results.
That would be a way more simple calculation:
df['D'] = df['A'].ewm(
alpha=1.0 / period,
adjust=False,
).mean()
I have a csv datafile that I've split by a column value into 5 datasets for each person using:
for i in range(1,6):
PersonData = df[df['Person'] == i].values
P[i] = PersonData
I want to sort the data into ascending order according to one column, then split the data half way at that column to find the median.
So I sorted the data with the following:
dataP = {}
for i in range(1,6):
sortData = P[i][P[i][:,9].argsort()]
P[i] = sortData
P[i] = pd.DataFrame(P[i])
dataP[1]
Using that I get a dataframe for each of my datasets 1-6 sorted by the relevant column (9), depending on which number I put into dataP[i].
Then I calculate half the length:
for i in range(1,6):
middle = len(dataP[i])/2
print(middle)
Here is where I'm stuck!
I need to create a new column in each dataP[i] dataframe that splits the length in 2 and gives the value 0 if it's in the first half and 1 if it's in the second.
This is what I've tried but I don't understand why it doesn't produce a new list of values 0 and 1 that I can later append to dataP[i]:
for n in range(1, (len(dataP[i]))):
for n, line in enumerate(dataP[i]):
if middle > n:
confval = 0
elif middle < n:
confval = 1
for i in range(1,6):
Confval[i] = confval
Confval[1]
Sorry if this is basic, I'm quite new to this so a lot of what I've written might not be the best way to do it/necessary, and sorry also for the long post.
Any help would be massively appreciated. Thanks in advance!
If I'm reading your question right I believe you are attempting to do two things.
Find the median value of a column
Create a new column which is 0 if the value is less than the median or 1 if greater.
Let's tackle #1 first:
median = df['originalcolumn'].median()
That easy! There's many great pandas functions for things like this.
Ok so number two:
df['newcolumn'] = df[df['originalcolumn'] > median].astype(int)
What we're doing here is creating a new bool series, false if the value at that location is less than the median, true otherwise. Then we can cast that to an int which gives us 0s and 1s.