Why is pd.qcut() producing massive boundaries? - python

I have a dataframe of event data of which a column is the interval of time in which that event occurred. I would like to use pd.qcut() to make the percentiles of each interval given the events that are in it, and give each event its respective percentile.
def event_quartiler(event_row):
in_interval = paired_events.loc[events['TimeInterval'] == event_row['TimeInterval']]
quartiles = pd.qcut(in_interval['DateTime'], 100)
counter = 1
for quartile in quartiles.unique():
if(event_row['DateTime'] in quartile):
return counter
counter = counter+1
if(counter > 100): break
return -1
events['Quartile'] = events.apply(event_quartiler, axis=1)
I expected that this would simply set the Quartile column to each event's respective percentile, but instead the code takes forever to run and effectively blows out by outputting this:
ValueError: ("Bin edges must be unique: array([1.55016605e+18, 1.55016616e+18, 1.55016627e+18, 1.55016632e+18,\n 1.55016632e+18, 1.55016636e+18,
... (I put the ellipsis here because there are 100 data points)
1.55017534e+18, 1.55017545e+18,\n 1.55017555e+18]).\nYou can drop duplicate edges by setting the 'duplicates' kwarg", 'occurred at index 6539')
There is nothing different about the data at 6539 or any of the events in its interval, but I cannot find where I am going wrong with the code either.

I figured out the problem: qcut tries to fit all of the data points themselves into quartiles while cut takes the min and max and cuts into n bins. Because in this example I had more quartiles that I was trying to make than actual datapoints, qcut was failing.
Just using cut into 100 bins solved my problem and I was able to make percentiles.

Related

How to calculate stdev inside a For Loop with conditions in Python

I have a CSV file, structured in two columns: "Time.s", "Volt.mv".
Example: 0, 1.06 0.0039115, 1.018 0.0078229, 0.90804
So, I have to return time values ​​that exceed the threshold of 0.95 indicating the deviation of each value with respect to the average of that time interval.
I calculated the average like this:
ecg_time_mean = ecg["Time.s"].mean()
print(ecg_time_mean)
Then, I tried to make a For Loop with condition:
ecg_dev = []
for elem in ecg["Time.s"]:
if elem > 0.95:
deviation = ecg["Time.s"].std()
sqrt = deviation**(1/2.0)
dev = sqrt/ecg_time_mean
ecg_dev.append(elem)
ecg["Deviation"] = dev
print(ecg)
And I would like to print the output in a new column called "Deviation".
This is the output: no If Condition taken into consideration and column Deviation with the same number in each row.
I can't understand the problem. Thank you in advance, guys!
You are adding "elem" and not your dev into the new column.
You didnt assign ecg_dev to your new column; you assigned a VALUE dev to that column.
If you notice, dev is actually the same value throughout your loop. The only variables factored into the calculation of dev is std and mean, which encompass the entire data, thus negating any looping.
And ecg_dev is different length as your ecg because it is shorter than your ecg (due to the if). So even if you assign it to the new column, it will fail.
The sqrt is fixed across ecg, you do not need to recalculate it inside the loop.
I do not understand what you want based on what you wrote here:
So, I have to return time values ​​that exceed the threshold of 0.95
indicating the deviation of each value with respect to the average of
that time interval.
What is Time.s? Is it a monotonically-increasing time or is it an interval length? If it is monotonically-increasing time, "deviation of each value with respect to the average" do not make sense. If it is an interval, then "average of that time interval" do not make sense.
For Time.s > 0.95, you want the deviation of what value (or column) with respect to the average of what column during the time interval.
I will amend my answer when you clarify these to find in the placeholder function. (Question was clarified in the comments)
It looks like you want the deviation of time taken from the mean time taken for the cases where volt.mv exceeds 0.95. In this case you do not need std() at all.
ecg_time_mean = ecg["Time.s"].mean()
ecg['TimeDeviationFromMean'] = ecg['Time.s'] - ecg_time_mean
ecg_above95 = ecg[ecg['Volt.mv'] > 0.95]
The ecg_above95 dataframe should be what you need.
The issue is that you are setting each of the values to the last calculation of dev.
To do what you want with the for loop, you have to make two edits:
First, append the calculated dev to ecg_dev
ecg_dev.append(dev) #not elem
Side step, you need to append nans when elem <= 0.95
if elem > 0.95:
....
else: ecg_dev.append(np.nan)
Second, you need to set the column to ecg_dev
ecg['Deviation'] = ecg_dev
Assuming that you are using a Pandas DataFrame, however, you can speed up your code by skipping the for loops altogether and calculate directly.
ecg['Deviation'] = #deviation calculation
ecg['Deviation'][ecg['Time.s'] <= 0.95] = np.nan

Pandas: Find end frequency spectrum above a defined threshold

long time reader, first time posting.
I am working with x,y data for frequency response plots in Pandas DataFrames. Here is an example of the data and the plots (see full .csv file at end of post):
fbc['x'],fbc['y']
(0 [89.25, 89.543, 89.719, 90.217, 90.422, 90.686...
1 [89.25, 89.602, 90.422, 90.568, 90.744, 91.242...
2 [89.25, 89.689, 89.895, 90.305, 91.008, 91.74,...
3 [89.25, 89.514, 90.041, 90.275, 90.422, 90.832...
Name: x, dtype: object,
0 [-77.775, -77.869, -77.766, -76.572, -76.327, ...
1 [-70.036, -70.223, -71.19, -71.229, -70.918, -...
2 [-73.079, -73.354, -73.317, -72.753, -72.061, ...
3 [-70.854, -71.377, -74.069, -74.712, -74.647, ...
Name: y, dtype: object)
where x = frequency and y = amplitude data. The resulting plots for each of these looks as follows:
See x,y Plot of image in this link - not enough points to embed yet
I can create a plot for each row of the x,y data in the Dataframe.
What I need to do in Pandas (Python) is identify the highest frequency in the data before the frequency response drops to the noise floor (permanently). As you can see there are places where the y data may go to a very low value (say <-50) but then return to >- 40.
How can I detect in Pandas / python (ideally without iterations due to very large data sizes) to find the highest frequency (> -40) such that I know that the frequency does not return to < -40 again and then jump back up? Basically, I'm trying to find the end of the frequency band. I've tried working with some of the Pandas statistics (which would also be nice to have), but have been unsuccessful in getting useful data.
Thanks in advance for any pointers and direction you can provide.
Here is a .csv file that can be imported with csv.reader: https://www.dropbox.com/s/ia7icov5fwh3h6j/sample_data.csv?dl=0
I believe I have come up with a solution:
Based on a suggestion from #katardin I came up with the following, though I think it can be optimized. Again, I will be dealing with huge amounts of data, so if anyone can find a more elegant solution it would be appreciated.
for row in fbc['y']:
list_reverse = row
# Reverse y data so we read from end (right to left)
test_list = list_reverse[::-1]
# Find value of y data above noise floor (>-50)
res = next(x for x, val in enumerate(test_list) if val > -50)
# Since we reversed the y data we must take the opposite of the returned res to
# get the correct index
index = len(test_list) - res
# Print results
print ("The index of element is : " + str(index))
Where the output is index numbers as follows:
The index of element is : 2460
The index of element is : 2400
The index of element is : 2398
The index of element is : 2382
Each one I have checked and corresponds to the exact high frequency roll-off point I have been looking for. Great suggestion!

Welles Wilder's moving average with pandas

I'm trying to calculate Welles Wilder's type of moving average in a panda dataframe (also called cumulative moving average).
The method to calculate the Wilder's moving average for 'n' periods of series 'A' is:
Calculate the mean of the first 'n' values in 'A' and set as the mean for the 'n' position.
For the following values use the previous mean weighed by (n-1) and the current value of the series weighed by 1 and divide all by 'n'.
My question is: how to implement this in a vectorized way?
I tried to do it iterating over the dataframe (what a I read isn't recommend because is slow). It works, the values are correct, but I get an error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
and it's probably not the most efficient way of doing it.
My code so far:
import pandas as pd
import numpy as np
#Building Random sample:
datas = pd.date_range('2020-01-01','2020-01-31')
np.random.seed(693)
A = np.random.randint(40,60, size=(31,1))
df = pd.DataFrame(A,index = datas, columns = ['A'])
period = 12 # Main parameter
initial_mean = A[0:period].mean() # Equation for the first value.
size = len(df.index)
df['B'] = np.full(size, np.nan)
df.B[period-1] = initial_mean
for x in range(period, size):
df.B[x] = ((df.A[x] + (period-1)*df.B[x-1]) / period) # Equation for the following values.
print(df)
You can use the Pandas ewm() method, which behaves exactly as you described when adjust=False:
When adjust is False, weighted averages are calculated recursively as:
weighted_average[0] = arg[0];
weighted_average[i] = (1-alpha)*weighted_average[i-1] + alpha*arg[i]
If you want to do the simple average of the first period items, you can do that first and apply ewm() to the result.
You can calculate a series with the average of the first period items, followed by the other items repeated verbatim, with the formula:
pd.Series(
data=[df['A'].iloc[:period].mean()],
index=[df['A'].index[period-1]],
).append(
df['A'].iloc[period:]
)
So in order to calculate the Wilder moving average and store it in a new column 'C', you can use:
df['C'] = pd.Series(
data=[df['A'].iloc[:period].mean()],
index=[df['A'].index[period-1]],
).append(
df['A'].iloc[period:]
).ewm(
alpha=1.0 / period,
adjust=False,
).mean()
At this point, you can calculate df['B'] - df['C'] and you'll see that the difference is almost zero (there's some rounding error with float numbers.) So this is equivalent to your calculation using a loop.
You might want to consider skipping the direct average between the first period items and simply start applying ewm() from the start, which will assume the first row is the previous average in the first calculation. The results will be slightly different but once you've gone through a couple of periods then those initial values will hardly influence the results.
That would be a way more simple calculation:
df['D'] = df['A'].ewm(
alpha=1.0 / period,
adjust=False,
).mean()

Pandas dataframe find first and last element given condition and calculate slope

The situation:
I have a pandas dataframe where I have some data about the production of a product. The product is produced in 3 phases. The phases are not fixed meaning that their cycles (the time till last) is changing. During the production phases, at each cycle the temperature of the product is measured.
Please see the table below:
The problem:
I need to calculate the slope for each cycle of each phase for each product. I also need to add it to the dataframe in a new column called "Slope". The one you can see, highlighted in yellow was added by me manually in an excel file. The real dataset contains hundreds of parameters (not only temperatures) so in reality I need to calculate the slope for many, many columns, therefore I tried to define a function.
My solution is not working at all:
This is the code I tried, but it does not work. I am trying to catch the first and last row for the given product, for the given phase. And then get the temperature data and the difference of these two rows. And this way I could calculate the slope.
This is all I could come up with so far (I created another column called: "Max_cylce_no", this stores the maximum amount of the cycle for each phase):
temp_at_start=-1
def slope(col_name):
global temp_at_start
start_cycle_no = 1
if row["Cycle"]==1:
temp_at_start =row["Temperature"]
start_row = df.index(row)
cycle_numbers = row["Max_cylce_no"]
last_cycle_row = cycle_numbers + start_row
last_temp = df.loc[last_cycle_row, "Temperature"]
And the way I would like to apply it:
df.apply(slope("Temperature"), axis=1)
Unfortunatelly I get a NameError right away saying that: name 'row' is not defined.
Could you please help me and show me the right direction on how to solve this problem. It gives me a really hard time. :(
Thank you in advance!
I believe you need GroupBy.transform with subtract last value with first and divide by length:
f = lambda x: (x.iloc[-1] - x.iloc[0]) / len(x)
df['new'] = df.groupby(['Product_no','Phase_no'])['Temperature'].transform(f)

Replacing specified elements in a list (python), using an index array - error message

I am trying to create a large list to later add in to a panda dataframe, with the elements of this list corresponding to the conditions of the data in that row (i.e. basal condition, +some drug ... etc).
These conditions come in blocks; i.e. the first 500 rows (corresponding to first 500 frames of imaging data) correspond to basal conditions (so each element should be 'basal'), the next 500 with some drug added, and so on.
The precise size of each of these blocks, and the first row of the block, varies from experiment to experiment, so the code should ideally be able to generate these blocks based on the numbers I input specifying the time of different conditions for each experiment
To do this I am first generating a list of 'basal' repeated according to the total number of rows, then using the timing variables signifying the start of each condition to overwrite every entry from this index to the end of the list with the next condition. Code is:
epochs = ['basal'] * frames
if ttx == True:
ttx_epoch = np.arange(ttx_t*freq,frames,1, dtype=int)
epochs[ttx_epoch] = 'TTX'
if lo_k == True:
lok_epoch = np.arange(lo_k_t*freq,frames,1, dtype=int)
epochs[lok_epoch] = 'Low K'
if hi_k == True:
hik_epoch = np.arange(hi_k_t*freq,frames,1, dtype=int)
print(hik_epoch)
epochs[hik_epoch] = 'High K'
when I attempt to run I get the error message:
TypeError: only integer scalar arrays can be converted to a scalar index
Despite specifying the type of the arange index array as int
Any ideas where I'm going wrong?
SOLVED: By finding alternate way
I realised the whole task was unecessary, as I could achieve the desired result by specifying a range to index into the dataframe itself (rather than generate an array to then insert into the dataframe.

Categories

Resources