long time reader, first time posting.
I am working with x,y data for frequency response plots in Pandas DataFrames. Here is an example of the data and the plots (see full .csv file at end of post):
fbc['x'],fbc['y']
(0 [89.25, 89.543, 89.719, 90.217, 90.422, 90.686...
1 [89.25, 89.602, 90.422, 90.568, 90.744, 91.242...
2 [89.25, 89.689, 89.895, 90.305, 91.008, 91.74,...
3 [89.25, 89.514, 90.041, 90.275, 90.422, 90.832...
Name: x, dtype: object,
0 [-77.775, -77.869, -77.766, -76.572, -76.327, ...
1 [-70.036, -70.223, -71.19, -71.229, -70.918, -...
2 [-73.079, -73.354, -73.317, -72.753, -72.061, ...
3 [-70.854, -71.377, -74.069, -74.712, -74.647, ...
Name: y, dtype: object)
where x = frequency and y = amplitude data. The resulting plots for each of these looks as follows:
See x,y Plot of image in this link - not enough points to embed yet
I can create a plot for each row of the x,y data in the Dataframe.
What I need to do in Pandas (Python) is identify the highest frequency in the data before the frequency response drops to the noise floor (permanently). As you can see there are places where the y data may go to a very low value (say <-50) but then return to >- 40.
How can I detect in Pandas / python (ideally without iterations due to very large data sizes) to find the highest frequency (> -40) such that I know that the frequency does not return to < -40 again and then jump back up? Basically, I'm trying to find the end of the frequency band. I've tried working with some of the Pandas statistics (which would also be nice to have), but have been unsuccessful in getting useful data.
Thanks in advance for any pointers and direction you can provide.
Here is a .csv file that can be imported with csv.reader: https://www.dropbox.com/s/ia7icov5fwh3h6j/sample_data.csv?dl=0
I believe I have come up with a solution:
Based on a suggestion from #katardin I came up with the following, though I think it can be optimized. Again, I will be dealing with huge amounts of data, so if anyone can find a more elegant solution it would be appreciated.
for row in fbc['y']:
list_reverse = row
# Reverse y data so we read from end (right to left)
test_list = list_reverse[::-1]
# Find value of y data above noise floor (>-50)
res = next(x for x, val in enumerate(test_list) if val > -50)
# Since we reversed the y data we must take the opposite of the returned res to
# get the correct index
index = len(test_list) - res
# Print results
print ("The index of element is : " + str(index))
Where the output is index numbers as follows:
The index of element is : 2460
The index of element is : 2400
The index of element is : 2398
The index of element is : 2382
Each one I have checked and corresponds to the exact high frequency roll-off point I have been looking for. Great suggestion!
Related
I am trying to normalize price at a certain point in time with respect to price 10 seconds later using this formula: ((price(t+10seconds) – price(t)) / price(t) ) / spread(t)
Both price and spread are columns in my dataframe. And I have indexed my dataframe by timestamp (pd.datetime object) because I figured that would make calculating price(t+10sec) easier.
What I've tried so far:
pos['timestamp'] = pd.to_datetime(pos['timestamp'])
pos.set_index('timestamp')
def normalize_data(pos):
t0 = pd.to_datetime('2021-10-27 09:30:13.201')
x = pos['mid_price']
y = ((x[t0 + pd.Timedelta('10 sec')] - x)/x) / (spread)
return y
pos['norm_price'] = normalize_data(pos)
this gives me an error because I'm indexing x[t0+pd.Timedelta('10sec')] but not the other x's in the equation. I also don't think I'm using pd.Timedelta or the x[t0+pd.Time...] correctly and unsure of how to fix all this/define a better function.
Any input would be much appreciated
dataframe
Your problem is here:
pos.set_index('timestamp')
This line of code will return a new dataframe, and leave your original dataframe unchanged. So, your function normalize_data is working on the original version of pos, which does not have the index you want, and neither will x. Change your code to this:
pos = pos.set_index('timestamp')
And that should get things working.
I have a dataframe of event data of which a column is the interval of time in which that event occurred. I would like to use pd.qcut() to make the percentiles of each interval given the events that are in it, and give each event its respective percentile.
def event_quartiler(event_row):
in_interval = paired_events.loc[events['TimeInterval'] == event_row['TimeInterval']]
quartiles = pd.qcut(in_interval['DateTime'], 100)
counter = 1
for quartile in quartiles.unique():
if(event_row['DateTime'] in quartile):
return counter
counter = counter+1
if(counter > 100): break
return -1
events['Quartile'] = events.apply(event_quartiler, axis=1)
I expected that this would simply set the Quartile column to each event's respective percentile, but instead the code takes forever to run and effectively blows out by outputting this:
ValueError: ("Bin edges must be unique: array([1.55016605e+18, 1.55016616e+18, 1.55016627e+18, 1.55016632e+18,\n 1.55016632e+18, 1.55016636e+18,
... (I put the ellipsis here because there are 100 data points)
1.55017534e+18, 1.55017545e+18,\n 1.55017555e+18]).\nYou can drop duplicate edges by setting the 'duplicates' kwarg", 'occurred at index 6539')
There is nothing different about the data at 6539 or any of the events in its interval, but I cannot find where I am going wrong with the code either.
I figured out the problem: qcut tries to fit all of the data points themselves into quartiles while cut takes the min and max and cuts into n bins. Because in this example I had more quartiles that I was trying to make than actual datapoints, qcut was failing.
Just using cut into 100 bins solved my problem and I was able to make percentiles.
The situation:
I have a pandas dataframe where I have some data about the production of a product. The product is produced in 3 phases. The phases are not fixed meaning that their cycles (the time till last) is changing. During the production phases, at each cycle the temperature of the product is measured.
Please see the table below:
The problem:
I need to calculate the slope for each cycle of each phase for each product. I also need to add it to the dataframe in a new column called "Slope". The one you can see, highlighted in yellow was added by me manually in an excel file. The real dataset contains hundreds of parameters (not only temperatures) so in reality I need to calculate the slope for many, many columns, therefore I tried to define a function.
My solution is not working at all:
This is the code I tried, but it does not work. I am trying to catch the first and last row for the given product, for the given phase. And then get the temperature data and the difference of these two rows. And this way I could calculate the slope.
This is all I could come up with so far (I created another column called: "Max_cylce_no", this stores the maximum amount of the cycle for each phase):
temp_at_start=-1
def slope(col_name):
global temp_at_start
start_cycle_no = 1
if row["Cycle"]==1:
temp_at_start =row["Temperature"]
start_row = df.index(row)
cycle_numbers = row["Max_cylce_no"]
last_cycle_row = cycle_numbers + start_row
last_temp = df.loc[last_cycle_row, "Temperature"]
And the way I would like to apply it:
df.apply(slope("Temperature"), axis=1)
Unfortunatelly I get a NameError right away saying that: name 'row' is not defined.
Could you please help me and show me the right direction on how to solve this problem. It gives me a really hard time. :(
Thank you in advance!
I believe you need GroupBy.transform with subtract last value with first and divide by length:
f = lambda x: (x.iloc[-1] - x.iloc[0]) / len(x)
df['new'] = df.groupby(['Product_no','Phase_no'])['Temperature'].transform(f)
I'm struggling to understand this error, since I'll give you an example that's working and the one I'm interested in that's not.
I have to analyse a set of data with hourly prices for an entire year in it, called sys_prices, which - after various transformations - is a numpy.ndarray object with 8785 rows (1 column), and every row is a numpy.ndarray item with only one element, a numpy.float64 number.
The code not working is the following:
stop_day = 95
start_day = stop_day - 10 # 10 days before
stop_day = (stop_day-1)*24
start_day = (start_day-1)*24
pcs=[] # list of prices to analyse
for ii in range(start_day, stop_day):
pcs.append(sys_prices[ii][0])
p, x = np.histogram(pcs, bins='fd')
The *24 part is to tune the index within the dataset so that to respect the hourly resolution.
What I expect is to supply the list pcs to the histogram method, so that to get the values of my histogram and bin edges into p and x, respectively.
I say that I expect this because the following code works:
start_day = 1
start_month = 1
start_year = 2016
stop_day = 1
stop_month = 2
stop_year = 2016
num_prices = (date(stop_year, stop_month, stop_day) - date(start_year, start_month, start_day)).days*24
jan_prices = []
for ii in range(num_prices):
jan_prices.append(sys_prices[ii][0])
p, x = np.histogram(jan_prices, bins='fd') # bin the data`
The difference in the codes is that the working one is analyzing only 10 days within an arbitrary period starting backwards from a chosen day of the year, while the working example uses all the prices in the month of January (eg. the first 744 values of the dataset).
Strange(r) thing: I used different values for stop_day, and it seems that 95 raises the error, while 99 or 100 or 200 don't.
Could you help me?
I solved it, there was a single NaN in the dataset I couldn't spot.
For those wondering how to spot it, I just used this code to find the index of the item:
nanlist=[]
for ii in range(len(array)):
if numpy.isnan(array[ii]):
nanlist.append(ii)
Where array is your container.
The problem occurs because, by default, histogram uses min(pcs) and max(pcs) to determine the minimum and maximum range of the bins but since you have nans in your dataset the min and max becomes nans. You can fix this by using np.nanmin and np.nanmax for the range parameters.
p, x = np.histogram(pcs, range=(np.nanmin(pcs), np.nanmax(pcs)) bins='fd')
I think this is better than accepted answer since it does not require modifying of pcs.
I am trying to loop through a Series data type which was randomly generated from an existing data set to serve as a training data set). Here is the output of my Series data set after the split:
Index data
0 1150
1 2000
2 1800
. .
. .
. .
1960 1800
1962 1200
. .
. .
. .
20010 1500
There is no index of 1961 because the random selection process to create the training data set removed it. When I try to loop through to calculate my residual sum squares it does not work. Here is my loop code:
def ResidSumSquares(x, y, intercept, slope):
out = 0
temprss = 0
for i in x:
out = (slope * x.loc[i]) + intercept
temprss = temprss + (y.loc[i] - out)
RSS = temprss**2
return print("RSS: {}".format(RSS))
KeyError: 'the label [1961] is not in the [index]'
I am still very new to Python and I am not sure of the best way to fix this.
Thank you in advance.
I found the answer right after I posted the question, my apologies. Posted by #mkln
How to reset index in a pandas data frame?
df = df.reset_index(drop=True)
This resets the index of the entire Series and it is not exclusive to DataFrame data type.
My updated function code works like a charm:
def ResidSumSquares(x, y, intercept, slope):
out = 0
myerror = 0
x = x.reset_index(drop=True)
y = y.reset_index(drop=True)
for i in x:
out = slope * x.loc[i] + float(intercept)
myerror = myerror + (y.loc[i] - out)
RSS = myerror**2
return print("RSS: {}".format(RSS))
You omit your actual call to ResidSumSquares. How about not resetting the index within the function and passing the training set as x. Iterating over an unusual (not 1,2,3,...) index shouldn't be a problem
A few observations:
As currently written your function is calculating the squared sum of the error, not the sum of squared error... is this intentional? The latter is typically what is used in regression type applications. Since your variable is named RSS--I assume residual sum of squares, you will want to revisit.
If x and y are consistent subsets of the same larger dataset, the you should have the same indices for both, right? Otherwise by dropping the index you may be matching unrelated x and y variables and glossing over a bug earlier in the code.
Since you are using Pandas this can be easily vectorized to improve readability and speed (Python loops have high overhead)
Example of (3), assuming (2), and illustrating the differences between approaches in (1):
#assuming your indices should be aligned,
#pandas will link xs and ys by index
vectorized_error = y - slope*x + float(intercept)
#your residual sum of squares--you have to square first!
rss = (vectorized_error**2).sum()
# if you really want the square of the summed errors...
sse = (vectorized_error.sum())**2
Edit: didn't notice this has been dead for a year.