How to remove divergent values from a list - python

I'm extracting some data from some images to get a graph but some of the values extracted are wrong (the diverging values), is there I way to remove them without changing the length of the list. I tried to calculate the mean and assign it to those values but apparently, the mean is so big too.

Assuming your data is a np.array :
min_value = 12 # The minimal value of your data, below it the points will be considered to be diverging. Change it as you want.
good_data = data[data > min_value] # Using advanced indexing
good_mean = np.mean(good_data)
new_data = np.where(data > min_value, data, good_mean)

Related

How to get value from nsmallest instead of .core.series.Series

Pretty new to python so any advice is always welcome.
I am trying to map data from multiple sets of coordinates to one set and am trying to use Bilinear interpolation to do it.
I have a set of DataFrames I iterate over and am trying to find the nearest neighbors for my interpolation.
Since my grids may not be uniform in spacing I am sorting by Y position first:
for i in range(0, len(df_x['X'])):
x_pos = df_x._get_value(i, 'X')#pull x coord y coord
y_pos = df_y._get_value(i, 'Y')
for n in data_list:
df = data_list[n] #
d_y = abs(df['Y'] - y_pos) #array of distance from Y pos
d_y.drop_duplicates() # remove duplicates
nn_y1 = d_y.nsmallest(1) # finds closest row
nn_y2 = d_y.nsmallest(2).iloc[-1] # finds next closest row
print(type(nn_y1))
d_x_y1 = df[df['DesignY'] == nn_y1] # creates list of X at closest row
I think this should provide me with my upper and lower bounds nearest my points.
however when then sorting for X position I get an error
ValueError: Can only compare identically-labeled Series objects
I think this is due to the fact that the type for nn_y1 kicks out <class 'pandas.core.series.Series'>
any advice for how to get the value instead of the series? I could create a dataframe with one element but that seems hacky? I tried some combinations of _get_value() but to no avail.
nsmallest returns:
"The n smallest values in the Series, sorted in increasing order." (Type Series)
In this case the simple way is to unpack from nsmallest(2) since both values are needed:
nn_y1, nn_y2 = d_y.nsmallest(2)
To modify the code directly iloc is needed to get the first value from the Series:
nn_y1 = d_y.nsmallest(1).iloc[0]
Alternatively d_y.nsmallest(2) could've been used twice with iloc to get both values:
smallest = d_y.nsmallest(2)
nn_y1 = smallest.iloc[0]
nn_y2 = smallest.iloc[1]

Can you extract indexes of data over a threshold from numpy array or pandas dataframe

I am using the following to compare several strings to each other. It's the fastest method I've been able to devise, but it results in a very large 2D array. which I can look at and see what I want. Ideally, I would like to set a threshold and pull the index(es) for each value over that number. To make matters more complicated, I don't want the index comparing the string to itself, and it's possible the string might be duplicated elsewhere so I would want to know if that's the case, so I can't just ignore 1's.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
texts = sql.get_corpus()
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(texts)
similarity = cosine_similarity(vectors)
sql.get_corups() returns a list of strings, currently 1600ish strings.
Is what I want possible? I've tried using comparing each of the 1.4M combinations to each other using Levenshtein, which works, but it takes 2.5 hours vs half above. I've also tried vecotrs with spacy, which takes days.
I'm not entirely sure I read your post correctly, but I believe this should get you started:
import numpy as np
# randomly distributed data we want to filter
data = np.random.rand(5, 5)
# get index of all values above a threshold
threshold = 0.5
above_threshold = data > threshold
# I am assuming your matrix has all string comparisons to
# itself on the diagonal
not_ident = np.identity(5) == 0.
# [edit: to prevent duplicate comparisons, use this instead of not_ident]
#upper_only = np.triu(np.ones((5,5)) - np.identity(5))
# 2D array, True when criteria met
result = above_threshold * not_ident
print(result)
# original shape, but 0 in place of all values not matching above criteria
values_orig_shape = data * result
print(values_orig_shape)
# all values that meet criteria, as a 1D array
values = data[result]
print(values)
# indices of all values that meet criteria (in same order as values array)
indices = [index for index,value in np.ndenumerate(result) if value]
print(indices)

Pandas: Find end frequency spectrum above a defined threshold

long time reader, first time posting.
I am working with x,y data for frequency response plots in Pandas DataFrames. Here is an example of the data and the plots (see full .csv file at end of post):
fbc['x'],fbc['y']
(0 [89.25, 89.543, 89.719, 90.217, 90.422, 90.686...
1 [89.25, 89.602, 90.422, 90.568, 90.744, 91.242...
2 [89.25, 89.689, 89.895, 90.305, 91.008, 91.74,...
3 [89.25, 89.514, 90.041, 90.275, 90.422, 90.832...
Name: x, dtype: object,
0 [-77.775, -77.869, -77.766, -76.572, -76.327, ...
1 [-70.036, -70.223, -71.19, -71.229, -70.918, -...
2 [-73.079, -73.354, -73.317, -72.753, -72.061, ...
3 [-70.854, -71.377, -74.069, -74.712, -74.647, ...
Name: y, dtype: object)
where x = frequency and y = amplitude data. The resulting plots for each of these looks as follows:
See x,y Plot of image in this link - not enough points to embed yet
I can create a plot for each row of the x,y data in the Dataframe.
What I need to do in Pandas (Python) is identify the highest frequency in the data before the frequency response drops to the noise floor (permanently). As you can see there are places where the y data may go to a very low value (say <-50) but then return to >- 40.
How can I detect in Pandas / python (ideally without iterations due to very large data sizes) to find the highest frequency (> -40) such that I know that the frequency does not return to < -40 again and then jump back up? Basically, I'm trying to find the end of the frequency band. I've tried working with some of the Pandas statistics (which would also be nice to have), but have been unsuccessful in getting useful data.
Thanks in advance for any pointers and direction you can provide.
Here is a .csv file that can be imported with csv.reader: https://www.dropbox.com/s/ia7icov5fwh3h6j/sample_data.csv?dl=0
I believe I have come up with a solution:
Based on a suggestion from #katardin I came up with the following, though I think it can be optimized. Again, I will be dealing with huge amounts of data, so if anyone can find a more elegant solution it would be appreciated.
for row in fbc['y']:
list_reverse = row
# Reverse y data so we read from end (right to left)
test_list = list_reverse[::-1]
# Find value of y data above noise floor (>-50)
res = next(x for x, val in enumerate(test_list) if val > -50)
# Since we reversed the y data we must take the opposite of the returned res to
# get the correct index
index = len(test_list) - res
# Print results
print ("The index of element is : " + str(index))
Where the output is index numbers as follows:
The index of element is : 2460
The index of element is : 2400
The index of element is : 2398
The index of element is : 2382
Each one I have checked and corresponds to the exact high frequency roll-off point I have been looking for. Great suggestion!

how to iterate through a 3d array and calculating the mean of each cell

i want to create a loop for the following lines in python (i use pycharm):
mean_diff = np.mean(np.array([diff_list[0].values, diff_list[1].values, diff_list[2].values, diff_list[3].values,...,diff_list[100], axis=0)
with this i get the mean of each individual cell from different arrays (raster change over time)
i tried the following:
for x in range(100):
mean_diff = np.mean(np.array([diff_list[x].values]), axis=0)
but what's happening here is that it will start to calculate the mean between the mean of the last iteration and the new array and so on, instead of adding everything up first and afterwards calculating the mean of the total. one idea was to create a "sumarray" first with all the diff_list values in it, but i failed to do that too. the original type of my diff_list is a list which contains data frames in it (for each row it has an array, so it's a 3d array/data frame (??)... -> see picture: image shows the structure of the list).
You need to populate the array, not do the computation, within the loop. Python list comprehensions are perfect for this:
Your first program is the equivalent of:
mean_diff = np.mean(np.array([a.values for a in diff_list[:101]], axis=0))
Or if you prefer:
x = []
for a in diff_list[:101]:
x.append(a.values)
mean_diff = np.mean(np.array(x, axis=0))
If you are using the whole list instead of its first 101 elements you can drop the "[:101]".

Python Dataframe find row value which changes every 10 percent

i have a Dataframe which has a column like (0.12,0.14,0.16,0.13,0.23,0.25,0.28,0.32,0.33), I want to have a new column that only record the value change more than 0.1 or -0.1. And other values remain same when changes.
so the new column should be like (0.12,0.12,0.12,0.12,0.23,0.23,0.23,0.32,0.32)
Anyone knows how to write in a simple way?
Thanks ahead.
Not really sure what you're trying to achieve with this by rounding the data to arbitrary numbers. You might want to consider either round function to midpoint, or using ceiling/floor function after multiplying the array by 10.
What you're trying to achieve however can be done like this:
import numpy as np
def cookdata(data):
#Assuming your data is sorted as per example array in your question
data = np.asarray(data)
i = 0
startidx = 0
while np.unique(data).size > np.ceil((data.max()-data.min())/0.1):
lastidx = startidx + np.where(data[startidx:] < np.unique(data)[0]+0.1*(i+1))[0].size
data[startidx:lastidx] = np.unique(data)[i]
startidx = lastidx
i += 1
return data
This returns a dataset as asked in your question. I am sure there are better ways to do it:
data = np.sort(np.random.uniform(0.12, 0.5, 10))
data
array([ 0.12959374, 0.14192312, 0.21706382, 0.27638412, 0.27745105,
0.28516701, 0.37941334, 0.4037809 , 0.41016534, 0.48978927])
cookdata(data)
array([ 0.12959374, 0.12959374, 0.12959374, 0.27638412, 0.27638412,
0.27638412, 0.37941334, 0.37941334, 0.37941334, 0.48978927])
The function returns array based on first value.
You might however want to consider simpler operations that do not require rounding values to arbitrary datapoints. Consider np.round(data, decimals=1). In your case you could also use floor function as in: np.floor(data/0.1)*0.1 or if you want to keep the initial value:
data = np.asarray(data)
datamin = data.min()
data = np.floor((data-datamin)/0.1)*0.1+datamin
data
array([ 0.12959374, 0.12959374, 0.12959374, 0.22959374, 0.22959374,
0.22959374, 0.32959374, 0.32959374, 0.32959374, 0.42959374])
Here the data is as multiples of first value, rather than arbitrary value between the multiples of first value.

Categories

Resources