Python Dataframe find row value which changes every 10 percent

Python Dataframe find row value which changes every 10 percent - python

i have a Dataframe which has a column like (0.12,0.14,0.16,0.13,0.23,0.25,0.28,0.32,0.33), I want to have a new column that only record the value change more than 0.1 or -0.1. And other values remain same when changes.
so the new column should be like (0.12,0.12,0.12,0.12,0.23,0.23,0.23,0.32,0.32)
Anyone knows how to write in a simple way?
Thanks ahead.

Not really sure what you're trying to achieve with this by rounding the data to arbitrary numbers. You might want to consider either round function to midpoint, or using ceiling/floor function after multiplying the array by 10.
What you're trying to achieve however can be done like this:
import numpy as np
def cookdata(data):
#Assuming your data is sorted as per example array in your question
data = np.asarray(data)
i = 0
startidx = 0
while np.unique(data).size > np.ceil((data.max()-data.min())/0.1):
lastidx = startidx + np.where(data[startidx:] < np.unique(data)[0]+0.1*(i+1))[0].size
data[startidx:lastidx] = np.unique(data)[i]
startidx = lastidx
i += 1
return data
This returns a dataset as asked in your question. I am sure there are better ways to do it:
data = np.sort(np.random.uniform(0.12, 0.5, 10))
data
array([ 0.12959374, 0.14192312, 0.21706382, 0.27638412, 0.27745105,
0.28516701, 0.37941334, 0.4037809 , 0.41016534, 0.48978927])
cookdata(data)
array([ 0.12959374, 0.12959374, 0.12959374, 0.27638412, 0.27638412,
0.27638412, 0.37941334, 0.37941334, 0.37941334, 0.48978927])
The function returns array based on first value.
You might however want to consider simpler operations that do not require rounding values to arbitrary datapoints. Consider np.round(data, decimals=1). In your case you could also use floor function as in: np.floor(data/0.1)*0.1 or if you want to keep the initial value:
data = np.asarray(data)
datamin = data.min()
data = np.floor((data-datamin)/0.1)*0.1+datamin
data
array([ 0.12959374, 0.12959374, 0.12959374, 0.22959374, 0.22959374,
0.22959374, 0.32959374, 0.32959374, 0.32959374, 0.42959374])
Here the data is as multiples of first value, rather than arbitrary value between the multiples of first value.

Related

How to remove divergent values from a list

I'm extracting some data from some images to get a graph but some of the values extracted are wrong (the diverging values), is there I way to remove them without changing the length of the list. I tried to calculate the mean and assign it to those values but apparently, the mean is so big too.

Assuming your data is a np.array :
min_value = 12 # The minimal value of your data, below it the points will be considered to be diverging. Change it as you want.
good_data = data[data > min_value] # Using advanced indexing
good_mean = np.mean(good_data)
new_data = np.where(data > min_value, data, good_mean)

Python numpy array filter

I got the following numpy array named 'data'. It consists of 15118 rows and 2 columns. The first column mostly consist of 0.01 steps, but sometimes there is a step in between (shown in red) which I would like to remove/filter out.
I achieved this with the following code:
# Create array [0, 0.01 .... 140], rounded 2 decimals to prevent floating point error
b = np.round(np.arange(0,140.01,0.01),2)
# New empty data array
new_data = np.empty(shape=[0, 2])
# Loop over values to remove/filter out data
for x in b:
Index = np.where(x == data[:,0])[0][0]
new_data = np.vstack([new_data,data[Index]])
I feel like this code is far from optimal and I was wondering if anyone knows a faster/better way of achieving this?

Here's a solution using pandas for resampling, you can probably achieve the same result in pure numpy but there are a number of floating point and rounding error pitfalls you are going to face, maybe it's better to let a trusted library do the work for you.
Let's say arr is your data array and assume your index to be in fractions of seconds. You can convert your array to a dataframe with a timedelta index:
df = pd.DataFrame(arr[:,1], index=arr[:,0])
df.index = pd.to_timedelta(df.index, unit="s")
Than resampling it's pretty easy, 10ms is the frequency you want, first() should give you the expected result dropping everything but the records at 10ms ticks, but feel free to experiment with other functions
df = df.resample("10ms").first()
Eventually you could get back to your array with something like:
np.vstack([pd.to_numeric(df.index, downcast="float").values / 1e9,
df.values.squeeze()]).T

Thresholding a python list with multiple values

Okay so I have a an array of 1000x100 with random numbers. I want to threshold this list with a list of multiple numbers; these numbers go from [3 to 9].If they are higher than the threshold I want the sum of the row appended to a list.
I have tried many ways, including a 3 times for conditional. Right now, I have found a way to compare an array to a list of numbers but each time that happens I get random numbers from that list again.
xpatient=5
sd_healthy=2
xhealthy=7
sd_patient=2
thresholdvalue1=(xpatient-sd_healthy)*10
thresholdvalue2=(((xhealthy+sd_patient))*10)
thresholdlist=[]
x1=[]
Ahealthy=np.random.randint(10,size=(1000,100))
Apatient=np.random.randint(10,size=(1000,100))
TParray=np.random.randint(10,size=(1,61))
def thresholding(A,B):
for i in range(A,B):
thresholdlist.append(i)
i+=1
thresholding(thresholdvalue1,thresholdvalue2+1)
thresholdarray=np.asarray(thresholdlist)
thedivisor=10
newthreshold=(thresholdarray/thedivisor)
for x in range(61):
Apatient=np.random.randint(10,size=(1000,100))
Apatient=[Apatient>=newthreshold[x]]*Apatient
x1.append([sum(x) for x in zip(*Apatient)])
So,my for loop consists of a random integer within it, but if I don't do that, I don't get to see the threshold each turn. I want the threshold for the whole array to be 3,3.1,3.2 etc. etc.
I hope I delivered my point. Thanks in advance

You can solve your problem using this approach:
import numpy as np
def get_sums_by_threshold(data, threshold, axis): # use axis=0 to sum values along rows, axis=1 - along columns
result = list(np.where(data >= threshold, data, 0).sum(axis=axis))
return result
xpatient=5
sd_healthy=2
xhealthy=7
sd_patient=2
thresholdvalue1=(xpatient-sd_healthy)*10
thresholdvalue2=(((xhealthy+sd_patient))*10)
np.random.seed(100) # to keep generated array reproducable
data = np.random.randint(10,size=(1000,100))
thresholds = [num / 10.0 for num in range(thresholdvalue1, thresholdvalue2+1)]
sums = list(map(lambda x: get_sums_by_threshold(data, x, axis=0), thresholds))
But you should know that your initial array includes only integer values and you will have same result for multiple thresholds that have the same integer part (f.e. 3.0, 3.1, 3.2, ..., 3.9). If you want to store float numbers from 0 to 9 in your initial array with the specified shape you can do following:
data = np.random.randint(90,size=(1000,100)) / 10.0

Trying to divide a dataframe column by a float yields NaN

Background
I deal with a csv datasheet that prints out columns of numbers. I am working on a program that will take the first column, ask a user for a time in float (ie. 45 and a half hours = 45.5) and then subtract that number from the first column. I have been successful in that regard. Now, I need to find the row index of the "zero" time point. I use min to find that index and then call that off of the following column A1. I need to find the reading at Time 0 to then normalize A1 to so that on a graph, at the 0 time point the reading is 1 in column A1 (and eventually all subsequent columns but baby steps for me)
time_zero = float(input("Which time would you like to be set to 0?"))
df['A1']= df['A1']-time_zero
This works fine so far to set the zero time.
zero_location_series = df[df['A1'] == df['A1'].min()]
r1 = zero_location_series[' A1.1']
df[' A1.1'] = df[' A1.1']/r1
Here's where I run into trouble. The first line will correctly identify a series that I can pull off of for all my other columns. Next r1 correctly identifies the proper A1.1 value and this value is a float when I use type(r1).
However when I divide df[' A1.1']/r1 it yields only one correct value and that value is where r1/r1 = 1. All other values come out NaN.
My Questions:
How to divide a column by a float I guess? Why am I getting NaN?
Is there a faster way to do this as I need to do this for 16 columns.(ie 'A2/r2' 'a3/r3' etc.)
Do I need to do inplace = True anywhere to make the operations stick prior to resaving the data? or is that only for adding/deleting rows?
Example
Dataframe that looks like this
!http://i.imgur.com/ObUzY7p.png
zero time sets properly (image not shown)
after dividing the column
!http://i.imgur.com/TpLUiyE.png

This should work:
df['A1.1']=df['A1.1']/df['A1.1'].min()
I think the reason df[' A1.1'] = df[' A1.1']/r1 did not work was because r1 is a series. Try r1? instead of type(r1) and pandas will tell you that r1 is a series, not an individual float number.
To do it in one attempt, you have to iterate over each column, like this:
for c in df:
df[c] = df[c]/df[c].min()

If you want to divide every value in the column by r1 it's best to apply, for example:
import pandas as pd
df = pd.DataFrame([1,2,3,4,5])
# apply an anonymous function to the first column ([0]), divide every value
# in the column by 3
df = df[0].apply(lambda x: x/3.0, 0)
print(df)
So you'd probably want something like this:
df = df["A1.1"].apply(lambda x: x/r1, 0)
This really only answers part 2 of you question. Apply is probably your best bet for running a function on multiple rows and columns quickly. As for why you're getting nans when dividing by a float, is it possible the values in your columns are anything other than floats or integers?

how to read a file, sort based on specified column

I am trying to convert all my codes, written in MATLAB, to python. I have a problem and I couldn't find a way to solve it. Maybe someone has an idea.
I have a file which has m rows and two columns. I want to read file, and then sort file based on the second column. Later, I must use the sorted first column (from first row to 1000th row) and find values larger than threshold (here for example 0.2) and sum them.
Hope someone has an idea.
Thanks

If the file is for example with fields separated by tabs and rows separated by columns the problem is quite simple:
f = open("filename.csv")
data = [map(float, x.split("\t")) for x in f.readlines()]
data.sort(key = lambda x:x[1])
result = sum(x[0] for x in data[:1000] if x[0] > 0.2)

Consider using Numpy arrays and its accompanying functions. They are (usually) quite similar to those in MATLAB, which might make your conversion from the latter easier.
import numpy as np
data = np.genfromtext("filename.csv", delimiter="\t", dtype=np.float)
idx = np.argsort(data[:, 1])
data1000 = data[idx[:1000]] # First 1000 of sorted data
result = np.sum(data1000[data1000[:, 0] > 0.2, 0])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Dataframe find row value which changes every 10 percent - python

Related

How to remove divergent values from a list

Python numpy array filter

Thresholding a python list with multiple values

Trying to divide a dataframe column by a float yields NaN

how to read a file, sort based on specified column

Categories

Resources