python: processing data so that only constant values remain - python

I have data from a measurement and I want to process the data so that only the values remain, that are constant. The measured signal consists of parts where the value stays constant for some time then I do a change on the system that causes the value to increase. It takes time for the system to reach the constant value after the adjustment I do.
I wrote a programm that compares every value with the 10 previous values. If it is equal to them within a tolerance it gets saved.
The code works but i feel like this can be done cleaner and more efficient so that it is sutable to process larger amouts of data. But I dont know how to make the code in for-loop more efficient. Do you have any suggestions?
Thank you in advance.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('radiale Steifigkeit_22_04_2022_raw.csv',
sep= ";",
decimal = ',',
skipinitialspace=True,
comment = '\t')
#df = df.drop(df.columns[[0,4]], axis=1)
#print(df.head())
#print(df.dtypes)
#df.plot(x = 'Time_SYS 01-cDAQ:1_A-In-All_Rec_rel', y = 'Kraft')
#df.plot(x = 'Time_SYS 01-cDAQ:1_A-In-All_Rec_rel', y = 'Weg')
#plt.show()
s = pd.Series(df['Weg'], name = 'Weg')
f = pd.Series(df['Kraft'], name= 'Kraft')
t = pd.Series(df['Time_SYS 01-cDAQ:1_A-In-All_Rec_rel'], name= 'Zeit')
#s_const = pd.Series()
s_const = []
f_const = []
t_const = []
s = np.abs(s)
#plt.plot(s)
#plt.show()
c = 0
#this for-loop compares the value s[i] with the previous 10 measurements.
#If it is equal within a tolerance it is saved into s_const.
for i in range(len(s)):
#for i in range(0,2000):
if i > 10:
si = round(s[i],3)
s1i = round(s[i-1],3)
s2i = round(s[i-2],3)
s3i = round(s[i-3],3)
s4i = round(s[i-4],3)
s5i = round(s[i-5],3)
s6i = round(s[i-6],3)
s7i = round(s[i-7],3)
s8i = round(s[i-8],3)
s9i = round(s[i-9],3)
s10i = round(s[i-10],3)
if si == s1i == s2i == s3i == s4i == s5i== s6i == s7i== s8i == s9i == s10i:
c = c+1
s_const.append(s[i])
f_const.append(f[i])

Here is a very performant implementation using itertools (based on Check if all elements in a list are identical):
from itertools import groupby
def all_equal(iterable):
g = groupby(iterable)
return next(g, True) and not next(g, False)
data = [1, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5]
window = 3
stable = [i for i in range(len(data) - window + 1) if all_equal(data[i:i+window])]
print(stable) # -> [1, 2, 7, 8, 9, 10, 13]
The algorithm produces a list of indices in your data where a stable period of length window starts.

Related

numpy to find products of all combinations of pairs of numbers in a matrix row

I'm given an n x m matrix and my goal is to find the "cross-product" of all the features, specifically each row in the product matrix is of the form
xijjxij', j < j', j = 1, . . . , m, j' = (j + 1), . . . , m
so that the resulting row is a the product of all combinations of pairs in that row. Is there an elegant way to do this using numpy functions rather than python loops?
Edit: example
[1, 2, 3, 4]
should become
[1*2, 1*3, 1*4, 2*3, 2*4, 3*4]
which give:
[2, 3, 4, 6, 8, 12]
Basically you want to generate all the possible subsets containing 2 elements of your original set.
Short answer:
# With m = 4
c = np.multiply(*np.add(np.triu_indices(4,1),1))
General solution for any input array:
If using itertools is an option then you can use:
import numpy as np
import itertools
x = list(itertools.combinations([1,2,3,4], 2))
c = np.prod(x,-1)
c output:
array([ 2, 3, 4, 6, 8, 12])
From the doc:
itertools.combinations(iterables,r) : return r-length tuples in
sorted order with no repeated elements.
And the number of elements in c correspond to the binomial coefficient C(n,k): n choose k, where n = len([1,2,3,4]) and k = 2.
Noticed that itertools.combinations() only hide the for loops, but since there is no closed-form formula for this problem a for loop is inevitable.
Numpy only solution:
In your specific case, where your iterable are the suit of n positive integers [1,2,3,4,...,n] then you can noticed that the positive indice of an upper triangle 2D matrice of length n-1 will produce the same result as combinations so:
# Number of elements in your array
n = 4
# Upper triangular matrice
x = np.triu(np.ones([n-1,n-1]))
# Get the result
c = np.prod(np.argwhere(x)+np.arange(1,3),-1)
And again c output:
array([ 2, 3, 4, 6, 8, 12])
Or (with the help of #Nachikel, I wasn't aware of the existence of np.triu_indices()) the one liner:
c = np.multiply(*np.add(np.triu_indices(4,1),1))
Benchmarking:
and also with itertools:
The code:
import numpy as np
import itertools
import timeit
import matplotlib.pyplot as plt
def itertools1(m):
x = list(itertools.combinations(np.arange(1,m+1), 2))
np.prod(x,-1)
def numpy1(m):
n = m-1
x = np.triu(np.ones([n,n]))
np.prod(np.argwhere(x)+np.arange(1,3),-1)
def numpy2(m):
np.multiply(*np.add(np.triu_indices(m,1),1))
def benchmark_time(m):
SETUP_CODE = '''
from __main__ import numpy1
from __main__ import numpy2
from __main__ import itertools1
'''
x = np.zeros([3,len(m)])
for ind, m in enumerate(m):
print('For m = {}'.format(m))
TEST_CODE = '''
itertools1({})
'''.format(m)
# timeit.repeat statement
times = timeit.repeat(setup = SETUP_CODE,
stmt = TEST_CODE,
repeat = 10,
number = 50)
x[0,ind] = np.average(times)
print('Itertools1 give:\t{} s'.format(np.round(np.average(times),3)))
TEST_CODE = '''
numpy1({})
'''.format(m)
times = timeit.repeat(setup = SETUP_CODE,
stmt = TEST_CODE,
repeat = 10,
number = 50)
x[1,ind] = np.average(times)
print('Numpy1 give:\t\t{} s'.format(np.round(np.average(times),3)))
TEST_CODE = '''
numpy2({})
'''.format(m)
times = timeit.repeat(setup = SETUP_CODE,
stmt = TEST_CODE,
repeat = 10,
number = 50)
x[2,ind] = np.average(times)
print('Numpy2 give:\t\t{} s\n'.format(np.round(np.average(times),3)))
return x
m = np.arange(10,150,10)
x = benchmark_time(m)
plt.plot(m,x.T)
plt.legend(('itertools', 'numpy triu', 'numpy triu_indices'))
plt.xlabel('m')
plt.ylabel('sec')
plt.show()

Python 'for' loop performance too slow

I have over 500,000 rows in my dataframe and a number of similar 'for' loops which are causing my code to take over a hour to complete its computation. Is there a more efficient way of writing the following 'for' loop so that things run a lot faster:
col_26 = []
col_27 = []
col_28 = []
for ind in df.index:
if df['A_factor'][ind] > df['B_factor'][ind]:
col_26.append('Yes')
col_27.append('No')
col_28.append(df['A_value'][ind])
elif df['A_factor'][ind] < df['B_factor'][ind]:
col_26.append('No')
col_27.append('Yes')
col_28.append(df['B_value'][ind])
else:
col_26.append('')
col_27.append('')
col_28.append(float('nan'))
You might want to look into the pandas iterrows() function or using apply, you can look at this article aswell: https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06
Try column operations:
data = {'A_factor': [1, 2, 3, 4, 5],
'A_value': [10, 20, 30, 40, 50],
'B_factor': [2, 3, 1, 2, 6],
'B_value': [11, 22, 33, 44, 55]}
df = pd.DataFrame(data)
df['col_26'] = ''
df['col_27'] = ''
df['col_28'] = np.nan
mask = df['A_factor'] > df['B_factor']
df.loc[mask, 'col_26'] = 'Yes'
df.loc[~mask, 'col_26'] = 'No'
df.loc[mask, 'col_28'] = df[mask]['A_value']
df.loc[~mask, 'col_27'] = 'Yes'
df.loc[mask, 'col_27'] = 'No'
df.loc[~mask, 'col_28'] = df[~mask]['B_value']
Appending to lists in Python is painfully slow. Initializing the lists before the iteration can speed things up. For example,
def f():
x = []
for ii in range(500000):
x.append(str(x))
def f2():
x = [""] * 500000
for ii in range(500000):
x[ii] = str(x)
timeit.timeit("f()", "from __main__ import f", number=10)
# Output: 1.6317970999989484
timeit.timeit("f2()", "from __main__ import f2", number=10)
# Output: 1.3037318000024243
Since you're already using pandas / numpy, there are ways to vectorize your operations so they don't need looping. For example:
a_factor = df["A_factor"].to_numpy()
b_factor = df["B_factor"].to_numpy()
col_26 = np.empty(a_factor.shape, dtype='U3') # U3 => string of size 3
col_27 = np.empty(a_factor.shape, dtype='U3')
col_28 = np.empty(a_factor.shape)
a_greater = a_factor > b_factor
b_greater = a_factor < b_factor
both_equal = a_factor == b_factor
col_26[a_greater] = 'Yes'
col_26[b_greater] = 'No'
col_27[a_greater] = 'Yes'
col_27[b_greater] = 'No'
col_28[a_greater] = a_factor[a_greater]
col_28[b_greater] = b_factor[b_greater]
col_28[both_equal] = np.nan
append causes python requests for heap memory to get more memory. using append in for loop causes get memory and free it continually to get more memory. so it's better to say to python how many item you need.
col_26 = [True]*500000
col_27 = [False]*500000
col_28 = [float('nan')]*500000
for ind in df.index:
if df['A_factor'][ind] > df['B_factor'][ind]:
col_28[ind] = df['A_value'][ind]
elif df['A_factor'][ind] < df['B_factor'][ind]:
col_26[ind] = False
col_27[ind] = True
col_28[ind] = df['B_value'][ind]
else:
col_26[ind] = ''
col_27[ind] = ''

How to iterate over pandas df with a def function variable function

i hope you can guide me here, cause i am a little lost and not really experienced in python programing.
My goal: i have to calculate the "adducts" for a given "Compound", both represents numbes, but for eah "Compound" there are 46 different "Adducts".
Each adduct is calculated as follow:
Adduct 1 = [Exact_mass*M/Charge + Adduct_mass]
where exact_mass = number, M and Charge = number (1, 2, 3, etc) according to each type of adduct, Adduct_mass = number (positive or negative) according to each adduct.
My data: 2 data frames. One with the Adducts names, M, Charge, Adduct_mass.
The other one correspond to the Compound_name and Exact_mass of the Compounds i want to iterate over (i just put a small data set)
Adducts: df_al
import pandas as pd
data = [["M+3H", 3, 1, 1.007276], ["M+3Na", 3, 1, 22.989], ["M+H", 1, 1, 1.007276], ["2M+H", 1, 2, 1.007276], ["M-3H", 3, 1, -1.007276]]
df_al = pd.DataFrame(data, columns=["Ion_name", "Charge", "M", "Adduct_mass"])
Compounds: df
import pandas as pd
data1 = [[1, "C3H64O7", 596.465179], [2, "C30H42O7", 514.293038], [4, "C44H56O8", 712.397498], [4, "C24H32O6S", 448.191949], [5, "C20H28O3", 316.203834]]
df = pd.DataFrame(data1, columns=["CdId", "Formula", "exact_mass"])
My code:
df_name = df_al["Ion_name"]
df_mass = df_al["adduct_mass"]
df_div = df_al["Div"]
df_M = df_al["M"]
then i defined for each ion a function using the index to set each value
def A0(x):
return x*df_M[0]/df_div[0] + df_mass[0]
def A1(x):
return x*df_M[1]/df_div[1] + df_mass[1]
def A2(x):
return x*df_M[2]/df_div[2] + df_mass[2]
def A3(x):
return x*df_M[3]/df_div[3] + df_mass[3]
def A4(x):
return x*df_M[4]/df_div[4] + df_mass[4]
def A5(x):
return x*df_M[5]/df_div[5] + df_mass[5]
def A6(x):
return x*df_M[6]/df_div[6] + df_mass[6]
and so on, till func A46
then i .map each function to to each of the Compounds and i store each value in a new column in the df (Here is my other problem: how to add the name of each ion on the top of each column matching the corresponding function?)
df[df_name.loc[0]] = df["exact_mass"].map(A0)
df[df_name.loc[1]] = df["exact_mass"].map(A1)
df[df_name.loc[2]] = df["exact_mass"].map(A2)
df[df_name.loc[3]] = df["exact_mass"].map(A3)
df[df_name.loc[4]] = df["exact_mass"].map(A4)
df[df_name.loc[5]] = df["exact_mass"].map(A5)
df[df_name.loc[6]] = df["exact_mass"].map(A6)
.
.
.
and so on till applying A46.
I thing it could be a simpler way to def the function and that it changes according each ion (maybe a forloop?) and also a simpler way to apply the function and get the corresponding name without .loc each one.
Thanks!
One way is using functools.partial together with map.
Given the regularity of your function calls, I would try something like:
from funtools import partial
def func(x, n):
return x*df_M[n]/df_div[n] + df_mass[n]
for i in range(max_i): #change max_i with the integer you need
df[df_name.loc[i]] = map(partial(func, n=i), df["exact_mass"])
#df[df_name.loc[i]] = df["exact_mass"].map(partial(func, n=i)) should work as well
more info here https://docs.python.org/3.7/library/functools.html#functools.partial
Here's a proposition define
def A(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
Then doing A(x,5) is the same as A5(x). Then you loop through all your stuff:
for i in range(47):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: A(x,i))
I think there is probably a more elegant way to do this, but this should work.

Spliting dataframe in 10 equal parts and merge 9 parts after picking one at a time in loop

I need to split dataframe into 10 parts then use one part as the testset and remaining 9 (merged to use as training set) , I have come up to the following code where I am able to split the dataset , and m trying to merge the remaining sets after picking one of those 10.
The first iteration goes fine , but I get following error in second iteration.
df = pd.DataFrame(np.random.randn(10, 4), index=list(xrange(10)))
for x in range(3):
dfList = np.array_split(df, 3)
testdf = dfList[x]
dfList.remove(dfList[x])
print testdf
traindf = pd.concat(dfList)
print traindf
print "================================================"
I don't think you have to split the dataframe in 10 but just in 2.
I use this code for splitting a dataframe in training set and validation set:
test_index = np.random.choice(df.index, int(len(df.index)/10), replace=False)
test_df = df.loc[test_index]
train_df = df.loc[~df.index.isin(test_index)]
okay I got it working this way :
df = pd.DataFrame(np.random.randn(10, 4), index=list(xrange(10)))
dfList = np.array_split(df, 3)
for x in range(3):
trainList = []
for y in range(3):
if y == x :
testdf = dfList[y]
else:
trainList.append(dfList[y])
traindf = pd.concat(trainList)
print testdf
print traindf
print "================================================"
But better approach is welcome.
You can use the permutation function from numpy.random
import numpy as np
import pandas as pd
import math as mt
l = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
df = pd.DataFrame({'a': l, 'b': l})
shuffle the dataframe index
shuffled_idx = np.random.permutation(df.index)
divide the shuffled_index into N equal(ish) parts
for this example, let N = 4
N = 4
n = len(shuffled_idx) / N
parts = []
for j in range(N):
parts.append(shuffled_idx[mt.ceil(j*n): mt.ceil(j*n+n)])
# to show each shuffled part of the data frame
for k in parts:
print(df.iloc[k])
I wrote a piece of script find / fork it on github for the purpose of splitting a Pandas dataframe randomly. Here's a link to Pandas - Merge, join, and concatenate functionality!
Same code for your reference:
import pandas as pd
import numpy as np
from xlwings import Sheet, Range, Workbook
#path to file
df = pd.read_excel(r"//PATH TO FILE//")
df.columns = [c.replace(' ',"_") for c in df.columns]
x = df.columns[0].encode("utf-8")
#number of parts the data frame or the list needs to be split into
n = 7
seq = list(df[x])
np.random.shuffle(seq)
lists1 = [seq[i:i+n] for i in range(0, len(seq), n)]
listsdf = pd.DataFrame(lists1).reset_index()
dataframesDict = dict()
# calling xlwings workbook function
Workbook()
for i in range(0,n):
if Sheet.count() < n:
Sheet.add()
doubles[i] =
df.loc[df.Column_Name.isin(list(listsdf[listsdf.columns[i+1]]))]
Range(i,"A1").value = doubles[i]
Looks like you are trying to do a k-fold type thing, rather than a one-off. This code should help. You may also find the SKLearn k-fold functionality works in your case, that's also worth checking out.
# Split dataframe by rows into n roughly equal portions and return list of
# them.
def splitDf(df, n) :
splitPoints = list(map( lambda x: int(x*len(df)/n), (list(range(1,n)))))
splits = list(np.split(df.sample(frac=1), splitPoints))
return splits
# Take splits from splitDf, and return into test set (splits[index]) and training set (the rest)
def makeTrainAndTest(splits, index) :
# index is zero based, so range 0-9 for 10 fold split
test = splits[index]
leftLst = splits[:index]
rightLst = splits[index+1:]
train = pd.concat(leftLst+rightLst)
return train, test
You can then use these functions to make the folds
df = <my_total_data>
n = 10
splits = splitDf(df, n)
trainTest = []
for i in range(0,n) :
trainTest.append(makeTrainAndTest(splits, i))
# Get test set 2
test2 = trainTest[2][1].shape
# Get training set zero
train0 = trainTest[0][0]

How do I generate a table from a list

I have a list that contains sublists with 3 values and I need to print out a list that looks like:
I also need to compare the third column values with eachother to tell if they are increasing or decreasing as you go down.
bb = 3.9
lowest = 0.4
#appending all the information to a list
allinfo= []
while bb>=lowest:
everything = angleWithPost(bb,cc,dd,ee)
allinfo.append(everything)
bb-=0.1
I think the general idea for finding out whether or not the third column values are increasing or decreasing is:
#Checking whether or not Fnet are increasing or decreasing
ii=0
while ii<=(10*(bb-lowest)):
if allinfo[ii][2]>allinfo[ii+1][2]:
abc = "decreasing"
elif allinfo[ii][2]<allinfo[ii+1][2]:
abc = "increasing"
ii+=1
Then when i want to print out my table similar to the one above.
jj=0
while jj<=(10*(bb-lowest))
print "%8.2f %12.2f %12.2f %s" %(allinfo[jj][0], allinfo[jj][1], allinfo[jj][2], abc)
jj+=1
here is the angle with part
def chainPoints(aa,DIS,SEG,H):
#xtuple x chain points
n=0
xterms = []
xterm = -DIS
while n<=SEG:
xterms.append(xterm)
n+=1
xterm = -DIS + n*2*DIS/(SEG)
#
#ytuple y chain points
k=0
yterms = []
while k<=SEG:
yterm = H + aa*m.cosh(xterms[k]/aa) - aa*m.cosh(DIS/aa)
yterms.append(yterm)
k+=1
return(xterms,yterms)
#
#
def chainLength(aa,DIS,SEG,H):
xterms, yterms = chainPoints(aa,DIS,SEG,H)# using x points and y points from the chainpoints function
#length of chain
ff=1
Lterm=0.
totallength=0.
while ff<=SEG:
Lterm = m.sqrt((xterms[ff]-xterms[ff-1])**2 + (yterms[ff]-yterms[ff-1])**2)
totallength += Lterm
ff+=1
return(totallength)
#
def angleWithPost(aa,DIS,SEG,H):
xterms, yterms = chainPoints(aa,DIS,SEG,H)
totallength = chainLength(aa,DIS,SEG,H)
#Find the angle
thetaradians = (m.pi)/2. + m.atan(((yterms[1]-yterms[0])/(xterms[1]-xterms[0])))
#Need to print out the degrees
thetadegrees = (180/m.pi)*thetaradians
#finding the net force
Fnet = abs((rho*grav*totallength))/(2.*m.cos(thetaradians))
return(totallength, thetadegrees, Fnet)
Review this Python2 implementation which uses map and an iterator trick.
from itertools import izip_longest, islice
from pprint import pprint
data = [
[1, 2, 3],
[1, 2, 4],
[1, 2, 3],
[1, 2, 5],
]
class AddDirection(object):
def __init__(self):
# This default is used if the series begins with equal values or has a
# single element.
self.increasing = True
def __call__(self, pair):
crow, nrow = pair
if nrow is None or crow[-1] == nrow[-1]:
# This is the last row or the direction didn't change. Just return
# the direction we previouly had.
inc = self.increasing
elif crow[-1] > nrow[-1]:
inc = False
else:
# Here crow[-1] < nrow[-1].
inc = True
self.increasing = inc
return crow + ["Increasing" if inc else "Decreasing"]
result = map(AddDirection(), izip_longest(data, islice(data, 1, None)))
pprint(result)
The output:
pts/1$ python2 a.py
[[1, 2, 3, 'Increasing'],
[1, 2, 4, 'Decreasing'],
[1, 2, 3, 'Increasing'],
[1, 2, 5, 'Increasing']]
Whenever you want to transform the contents of a list (in this case the list of rows), map is a good place where to begin thinking.
When the algorithm requires data from several places of a list, offsetting the list and zipping the needed values is also a powerful technique. Using generators so that the list doesn't have to be copied, makes this viable in real code.
Finally, when you need to keep state between calls (in this case the direction), using an object is the best choice.
Sorry if the code is too terse!
Basically you want to add a 4th column to the inner list and print the results?
#print headers of table here, use .format for consistent padding
previous = 0
for l in outer_list:
if l[2] > previous:
l.append('increasing')
elif l[2] < previous:
l.append('decreasing')
previous = l[2]
#print row here use .format for consistent padding
Update for list of tuples, add value to tuple:
import random
outer_list = [ (i, i, random.randint(0,10),)for i in range(0,10)]
previous = 0
allinfo = []
for l in outer_list:
if l[2] > previous:
allinfo.append(l +('increasing',))
elif l[2] < previous:
allinfo.append(l +('decreasing',))
previous = l[2]
#print row here use .format for consistent padding
print(allinfo)
This most definitely can be optimized and you could reduce the number of times you are iterating over the data.

Categories

Resources