How to iterate over pandas df with a def function variable function

How to iterate over pandas df with a def function variable function - python

i hope you can guide me here, cause i am a little lost and not really experienced in python programing.
My goal: i have to calculate the "adducts" for a given "Compound", both represents numbes, but for eah "Compound" there are 46 different "Adducts".
Each adduct is calculated as follow:
Adduct 1 = [Exact_mass*M/Charge + Adduct_mass]
where exact_mass = number, M and Charge = number (1, 2, 3, etc) according to each type of adduct, Adduct_mass = number (positive or negative) according to each adduct.
My data: 2 data frames. One with the Adducts names, M, Charge, Adduct_mass.
The other one correspond to the Compound_name and Exact_mass of the Compounds i want to iterate over (i just put a small data set)
Adducts: df_al
import pandas as pd
data = [["M+3H", 3, 1, 1.007276], ["M+3Na", 3, 1, 22.989], ["M+H", 1, 1, 1.007276], ["2M+H", 1, 2, 1.007276], ["M-3H", 3, 1, -1.007276]]
df_al = pd.DataFrame(data, columns=["Ion_name", "Charge", "M", "Adduct_mass"])
Compounds: df
import pandas as pd
data1 = [[1, "C3H64O7", 596.465179], [2, "C30H42O7", 514.293038], [4, "C44H56O8", 712.397498], [4, "C24H32O6S", 448.191949], [5, "C20H28O3", 316.203834]]
df = pd.DataFrame(data1, columns=["CdId", "Formula", "exact_mass"])
My code:
df_name = df_al["Ion_name"]
df_mass = df_al["adduct_mass"]
df_div = df_al["Div"]
df_M = df_al["M"]
then i defined for each ion a function using the index to set each value
def A0(x):
return x*df_M[0]/df_div[0] + df_mass[0]
def A1(x):
return x*df_M[1]/df_div[1] + df_mass[1]
def A2(x):
return x*df_M[2]/df_div[2] + df_mass[2]
def A3(x):
return x*df_M[3]/df_div[3] + df_mass[3]
def A4(x):
return x*df_M[4]/df_div[4] + df_mass[4]
def A5(x):
return x*df_M[5]/df_div[5] + df_mass[5]
def A6(x):
return x*df_M[6]/df_div[6] + df_mass[6]
and so on, till func A46
then i .map each function to to each of the Compounds and i store each value in a new column in the df (Here is my other problem: how to add the name of each ion on the top of each column matching the corresponding function?)
df[df_name.loc[0]] = df["exact_mass"].map(A0)
df[df_name.loc[1]] = df["exact_mass"].map(A1)
df[df_name.loc[2]] = df["exact_mass"].map(A2)
df[df_name.loc[3]] = df["exact_mass"].map(A3)
df[df_name.loc[4]] = df["exact_mass"].map(A4)
df[df_name.loc[5]] = df["exact_mass"].map(A5)
df[df_name.loc[6]] = df["exact_mass"].map(A6)
.
.
.
and so on till applying A46.
I thing it could be a simpler way to def the function and that it changes according each ion (maybe a forloop?) and also a simpler way to apply the function and get the corresponding name without .loc each one.
Thanks!

One way is using functools.partial together with map.
Given the regularity of your function calls, I would try something like:
from funtools import partial
def func(x, n):
return x*df_M[n]/df_div[n] + df_mass[n]
for i in range(max_i): #change max_i with the integer you need
df[df_name.loc[i]] = map(partial(func, n=i), df["exact_mass"])
#df[df_name.loc[i]] = df["exact_mass"].map(partial(func, n=i)) should work as well
more info here https://docs.python.org/3.7/library/functools.html#functools.partial

Here's a proposition define
def A(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
Then doing A(x,5) is the same as A5(x). Then you loop through all your stuff:
for i in range(47):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: A(x,i))
I think there is probably a more elegant way to do this, but this should work.

Related

python: processing data so that only constant values remain

I have data from a measurement and I want to process the data so that only the values remain, that are constant. The measured signal consists of parts where the value stays constant for some time then I do a change on the system that causes the value to increase. It takes time for the system to reach the constant value after the adjustment I do.
I wrote a programm that compares every value with the 10 previous values. If it is equal to them within a tolerance it gets saved.
The code works but i feel like this can be done cleaner and more efficient so that it is sutable to process larger amouts of data. But I dont know how to make the code in for-loop more efficient. Do you have any suggestions?
Thank you in advance.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('radiale Steifigkeit_22_04_2022_raw.csv',
sep= ";",
decimal = ',',
skipinitialspace=True,
comment = '\t')
#df = df.drop(df.columns[[0,4]], axis=1)
#print(df.head())
#print(df.dtypes)
#df.plot(x = 'Time_SYS 01-cDAQ:1_A-In-All_Rec_rel', y = 'Kraft')
#df.plot(x = 'Time_SYS 01-cDAQ:1_A-In-All_Rec_rel', y = 'Weg')
#plt.show()
s = pd.Series(df['Weg'], name = 'Weg')
f = pd.Series(df['Kraft'], name= 'Kraft')
t = pd.Series(df['Time_SYS 01-cDAQ:1_A-In-All_Rec_rel'], name= 'Zeit')
#s_const = pd.Series()
s_const = []
f_const = []
t_const = []
s = np.abs(s)
#plt.plot(s)
#plt.show()
c = 0
#this for-loop compares the value s[i] with the previous 10 measurements.
#If it is equal within a tolerance it is saved into s_const.
for i in range(len(s)):
#for i in range(0,2000):
if i > 10:
si = round(s[i],3)
s1i = round(s[i-1],3)
s2i = round(s[i-2],3)
s3i = round(s[i-3],3)
s4i = round(s[i-4],3)
s5i = round(s[i-5],3)
s6i = round(s[i-6],3)
s7i = round(s[i-7],3)
s8i = round(s[i-8],3)
s9i = round(s[i-9],3)
s10i = round(s[i-10],3)
if si == s1i == s2i == s3i == s4i == s5i== s6i == s7i== s8i == s9i == s10i:
c = c+1
s_const.append(s[i])
f_const.append(f[i])

Here is a very performant implementation using itertools (based on Check if all elements in a list are identical):
from itertools import groupby
def all_equal(iterable):
g = groupby(iterable)
return next(g, True) and not next(g, False)
data = [1, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5]
window = 3
stable = [i for i in range(len(data) - window + 1) if all_equal(data[i:i+window])]
print(stable) # -> [1, 2, 7, 8, 9, 10, 13]
The algorithm produces a list of indices in your data where a stable period of length window starts.

Count matching combinations in a pandas dataframe

I need to find a more efficient solution for the following problem:
Given is a dataframe with 4 variables in each row. I need to find the list of 8 elements that includes all the variables per row in a maximum amount of rows.
A working, but very slow, solution is to create a second dataframe containing all possible combinations (basically a permutation without repetation). Then loop through every combination and compare it wit the inital dataframe. The amount of solutions is counted and added to the second dataframe.
import numpy as np
import pandas as pd
from itertools import combinations
df = pd.DataFrame(np.random.randint(0,20,size=(100, 4)), columns=list('ABCD'))
df = 'x' + df.astype(str)
listofvalues = df['A'].tolist()
listofvalues.extend(df['B'].tolist())
listofvalues.extend(df['C'].tolist())
listofvalues.extend(df['D'].tolist())
listofvalues = list(dict.fromkeys(listofvalues))
possiblecombinations = list(combinations(listofvalues, 6))
dfcombi = pd.DataFrame(possiblecombinations, columns = ['M','N','O','P','Q','R'])
dfcombi['List'] = dfcombi.M.map(str) + ',' + dfcombi.N.map(str) + ',' + dfcombi.O.map(str) + ',' + dfcombi.P.map(str) + ',' + dfcombi.Q.map(str) + ',' + dfcombi.R.map(str)
dfcombi['Count'] = ''
for x, row in dfcombi.iterrows():
comparelist = row['List'].split(',')
pointercounter = df.index[(df['A'].isin(comparelist) == True) & (df['B'].isin(comparelist) == True) & (df['C'].isin(comparelist) == True) & (df['D'].isin(comparelist) == True)].tolist()
row['Count'] = len(pointercounter)
I assume there must be a way to avoid the for - loop and replace it with some pointer, i just can not figure out how.
Thanks!

Your code can be rewritten as:
# working with integers are much better than strings
enums, codes = df.stack().factorize()
# encodings of df
s = [set(x) for x in enums.reshape(-1,4)]
# possible combinations
from itertools import combinations, product
possiblecombinations = np.array([set(x) for x in combinations(range(len(codes)), 6)])
# count the combination with issubset
ret = [0]*len(possiblecombinations)
for a, (i,b) in product(s, enumerate(possiblecombinations)):
ret[i] += a.issubset(b)
# the combination with maximum count
max_combination = possiblecombinations[np.argmax(ret)]
# in code {0, 3, 4, 5, 17, 18}
# and in values:
codes[list(max_combination)]
# Index(['x5', 'x15', 'x12', 'x8', 'x0', 'x6'], dtype='object')
All that took about 2 seconds as oppose to your code that took around 1.5 mins.

for loop construction with array before loop

In the Python Data Science Handbook the following example is given (the penultimate line is the one which I don't understand, as indicated):
import pandas as pd
import numpy as np
import seaborn as sns
sns.set()
#Downloaded from: https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv
births = pd.read_csv('births.csv')
births['decades'] = (births['year'] // 10) * 10
# Rhobust sigma clipping operation - ignore this
quartiles = np.percentile(births['births'], [25, 50, 75])
mu = quartiles[1]
sig = 0.74 * (quartiles[2] - quartiles[0])
births = births.query('(births > #mu - 5 * #sig) & (births < #mu + 5 * #sig)')
births['day'] = births['day'].astype(int)
births.index = pd.to_datetime(10000 * births.year +
100 * births.month +
births.day, format='%Y%m%d')
births_by_date = births.pivot_table('births', [births.index.month, births.index.day])
#Help on the loop below
births_by_date.index = [pd.datetime(2012, month, day)
for (month, day) in births_by_date.index]
print(births_by_date.index)
I don't understand the construction of the births_by_date.index in the for loop. I understand that the loop is getting applied to the pivot table, but I've never seen what looks like the output array put before the loop.
Can someone explain how this works, or direct me to an appropriate explanation please?
I have tried:
How do I save results of a "for" loop into a single variable?
numberous tutorials such as this one: https://www.learnpython.org/en/Loops
various other questions, but I can't find anything similar.

It's called a "list comprehension" which you can read about here among other sources. The comprehension is evaluated and then assigned back to the index of the dataframe, basically to give a year to your dates. It's equivalent to:
some_list = []
for month, day in births_by_date.index:
some_list.append(pd.datetime(2012, month, day))
births_by_date.index = some_list

It's a list comprehension as already mentioned. It's a concise syntax for running a loop on a list and generating another list by transforming it.
A simple example to double the elements of a list:
items = [1, 2, 3, 4]
doubled_items = [2*item for item in items]
# doubled_items is [2, 4, 6, 8]
This is essentially the same as:
items = [1, 2, 3, 4]
doubled_items = []
for item in items:
doubled_items.append(2*item)

How can I fix my function?

I am being told.. from comments to fix my function to make it look "cleaner". I've tried alot.. but i don't know how to use llambda to accomplish what I'm trying to do. My code works.. it just isn't what is being asked of me.
Here is my code with suggestions on how to fix it.
def immutable_fibonacci(position):
#define a lambda instead of def here
def compute_fib (previousSeries, ignore):
newList = previousSeries
if len(newList) < 2: # Do this outside and keep this function focused only on returning a new list with last element being sum of previous two elements
newList.append(1)
else:
first = newList[-1]
second = newList[-2]
newList.append(first+second)
return newList
range=[None]*position
return reduce(compute_fib, range, [])
#Above is too much code. How about something like this:
#next_series = lambda series,_ : (Use these instead of the above line)
#return reduce(next_series, range(position - 2), [1, 1])
Anything helps.. I am just confused on how I can implement these suggestions.
Here is what I attempted.
def immutable_fibonacci(position):
range=[None]*position
next_series = lambda series, _ : series.append(series[-1] + series[-2])
return reduce(next_series, range(position - 2), [1, 1])

The append function returns None. You need to return a renewed array
next_series = lambda series: series + [series[-1] + series[-2]]
Also renaming range isn't serving any purpose.
def immutable_fibonacci(position):
next_series = lambda series: series + [series[-1] + series[-2]]
return reduce(next_series, range(position - 2), [1, 1])
This is assuming you only call the function for positions >= 2. Conventionally fib(0) is 0 and fib(1) is 1.

Compute "concentration" of pandas categoricals

I'm having a problem with an old function computing the concentration of pandas categorical columns. There seems to have been a change making it impossible to subset the result of the .value_counts() method of a categorical series.
Minimal non-working example:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":["a","b","c","a"]})
def get_concentration(df,cat):
tmp = df[cat].astype("category")
counts = tmp.value_counts()
obs = len(tmp)
all_cons = []
for key in counts.keys():
single = np.square(np.divide(float(counts[key]),float(obs)))
all_cons.append(single)
return np.sum(all_cons)
get_concentration(df, "A")
This results in a key error for counts["a"]. I'm quite sure this worked in a past version of pandas and the documentation doesn't seem to mention a change regarding the .value_counts() method.

Let's agree on methodology:
>>> df.A.value_counts()
a 2
b 1
c 1
obs = len((df['A'].astype('category'))
>>> obs
4
The concentration should be as follows (per the Herfindahl Index):
>>> (2 / 4.) ** 2 + (1 / 4.) ** 2 + (1 / 4.) ** 2
0.375
Which is equivalent to (Pandas 0.17+):
>>> ((df.A.value_counts() / df.A.count()) ** 2).sum()
0.375
If you really want a function:
def concentration(df, col):
return ((df[col].value_counts() / df[col].count()) ** 2).sum()
>>> concentration(df, 'A')
0.375

Since you're iterating in a loop (and not working vectorically), you might as well just explicitly iterate over pairs. It simplifies the syntax, IMHO:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":["a","b","c","a"]})
def get_concentration(df,cat):
tmp = df[cat].astype("category")
counts = tmp.value_counts()
obs = len(tmp)
all_cons = []
# See change in following line - you're anyway iterating
# over key-value pairs; why not do so explicitly?
for k, v in counts.to_dict().items():
single = np.square(np.divide(float(v),float(obs)))
all_cons.append(single)
return np.sum(all_cons)
>>> get_concentration(df, "A")
0.25

To fix the current function, you just need to access the index values using .ix (see below). You might be better of using a vectorized function - I've addend one at the end.
df = pd.DataFrame({"A":["a","b","c","a"]})
tmp = df[cat].astype('category')
counts = tmp.value_counts()
obs = len(tmp)
all_cons = []
for key in counts.index:
single = np.square(np.divide(float(counts.ix[key]), float(obs)))
all_cons.append(single)
return np.sum(all_cons)
yields:
get_concentration(df, "A")
0.25
You might want to try a vectorized version, which also doesn't necessarily need the category dtype, such as:
def get_concentration(df, cat):
counts = df[cat].value_counts()
return counts.div(len(counts)).pow(2).sum()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to iterate over pandas df with a def function variable function - python

Related

python: processing data so that only constant values remain

Count matching combinations in a pandas dataframe

for loop construction with array before loop

How can I fix my function?

Compute "concentration" of pandas categoricals

Categories

Resources