Loop through rows of dataframe at specific row values - python

My dataframe contains three different replications for each treatment. I want to loop through both, so I want to loop through each treatment, and for each treatment calculate a model for each replication. I managed to loop through the treatments, but I need to also loop through the replications of each treatment. Ideally, the output should be saved into a new dataframe that contains 'treatment' and 'replication'. Any suggestion?
The dataframe (df) looks like this:
treatment replication time y
**8 1 1 0.1**
8 1 2 0.1
8 1 3 0.1
**8 2 1 0.1**
8 2 2 0.1
8 2 3 0.1
**10 1 1 0.1**
10 1 2 0.1
10 1 3 0.1
**10 2 1 0.1**
10 2 2 0.1
10 2 3 0.1
for i, g in df.groupby('treament'):
k = g.iloc[0].y
popt, pcov = curve_fit(model, x, y)
fit_m = popt
I now apply iterrows, but then I can no longer use the index of NPQ [0] to get the initial value. Any idea how to solve this? The error message reads as:
for index, row in HL.iterrows():
g = (index, row['filename'], row['hr'], row['time'], row['NPQ'])
k = g.iloc[0]['NPQ'])
AttributeError: 'tuple' object has no attribute 'iloc'
Thank you in advance

grouped_df = HL.groupby(["hr", "filename"])
for key, g in grouped_df:
k = g.iloc[0].y
popt, pcov = curve_fit(model, x, y)
fit_m = popt

Related

Pandas sum last n rows of df.count() results into one row

I am looking for a way to generate nice summary statistics of a dataframe. Consider the following example:
>> df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
>> df['category'].value_counts()
z 4
x 4
y 3
u 2
v 1
w 1
>> ??
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
The result sums the value counts of the n=3 last rows up, deletes them and then adds them as one row to the original value counts. Also it would be nice to have everything as percents. Any ideas how to implement this? Cheers!
For DataFrame with percentages use Series.iloc with indexing, crate DataFrame by Series.to_frame, add new row and new column filled by percentages:
s = df['category'].value_counts()
n= 3
out = s.iloc[:-n].to_frame('count')
out.loc['Other ({n})'] = s.iloc[-n:].sum()
out['pct'] = out['count'].div(out['count'].sum()).apply(lambda x: f"{x:.0%}")
print (out)
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
I would use tail(-3) to get the last values except for the first 3:
counts = df['category'].value_counts()
others = counts.tail(-3)
counts[f'Others ({len(others)})'] = others.sum()
counts.drop(others.index, inplace=True)
counts.to_frame(name='count').assign(pct=lambda d: d['count'].div(d['count'].sum()).mul(100).round())
Output:
count pct
z 4 27.0
x 4 27.0
y 3 20.0
Others (3) 4 27.0
This snippet
df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
cutoff_index = 3
categegory_counts = pd.DataFrame([df['category'].value_counts(),df['category'].value_counts(normalize=True)],index=["Count","Percent"]).T.reset_index()
other_rows = categegory_counts[cutoff_index:].set_index("index")
categegory_counts = categegory_counts[:cutoff_index].set_index("index")
summary_table = pd.concat([categegory_counts,pd.DataFrame(other_rows.sum(),columns=[f"Other ({len(other_rows)})"]).T])
summary_table = summary_table.astype({'Count':'int'})
summary_table['Percent'] = summary_table['Percent'].apply(lambda x: "{0:.2f}%".format(x*100))
print(summary_table)
will give you what you need. Also in a nice format;)
Count Percent
z 4 26.67%
x 4 26.67%
y 3 20.00%
Other (3) 4 26.67%

Create a binary matrix after comparing columns' values in a dataframe

The text is long but the question is simple!
I have two dataframes that brings different informations about two variables and I need to create a binary matrix as my output after following some steps.
Let's say my dataframes are these:
market_values = pd.DataFrame({'variableA': (1,2.0,3), 'variableB': (np.nan,2,np.nan), 'variableC': (9,10,15), 'variableD' : (18,25,43),'variableE':(36,11,12),'variableF':(99,10,98), 'variableG': (42,19,27)})
variableA variableB variableC variableD variableE variableF variableG
0 1.0 NaN 9 18 36 99 42
1 2.0 2.0 10 25 11 10 19
2 3.0 NaN 15 43 12 98 27
negociation_values = pd.DataFrame({'variableA': (0.1,0.2,0.3), 'variableB': (0.5,np.nan,0.303), 'variableC': (0.9,0.10,0.4), 'variableD' : (0.12,0.11,0.09),'variableE':(np.nan,0.13,0.21),'variableF':(0.14,np.nan,0.03), 'variableG': (1.4,0.134,0.111)})
variableA variableB variableC variableD variableE variableF variableG
0 0.1 0.500 0.9 0.12 NaN 1.4 0.141
1 0.2 NaN 0.1 0.11 0.13 NaN 0.134
2 0.3 0.303 0.4 0.09 0.21 0.03 0.111
And I need to follow these steps:
1 - Check if two columns in my 'market_values' df have at least one
value that is equal (for the same row)
2 - If a pair of columns has one value that is equal (for the same row),
then I need to compare these same columns in my
'negociation_values' df
3 - Then I have to discover which variable has the higher
negociation value (for a given row)
4 - Finally I need to create a binary matrix.
For those equal values' variable, I'll put 1 where one
negociation value is higher and 0 for the other. If a column
doesn't have an equal value with another column, I'll just put 1
for the entire column.
The desired output matrix will be like:
variableA variableB variableC variableD variableE variableF variableG
0 0 1 0 1 1 1 1
1 1 0 1 1 1 0 1
2 0 1 1 1 1 0 1
The main difficult is at steps 3 and 4.
I've done steps 1 and 2 so far. They're above:
arr = market_values.to_numpy()
is_equal = ((arr == arr[None].T).any(axis=1))
is_equal[np.tril_indices_from(is_equal)] = False
inds_of_same_cols = [*zip(*np.where(is_equal))]
equal_cols = [market_values.columns[list(inds)].tolist() for inds in inds_of_same_cols]
print(equal_cols)
-----------------
[['variableA', 'variableB'], ['variableC', 'variableF']]
h = []
for i in equal_cols:
op = pd.DataFrame(negociation_values[i])
h.append(op)
print(h)
-------
[ variableA variableB
0 0.1 0.500
1 0.2 NaN
2 0.3 0.303,
variableC variableF
0 0.9 0.14
1 0.1 NaN
2 0.4 0.03]
The code above returns me the negociation values for the columns that have at least one equal value in the market values df.
Unfortunately, I don't know where to go from here. I need to write a code that says something like: "If variableA > variableB (for a row), insert '1' in a new matrix under variableA column and a '0' under variableB column for that row. keep doing that and then do that for the others". Also, I need to say "If a variable doesn't have an equal value in some other column, insert 1 for all values in this binary matrix"
your negociation_values definition and presented table are not the same:
here is the definition I used
market_values = pd.DataFrame({'variableA': (1,2.0,3), 'variableB': (np.nan,2,np.nan), 'variableC': (9,10,15), 'variableD' : (18,25,43),'variableE':(36,11,12),'variableF':(99,10,98), 'variableG': (42,19,27)})
negociation_values = pd.DataFrame({'variableA': (0.1,0.2,0.3), 'variableB': (0.5,np.nan,0.303), 'variableC': (0.9,0.10,0.4), 'variableD' : (0.12,0.11,0.09),'variableE':(np.nan,0.13,0.21),'variableF':(1.4,np.nan,0.03), 'variableG': (0.141,0.134,0.111)})
The following code gives me the required matrix (though there are a number of edge cases you will need to consider)
cols = market_values.columns.values
bmatrix = pd.DataFrame(index=market_values.index, columns=cols, data=1)
for idx,col in enumerate(cols):
print(cols[idx+1:])
df_m = market_values[cols[idx+1:]]
df_n = negociation_values[cols[idx+1:]]
v = df_n.loc[:,df_m.sub(market_values[col],axis=0).eq(0).any()].sub(negociation_values[col], axis=0).applymap(lambda x: 1 if x > 0 else 0)
if v.columns.size > 0:
bmatrix[v.columns[0]] = v
bmatrix[col] = 1 - v
The result is as required:
The pseudo code is:
for each column of the market matrix:
subtract from the later columns,
keep columns with any zeros (edge case: more than one column),
from column with zero , find difference between corresponding negoc. matrix,
set result to 1 if > 0, otherwise 0,
enter into binary matrix
Hope that makes sense.

How to compute a selective norm for a Multi-Index DF

I have measurement data in MultiIndex spreadsheet format and need to compute a norm by dividing each value of a column by its corresponding reference value.
How can this be done efficiently and 'readable' using Python Pandas, i.e. how do I filter the correct reference value in order to compute the normed values?
Here's the input data:
result
var run ID
10 1 A 10
B 50
2 A 30
B 70
20 1 A 100
B 500
2 A 300
B 700
30 1 A 1000
B 5000
2 A 3000
B 7000
and this is the desired result:
normed
var run ID
10 1 A 0.1
B 0.1
2 A 0.1
B 0.1
20 1 A 1.0
B 1.0
2 A 1.0
B 1.0
30 1 A 10.0
B 10.0
2 A 10.0
B 10.0
As can be seen, var = 20 is the reference, but it gets even more complicated since there are two runs (1 and 2) and two devices under test.
I can create a mask df[df['var' == 20] when the DF is flattened using df.reset_index() (see comment #1), but I don't know how to proceed from here.
Any help is deeply appreciated!
Update
I have found a solution using query() in a for loop:
df_norm = pd.DataFrame()
df_flat = df.reset_index()
var_ref = 20
for ident in 'A','B':
for run in 1,2:
q = f'var == {var_ref} & run == {run} & ID == "{ident}"'
ref = df_flat.query(q)
#ref
#ref.result
#ref.result.iloc[0]
q = f'run == {run} & ID == "{ident}"'
df_m = df_flat.query(q)
norm = df_m.result / ref.result.iloc[0]
#norm
df__ = pd.DataFrame(norm.rename('norm'))
df__ = df_flat.merge(df__, left_index=True, right_index=True)
df_norm = pd.concat([df_norm, df__])
df_norm.sort_index()
Maybe there's a more elegant way to do it?

Pandas sequentially apply function using output of previous value

I want to compute the "carryover" of a series. This computes a value for each row and then adds it to the previously computed value (for the previous row).
How do I do this in pandas?
decay = 0.5
test = pd.DataFrame(np.random.randint(1,10,12),columns = ['val'])
test
val
0 4
1 5
2 7
3 9
4 1
5 1
6 8
7 7
8 3
9 9
10 7
11 2
decayed = []
for i, v in test.iterrows():
if i ==0:
decayed.append(v.val)
continue
d = decayed[i-1] + v.val*decay
decayed.append(d)
test['loop_decay'] = decayed
test.head()
val loop_decay
0 4 4.0
1 5 6.5
2 7 10.0
3 9 14.5
4 1 15.0
Consider a vectorized version with cumsum() where you cumulatively sum (val * decay) with the very first val.
However, you then need to subtract the very first (val * decay) since cumsum() includes it:
test['loop_decay'] = (test.ix[0,'val']) + (test['val']*decay).cumsum() - (test.ix[0,'val']*decay)
You can utilize pd.Series.shift() to create a dataframe with val[i] and val[i-1] and then apply your function across a single axis (1 in this case):
# Create a series that shifts the rows by 1
test['val2'] = test.val.shift()
# Set the first row on the shifted series to 0
test['val2'].ix[0] = 0
# Apply the decay formula:
test['loop_decay'] = test.apply(lambda x: x['val'] + x['val2'] * 0.5, axis=1)

Apply curve_fit on dataframe columns

I have a pandas.DataFrame with with multiple columns and I would like to apply a curve_fit function to each of them. I would like the output to be a dataframe with the optimal values fitting the data in the columns (for now, I am not interested in their covariance).
The df has the following structure:
a b c
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 1 0 1
7 1 1 1
8 1 1 1
9 1 1 1
10 1 1 1
11 1 1 1
12 1 1 1
13 1 1 1
14 2 1 2
15 6 2 6
16 7 2 7
17 8 2 8
18 9 2 9
19 7 2 7
I have defined a function to fit to the data as so:
def sigmoid(x, a, x0, k):
y = a / (1 + np.exp(-k*(x-x0)))
return y
def fitdata(dataseries):
popt, pcov=curve_fit(sigmoid, dataseries.index, dataseries)
return popt
I can apply the function and get an array in return:
result_a=fitdata(df['a'])
In []: result_a
Out[]: array([ 8.04197008, 14.48710063, 1.51668241])
If I try to df.apply the function I get the following error:
fittings=df.apply(fitdata)
ValueError: Shape of passed values is (3, 3), indices imply (3, 20)
Ultimately I would like the output to look like:
a b c
0 8.041970 2.366496 8.041970
1 14.487101 12.006009 14.487101
2 1.516682 0.282359 1.516682
Can this be done with something similar to apply?
Hope my solution work for you.
result = pd.DataFrame()
for i in df.columns:
frames = [result, pd.DataFrame(fitdata(df[i]))]
result = pd.concat(frames, axis=1)
result.columns = df.columns
a b c
0 8.041970 2.366496 8.041970
1 14.487101 12.006009 14.487101
2 1.516682 0.282359 1.516682
I think the issue is that the apply of your fitting function returns an array of dim 3x3 (the 3 fitparameters as returned by conner). But expected is something in the shape of 20x3 as your df.
So you have to re-apply your fitfunction on these parameters to get your fitted y-values.
def fitdata(dataseries):
# fit the data
fitParams, fitCovariances=curve_fit(sigmoid, dataseries.index, dataseries)
# we have to re-apply a function to the coeffs. so that we get fittet
# data in shape of the df again.
y_fit = sigmoid(dataseries, fitParams[0], fitParams[1], fitParams[2])
return y_fit
Have a look here for more examples
(this post is based on both previous answers and provides a complete example including an improvement in the dataframe construction of the fit parameters)
The following function fit_to_dataframe fits an arbitrary function to each column in your data and returns the fit parameters (covariance ignored here) in a convenient format:
def fit_to_dataframe(df, function, parameter_names):
popts = {}
pcovs = {}
for c in df.columns:
popts[c], pcovs[c] = curve_fit(function, df.index, df[c])
fit_parameters = pd.DataFrame.from_dict(popts,
orient='index',
columns=parameter_names)
return fit_parameters
fit_parameters = fit_to_dataframe(df, sigmoid, parameter_names=['a', 'x0', 'k'])
The fit parameters are available in the following form:
a x0 k
a 8.869996 11.714575 0.844969
b 2.366496 12.006009 0.282359
c 8.041970 14.487101 1.516682
In order to inspect the fit result, you can use the following function to plot the results:
def plot_fit_results(df, function, fit_parameters):
NUM_POINTS = 50
t = np.linspace(df.index.values.min(), df.index.values.max(), NUM_POINTS)
df.plot(style='.')
for idx, column in enumerate(df.columns):
plt.plot(t,
function(t, *fit_parameters.loc[column]),
color='C{}'.format(idx))
plt.show()
plot_fit_results(df, sigmoid, fit_parameters)
Result: Output Graph
This answer is also available as an interactive jupyter notebook here.

Categories

Resources