I have the following dataframe:
df = pd.DataFrame({'date' : ['2020-6','2020-07','2020-8'], 'd3_real':[1.2,1.3,0.8], 'd7_real' : [1.5,1.8,1.2], 'd14_real':[1.9,2.1,1.5],'d30_real' : [2.1, 2.2, 1.8],
'd7_mul':[1.12,1.1,1.15],'d14_mul':[1.08, 1.1, 1.14],'d30_mul':[1.23,1.25,1.12]})
The dX_real refers to the actual values on day 3, day 7, and day 14... and the second one is each multiplier for that specific day.
I want to calculate those predictions in the following way. First, I take the target column (d3_real, d7_real...) and then I multiply it for each multiplier depending on the case. for example, to calculate the prediction from d3_real to d30, I would need to multiply it by the multipliers of D7, D14 and D30.
df['d30_from_d3'] = df.iloc[:,1] * df.iloc[:,5] * df.iloc[:,6] * df.iloc[:,7]
df['d30_from_d7'] = df.iloc[:,2] * df.iloc[:,6] * df.iloc[:,7]
df['d30_from_d14'] = df.iloc[:,3] * df.iloc[:,7]
Is there any way to automate this with a loop? I do not know how to multiply each dX_real column without using conditional for each case as the number of multiplications changes.
This is what I have tried that it is not working as expected, as it is only multiplying the first multiplier:
pos_reals = [1,2,3]
pos_mul = [5,6,7]
clases = ['d3', 'd7','d14']
for target in pos_reals:
for clase in pos_mul:
df[f'f{clases}_hm_p_d30'] = df.iloc[:,target]
However, from here, I do not know how to specific which values it needs to multiply based on d3, d7 and d14.
Thanks!
bbb = [[1, 5, 6, 7], [2, 6, 7], [3, 7]]
ddd = ['d30_from_d3', 'd30_from_d7', 'd30_from_d14']
for i in range(0, len(ddd)):
df[ddd[i]] = df.iloc[:, bbb[i][0]]
for x in range(1, len(bbb[i])):
df[ddd[i]] = df[ddd[i]] * df.iloc[:, bbb[i][x]]
Output
date d3_real d7_real d14_real d30_real d7_mul d14_mul d30_mul \
0 2020-6 1.2 1.5 1.9 2.1 1.12 1.08 1.23
1 2020-07 1.3 1.8 2.1 2.2 1.10 1.10 1.25
2 2020-8 0.8 1.2 1.5 1.8 1.15 1.14 1.12
d30_from_d3 d30_from_d7 d30_from_d14
0 1.785370 1.99260 2.337
1 1.966250 2.47500 2.625
2 1.174656 1.53216 1.680
Here the first loop selects the names of the new columns from the 'ddd' list and sets the first value in the new column. In the nested loop, the numbers of the desired columns are taken from the list 'bbb' and the values are multiplied. Check with your data or show the expected result with your example. You need to check for a match.
Related
I am new to Python and Pandas and am trying so have a simple function that will repeat the value x amount of times accoring to a adjacent value.
For example:
I want to take the first column (weight) and add it to a new column based on the amount next to it (wheels). So the column will have 1.5 27x, than immediatly after will have 2.4 177x and repeate this for all values shown. Does anyone know a simple way to do this?
Use Series.repeat:
out = df['Weight'].repeat(df['Wheels'])
print(out)
# Output
0 1.5
0 1.5
1 2.4
1 2.4
1 2.4
Name: Weight, dtype: float64
Setup:
df = pd.DataFrame({'Weight': [1.5, 2.4], 'Wheels': [2, 3]})
print(df)
# Output
Weight Wheels
0 1.5 2
1 2.4 3
Assuming you have a pandas dataframe named df.
import numpy as np
np.repeat(df['weigth'], df['wheels'])
I have a relatively large dataframe (~24000 rows and 15 columns) which has 2D coordinate data of rat movements, outputted by a neural network (DeepLabCut).
As part of this output data, there is a p-value score that is a measure of how certain the neural network was when applying that label. I'm trying to filter low quality predictions by copying the previous row into its place, each time that a low p-value is encountered, which assumes that the rat remained still for that frame.
Here's my function thus far:
def checkPVals(DataFrame, CutOff):
for Cols in DataFrame.columns.values:
if Cols % 3 == 0:
for Vals in DataFrame.index.values:
if float(DataFrame[Cols][Vals]) < CutOff:
if (Vals != 0):
PreviousRow = DataFrame.loc[Vals - 1, Cols - 3:Cols]
DataFrame.loc[Vals, Cols - 3:Cols] = PreviousRow
return(DataFrame)
Here is a sample of the input data frame:
pd.DataFrame(data={
"x":[1, 2, 3, 4],
"y":[5, 4, 3, 2],
"likelihood":[1, 1, 0.3, 1]
})
Here is a sample of the desired output:
x y Pval
0 1 5 1.0
1 2 4 1.0
2 2 4 1.0
3 4 2 1.0
With the idea being that row index 2 is replaced with values from row index 1, such that when the inter-frame Euclidean distance between these coordinates is calculated, the distance is 0, implying the label (rat) has not moved.
Clearly, my current implementation is very inefficient. I was looking at iterrows(), but that converts my data into a series and messes with it. My other thought was to convert the p-value columns into np.arrrays, iterate through those, take the index of the p-values below threshold and then swap the rows for the previous one in an iterative manner. However, I feel like that'll take just as long.
Any help is very much appreciated. Thank you!
I'm pretty sure I understood what you are attempting to do. If you could update your question to have a sample output that's paired with you sample input, that would be greatly beneficial.
If I understood correctly, you should be using a vectorized approach instead of explicit looping (this will massively speed up your data wrangling). Essentially you can mask the rows of the dataframe depending on whether or not the "likelihood" column is above a certain value. Once you mask the low likelihoods away (i.e. replace those values with NaN), you can simply forward fill the entire dataframe to fill in the "bad" rows with the previous row's values.
df = pd.DataFrame(data={
"x":[1, 2, 3, 4],
"y":[5, 4, 3, 2],
"likelihood":[1, 1, 0.3, 1]
})
cutoff = 0.5
new_df = df.mask(df["likelihood"] < cutoff).ffill()
print(new_df)
x y likelihood
0 1.0 5.0 1.0
1 2.0 4.0 1.0
2 2.0 4.0 1.0
3 4.0 2.0 1.0
I have time series data per row (with columns as time steps) and I'd like to left and right pad each row with 0s based on a conditional row value (i.e. 'Padding amount'). This is what I have:
Padding amount T1 T2 T3
0 3 2.9 2.8
1 2.9 2.8 2.7
1 2.8 2.3 2.0
2 4.4 3.3 2.3
And this is what I'd like to produce:
Padding amount T1 T2 T3 T4 T5
0 3 2.9 2.8 0 0 (--> padding = 0, so no change)
1 0 2.9 2.8 2.7 0 (--> shifted one to the left)
1 0 2.8 2.3 2.0 0
2 0 0 4.4 3.3 2.3 (--> shifted two to the right)
I see that Keras has sequence padding, but not sure how this would work considering all rows have the same number of entries. I'm looking at Shift and np.roll but I'm sure a solution exists for this already somewhere.
In numpy, you could construct an array of indices for the locations where you want to place your array elements.
Let's say you have
padding = np.array([0, 1, 1, 2])
data = np.array([[3.0, 2.9, 2.8],
[2.9, 2.8, 2.7],
[2.8, 2.3, 2.0],
[4.4, 3.3, 2.3]])
M, N = data.shape
The output array would be
output = np.zeros((M, N + padding.max()))
You can make an index of where the data goes:
rows = np.arange(M)[:, None]
cols = padding[:, None] + np.arange(N)
Since the shape of the index broadcasts to the shape of the shape of the data, you can assign the output directly:
output[rows, cols] = data
Not sure how this applies to a DataFrame exactly, but you could probably construct a new one after operating on the values of the old one. Alternatively, you could probably implement all these operations equivalently directly in pandas.
This is one way of doing it, i've made the process really flexible in terms of how many time periods/steps it can take:
import pandas as pd
#data
d = {'Padding amount': [0, 1, 1, 2],
'T1': [3, 2.9, 2.8, 4.4],
'T2': [2.9, 2.7, 2.3, 3.3],
'T3': [2.8, 2.7, 2.0, 2.3]}
#create DF
df = pd.DataFrame(data = d)
#get max padding amount
maxPadd = df['Padding amount'].max()
#list of time periods
timePeriodsCols = [c for c in df.columns.tolist() if 'T' in c]
#reverse list
reverseList = timePeriodsCols[::-1]
#number of periods
noOfPeriods = len(timePeriodsCols)
#create new needed columns
for i in range(noOfPeriods + 1, noOfPeriods + 1 + maxPadd):
df['T' + str(i)] = ''
#loop over records
for i, row in df.iterrows():
#get padding amount
padAmount = df.at[i, 'Padding amount']
#if zero then do nothing
if padAmount == 0:
continue
#else: roll column value by padding amount and set old location to zero
else:
for col in reverseList:
df.at[i, df.columns[df.columns.get_loc(col) + padAmount]] = df.at[i, df.columns[df.columns.get_loc(col)]]
df.at[i, df.columns[df.columns.get_loc(col)]] = 0
print(df)
Padding amount T1 T2 T3 T4 T5
0 0 3.0 2.9 2.8
1 1 0.0 2.9 2.7 2.7
2 1 0.0 2.8 2.3 2
3 2 0.0 0.0 4.4 3.3 2.3
I have an array that contains numbers that are distances, and another that represents certain values at that distance. How do I calculate the average of all the data at a fixed value of the distance?
e.g distances (d): [1 1 14 6 1 12 14 6 6 7 4 3 7 9 1 3 3 6 5 8]
e.g data corresponding to the entry of the distances:
therefore value=3.3 at d=1; value=2,1 at d=1; value=3.5 at d=14; etc..
[3.3 2.1 3.5 2.5 4.6 7.4 2.6 7.8 9.2 10.11 14.3 2.5 6.7 3.4 7.5 8.5 9.7 4.3 2.8 4.1]
For example, at distance d=6 I should do the mean of 2.5, 7.8, 9.2 and 4.3
I've used the following code that works, but I do not know how to store the values into a new array:
from numpy import mean
for d in set(key):
print d, mean([dist[i] for i in range(len(key)) if key[i] == d])
Please help! Thanks
You've got the hard part done, just putting your results into a new list is as easy as:
result = []
for d in set(key):
result.append(mean([dist[i] for i in range(len(key)) if key[i] == d]))
Using pandas
g = pd.DataFrame({'d':d, 'k':k}).groupby('d')
Option 1: transform to get the values in the same positions
g.transform('mean').values
Option2: mean directly and get a dict with the mapping
g.mean().to_dict()['k']
Setup
d = np.array(
[1, 1, 14, 6, 1, 12, 14, 6, 6, 7, 4, 3, 7, 9, 1, 3, 3, 6, 5, 8]
)
k = np.array(
[3.3,2.1,3.5,2.5,4.6,7.4,2.6,7.8,9.2,10.11,14.3,2.5,6.7,3.4,7.5,8.5,9.7,4.3,2.8,4.1]
)
scipy.sparse + csr_matrix
from scipy import sparse
s = d.shape[0]
r = np.arange(s+1)
m = d.max() + 1
b = np.bincount(d)
out = sparse.csr_matrix( (k, d, r), (s, m) ).sum(0).A1
(out / b)[d]
array([ 4.375, 4.375, 3.05 , 5.95 , 4.375, 7.4 , 3.05 , 5.95 ,
5.95 , 8.405, 14.3 , 6.9 , 8.405, 3.4 , 4.375, 6.9 ,
6.9 , 5.95 , 2.8 , 4.1 ])
You could use array from the numpy lib in combination with where, also from the same lib.
You can define a function to get the positions of the desired distances:
from numpy import mean, array, where
def key_distances(distances, d):
return where(distances == d)[0]
then you use it for getting the values at those positions.
Let's say you have:
d = array([1,1,14,6,1,12,14,6,6,7,4,3,7,9,1,3,3,6,5,8])
v = array([3.3,2.1,3.5,2.5,4.6,7.4,2.6,7.8,9.2,10.11,14.3,2.5,6.7,3.4,7.5,8.5,9.7,4.3,2.8,4.1])
Then you might do something like:
vs = v[key_distances(d,d[1])]
Then get your mean:
print mean(vs)
The numpy_indexed package (disclaimer: I am its author) was designed with these use-cases in mind:
import numpy_indexed as npi
npi.group_by(d).mean(dist)
Pandas can do similar things; but its api isnt really tailored to these things; and for such an elementary operation as a group-by I feel its kinda wrong to have to hoist your data into a completely new datastructure.
given the following numpy.ndarrays of identical length
nparray_upper = [ 5.2 4.9 7.6 10.1]
nparray_base = [ 2.2 2.6 5.5 11.02]
nparray_lower = [ 4.3 1.4 3.2 8.9]
and a fixed size variable
multiplier = 10
how do i multiply the index of each with the multiplier based on a condition?
indexMultiplierCondition = np.where(((nparray_base <= nparray_upper) & (nparray_base >= nparray_lower)), INDEX * multiplier, 0).sum()
the above should return
indexMultiplierCondition = 30
because only 2.6 and 5.5 in nparray_base are within the upper and lower level and the sum of their index 1 and 2 multiplied by 10 is 30
this should be as efficient as possible
np.where returns a tuple.
So, you can retrieve the first element of the tuple (which is a np.ndarray) and multiply by a scalar value of your choice.
For example,
i = np.where(((b<=a) & (c<=b)))
(array([1, 2], dtype=int64),)
i[0] * m
array([10, 20], dtype=int64)
(i[0] * m).sum()
30