Calculate running total based off original value in pandas - python

I wish to take an inital value of 1000 and multiply it by the first value in the 'Change' and then take that value and multiply it to the second value in the 'Change' column and so on.
I could do this by using a loop as follows
changes = [0.97,1.02,1.1,0.88,1.01 ]
df = pd.DataFrame()
df['Change'] = changes
df['Total'] = np.nan
df['Total'][0] = 1000*df['Change'][0]
for i in range(1,len(df)):
df['Total'][i] = df['Total'][i-1] * df['Change'][i]
Output:
Change Total
0 0.97 970.000000
1 1.02 989.400000
2 1.10 1088.340000
3 0.88 957.739200
4 1.01 967.316592
But this will be too slow for a large dataset. Is there any way to do this without loops?
Thanks

Related

Divide a column(containing both negative numbers and positive numbers) of a Pandas Dataframe with a specific number and create a new column in python?

A = 49531.78, -3.178 ,-2.119
Want to divide A's value with 49939.203 , used formula in excel =IFERROR(49939.203/A1,0)
B should be => 1.01;-15714.04; -23567.35 , how should I do in python to get this division ?
Use Series.rdiv for divide from right side:
df['B'] = df.A.rdiv(49939.203).replace(np.inf, 0).round(2)
print (df)
A B
0 49531.780 1.01
1 -3.178 -15714.03
2 -2.119 -23567.34
3 0.000 0.00

Cumulative sum of a pandas column until a maximum value is met, and average adjacent rows

I'm a biology student who is fairly new to python and was hoping someone might be able to help with a problem I have yet to solve
With some subsequent code I have created a pandas dataframe that looks like the example below:
Distance. No. of values Mean rSquared
1 500 0.6
2 80 0.3
3 40 0.4
4 30 0.2
5 50 0.2
6 30 0.1
I can provide my previous code to create this dataframe, but I didn't think it was particularly relevant.
I need to sum the number of values column until I achieve a value >= 100; and then combine the data of the rows of the adjacent columns, taking the weighted average of the distance and mean r2 values, as seen in the example below
Mean Distance. No. Of values Mean rSquared
1 500 0.6
(80*2+40*3)/120 (80+40) = 120 (80*0.3+40*0.4)/120
(30*4+50*5+30*6)/110 (30+50+30) = 110 (30*0.2+50*0.2+30*0.1)/110
etc...
I know pandas has it's .cumsum function, which I might be able to implement into a for loop with an if statement that checks the upper limit and resets the sum back to 0 when it is greater than or equal to the upper limit. However, I haven't a clue how to average the adjacent columns.
Any help would be appreciated!
You can use this code snippet to solve your problem.
# First, compute some weighted values
df.loc[:, "weighted_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "weighted_mean_rSquared"] = df["Mean rSquared"] * df["No. of values"]
min_threshold = 100
indexes = []
temp_sum = 0
# placeholder for final result
final_df = pd.DataFrame()
columns = ["Distance", "No. of values", "Mean rSquared"]
# reseting index to make the 'df' usable in following output
df = df.reset_index(drop=True)
# main loop to check and compute the desired output
for index, _ in df.iterrows():
temp_sum += df.iloc[index]["No. of values"]
indexes.append(index)
# if the sum exceeds 'min_threshold' then do some computation
if temp_sum >= min_threshold:
temp_distance = df.iloc[indexes]["weighted_distance"].sum() / temp_sum
temp_mean_rSquared = df.iloc[indexes]["weighted_mean_rSquared"].sum() / temp_sum
# create temporary dataframe and concatenate with the 'final_df'
temp_df = pd.DataFrame([[temp_distance, temp_sum, temp_mean_rSquared]], columns=columns)
final_df = pd.concat([final_df, temp_df])
# reset the variables
temp_sum = 0
indexes = []
Numpy has a function numpy.frompyfunc You can use that to get the cumulative value based on a threshold.
Here's how to implement it. With that, you can then figure out the index when the value goes over the threshold. Use that to calculate the Mean Distance and Mean rSquared for the values in your original dataframe.
I also leveraged #sujanay's idea of calculating the weighted values first.
c = ['Distance','No. of values','Mean rSquared']
d = [[1,500,0.6], [2,80,0.3], [3,40,0.4],
[4,30,0.2], [5,50,0.2], [6,30,0.1]]
import pandas as pd
import numpy as np
df = pd.DataFrame(d,columns=c)
#calculate the weighted distance and weighted mean squares first
df.loc[:, "w_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "w_mean_rSqrd"] = df["Mean rSquared"] * df["No. of values"]
#use numpy.frompyfunc to setup the threshold condition
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
#assign value to cumvals based on threshold
df['cumvals'] = sumvals.accumulate(df['No. of values'], dtype=np.object)
#find out all records that have >= 100 as cumulative values
idx = df.index[df['cumvals'] >= 100].tolist()
#if last row not in idx, then add it to the list
if (len(df)-1) not in idx: idx += [len(df)-1]
#iterate thru the idx for each set and calculate Mean Distance and Mean rSquared
i = 0
for j in idx:
df.loc[j,'Mean Distance'] = (df.iloc[i:j+1]["w_distance"].sum() / df.loc[j,'cumvals']).round(2)
df.loc[j,'New Mean rSquared'] = (df.iloc[i:j+1]["w_mean_rSqrd"].sum() / df.loc[j,'cumvals']).round(2)
i = j+1
print (df)
The output of this will be:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
1 2 80 ... NaN NaN
2 3 40 ... 2.33 0.33
3 4 30 ... NaN NaN
4 5 50 ... NaN NaN
5 6 30 ... 5.00 0.17
If you want to extract only the records that are non NaN, you can do:
final_df = df[df['Mean Distance'].notnull()]
This will result in:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
2 3 40 ... 2.33 0.33
5 6 30 ... 5.00 0.17
I looked up BEN_YO's implementation of numpy.frompyfunc. The original SO post can be found here. Restart cumsum and get index if cumsum more than value
If you figure out the grouping first, pandas groupby-functionality will do a lot of the remaining work for you. A loop is appropriate to get the grouping (unless somebody has a clever one-liner):
>>> groups = []
>>> group = 0
>>> cumsum = 0
>>> for n in df["No. of values"]:
... if cumsum >= 100:
... cumsum = 0
... group = group + 1
... cumsum = cumsum + n
... groups.append(group)
>>>
>>> groups
[0, 1, 1, 2, 2, 2]
Before doing the grouped operations you need to use the No. of values information to get the weighting in:
df[["Distance.", "Mean rSquared"]] = df[["Distance.", "Mean rSquared"]].multiply(df["No. of values"], axis=0)
Now get the sums like this:
>>> sums = df.groupby(groups)["No. of values"].sum()
>>> sums
0 500
1 120
2 110
Name: No. of values, dtype: int64
And finally the weighted group averages like this:
>>> df[["Distance.", "Mean rSquared"]].groupby(groups).sum().div(sums, axis=0)
Distance. Mean rSquared
0 1.000000 0.600000
1 2.333333 0.333333
2 5.000000 0.172727

Merge multiple measurements into a pandas dataframe

I have some measurements organized in *.csv files as follows:
m_number,value
0,0.154
1,0.785
…
55,0.578
NaN,NaN
0,1.214
1,0.742
…
So there is always a set of x measurements (x should be constant inside a single file but it's not guaranteed and I have to check this number) separated by a NaN line.
After reading the data into a dataframe, I want to reorganize it for later usage:
m_number value 1 value 2 value 3 value 4
0 0 0.154 0.214 0.229 0.234
1 1 0.785 0.742 0.714 0.771
...
55 55 0.578 0.647 0.597 0.623
Each set of measurements should be one column.
Here's a snippet of the code:
split_index = df.index[df_benchmark['id'].isnull()]
df_sliced = pd.DataFrame()
for i, index in enumerate(split_index):
if i == 0:
df_sliced = df.loc[0:index - 1].copy()
else:
#ToDo: Rename first column to 'value 1' if more than 1 measurement
temp = df['value'].loc[0:index - 1].copy()
temp.reset_index(drop=True, inplace=True)
df_sliced['value '+str(i)] = temp
df.drop(df.index[0:index - split_index[i - 1]], inplace=True)
The code works, but I do not like my current approach. So I'm asking if there's a better and more elegant solution for this problem.
Best,
Julz
You can use cumsum, set_index, and unstack to do this is three lines of code:
#Create dummy data with 4 runs of 10 measures
df = pd.DataFrame({'m_number':np.tile(np.arange(10),4), 'value':np.random.random(40)})
#Use condition to find first run and increment using cumsum and unstack to create
MultiIndex column headers
df_u = df.set_index([df['m_number'].eq(0).cumsum(), df['m_number']])[['value']].unstack()
#Use condition to find first run and increment using cumsum and unstack to create
#MultiIndex column headers (Corrected per comments below)
df_u = df.set_index([df['m_number'], df['m_number'].eq(0).cumsum()])[['value']].unstack()
#Flatten MultiIndex column headers
df_u.columns = [f'{i}_{j}' for i, j in df_u.columns]
#Display results
df_u
Output:
value_1 value_2 value_3 value_4
m_number
0 0.919057 0.064409 0.288592 0.742759
1 0.449587 0.867031 0.193493 0.853700
2 0.551929 0.925111 0.895273 0.117306
3 0.487501 0.893696 0.696540 0.381469
4 0.389431 0.818801 0.771516 0.489404
5 0.790619 0.478995 0.023236 0.344112
6 0.015389 0.815073 0.195856 0.628263
7 0.068860 0.483731 0.752803 0.581106
8 0.109404 0.281335 0.330910 0.909965
9 0.695120 0.538676 0.766864 0.247283

Selecting random values from dataframe without replacement

I am following the answer from the link:
If I have a dataframe df as:
Month Day mnthShape
1 1 1.01
1 1 1.09
1 1 0.96
1 2 1.01
1 1 1.09
1 2 0.96
1 3 1.01
1 3 1.09
1 3 1.78
I want to get the following from df:
Month Day mnthShape
1 1 1.01
1 2 1.01
1 1 0.96
where the mnthShape values are selected at random from the index without replacement. i.e. if the query is df.loc[(1, 1)] it should look for all values for (1, 1) and select randomly from it a value to be displayed above. If another df.loc[(1,1)] appears it should select randomly again but without replacement.
I know I need to modify the code to use the following:
apply(np.random.choice, replace=False)
But not sure how to do it.
Edit:
Everytime I do df.loc[(1, 1)], it should give new value without replacement. I intend to do df.loc[(1, 1)] multiple times. In the previous question, it was just one time.
If you're trying to sample from the dataset without replacement, it probably makes sense to do this all in one go, rather than iteratively pulling a sample from the dataset.
Pulling N samples from each month/day combo requires that there be sufficient combinations to pull N without replacement. But assuming this is true, you could write a function to sample N values from a subset of the data:
def select_n(subset, n=2):
choices = np.random.choice(len(x), size=n, replace=False)
return (
subset
.mnthShape
.iloc[choices]
.reset_index(drop=True)
.rename_axis('choice'))
to apply this across the whole dataset:
In [34]: df.groupby(['Month', 'Day']).apply(select_n)
Out[34]:
choice 0 1
Month Day
1 1 1.09 0.96
2 0.96 1.01
3 1.09 1.01
If you really need to pull these one at a time, you'll still need to generate the samples all at once to guarantee that they're drawn without replacement, but you could generate the sample indices separately from subsetting the data:
In [48]: indices = np.random.choice(3, size=2, replace=False)
In [49]: df[((df.Month == 1) & (df.Day == 2))].iloc[indices[0]]
Out[49]:
Month 1.00
Day 2.00
mnthShape 1.01
Name: 3, dtype: float64
In [50]: df[((df.Month == 1) & (df.Day == 2))].iloc[indices[1]]
Out[50]:
Month 1.00
Day 2.00
mnthShape 0.96
Name: 5, dtype: float64

Calculate a rolling window weighted average on a Pandas column

I'm relatively new to python, and have been trying to calculate some simple rolling weighted averages across rows in a pandas data frame. I have a dataframe of observations df and a dataframe of weights w. I create a new dataframe to hold the inner-product between these two sets of values, dot.
As w is of smaller dimension, I use a for loop to calculate the weighted average by row, of the leading rows equal to the length of w.
More clearly, my set-up is as follows:
import pandas as pd
df = pd.DataFrame([0,1,2,3,4,5,6,7,8], index = range(0,9))
w = pd.DataFrame([0.1,0.25,0.5], index = range(0,3))
dot = pd.DataFrame(0, columns = ['dot'], index = df.index)
for i in range(0,len(df)):
df.loc[i] = sum(df.iloc[max(1,(i-3)):i].values * w.iloc[-min(3,(i-1)):4].values)
I would expect the result to be as follows (i.e. when i = 4)
dot.loc[4] = sum(df.iloc[max(1,(4-3)):4].values * w.iloc[-min(3,(4-1)):4].values)
print dot.loc[4] #2.1
However, when running the for loop above, I receive the error:
ValueError: operands could not be broadcast together with shapes (0,1) (2,1)
Which is where I get confused - I think it must have to do with how I call i into iloc, as I don't receive shape errors when I manually calculate it, as in the example with 4 above. However, looking at other examples and documentation, I don't see why that's the case... Any help is appreciated.
Your first problem is that you are trying to multiply arrays of two different sizes. For example, when i=0 the different parts of your for loop return
df.iloc[max(1,(0-3)):0].values.shape
# (0,1)
w.iloc[-min(3,(0-1)):4].values.shape
# (2,1)
Which is exactly the error you are getting. The easiest way I can think of to make the arrays multipliable is to pad your dataframe with leading zeros, using concatenation.
df2 = pd.concat([pd.Series([0,0]),df], ignore_index=True)
df2
0
0 0
1 0
2 0
3 1
4 2
5 3
6 4
7 5
8 6
9 7
10 8
While you can now use your for loop (with some minor tweaking):
for i in range(len(df)):
dot.loc[i] = sum(df2.iloc[max(0,(i)):i+3].values * w.values)
A nicer way might be the way JohnE suggested, to use the rolling and apply functions built into pandas, there by getting rid of your for loop
import numpy as np
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w))
0
0 NaN
1 NaN
2 0.00
3 0.50
4 1.25
5 2.10
6 2.95
7 3.80
8 4.65
9 5.50
10 6.35
You can also drop the first two padding rows and reset the index
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w)).drop([0,1]).reset_index(drop=True)
0
0 0.00
1 0.50
2 1.25
3 2.10
4 2.95
5 3.80
6 4.65
7 5.50
8 6.35

Categories

Resources