Selecting random values from dataframe without replacement - python

I am following the answer from the link:
If I have a dataframe df as:
Month Day mnthShape
1 1 1.01
1 1 1.09
1 1 0.96
1 2 1.01
1 1 1.09
1 2 0.96
1 3 1.01
1 3 1.09
1 3 1.78
I want to get the following from df:
Month Day mnthShape
1 1 1.01
1 2 1.01
1 1 0.96
where the mnthShape values are selected at random from the index without replacement. i.e. if the query is df.loc[(1, 1)] it should look for all values for (1, 1) and select randomly from it a value to be displayed above. If another df.loc[(1,1)] appears it should select randomly again but without replacement.
I know I need to modify the code to use the following:
apply(np.random.choice, replace=False)
But not sure how to do it.
Edit:
Everytime I do df.loc[(1, 1)], it should give new value without replacement. I intend to do df.loc[(1, 1)] multiple times. In the previous question, it was just one time.

If you're trying to sample from the dataset without replacement, it probably makes sense to do this all in one go, rather than iteratively pulling a sample from the dataset.
Pulling N samples from each month/day combo requires that there be sufficient combinations to pull N without replacement. But assuming this is true, you could write a function to sample N values from a subset of the data:
def select_n(subset, n=2):
choices = np.random.choice(len(x), size=n, replace=False)
return (
subset
.mnthShape
.iloc[choices]
.reset_index(drop=True)
.rename_axis('choice'))
to apply this across the whole dataset:
In [34]: df.groupby(['Month', 'Day']).apply(select_n)
Out[34]:
choice 0 1
Month Day
1 1 1.09 0.96
2 0.96 1.01
3 1.09 1.01
If you really need to pull these one at a time, you'll still need to generate the samples all at once to guarantee that they're drawn without replacement, but you could generate the sample indices separately from subsetting the data:
In [48]: indices = np.random.choice(3, size=2, replace=False)
In [49]: df[((df.Month == 1) & (df.Day == 2))].iloc[indices[0]]
Out[49]:
Month 1.00
Day 2.00
mnthShape 1.01
Name: 3, dtype: float64
In [50]: df[((df.Month == 1) & (df.Day == 2))].iloc[indices[1]]
Out[50]:
Month 1.00
Day 2.00
mnthShape 0.96
Name: 5, dtype: float64

Related

Calculate running total based off original value in pandas

I wish to take an inital value of 1000 and multiply it by the first value in the 'Change' and then take that value and multiply it to the second value in the 'Change' column and so on.
I could do this by using a loop as follows
changes = [0.97,1.02,1.1,0.88,1.01 ]
df = pd.DataFrame()
df['Change'] = changes
df['Total'] = np.nan
df['Total'][0] = 1000*df['Change'][0]
for i in range(1,len(df)):
df['Total'][i] = df['Total'][i-1] * df['Change'][i]
Output:
Change Total
0 0.97 970.000000
1 1.02 989.400000
2 1.10 1088.340000
3 0.88 957.739200
4 1.01 967.316592
But this will be too slow for a large dataset. Is there any way to do this without loops?
Thanks

How transform all values NOT = 0 in 1 with REGEX, efficiently

I've a Column that contains 0 and 12/02/19 dates, I want to transforming all dates into ones and multiply by the column Enrolls_F
-
Preferring using REGEX, but any other options should be fine too, it is a large Dataset, I tried with simple for loop and my kernel could not run it.
-
Data:
df = pd.DataFrame({ 'Enrolled_Date': ['0','2019/02/04','0','0','2019/02/04','0'] , 'Enrolls_F': ['1.11','1.11','0.222','1.11','5.22','1'] })
Attempts:
trying to search for everything starts with 2 and replace with 1 and multiply by Enrolls_F
df_test = (df.replace({'Enrolled_Date': r'2.$'}, {'Enrolled_Date': '1'}, regex=True)) * df.Enrolls_F
# Nothing happens
IIUC, this should help you get the trouble sorted;
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'Enrolled_Date': ['0','2019/02/04','0','0','2019/02/04','0'] , 'Enrolls_F': ['1.11','1.11','0.222','1.11','5.22','1'] })
df['Enrolled_Date'] = np.where(df['Enrolled_Date'] == '0',0,1)
df['multiplication_column'] = df['Enrolled_Date'] * df['Enrolls_F']
print(df)
Output:
Enrolled_Date Enrolls_F multiplication_column
0 0 1.11
1 1 1.11 1.11
2 0 0.222
3 0 1.11
4 1 5.22 5.22
5 0 1
If you want output is float, try this
df.Enrolled_Date.ne('0').astype(int) * df.Enrolls_F.astype(float)
Out[212]:
0 0.00
1 1.11
2 0.00
3 0.00
4 5.22
5 0.00
dtype: float64

Populate column in dataframe based on iat values

lookup={'Tier':[1,2,3,4],'Terr.1':[0.88,0.83,1.04,1.33],'Terr.2':[0.78,0.82,0.91,1.15],'Terr.3':[0.92,0.98,1.09,1.33],'Terr.4':[1.39,1.49,1.66,1.96],'Terr.5':[1.17,1.24,1.39,1.68]}
df={'Tier':[1,1,2,2,3,2,4,4,4,1],'Territory':[1,3,4,5,4,4,2,1,1,2]}
df=pd.DataFrame(df)
lookup=pd.DataFrame(lookup)
lookup contains the lookup values, and df contains the data being fed into iat.
I get the correct values when I print(lookup.iat[tier,terr]). However, when I try to set those values in a new column, it endlessly runs, or in this simple test case just copies 1 value 10 times.
for i in df["Tier"]:
tier=i-1
for j in df["Territory"]:
terr=j
#print(lookup.iat[tier,terr])
df["Rate"]=lookup.iat[tier,terr]
Any thoughts on a possible better solution?
You can use apply() after some modification to your lookup dataframe:
lookup = lookup.rename(columns={i: i.split('.')[-1] for i in lookup.columns}).set_index('Tier')
lookup.columns = lookup.columns.astype(int)
df['Rate'] = df.apply(lambda x: lookup.loc[x['Tier'],x['Territory']], axis=1)
Returns:
Tier Territory Rate
0 1 1 0.88
1 1 3 0.92
2 2 4 1.49
3 2 5 1.24
4 3 4 1.66
5 2 4 1.49
6 4 2 1.15
7 4 1 1.33
8 4 1 1.33
9 1 2 0.78
Once lookup modified a bit the same way than #rahlf23 plus using stack, you can merge both dataframes such as:
df['Rate'] = df.merge( lookup.rename(columns={ i: int(i.split('.')[-1])
for i in lookup.columns if 'Terr' in i})
.set_index('Tier').stack()
.reset_index().rename(columns={'level_1':'Territory'}),
how='left')[0]
If you have a big dataframe df, then it should be faster than using apply and loc
Also, if any couple (Tier, Territory) in df does not exist in lookup, this method won't throw an error

Pandas - Using `.rolling()` on multiple columns

Consider a pandas DataFrame which looks like the one below
A B C
0 0.63 1.12 1.73
1 2.20 -2.16 -0.13
2 0.97 -0.68 1.09
3 -0.78 -1.22 0.96
4 -0.06 -0.02 2.18
I would like to use the function .rolling() to perform the following calculation for t = 0,1,2:
Select the rows from t to t+2
Take the 9 values contained in those 3 rows, from all the columns. Call this set S
Compute the 75th percentile of S (or other summary statistics about S)
For instance, for t = 1 we have
S = { 2.2 , -2.16, -0.13, 0.97, -0.68, 1.09, -0.78, -1.22, 0.96 } and the 75th percentile is 0.97.
I couldn't find a way to make it work with .rolling(), since it apparently takes each column separately. I'm now relying on a for loop, but it is really slow.
Do you have any suggestion for a more efficient approach?
One solution is to stack the data and then multiply your window size by the number of columns and slice the result by the number of columns. Also, since you want a forward looking window, reverse the order of the stacked DataFrame
wsize = 3
cols = len(df.columns)
df.stack(dropna=False)[::-1].rolling(window=wsize*cols).quantile(0.75)[cols-1::cols].reset_index(-1, drop=True).sort_index()
Output:
0 1.12
1 0.97
2 0.97
3 NaN
4 NaN
dtype: float64
In the case of many columns and a small window:
import pandas as pd
import numpy as np
wsize = 3
df2 = pd.concat([df.shift(-x) for x in range(wsize)], 1)
s_quant = df2.quantile(0.75, 1)
# Only necessary if you need to enforce sufficient data.
s_quant[df2.isnull().any(1)] = np.NaN
Output: s_quant
0 1.12
1 0.97
2 0.97
3 NaN
4 NaN
Name: 0.75, dtype: float64
You can use numpy ravel. Still you may have to use for loops.
for i in range(0,3):
print(df.iloc[i:i+3].values.ravel())
If your t steps in 3s, you can use numpy reshape function to create a n*9 dataframe.

Calculate a rolling window weighted average on a Pandas column

I'm relatively new to python, and have been trying to calculate some simple rolling weighted averages across rows in a pandas data frame. I have a dataframe of observations df and a dataframe of weights w. I create a new dataframe to hold the inner-product between these two sets of values, dot.
As w is of smaller dimension, I use a for loop to calculate the weighted average by row, of the leading rows equal to the length of w.
More clearly, my set-up is as follows:
import pandas as pd
df = pd.DataFrame([0,1,2,3,4,5,6,7,8], index = range(0,9))
w = pd.DataFrame([0.1,0.25,0.5], index = range(0,3))
dot = pd.DataFrame(0, columns = ['dot'], index = df.index)
for i in range(0,len(df)):
df.loc[i] = sum(df.iloc[max(1,(i-3)):i].values * w.iloc[-min(3,(i-1)):4].values)
I would expect the result to be as follows (i.e. when i = 4)
dot.loc[4] = sum(df.iloc[max(1,(4-3)):4].values * w.iloc[-min(3,(4-1)):4].values)
print dot.loc[4] #2.1
However, when running the for loop above, I receive the error:
ValueError: operands could not be broadcast together with shapes (0,1) (2,1)
Which is where I get confused - I think it must have to do with how I call i into iloc, as I don't receive shape errors when I manually calculate it, as in the example with 4 above. However, looking at other examples and documentation, I don't see why that's the case... Any help is appreciated.
Your first problem is that you are trying to multiply arrays of two different sizes. For example, when i=0 the different parts of your for loop return
df.iloc[max(1,(0-3)):0].values.shape
# (0,1)
w.iloc[-min(3,(0-1)):4].values.shape
# (2,1)
Which is exactly the error you are getting. The easiest way I can think of to make the arrays multipliable is to pad your dataframe with leading zeros, using concatenation.
df2 = pd.concat([pd.Series([0,0]),df], ignore_index=True)
df2
0
0 0
1 0
2 0
3 1
4 2
5 3
6 4
7 5
8 6
9 7
10 8
While you can now use your for loop (with some minor tweaking):
for i in range(len(df)):
dot.loc[i] = sum(df2.iloc[max(0,(i)):i+3].values * w.values)
A nicer way might be the way JohnE suggested, to use the rolling and apply functions built into pandas, there by getting rid of your for loop
import numpy as np
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w))
0
0 NaN
1 NaN
2 0.00
3 0.50
4 1.25
5 2.10
6 2.95
7 3.80
8 4.65
9 5.50
10 6.35
You can also drop the first two padding rows and reset the index
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w)).drop([0,1]).reset_index(drop=True)
0
0 0.00
1 0.50
2 1.25
3 2.10
4 2.95
5 3.80
6 4.65
7 5.50
8 6.35

Categories

Resources