How transform all values NOT = 0 in 1 with REGEX, efficiently - python

I've a Column that contains 0 and 12/02/19 dates, I want to transforming all dates into ones and multiply by the column Enrolls_F
-
Preferring using REGEX, but any other options should be fine too, it is a large Dataset, I tried with simple for loop and my kernel could not run it.
-
Data:
df = pd.DataFrame({ 'Enrolled_Date': ['0','2019/02/04','0','0','2019/02/04','0'] , 'Enrolls_F': ['1.11','1.11','0.222','1.11','5.22','1'] })
Attempts:
trying to search for everything starts with 2 and replace with 1 and multiply by Enrolls_F
df_test = (df.replace({'Enrolled_Date': r'2.$'}, {'Enrolled_Date': '1'}, regex=True)) * df.Enrolls_F
# Nothing happens

IIUC, this should help you get the trouble sorted;
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'Enrolled_Date': ['0','2019/02/04','0','0','2019/02/04','0'] , 'Enrolls_F': ['1.11','1.11','0.222','1.11','5.22','1'] })
df['Enrolled_Date'] = np.where(df['Enrolled_Date'] == '0',0,1)
df['multiplication_column'] = df['Enrolled_Date'] * df['Enrolls_F']
print(df)
Output:
Enrolled_Date Enrolls_F multiplication_column
0 0 1.11
1 1 1.11 1.11
2 0 0.222
3 0 1.11
4 1 5.22 5.22
5 0 1

If you want output is float, try this
df.Enrolled_Date.ne('0').astype(int) * df.Enrolls_F.astype(float)
Out[212]:
0 0.00
1 1.11
2 0.00
3 0.00
4 5.22
5 0.00
dtype: float64

Related

Populate column in dataframe based on iat values

lookup={'Tier':[1,2,3,4],'Terr.1':[0.88,0.83,1.04,1.33],'Terr.2':[0.78,0.82,0.91,1.15],'Terr.3':[0.92,0.98,1.09,1.33],'Terr.4':[1.39,1.49,1.66,1.96],'Terr.5':[1.17,1.24,1.39,1.68]}
df={'Tier':[1,1,2,2,3,2,4,4,4,1],'Territory':[1,3,4,5,4,4,2,1,1,2]}
df=pd.DataFrame(df)
lookup=pd.DataFrame(lookup)
lookup contains the lookup values, and df contains the data being fed into iat.
I get the correct values when I print(lookup.iat[tier,terr]). However, when I try to set those values in a new column, it endlessly runs, or in this simple test case just copies 1 value 10 times.
for i in df["Tier"]:
tier=i-1
for j in df["Territory"]:
terr=j
#print(lookup.iat[tier,terr])
df["Rate"]=lookup.iat[tier,terr]
Any thoughts on a possible better solution?
You can use apply() after some modification to your lookup dataframe:
lookup = lookup.rename(columns={i: i.split('.')[-1] for i in lookup.columns}).set_index('Tier')
lookup.columns = lookup.columns.astype(int)
df['Rate'] = df.apply(lambda x: lookup.loc[x['Tier'],x['Territory']], axis=1)
Returns:
Tier Territory Rate
0 1 1 0.88
1 1 3 0.92
2 2 4 1.49
3 2 5 1.24
4 3 4 1.66
5 2 4 1.49
6 4 2 1.15
7 4 1 1.33
8 4 1 1.33
9 1 2 0.78
Once lookup modified a bit the same way than #rahlf23 plus using stack, you can merge both dataframes such as:
df['Rate'] = df.merge( lookup.rename(columns={ i: int(i.split('.')[-1])
for i in lookup.columns if 'Terr' in i})
.set_index('Tier').stack()
.reset_index().rename(columns={'level_1':'Territory'}),
how='left')[0]
If you have a big dataframe df, then it should be faster than using apply and loc
Also, if any couple (Tier, Territory) in df does not exist in lookup, this method won't throw an error

Selecting random values from dataframe without replacement

I am following the answer from the link:
If I have a dataframe df as:
Month Day mnthShape
1 1 1.01
1 1 1.09
1 1 0.96
1 2 1.01
1 1 1.09
1 2 0.96
1 3 1.01
1 3 1.09
1 3 1.78
I want to get the following from df:
Month Day mnthShape
1 1 1.01
1 2 1.01
1 1 0.96
where the mnthShape values are selected at random from the index without replacement. i.e. if the query is df.loc[(1, 1)] it should look for all values for (1, 1) and select randomly from it a value to be displayed above. If another df.loc[(1,1)] appears it should select randomly again but without replacement.
I know I need to modify the code to use the following:
apply(np.random.choice, replace=False)
But not sure how to do it.
Edit:
Everytime I do df.loc[(1, 1)], it should give new value without replacement. I intend to do df.loc[(1, 1)] multiple times. In the previous question, it was just one time.
If you're trying to sample from the dataset without replacement, it probably makes sense to do this all in one go, rather than iteratively pulling a sample from the dataset.
Pulling N samples from each month/day combo requires that there be sufficient combinations to pull N without replacement. But assuming this is true, you could write a function to sample N values from a subset of the data:
def select_n(subset, n=2):
choices = np.random.choice(len(x), size=n, replace=False)
return (
subset
.mnthShape
.iloc[choices]
.reset_index(drop=True)
.rename_axis('choice'))
to apply this across the whole dataset:
In [34]: df.groupby(['Month', 'Day']).apply(select_n)
Out[34]:
choice 0 1
Month Day
1 1 1.09 0.96
2 0.96 1.01
3 1.09 1.01
If you really need to pull these one at a time, you'll still need to generate the samples all at once to guarantee that they're drawn without replacement, but you could generate the sample indices separately from subsetting the data:
In [48]: indices = np.random.choice(3, size=2, replace=False)
In [49]: df[((df.Month == 1) & (df.Day == 2))].iloc[indices[0]]
Out[49]:
Month 1.00
Day 2.00
mnthShape 1.01
Name: 3, dtype: float64
In [50]: df[((df.Month == 1) & (df.Day == 2))].iloc[indices[1]]
Out[50]:
Month 1.00
Day 2.00
mnthShape 0.96
Name: 5, dtype: float64

Pandas - Using `.rolling()` on multiple columns

Consider a pandas DataFrame which looks like the one below
A B C
0 0.63 1.12 1.73
1 2.20 -2.16 -0.13
2 0.97 -0.68 1.09
3 -0.78 -1.22 0.96
4 -0.06 -0.02 2.18
I would like to use the function .rolling() to perform the following calculation for t = 0,1,2:
Select the rows from t to t+2
Take the 9 values contained in those 3 rows, from all the columns. Call this set S
Compute the 75th percentile of S (or other summary statistics about S)
For instance, for t = 1 we have
S = { 2.2 , -2.16, -0.13, 0.97, -0.68, 1.09, -0.78, -1.22, 0.96 } and the 75th percentile is 0.97.
I couldn't find a way to make it work with .rolling(), since it apparently takes each column separately. I'm now relying on a for loop, but it is really slow.
Do you have any suggestion for a more efficient approach?
One solution is to stack the data and then multiply your window size by the number of columns and slice the result by the number of columns. Also, since you want a forward looking window, reverse the order of the stacked DataFrame
wsize = 3
cols = len(df.columns)
df.stack(dropna=False)[::-1].rolling(window=wsize*cols).quantile(0.75)[cols-1::cols].reset_index(-1, drop=True).sort_index()
Output:
0 1.12
1 0.97
2 0.97
3 NaN
4 NaN
dtype: float64
In the case of many columns and a small window:
import pandas as pd
import numpy as np
wsize = 3
df2 = pd.concat([df.shift(-x) for x in range(wsize)], 1)
s_quant = df2.quantile(0.75, 1)
# Only necessary if you need to enforce sufficient data.
s_quant[df2.isnull().any(1)] = np.NaN
Output: s_quant
0 1.12
1 0.97
2 0.97
3 NaN
4 NaN
Name: 0.75, dtype: float64
You can use numpy ravel. Still you may have to use for loops.
for i in range(0,3):
print(df.iloc[i:i+3].values.ravel())
If your t steps in 3s, you can use numpy reshape function to create a n*9 dataframe.

Multiply many columns pandas

I have a data frame like this, but with many more columns and I would like to multiply each two adjacent columns and state the product of the two in a new column beside it and call it Sub_pro and at the end have the total sum of all Sub_pro in a column called F_Pro and reduce the precision to 3 decimal places. I don't know how to get the Sub_pro columns. Below is my code.
import pandas as pd
df = pd.read_excel("C:dummy")
df['F_Pro'] = ("Result" * "Attribute").sum(axis=1)
df.round(decimals=3)
print (df)
Input
Id Result Attribute Result1 Attribute1
1 0.5621 0.56 536 0.005642
2 0.5221 0.5677 2.15 93
3 0.024564 5.23 6.489 8
4 11.564256 4.005 0.45556 5.25
5 0.6123 0.4798 0.6667 5.10
Desire Output
id Result Attribute Sub_Pro Result1 Attribute1 Sub_pro1 F_Pro
1 0.5621 0.56 0.314776 536 0.005642 3.024112 3.338888
2 0.5221 0.5677 0.29639617 2.15 93 199.95 200.2463962
3 0.024564 5.23 0.12846972 6.489 8 51.912 52.04046972
4 11.564256 4.005 46.31484528 0.45556 5.25 2.39169 48.70653528
5 0.6123 0.4798 0.29378154 0.6667 5.1 3.40017 3.69395154
Because you have several columns named kind of the same, here is one way using filter. To see how it works, on your df, you do df.filter(like='Result') and you get the columns where the name has Result in it:
Result Result1
0 0.562100 536.00000
1 0.522100 2.15000
2 0.024564 6.48900
3 11.564256 0.45556
4 0.612300 0.66670
You can create an array containing the columns 'Sub_Pro':
import numpy as np
arr_sub_pro = np.round(df.filter(like='Result').values* df.filter(like='Attribute').values,3)
and you get the values of the columns sub_pro such as arr_sub_pro:
array([[3.1500e-01, 3.0240e+00],
[2.9600e-01, 1.9995e+02],
[1.2800e-01, 5.1912e+01],
[4.6315e+01, 2.3920e+00],
[2.9400e-01, 3.4000e+00]])
Now you need to add them at the right position in the dataframe, I think a loop for is necessary
for nb, col in zip( range(arr_sub_pro.shape[1]), df.filter(like='Attribute').columns):
df.insert(df.columns.get_loc(col)+1, 'Sub_pro{}'.format(nb), arr_sub_pro[:,nb])
here I get the location of the column Attibut(nb) and insert the value from column nb of arr_sub_pro at the next position
To add the column 'F_Pro', you can do:
df.insert(len(df.columns), 'F_Pro', arr_sub_pro.sum(axis=1))
the final df looks like:
Id Result Attribute Sub_pro0 Result1 Attribute1 Sub_pro1 \
0 1 0.562100 0.5600 0.315 536.00000 0.005642 3.024
1 2 0.522100 0.5677 0.296 2.15000 93.000000 199.950
2 3 0.024564 5.2300 0.128 6.48900 8.000000 51.912
3 4 11.564256 4.0050 46.315 0.45556 5.250000 2.392
4 5 0.612300 0.4798 0.294 0.66670 5.100000 3.400
F_Pro
0 3.339
1 200.246
2 52.040
3 48.707
4 3.694
import pandas as pd
src = "/opt/repos/pareto/test/stack/data.csv"
df = pd.read_csv(src)
count = 0
def multiply(x):
res = x.copy()
keys_len = len(x)
idx = 1
while idx + 1 < keys_len:
left = x[idx]
right = x[idx + 1]
new_key = "sub_prod_{}".format(idx)
# Multiply and round to three decimal places.
res[new_key] = round(left * right,3)
idx = idx + 1
return res
res_df = df.apply(lambda x: multiply(x),axis=1)
It solve the problem but you need now order de columns you can iterate over the keys instead of make a deep copy of the full row. I hope that the code help you.
Here's one way using NumPy and a dictionary comprehension:
# extract NumPy array for relevant columns
A = df.iloc[:, 1:].values
n = int(A.shape[1] / 2)
# calculate products and feed to pd.DataFrame
prods = pd.DataFrame({'Sub_Pro_'+str(i): np.prod(A[:, 2*i: 2*(i+1)], axis=1) \
for i in range(n)})
# calculate sum of product rows
prods['F_Pro'] = prods.sum(axis=1)
# join to original dataframe
df = df.join(prods)
print(df)
Id Result Attribute Result1 Attribute1 Sub_Pro_0 Sub_Pro_1 \
0 1 0.562100 0.5600 536.00000 0.005642 0.314776 3.024112
1 2 0.522100 0.5677 2.15000 93.000000 0.296396 199.950000
2 3 0.024564 5.2300 6.48900 8.000000 0.128470 51.912000
3 4 11.564256 4.0050 0.45556 5.250000 46.314845 2.391690
4 5 0.612300 0.4798 0.66670 5.100000 0.293782 3.400170
F_Pro
0 3.338888
1 200.246396
2 52.040470
3 48.706535
4 3.693952

Remove strongly correlated columns from DataFrame [duplicate]

This question already has answers here:
How to calculate correlation between all columns and remove highly correlated ones using pandas?
(28 answers)
Closed 1 year ago.
I have a DataFrame like this
dict_ = {'Date':['2018-01-01','2018-01-02','2018-01-03','2018-01-04','2018-01-05'],'Col1':[1,2,3,4,5],'Col2':[1.1,1.2,1.3,1.4,1.5],'Col3':[0.33,0.98,1.54,0.01,0.99]}
df = pd.DataFrame(dict_, columns=dict_.keys())
I then calculate the pearson correlation between the columns and filter out columns that are correlated above my threshold of 0.95
def trimm_correlated(df_in, threshold):
df_corr = df_in.corr(method='pearson', min_periods=1)
df_not_correlated = ~(df_corr.mask(np.eye(len(df_corr), dtype=bool)).abs() > threshold).any()
un_corr_idx = df_not_correlated.loc[df_not_correlated[df_not_correlated.index] == True].index
df_out = df_in[un_corr_idx]
return df_out
which yields
uncorrelated_factors = trimm_correlated(df, 0.95)
print uncorrelated_factors
Col3
0 0.33
1 0.98
2 1.54
3 0.01
4 0.99
So far I am happy with the result, but I would like to keep one column from each correlated pair, so in the above example I would like to include Col1 or Col2. To get s.th. like this
Col1 Col3
0 1 0.33
1 2 0.98
2 3 1.54
3 4 0.01
4 5 0.99
Also on a side note, is there any further evaluation I can do to determine which of the correlated columns to keep?
thanks
You can use np.tril() instead of np.eye() for the mask:
def trimm_correlated(df_in, threshold):
df_corr = df_in.corr(method='pearson', min_periods=1)
df_not_correlated = ~(df_corr.mask(np.tril(np.ones([len(df_corr)]*2, dtype=bool))).abs() > threshold).any()
un_corr_idx = df_not_correlated.loc[df_not_correlated[df_not_correlated.index] == True].index
df_out = df_in[un_corr_idx]
return df_out
Output:
Col1 Col3
0 1 0.33
1 2 0.98
2 3 1.54
3 4 0.01
4 5 0.99
Use this directly on the dataframe to sort out the top correlation values.
import pandas as pd
import numpy as np
def correl(X_train):
cor = X_train.corr()
corrm = np.corrcoef(X_train.transpose())
corr = corrm - np.diagflat(corrm.diagonal())
print("max corr:",corr.max(), ", min corr: ", corr.min())
c1 = cor.stack().sort_values(ascending=False).drop_duplicates()
high_cor = c1[c1.values!=1]
## change this value to get more correlation results
thresh = 0.9
display(high_cor[high_cor>thresh])
correl(X)
output:
max corr: 0.9821068918331252 , min corr: -0.2993837739125243
object at 0x0000017712D504E0>
count_rech_2g_8 sachet_2g_8 0.982107
count_rech_2g_7 sachet_2g_7 0.979492
count_rech_2g_6 sachet_2g_6 0.975892
arpu_8 total_rech_amt_8 0.946617
arpu_3g_8 arpu_2g_8 0.942428
isd_og_mou_8 isd_og_mou_7 0.938388
arpu_2g_6 arpu_3g_6 0.933158
isd_og_mou_6 isd_og_mou_8 0.931683
arpu_3g_7 arpu_2g_7 0.930460
total_rech_amt_6 arpu_6 0.930103
isd_og_mou_7 isd_og_mou_6 0.926571
arpu_7 total_rech_amt_7 0.926111
dtype: float64

Categories

Resources