Python: how to map values between a matrix and a dataframe? - python

I have a dataframe net that contains the distance d between two locations A and B.
net =
A B d
0 5 3 3.5
1 2 0 2.3
2 3 2 1.2
3 4 5 2.2
4 0 1 3.2
5 0 3 4.5
Then I have a symmetric matrix M that contains all the possible distances between two pairs, so:
M =
0 1 2 3 4 5
0 0 3.2 2.3 4.5 1.7 5.2
1 3.2 0 2.1 0.7 3.9 3.8
2 2.3 2.1 0 1.2 1.5 4.7
3 4.5 0.7 1.2 0 3.2 3.5
4 1.7 3.9 1.5 3.2 0 2.2
5 5.2 3.8 4.7 3.5 2.2 0
I want to generate a new dataframe df1 that contains two random different locations A and B in the same distance interval ds > np.floor(d) & ds < np.floor(d)+1.
This is what I am doing
H = []
W = []
for i in net.index:
tmp = net['d'][i]
ds = np.where( (M > np.floor(tmp)) & (M < np.floor(tmp)+1) )
size = len(ds[0])
ind = randint(size) ## find two random locations with distance ds
h = ds[0][ind]
w = ds[1][ind]
H.append(h)
W.append(w)
df1 = pd.DataFrame()
df1['A'] = H
df1['B'] = W

group M by floor division of 1. Then use that to query and sample
g = M.stack().index.to_series().groupby(M.stack() // 1)
net.d.apply(lambda x: pd.Series(g.get_group(x // 1).sample(1).iloc[0], list('AB')))

Related

Subtract number from dataframe depending on a corresponding number in the dataframe

I have a dataframe of 2 columns, say df:
year cases
1.1 12
1.2 14
1.4 19
1.6 23
1.6 14
2.1 26
2.5 27
2.7 35
3.1 21
3.3 24
3.8 28
and a list of false cases, say f
f = [3,4,8]
I want to write a code so that for every +1 year, the number of cases is subtracted by its respective 'false cases'.
So for example, whilst 1 < year < 2, I want: cases - 3
Then when 2 < year < 3, I want: cases - 4
and when 3 < year < 4, I want: cases - 8
and so on
so that a new column, say actual cases is:
year actual cases
1.1 9 (12-3)
1.2 11 (14-3)
1.4 16 (19-3)
1.6 20 (23-3)
1.6 11 (14-3)
2.1 22 (26-4)
2.5 23 (27-4)
2.7 31 (35-4)
3.1 13 (21-8)
3.3 16 (24-8)
3.8 20 (28-8)
I tried something along the lines of
for i in range(0,df[["year"]:
if int(df[["year"][i]) > int(df[["year"][i+1]):
df[["cases"][i] - f[i]
But this is clearly wrong and I am not sure what to do.
You can do something like this:
df['cases'] - (df['year']//1).astype(int).map({e:i for e, i in enumerate(f, 1)})
or
df['cases'] - pd.Series(f).reindex(df['year']//1-1).to_numpy()
I would do it like this:
f = [3, 4, 8]
for i, row in df.iterrows():
if 1<=row["year"]<2:
df.at[i, "case"] = row["case"] - f[0]
elif 2<=row["year"]<3:
df.at[i, "case"] = row["case"] - f[1]
else:
df.at[i, "case"] = row["case"] - f[2]
The original dataframe:
year case
0 1.0 8
1 1.1 5
2 1.2 17
3 1.3 1
4 1.4 12
The result:
year case
0 1.0 5
1 1.1 2
2 1.2 14
3 1.3 -2
4 1.4 9
Or you can do this:
df["year"] = df["year"].astype(int)
for i, j in enumerate(f, 1):
df["case"] - j
Something like this should work:
def my_fun(df, year, factor):
df['cases'][df['year'].astype(int) == year] = df['cases'][df['year'].astype(int) == year] - factor
return df

Operations on specific elements of a dataframe in Python

I'm trying to convert kilometer values in one column of a dataframe to mile values. I've tried various things and this is what I have now:
def km_dist(column, dist):
length = len(column)
for dist in zip(range(length), column):
if (column == data["dist"] and dist in data.loc[(data["dist"] > 25)]):
return dist / 5820
else:
return dist
data = data.apply(lambda x: km_dist(data["dist"], x), axis=1)
The dataset I'm working with looks something like this:
past_score dist income lab score gender race income_bucket plays_sports student_id lat long
0 8.091553 11.586920 67111.784934 0 7.384394 male H 3 0 1 0.0 0.0
1 8.091553 11.586920 67111.784934 0 7.384394 male H 3 0 1 0.0 0.0
2 7.924539 7858.126614 93442.563796 1 10.219626 F W 4 0 2 0.0 0.0
3 7.924539 7858.126614 93442.563796 1 10.219626 F W 4 0 2 0.0 0.0
4 7.726480 11.057883 96508.386987 0 8.544586 M W 4 0 3 0.0 0.0
With my code above, I'm trying to loop through all the "dist" values and if those values are in the right column ("data["dist"]") and greater than 25, divide those values by 5820 (the number of feet in a kilometer). More generally, I'd like to find a way to operate on specific elements of dataframes. I'm sure this is at least a somewhat common question, I just haven't been able to find an answer for it. If someone could point me towards somewhere with an answer, I would be just as happy.
Instead your solution filter rows with mask and divide column dist by 5820:
data.loc[data["dist"] > 25, 'dist'] /= 5820
Working same like:
data.loc[data["dist"] > 25, 'dist'] = data.loc[data["dist"] > 25, 'dist'] / 5820
data.loc[data["dist"] > 25, 'dist'] /= 5820
print (data)
past_score dist income lab score gender race \
0 8.091553 11.586920 67111.784934 0 7.384394 male H
1 8.091553 11.586920 67111.784934 0 7.384394 male H
2 7.924539 1.350194 93442.563796 1 10.219626 F W
3 7.924539 1.350194 93442.563796 1 10.219626 F W
4 7.726480 11.057883 96508.386987 0 8.544586 M W
income_bucket plays_sports student_id lat long
0 3 0 1 0.0 0.0
1 3 0 1 0.0 0.0
2 4 0 2 0.0 0.0
3 4 0 2 0.0 0.0
4 4 0 3 0.0 0.0

Get n rows before specific value in pandas

Say, i have the following dataframe:
import pandas as pd
dict = {'val':[3.2, 2.4, -2.3, -4.9, 3.2, 2.4, -2.3, -4.9, 2.4, -2.3, -4.9],
'label': [0, 2, 1, -1, 1, 2, -1, -1,1, 1, -1]}
df = pd.DataFrame(dict)
df
val label
0 3.2 0
1 2.4 2
2 -2.3 1
3 -4.9 -1
4 3.2 1
5 2.4 2
6 -2.3 -1
7 -4.9 -1
8 2.4 1
9 -2.3 1
10 -4.9 -1
I want to take each n (for example 2) rows before -1 value in column label. In the given df first -1 appears at index 3, we take 2 rows before it and drop index 3, then next -1 appears at index 6, we again keep 2 rows before and etc. The desired output is as following:
val label
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1
Thanks for any ideas!
You can get the index values and then get the previous two row index values:
idx = df[df.label == -1].index
filtered_idx = (idx-1).union(idx-2)
filtered_idx = filtered_idx[filtered_idx > 0]
df_new = df.iloc[filtered_idx]
output:
val label
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1
Speed comparison with for a for loop solution:
# create large df:
import numpy as np
df = pd.DataFrame(np.random.random((20000000,2)), columns=["val","label"])
df.loc[df.sample(frac=0.01).index, "label"] = - 1
def vectorized_filter(df):
idx = df[df.label == -1].index
filtered_idx = (idx -1).union(idx-2)
df_new = df.iloc[filtered_idx]
return df_new
def loop_filter(df):
filter = df.loc[df['label'] == -1].index
req_idx = []
for idx in filter:
if idx == 0:
continue
elif idx == 1:
req_idx.append(idx-1)
else:
req_idx.append(idx-2)
req_idx.append(idx-1)
req_idx = list(set(req_idx))
df2 = df.loc[df.index.isin(req_idx)]
return df2
%timeit vectorized_filter(df)
%timeit loop_filter(df)
vectorized runs ~20x faster on my machine
Here's a solution:
new_df = pd.DataFrame()
markers = df[df.label.eq(-1)].index
for marker in markers:
new_df = new_df.append(df[marker-2:marker])
new_df.reset_index().drop_duplicates().set_index("index")
Result:
val label
index
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1
filter = df.loc[df['label'] == -1].index
req_idx = []
for idx in filter:
if idx == 0:
continue
elif idx == 1:
req_idx.append(idx-1)
else:
req_idx.append(idx-2)
req_idx.append(idx-1)
req_idx = list(set(req_idx))
df2 = df.loc[df.index.isin(req_idx)]
print(df2)
Output:
val label
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1
This should also work if you have the label as -1 in the first two rows

Convert string decimal numbers in column to float in a Pandas DataFrame

I have a dataset like this (with many more columns):
FLAG__DC SEXECOND CRM_TAUX
0 N M 0,9
1 N M 0,9
2 N M 1,2
3 O M 1
4 N M 1
5 N M 0,9
6 O M 1
7 N M 0,9
I want to convert the column CRM_TAUX to Float... Please help!
I have tried this but doesn't work:
df['CRM_TAUX'] = df.CRM_TAUX.replace(',','.')
df['CRM_TAUX'] = df.CRM_TAUX.apply(pd.to_numeric)
This is the error I get (and many more):
Unable to parse string "1,2" at position 0
Thanks in advance!
Use str.replace
df.CRM_TAUX.str.replace(',' , '.')
Out[2246]:
0 0.9
1 0.9
2 1.2
3 1
4 1
5 0.9
6 1
7 0.9
Name: CRM_TAUX, dtype: object
Next, call pd.to_numeric on it should work
s = df.CRM_TAUX.str.replace(',' , '.')
df['CRM_TAUX'] = pd.to_numeric(s)
Out[2250]:
FLAG__DC SEXECOND CRM_TAUX
0 N M 0.9
1 N M 0.9
2 N M 1.2
3 O M 1.0
4 N M 1.0
5 N M 0.9
6 O M 1.0
7 N M 0.9

How to groupby and then weight values according to size of each group

I would like to give each employee a pro rata share after a sale has been made. Therefore I first need to sum up the number of contacts per Customer that leads to a sale and then split the reward the each employee involved in this process.
import pandas as pd
df = pd.DataFrame({"Cust_ID":[1,1,1,2,3,3], "Employee": ["A","B","B","C","B","A"], "Purchase":[0,0,1,1,0,1]})
df
Cust_ID Employee Purchase
0 1 A 0
1 1 B 0
2 1 B 1
3 2 C 1
4 3 B 0
5 3 A 1
When it takes 3 (or more) steps for the final sale (Cust_ID = 1) the rewards shall be distributed in 50%, 30% and 20% (0%..).
For 2 steps 70% and 30%. One step = 100%
The result should look like this:
Cust_ID Employee Purchase Reward
0 1 A 0 0.2
1 1 B 0 0.3
2 1 B 1 0.5
3 2 C 1 1.0
4 3 B 0 0.3
5 3 A 1 0.7
I tried using df["Reward"] = df.groupby("Cust_ID").Purchase.transform("xxx") but this didn't execute the distributed reward..
Thanks in advance!
First let's augment the DataFrame:
df['Touch'] = df.groupby('Cust_ID').cumcount()
df['Touches'] = df.groupby('Cust_ID').Employee.count()[df.Cust_ID].values
df['Reward'] = 0.0
Now we have the basic setup:
Cust_ID Employee Purchase Touch Touches Reward
0 1 A 0 0 3 0.0
1 1 B 0 1 3 0.0
2 1 B 1 2 3 0.0
3 2 C 1 0 1 0.0
4 3 B 0 0 2 0.0
5 3 A 1 1 2 0.0
Finally, apply the reward rules:
df.loc[df.Touches == 1, 'Reward'] = 1.0
df.loc[(df.Touches == 2) & (df.Touch == 0), 'Reward'] = 0.3
df.loc[(df.Touches == 2) & (df.Touch == 1), 'Reward'] = 0.7
df.loc[(df.Touches == 3) & (df.Touch == 0), 'Reward'] = 0.2
df.loc[(df.Touches == 3) & (df.Touch == 1), 'Reward'] = 0.3
df.loc[(df.Touches == 3) & (df.Touch == 2), 'Reward'] = 0.5
This last part could be done more cleverly using np.select(). This is an exercise for the reader.

Categories

Resources