I'm trying to convert kilometer values in one column of a dataframe to mile values. I've tried various things and this is what I have now:
def km_dist(column, dist):
length = len(column)
for dist in zip(range(length), column):
if (column == data["dist"] and dist in data.loc[(data["dist"] > 25)]):
return dist / 5820
else:
return dist
data = data.apply(lambda x: km_dist(data["dist"], x), axis=1)
The dataset I'm working with looks something like this:
past_score dist income lab score gender race income_bucket plays_sports student_id lat long
0 8.091553 11.586920 67111.784934 0 7.384394 male H 3 0 1 0.0 0.0
1 8.091553 11.586920 67111.784934 0 7.384394 male H 3 0 1 0.0 0.0
2 7.924539 7858.126614 93442.563796 1 10.219626 F W 4 0 2 0.0 0.0
3 7.924539 7858.126614 93442.563796 1 10.219626 F W 4 0 2 0.0 0.0
4 7.726480 11.057883 96508.386987 0 8.544586 M W 4 0 3 0.0 0.0
With my code above, I'm trying to loop through all the "dist" values and if those values are in the right column ("data["dist"]") and greater than 25, divide those values by 5820 (the number of feet in a kilometer). More generally, I'd like to find a way to operate on specific elements of dataframes. I'm sure this is at least a somewhat common question, I just haven't been able to find an answer for it. If someone could point me towards somewhere with an answer, I would be just as happy.
Instead your solution filter rows with mask and divide column dist by 5820:
data.loc[data["dist"] > 25, 'dist'] /= 5820
Working same like:
data.loc[data["dist"] > 25, 'dist'] = data.loc[data["dist"] > 25, 'dist'] / 5820
data.loc[data["dist"] > 25, 'dist'] /= 5820
print (data)
past_score dist income lab score gender race \
0 8.091553 11.586920 67111.784934 0 7.384394 male H
1 8.091553 11.586920 67111.784934 0 7.384394 male H
2 7.924539 1.350194 93442.563796 1 10.219626 F W
3 7.924539 1.350194 93442.563796 1 10.219626 F W
4 7.726480 11.057883 96508.386987 0 8.544586 M W
income_bucket plays_sports student_id lat long
0 3 0 1 0.0 0.0
1 3 0 1 0.0 0.0
2 4 0 2 0.0 0.0
3 4 0 2 0.0 0.0
4 4 0 3 0.0 0.0
Related
I have a pandas Dataframe:
np.random.seed(0)
df = pd.DataFrame({'Close': np.random.uniform(0, 100, size=10)})
lbound, ubound = 0, 1
change = df["Close"].diff()
df["Change"] = change
df["Result"] = np.select([ np.isclose(change, 1) | np.isclose(change, 0) | np.isclose(change, -1),
# The other conditions
(change > 0) & (change > ubound),
(change < 0) & (change < lbound),
change.between(lbound, ubound)],[0, 1, -1, 0])
Close Change Result
0 54.881350 NaN 0
1 71.518937 16.637586 1
2 60.276338 -11.242599 -1
3 54.488318 -5.788019 -1
4 42.365480 -12.122838 -1
5 64.589411 22.223931 1
6 43.758721 -20.830690 -1
7 89.177300 45.418579 1
8 96.366276 7.188976 1
9 38.344152 58.022124 -1
Problem statement - Now, I want the majority of voting for index 1,2,3,4 assigned to index 0, index 2,3,4,5 assigned to index 1 of result columns, and so on for all the subsequent indexes.
I tried:
df['Voting'] = df['Result'].rolling(window = 4,min_periods=1).apply(lambda x: x.mode()[0]).shift()
But,this doesn't give the result I intend. It takes the first 4 rolling window and applies the mode function.
Close Change Result Voting
0 54.881350 NaN 0 NaN
1 71.518937 16.637586 1 0.0
2 60.276338 -11.242599 -1 0.0
3 54.488318 -5.788019 -1 -1.0
4 42.36548 -12.122838 -1 -1.0
5 64.589411 22.223931 1 -1.0
6 43.758721 -20.830690 -1 -1.0
7 89.177300 45.418579 1 -1.0
8 96.366276 7.188976 1 -1.0
9 38.344152 -58.022124 -1 1.0
Result I Intend - Rolling window of 4(index 1,2,3,4) should be set and mode function be applied and result
should be assigned to index 0,then next rolling window(index 2,3,4,5) and result should
be assigned to index 1 and so on..
You have to reverse your list before then shift of 1 (because you don't want the current index in the result):
majority = lambda x: 0 if len((m := x.mode())) > 1 else m[0]
df['Voting'] = (df[::-1].rolling(4, min_periods=1)['Result']
.apply(majority).shift())
print(df)
# Output
Close Change Result Voting
0 54.881350 NaN 0 -1.0
1 71.518937 16.637586 1 -1.0
2 60.276338 -11.242599 -1 -1.0
3 54.488318 -5.788019 -1 0.0
4 42.365480 -12.122838 -1 1.0
5 64.589411 22.223931 1 0.0
6 43.758721 -20.830690 -1 1.0
7 89.177300 45.418579 1 0.0
8 96.366276 7.188976 1 -1.0
9 38.344152 58.022124 -1 NaN
I want all missing values from dataset to replace with average of two nearest neighbors. Except of first and last cells and when neighbors are 0 (then I manually fix values). I coded this and it works, but the solution is not very smart. Is is another way to do it faster? Interpolate method is suitable for that? I'm not quite sure how does it work.
Input:
0 1 2 3 4 5
0 0.0 1596.0 1578.0 1567.0 1580.0 1649.0
1 1554.0 1506.0 0.0 1466.0 1469.0 1503.0
2 1588.0 1510.0 1495.0 1485.0 1489.0 0.0
3 1592.0 0.0 0.0 1571.0 1647.0 0.0
Output:
0 1 2 3 4 5
0 0.0 1596.0 1578.0 1567.0 1580.0 1649.0
1 1554.0 1506.0 1486.0 1466.0 1469.0 1503.0
2 1588.0 1510.0 1495.0 1485.0 1489.0 1540.5
3 1592.0 0.0 0.0 1571.0 1647.0 0.0
Code:
data_len = len(df)
first_col = str(df.columns[0])
last_col = str(df.columns[len(df.columns) - 1])
d = df.apply(lambda s: pd.to_numeric(s, errors="coerce"))
m = d.eq(0) | d.isna()
s = m.stack()
list = s[s].index.tolist() #list of indeces of missing values
count = len(list)
for el in list:
if (el == ('0', first_col) or el == (str(data_len - 1), last_col)):
continue
next = df.at[str(int(el[0]) + 1), first_col] if el[1] == last_col else df.at[el[0], str(int(el[1]) + 1)]
prev = df.at[str(int(el[0]) - 1), last_col] if el[1] == first_col else df.at[el[0], str(int(el[1]) - 1)]
if prev == 0 or next == 0:
continue
df.at[el[0],el[1]] = (prev + next)/2
JSON of example:
{"0":{"0":0.0,"1":1554.0,"2":1588.0,"3":0.0},"1":{"0":1596.0,"1":1506.0,"2":1510.0,"3":0.0},"2":{"0":1578.0,"1":0.0,"2":1495.0,"3":1561.0},"3":{"0":1567.0,"1":1466.0,"2":1485.0,"3":1571.0},"4":{"0":1580.0,"1":1469.0,"2":1489.0,"3":1647.0},"5":{"0":1649.0,"1":1503.0,"2":0.0,"3":0.0}}
Here's one approach using shift to average the neighbour's values and slice assigning back to the dataframe:
m = df==0
r = (df.shift(axis=1)+df.shift(-1,axis=1))/2
df.iloc[1:-1,1:-1] = df.mask(m,r)
print(df)
0 1 2 3 4 5
0 0.0 1596.0 1578.0 1567.0 1580.0 1649.0
1 1554.0 1506.0 1486.0 1466.0 1469.0 1503.0
2 1588.0 1510.0 1495.0 1485.0 1489.0 0.0
3 0.0 0.0 1561.0 1571.0 1647.0 0.0
I have two columns which I want to compare every nth row. If it comes across the nth row it will compare them and put the result of the if statement in a new column.
When I tried the enumerate function it always ends up in the true part of the if statement. Somehow this piece of the code is always thrue:
if (count % 3)== 0:
for count, factors in enumerate(df.index):
if (count % 3)== 0: #every 3th row
df['Signal']=np.where(df['Wind Ch']>=df['Rain Ch'],'1', '-1')
else:
df['Signal']=0
In column 'Signal' I am expecting a '1' or '-1' every 3rd row and '0' on all the other rows. However I am getting '1' or '-1' on each row
Now I am getting:
Date Wind CH Rain CH Signal
0 5/10/2005 -1.85% -3.79% 1
1 5/11/2005 1.51% -1.66% 1
2 5/12/2005 0.37% 0.88% -1
3 5/13/2005 -0.81% 3.83% -1
4 5/14/2005 -0.28% 4.05% -1
5 5/15/2005 3.93% 1.79% 1
6 5/16/2005 6.23% 0.94% 1
7 5/17/2005 -0.08% 4.43% -1
8 5/18/2005 -2.69% 4.02% -1
9 5/19/2005 6.40% 1.33% 1
10 5/20/2005 -3.41% 2.38% -1
11 5/21/2005 3.27% 5.46% -1
12 5/22/2005 -4.40% -4.15% -1
13 5/23/2005 3.27% 4.48% -1
But I want to get:
Date Wind CH Rain CH Signal
0 5/10/2005 -1.85% -3.79% 0.0
1 5/11/2005 1.51% -1.66% 0.0
2 5/12/2005 0.37% 0.88% -1.0
3 5/13/2005 -0.81% 3.83% 0.0
4 5/14/2005 -0.28% 4.05% 0.0
5 5/15/2005 3.93% 1.79% 1.0
6 5/16/2005 6.23% 0.94% 0.0
7 5/17/2005 -0.08% 4.43% 0.0
8 5/18/2005 -2.69% 4.02% -1.0
9 5/19/2005 6.40% 1.33% 0.0
10 5/20/2005 -3.41% 2.38% 0.0
11 5/21/2005 3.27% 5.46% -1.0
12 5/22/2005 -4.40% -4.15% 0.0
13 5/23/2005 3.27% 4.48% 0.0
What am I missing here?
You can go about it like this, using np.vectorize to avoid loops:
import numpy as np
def calcSignal(x, y, i):
return 0 if (i + 1) % 3 != 0 else 1 if x >= y else -1
func = np.vectorize(calcSignal)
df['Signal'] = func(df['Wind CH'], df['Rain CH'], df.index)
df
Date Wind CH Rain CH Signal
0 5/10/2005 -1.85% -3.79% 0
1 5/11/2005 1.51% -1.66% 0
2 5/12/2005 0.37% 0.88% -1
3 5/13/2005 -0.81% 3.83% 0
4 5/14/2005 -0.28% 4.05% 0
5 5/15/2005 3.93% 1.79% 1
6 5/16/2005 6.23% 0.94% 0
7 5/17/2005 -0.08% 4.43% 0
8 5/18/2005 -2.69% 4.02% -1
9 5/19/2005 6.40% 1.33% 0
10 5/20/2005 -3.41% 2.38% 0
11 5/21/2005 3.27% 5.46% -1
12 5/22/2005 -4.40% -4.15% 0
13 5/23/2005 3.27% 4.48% 0
In general you don't want to loop over pandas objects. This case is no exception.
In [12]: df = pd.DataFrame({'x': [1,2,3], 'y': [10, 20, 30]})
In [13]: df
Out[13]:
x y
0 1 10
1 2 20
2 3 30
In [14]: df.loc[df.index % 2 == 0, 'x'] = 5
In [15]: df
Out[15]:
x y
0 5 10
1 2 20
2 5 30
there is no need to use enumerate function as i see it.Also your logic is faulty. you are rewriting complete column in every iteration of loop instead of ith row of column. you could simply do this
for count in range(len(df.index)):
if (count % 3)== 0: #every 3th row
df['Signal'].iloc[count]=np.where(df['Wind Ch'].iloc[count]>=df['Rain Ch'].iloc[count],'1', '-1')
else:
df['Signal'].iloc[0]=0
I would like to give each employee a pro rata share after a sale has been made. Therefore I first need to sum up the number of contacts per Customer that leads to a sale and then split the reward the each employee involved in this process.
import pandas as pd
df = pd.DataFrame({"Cust_ID":[1,1,1,2,3,3], "Employee": ["A","B","B","C","B","A"], "Purchase":[0,0,1,1,0,1]})
df
Cust_ID Employee Purchase
0 1 A 0
1 1 B 0
2 1 B 1
3 2 C 1
4 3 B 0
5 3 A 1
When it takes 3 (or more) steps for the final sale (Cust_ID = 1) the rewards shall be distributed in 50%, 30% and 20% (0%..).
For 2 steps 70% and 30%. One step = 100%
The result should look like this:
Cust_ID Employee Purchase Reward
0 1 A 0 0.2
1 1 B 0 0.3
2 1 B 1 0.5
3 2 C 1 1.0
4 3 B 0 0.3
5 3 A 1 0.7
I tried using df["Reward"] = df.groupby("Cust_ID").Purchase.transform("xxx") but this didn't execute the distributed reward..
Thanks in advance!
First let's augment the DataFrame:
df['Touch'] = df.groupby('Cust_ID').cumcount()
df['Touches'] = df.groupby('Cust_ID').Employee.count()[df.Cust_ID].values
df['Reward'] = 0.0
Now we have the basic setup:
Cust_ID Employee Purchase Touch Touches Reward
0 1 A 0 0 3 0.0
1 1 B 0 1 3 0.0
2 1 B 1 2 3 0.0
3 2 C 1 0 1 0.0
4 3 B 0 0 2 0.0
5 3 A 1 1 2 0.0
Finally, apply the reward rules:
df.loc[df.Touches == 1, 'Reward'] = 1.0
df.loc[(df.Touches == 2) & (df.Touch == 0), 'Reward'] = 0.3
df.loc[(df.Touches == 2) & (df.Touch == 1), 'Reward'] = 0.7
df.loc[(df.Touches == 3) & (df.Touch == 0), 'Reward'] = 0.2
df.loc[(df.Touches == 3) & (df.Touch == 1), 'Reward'] = 0.3
df.loc[(df.Touches == 3) & (df.Touch == 2), 'Reward'] = 0.5
This last part could be done more cleverly using np.select(). This is an exercise for the reader.
I am calculating Z score and P value for different sub segments within a data frame.
The data frame has two columns, here are the top 5 values in my data frame:
df[["Engagement_score", "Performance"]].head()
Engagement_score Performance
0 6 0.0
1 5 0.0
2 7 66.3
3 3 0.0
4 11 0.0
Here's the distribution of engagement score:
Here's the distribution of performance:
I am grouping my dataframe by engagement score and then I calculate these three statistics for those groups:
1) Average performance score(sub_average) and number of values within that group(sub_bookings)
2) Average performance score for rest of the groups(rest_average) and number of values in rest of the groups(rest_bookings)
Overall performance score and overall bookings are calculated for the overall data frame.
Here's my code to do that.
def stats_comparison(i):
df.groupby(i)['Performance'].agg({
'average': 'mean',
'bookings': 'count'
}).reset_index()
cat = df.groupby(i)['Performance']\
.agg({
'sub_average': 'mean',
'sub_bookings': 'count'
}).reset_index()
cat['overall_average'] = df['Performance'].mean()
cat['overall_bookings'] = df['Performance'].count()
cat['rest_bookings'] = cat['overall_bookings'] - cat['sub_bookings']
cat['rest_average'] = (cat['overall_bookings']*cat['overall_average'] \
- cat['sub_bookings']*cat['sub_average'])/cat['rest_bookings']
cat['z_score'] = (cat['sub_average']-cat['rest_average'])/\
np.sqrt(cat['overall_average']*(1-cat['overall_average'])
*(1/cat['sub_bookings']+1/cat['rest_bookings']))
cat['prob'] = np.around(stats.norm.cdf(cat.z_score), decimals = 10) # this is the p value
cat['significant'] = [(lambda x: 1 if x > 0.9 else -1 if x < 0.1 else 0)(i) for i in cat['prob']]
# if the p value is less than 0.1 then I can confidently say that the 2 samples are different.
print(cat)
stats_comparison('Engagement_score')
I get the following output when I execute my code:
Engagement_score sub_average sub_bookings overall_average \
0 3 57.281118 1234 34.405373
1 4 56.165374 722 34.405373
2 5 52.896404 890 34.405373
3 6 50.275880 966 34.405373
4 7 43.475344 1018 34.405373
5 8 37.693290 1222 34.405373
6 9 30.418053 1695 34.405373
7 10 16.458142 2874 34.405373
8 11 25.604145 1375 34.405373
9 12 10.910013 789 34.405373
overall_bookings rest_bookings rest_average z_score prob significant
0 12785 11551 31.961544 NaN NaN 0
1 12785 12063 33.102984 NaN NaN 0
2 12785 11895 33.021850 NaN NaN 0
3 12785 11819 33.108233 NaN NaN 0
4 12785 11767 33.620702 NaN NaN 0
5 12785 11563 34.057900 NaN NaN 0
6 12785 11090 35.014797 NaN NaN 0
7 12785 9911 39.609727 NaN NaN 0
8 12785 11410 35.465995 NaN NaN 0
9 12785 11996 35.950709 NaN NaN 0
I don't know why I am getting a list of NAN values in ZScore and P value columns. There are no negative values in my data set.
I also get the following warning when I run the code in Jupyter Notebook:
C:\Users\User\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
after removing the cwd from sys.path.
C:\Users\User\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
C:\Users\User\Anaconda3\lib\site-packages\ipykernel_launcher.py:15: RuntimeWarning: invalid value encountered in sqrt
from ipykernel import kernelapp as app
C:\Users\User\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
return (self.a < x) & (x < self.b)
C:\Users\User\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
return (self.a < x) & (x < self.b)
C:\Users\User\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1738: RuntimeWarning: invalid value encountered in greater_equal
cond2 = (x >= self.b) & cond0