Operations on specific elements of a dataframe in Python

Operations on specific elements of a dataframe in Python - python

I'm trying to convert kilometer values in one column of a dataframe to mile values. I've tried various things and this is what I have now:
def km_dist(column, dist):
length = len(column)
for dist in zip(range(length), column):
if (column == data["dist"] and dist in data.loc[(data["dist"] > 25)]):
return dist / 5820
else:
return dist
data = data.apply(lambda x: km_dist(data["dist"], x), axis=1)
The dataset I'm working with looks something like this:
past_score dist income lab score gender race income_bucket plays_sports student_id lat long
0 8.091553 11.586920 67111.784934 0 7.384394 male H 3 0 1 0.0 0.0
1 8.091553 11.586920 67111.784934 0 7.384394 male H 3 0 1 0.0 0.0
2 7.924539 7858.126614 93442.563796 1 10.219626 F W 4 0 2 0.0 0.0
3 7.924539 7858.126614 93442.563796 1 10.219626 F W 4 0 2 0.0 0.0
4 7.726480 11.057883 96508.386987 0 8.544586 M W 4 0 3 0.0 0.0
With my code above, I'm trying to loop through all the "dist" values and if those values are in the right column ("data["dist"]") and greater than 25, divide those values by 5820 (the number of feet in a kilometer). More generally, I'd like to find a way to operate on specific elements of dataframes. I'm sure this is at least a somewhat common question, I just haven't been able to find an answer for it. If someone could point me towards somewhere with an answer, I would be just as happy.

Instead your solution filter rows with mask and divide column dist by 5820:
data.loc[data["dist"] > 25, 'dist'] /= 5820
Working same like:
data.loc[data["dist"] > 25, 'dist'] = data.loc[data["dist"] > 25, 'dist'] / 5820
data.loc[data["dist"] > 25, 'dist'] /= 5820
print (data)
past_score dist income lab score gender race \
0 8.091553 11.586920 67111.784934 0 7.384394 male H
1 8.091553 11.586920 67111.784934 0 7.384394 male H
2 7.924539 1.350194 93442.563796 1 10.219626 F W
3 7.924539 1.350194 93442.563796 1 10.219626 F W
4 7.726480 11.057883 96508.386987 0 8.544586 M W
income_bucket plays_sports student_id lat long
0 3 0 1 0.0 0.0
1 3 0 1 0.0 0.0
2 4 0 2 0.0 0.0
3 4 0 2 0.0 0.0
4 4 0 3 0.0 0.0

Related

Rolling and Mode function to get the majority of voting for rows in pandas Dataframe

I have a pandas Dataframe:
np.random.seed(0)
df = pd.DataFrame({'Close': np.random.uniform(0, 100, size=10)})
lbound, ubound = 0, 1
change = df["Close"].diff()
df["Change"] = change
df["Result"] = np.select([ np.isclose(change, 1) | np.isclose(change, 0) | np.isclose(change, -1),
# The other conditions
(change > 0) & (change > ubound),
(change < 0) & (change < lbound),
change.between(lbound, ubound)],[0, 1, -1, 0])
Close Change Result
0 54.881350 NaN 0
1 71.518937 16.637586 1
2 60.276338 -11.242599 -1
3 54.488318 -5.788019 -1
4 42.365480 -12.122838 -1
5 64.589411 22.223931 1
6 43.758721 -20.830690 -1
7 89.177300 45.418579 1
8 96.366276 7.188976 1
9 38.344152 58.022124 -1
Problem statement - Now, I want the majority of voting for index 1,2,3,4 assigned to index 0, index 2,3,4,5 assigned to index 1 of result columns, and so on for all the subsequent indexes.
I tried:
df['Voting'] = df['Result'].rolling(window = 4,min_periods=1).apply(lambda x: x.mode()[0]).shift()
But,this doesn't give the result I intend. It takes the first 4 rolling window and applies the mode function.
Close Change Result Voting
0 54.881350 NaN 0 NaN
1 71.518937 16.637586 1 0.0
2 60.276338 -11.242599 -1 0.0
3 54.488318 -5.788019 -1 -1.0
4 42.36548 -12.122838 -1 -1.0
5 64.589411 22.223931 1 -1.0
6 43.758721 -20.830690 -1 -1.0
7 89.177300 45.418579 1 -1.0
8 96.366276 7.188976 1 -1.0
9 38.344152 -58.022124 -1 1.0
Result I Intend - Rolling window of 4(index 1,2,3,4) should be set and mode function be applied and result
should be assigned to index 0,then next rolling window(index 2,3,4,5) and result should
be assigned to index 1 and so on..

You have to reverse your list before then shift of 1 (because you don't want the current index in the result):
majority = lambda x: 0 if len((m := x.mode())) > 1 else m[0]
df['Voting'] = (df[::-1].rolling(4, min_periods=1)['Result']
.apply(majority).shift())
print(df)
# Output
Close Change Result Voting
0 54.881350 NaN 0 -1.0
1 71.518937 16.637586 1 -1.0
2 60.276338 -11.242599 -1 -1.0
3 54.488318 -5.788019 -1 0.0
4 42.365480 -12.122838 -1 1.0
5 64.589411 22.223931 1 0.0
6 43.758721 -20.830690 -1 1.0
7 89.177300 45.418579 1 0.0
8 96.366276 7.188976 1 -1.0
9 38.344152 58.022124 -1 NaN

Missing values replaced with average of its neighbors (timeseries)

I want all missing values from dataset to replace with average of two nearest neighbors. Except of first and last cells and when neighbors are 0 (then I manually fix values). I coded this and it works, but the solution is not very smart. Is is another way to do it faster? Interpolate method is suitable for that? I'm not quite sure how does it work.
Input:
0 1 2 3 4 5
0 0.0 1596.0 1578.0 1567.0 1580.0 1649.0
1 1554.0 1506.0 0.0 1466.0 1469.0 1503.0
2 1588.0 1510.0 1495.0 1485.0 1489.0 0.0
3 1592.0 0.0 0.0 1571.0 1647.0 0.0
Output:
0 1 2 3 4 5
0 0.0 1596.0 1578.0 1567.0 1580.0 1649.0
1 1554.0 1506.0 1486.0 1466.0 1469.0 1503.0
2 1588.0 1510.0 1495.0 1485.0 1489.0 1540.5
3 1592.0 0.0 0.0 1571.0 1647.0 0.0
Code:
data_len = len(df)
first_col = str(df.columns[0])
last_col = str(df.columns[len(df.columns) - 1])
d = df.apply(lambda s: pd.to_numeric(s, errors="coerce"))
m = d.eq(0) | d.isna()
s = m.stack()
list = s[s].index.tolist() #list of indeces of missing values
count = len(list)
for el in list:
if (el == ('0', first_col) or el == (str(data_len - 1), last_col)):
continue
next = df.at[str(int(el[0]) + 1), first_col] if el[1] == last_col else df.at[el[0], str(int(el[1]) + 1)]
prev = df.at[str(int(el[0]) - 1), last_col] if el[1] == first_col else df.at[el[0], str(int(el[1]) - 1)]
if prev == 0 or next == 0:
continue
df.at[el[0],el[1]] = (prev + next)/2
JSON of example:
{"0":{"0":0.0,"1":1554.0,"2":1588.0,"3":0.0},"1":{"0":1596.0,"1":1506.0,"2":1510.0,"3":0.0},"2":{"0":1578.0,"1":0.0,"2":1495.0,"3":1561.0},"3":{"0":1567.0,"1":1466.0,"2":1485.0,"3":1571.0},"4":{"0":1580.0,"1":1469.0,"2":1489.0,"3":1647.0},"5":{"0":1649.0,"1":1503.0,"2":0.0,"3":0.0}}

Here's one approach using shift to average the neighbour's values and slice assigning back to the dataframe:
m = df==0
r = (df.shift(axis=1)+df.shift(-1,axis=1))/2
df.iloc[1:-1,1:-1] = df.mask(m,r)
print(df)
0 1 2 3 4 5
0 0.0 1596.0 1578.0 1567.0 1580.0 1649.0
1 1554.0 1506.0 1486.0 1466.0 1469.0 1503.0
2 1588.0 1510.0 1495.0 1485.0 1489.0 0.0
3 0.0 0.0 1561.0 1571.0 1647.0 0.0

Do something in nth row with enumerate

I have two columns which I want to compare every nth row. If it comes across the nth row it will compare them and put the result of the if statement in a new column.
When I tried the enumerate function it always ends up in the true part of the if statement. Somehow this piece of the code is always thrue:
if (count % 3)== 0:
for count, factors in enumerate(df.index):
if (count % 3)== 0: #every 3th row
df['Signal']=np.where(df['Wind Ch']>=df['Rain Ch'],'1', '-1')
else:
df['Signal']=0
In column 'Signal' I am expecting a '1' or '-1' every 3rd row and '0' on all the other rows. However I am getting '1' or '-1' on each row
Now I am getting:
Date Wind CH Rain CH Signal
0 5/10/2005 -1.85% -3.79% 1
1 5/11/2005 1.51% -1.66% 1
2 5/12/2005 0.37% 0.88% -1
3 5/13/2005 -0.81% 3.83% -1
4 5/14/2005 -0.28% 4.05% -1
5 5/15/2005 3.93% 1.79% 1
6 5/16/2005 6.23% 0.94% 1
7 5/17/2005 -0.08% 4.43% -1
8 5/18/2005 -2.69% 4.02% -1
9 5/19/2005 6.40% 1.33% 1
10 5/20/2005 -3.41% 2.38% -1
11 5/21/2005 3.27% 5.46% -1
12 5/22/2005 -4.40% -4.15% -1
13 5/23/2005 3.27% 4.48% -1
But I want to get:
Date Wind CH Rain CH Signal
0 5/10/2005 -1.85% -3.79% 0.0
1 5/11/2005 1.51% -1.66% 0.0
2 5/12/2005 0.37% 0.88% -1.0
3 5/13/2005 -0.81% 3.83% 0.0
4 5/14/2005 -0.28% 4.05% 0.0
5 5/15/2005 3.93% 1.79% 1.0
6 5/16/2005 6.23% 0.94% 0.0
7 5/17/2005 -0.08% 4.43% 0.0
8 5/18/2005 -2.69% 4.02% -1.0
9 5/19/2005 6.40% 1.33% 0.0
10 5/20/2005 -3.41% 2.38% 0.0
11 5/21/2005 3.27% 5.46% -1.0
12 5/22/2005 -4.40% -4.15% 0.0
13 5/23/2005 3.27% 4.48% 0.0
What am I missing here?

You can go about it like this, using np.vectorize to avoid loops:
import numpy as np
def calcSignal(x, y, i):
return 0 if (i + 1) % 3 != 0 else 1 if x >= y else -1
func = np.vectorize(calcSignal)
df['Signal'] = func(df['Wind CH'], df['Rain CH'], df.index)
df
Date Wind CH Rain CH Signal
0 5/10/2005 -1.85% -3.79% 0
1 5/11/2005 1.51% -1.66% 0
2 5/12/2005 0.37% 0.88% -1
3 5/13/2005 -0.81% 3.83% 0
4 5/14/2005 -0.28% 4.05% 0
5 5/15/2005 3.93% 1.79% 1
6 5/16/2005 6.23% 0.94% 0
7 5/17/2005 -0.08% 4.43% 0
8 5/18/2005 -2.69% 4.02% -1
9 5/19/2005 6.40% 1.33% 0
10 5/20/2005 -3.41% 2.38% 0
11 5/21/2005 3.27% 5.46% -1
12 5/22/2005 -4.40% -4.15% 0
13 5/23/2005 3.27% 4.48% 0

In general you don't want to loop over pandas objects. This case is no exception.
In [12]: df = pd.DataFrame({'x': [1,2,3], 'y': [10, 20, 30]})
In [13]: df
Out[13]:
x y
0 1 10
1 2 20
2 3 30
In [14]: df.loc[df.index % 2 == 0, 'x'] = 5
In [15]: df
Out[15]:
x y
0 5 10
1 2 20
2 5 30

there is no need to use enumerate function as i see it.Also your logic is faulty. you are rewriting complete column in every iteration of loop instead of ith row of column. you could simply do this
for count in range(len(df.index)):
if (count % 3)== 0: #every 3th row
df['Signal'].iloc[count]=np.where(df['Wind Ch'].iloc[count]>=df['Rain Ch'].iloc[count],'1', '-1')
else:
df['Signal'].iloc[0]=0

How to groupby and then weight values according to size of each group

I would like to give each employee a pro rata share after a sale has been made. Therefore I first need to sum up the number of contacts per Customer that leads to a sale and then split the reward the each employee involved in this process.
import pandas as pd
df = pd.DataFrame({"Cust_ID":[1,1,1,2,3,3], "Employee": ["A","B","B","C","B","A"], "Purchase":[0,0,1,1,0,1]})
df
Cust_ID Employee Purchase
0 1 A 0
1 1 B 0
2 1 B 1
3 2 C 1
4 3 B 0
5 3 A 1
When it takes 3 (or more) steps for the final sale (Cust_ID = 1) the rewards shall be distributed in 50%, 30% and 20% (0%..).
For 2 steps 70% and 30%. One step = 100%
The result should look like this:
Cust_ID Employee Purchase Reward
0 1 A 0 0.2
1 1 B 0 0.3
2 1 B 1 0.5
3 2 C 1 1.0
4 3 B 0 0.3
5 3 A 1 0.7
I tried using df["Reward"] = df.groupby("Cust_ID").Purchase.transform("xxx") but this didn't execute the distributed reward..
Thanks in advance!

First let's augment the DataFrame:
df['Touch'] = df.groupby('Cust_ID').cumcount()
df['Touches'] = df.groupby('Cust_ID').Employee.count()[df.Cust_ID].values
df['Reward'] = 0.0
Now we have the basic setup:
Cust_ID Employee Purchase Touch Touches Reward
0 1 A 0 0 3 0.0
1 1 B 0 1 3 0.0
2 1 B 1 2 3 0.0
3 2 C 1 0 1 0.0
4 3 B 0 0 2 0.0
5 3 A 1 1 2 0.0
Finally, apply the reward rules:
df.loc[df.Touches == 1, 'Reward'] = 1.0
df.loc[(df.Touches == 2) & (df.Touch == 0), 'Reward'] = 0.3
df.loc[(df.Touches == 2) & (df.Touch == 1), 'Reward'] = 0.7
df.loc[(df.Touches == 3) & (df.Touch == 0), 'Reward'] = 0.2
df.loc[(df.Touches == 3) & (df.Touch == 1), 'Reward'] = 0.3
df.loc[(df.Touches == 3) & (df.Touch == 2), 'Reward'] = 0.5
This last part could be done more cleverly using np.select(). This is an exercise for the reader.

List of NAN values while calculating p value and Z score in Scipy

I am calculating Z score and P value for different sub segments within a data frame.
The data frame has two columns, here are the top 5 values in my data frame:
df[["Engagement_score", "Performance"]].head()
Engagement_score Performance
0 6 0.0
1 5 0.0
2 7 66.3
3 3 0.0
4 11 0.0
Here's the distribution of engagement score:
Here's the distribution of performance:
I am grouping my dataframe by engagement score and then I calculate these three statistics for those groups:
1) Average performance score(sub_average) and number of values within that group(sub_bookings)
2) Average performance score for rest of the groups(rest_average) and number of values in rest of the groups(rest_bookings)
Overall performance score and overall bookings are calculated for the overall data frame.
Here's my code to do that.
def stats_comparison(i):
df.groupby(i)['Performance'].agg({
'average': 'mean',
'bookings': 'count'
}).reset_index()
cat = df.groupby(i)['Performance']\
.agg({
'sub_average': 'mean',
'sub_bookings': 'count'
}).reset_index()
cat['overall_average'] = df['Performance'].mean()
cat['overall_bookings'] = df['Performance'].count()
cat['rest_bookings'] = cat['overall_bookings'] - cat['sub_bookings']
cat['rest_average'] = (cat['overall_bookings']*cat['overall_average'] \
- cat['sub_bookings']*cat['sub_average'])/cat['rest_bookings']
cat['z_score'] = (cat['sub_average']-cat['rest_average'])/\
np.sqrt(cat['overall_average']*(1-cat['overall_average'])
*(1/cat['sub_bookings']+1/cat['rest_bookings']))
cat['prob'] = np.around(stats.norm.cdf(cat.z_score), decimals = 10) # this is the p value
cat['significant'] = [(lambda x: 1 if x > 0.9 else -1 if x < 0.1 else 0)(i) for i in cat['prob']]
# if the p value is less than 0.1 then I can confidently say that the 2 samples are different.
print(cat)
stats_comparison('Engagement_score')
I get the following output when I execute my code:
Engagement_score sub_average sub_bookings overall_average \
0 3 57.281118 1234 34.405373
1 4 56.165374 722 34.405373
2 5 52.896404 890 34.405373
3 6 50.275880 966 34.405373
4 7 43.475344 1018 34.405373
5 8 37.693290 1222 34.405373
6 9 30.418053 1695 34.405373
7 10 16.458142 2874 34.405373
8 11 25.604145 1375 34.405373
9 12 10.910013 789 34.405373
overall_bookings rest_bookings rest_average z_score prob significant
0 12785 11551 31.961544 NaN NaN 0
1 12785 12063 33.102984 NaN NaN 0
2 12785 11895 33.021850 NaN NaN 0
3 12785 11819 33.108233 NaN NaN 0
4 12785 11767 33.620702 NaN NaN 0
5 12785 11563 34.057900 NaN NaN 0
6 12785 11090 35.014797 NaN NaN 0
7 12785 9911 39.609727 NaN NaN 0
8 12785 11410 35.465995 NaN NaN 0
9 12785 11996 35.950709 NaN NaN 0
I don't know why I am getting a list of NAN values in ZScore and P value columns. There are no negative values in my data set.
I also get the following warning when I run the code in Jupyter Notebook:
C:\Users\User\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
after removing the cwd from sys.path.
C:\Users\User\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
C:\Users\User\Anaconda3\lib\site-packages\ipykernel_launcher.py:15: RuntimeWarning: invalid value encountered in sqrt
from ipykernel import kernelapp as app
C:\Users\User\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
return (self.a < x) & (x < self.b)
C:\Users\User\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
return (self.a < x) & (x < self.b)
C:\Users\User\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1738: RuntimeWarning: invalid value encountered in greater_equal
cond2 = (x >= self.b) & cond0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Operations on specific elements of a dataframe in Python - python

Related

Rolling and Mode function to get the majority of voting for rows in pandas Dataframe

Missing values replaced with average of its neighbors (timeseries)

Do something in nth row with enumerate

How to groupby and then weight values according to size of each group

List of NAN values while calculating p value and Z score in Scipy

Categories

Resources