Imagine the following dataset df:
Row
Population_density
Distance
1
400
50
2
500
30
3
300
40
4
200
120
5
500
60
6
1000
50
7
3300
30
8
500
90
9
700
100
10
1000
110
11
900
200
12
850
30
How can I make a new dummy column that represents a 1 when values of df['Population_density'] are above the third quantile (>75%) AND the df['Distance'] is < 100, while a 0 is given to the remainder of the data? Consequently, rows 6 and 7 should have a 1 while the other rows should have a 0.
Creating a dummy variable with only one criterium can be fairly easy. For instance, the following condition works for creating a new dummy variable that contains a 1 when the Distance is <100 and a 0 otherwise: df['Distance_Below_100'] = np.where(df['Distance'] < 100, 1, 0). However, I do not know how to combine conditions whereby one of the conditions includes a quantile selection (in this case, the upper 25% of the variable Population_density.
import pandas as pd
# assign data of lists.
data = {'Row': range(1,13,1), 'Population_density': [400, 500, 300, 200, 500, 1000, 3300, 500, 700, 1000, 900, 850],
'Distance': [50, 30, 40, 120, 60, 50, 30, 90, 100, 110, 200, 30]}
# Create DataFrame
df = pd.DataFrame(data)
You can use & or | to join the conditions
import numpy as np
df['Distance_Below_100'] = np.where(df['Population_density'].gt(df['Population_density'].quantile(0.75)) & df['Distance'].lt(100), 1, 0)
print(df)
Row Population_density Distance Distance_Below_100
0 1 400 50 0
1 2 500 30 0
2 3 300 40 0
3 4 200 120 0
4 5 500 60 0
5 6 1000 50 1
6 7 3300 30 1
7 8 500 90 0
8 9 700 100 0
9 10 1000 110 0
10 11 900 200 0
11 12 850 30 0
he, to make a function on data frame i recommended to use lambda.
for example this is your function:
def myFunction(value):
pass
to create a new column 'new_column', (pick_cell) is which cell you want to make a function on:
df['new_column']= df.apply(lambda x : myFunction(x.pick_cell))
Related
I have a calculation that I need to make for a dataset that models a tank of liquids, and I really would like to do this without using iterating manually over each row, but I just don't seem to be clever enough to figure it out.
The calculation is quite easy to do on a simple list of values, as shown:
inflow_1 = [100, 100, 90, 0, 20, 0, 20, 60, 30, 70]
inflow_2 = [0, 50, 30, 20, 50, 0, 90, 20, 70, 90]
outflow = [0, 10, 80, 70, 80, 50, 30, 100, 90, 10]
tank_volume1 = 0
tank_volume2 = 0
outflow_volume1 = 0
outflow_volume2 = 0
outflows_1 = []
outflows_2 = []
for in1, in2, out in zip(inflow_1, inflow_2, outflow):
tank_volume1 += in1
tank_volume2 += in2
outflow_volume1 += out * (tank_volume1 / (tank_volume1 + tank_volume2))
outflow_volume2 += out * (tank_volume2 / (tank_volume1 + tank_volume2))
tank_volume1 -= outflow_volume1
tank_volume2 -= outflow_volume2
outflows_1.append(outflow_volume1)
outflows_2.append(outflow_volume2)
df = pd.DataFrame({'inflow_1': inflow_1, 'inflow_2': inflow_2, 'outflow': outflow, 'outflow_1': outflows_1, 'outflow_2': outflows_2})
Which outputs:
inflow_1 inflow_2 outflow timestamp outflow_1 outflow_2
0 100 0 0 0 0.000000 0.000000
1 100 50 10 1 8.000000 2.000000
2 90 30 80 2 70.666667 19.333333
3 0 20 70 3 121.678161 38.321839
4 20 50 80 4 165.540230 74.459770
5 0 0 50 5 235.396552 54.603448
6 20 90 30 6 272.389498 47.610502
7 60 20 100 7 377.535391 42.464609
8 30 70 90 8 473.443834 36.556166
9 70 90 10 9 484.369943 35.630057
But I just don't see how to do this purely in pyspark, or even pandas without iterating through rows manually. I feel like it should be possible, since I basically just need access to the previously calculated value per calculation, something similar to cumsum(), but no combination I can think of gets the job done.
If there's also just a better term for this type of calculation, I'd appreciate that input.
I have the following dataset, which contains a column with the cluster number, the number of observations in that cluster and the maximum value of another variable x grouped by that cluster.
clust = np.arange(0, 10)
obs = np.array([1041, 544, 310, 1648, 1862, 2120, 2916, 5148, 12733, 1])
x_max = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
df = pd.DataFrame(np.c_[clust, obs, x_max], columns=['clust', 'obs', 'x_max'])
clust obs x_max
0 0 1041 10
1 1 544 20
2 2 310 30
3 3 1648 40
4 4 1862 50
5 5 2120 60
6 6 2916 70
7 7 5148 80
8 8 12733 90
9 9 1 100
My task is to combine the clust row values with adjasent rows, so that each cluster contains at least 1000 observations.
My current attempt gets stuck in an infinite loop because the last cluster has only 1 observation.
condition = True
while (condition):
condition = False
for i in np.arange(0, len(df) + 1):
if df.loc[i, 'x'] < 1000:
df.loc[i, 'id'] = df.loc[i, 'id'] + 1
df = df.groupby('id', as_index=False).agg({'x': 'sum', 'y': 'max'})
condition = True
break
Is there perhaps a more efficient way of doing this? I come from a background in SAS, where such situations would be solved with the if last.row condition, but it seems here is no such condition in python.
The resulting table should look like this
clust obs x_max
0 1041 10
1 2502 40
2 1862 50
3 2120 60
4 2916 70
5 5148 80
6 12734 100
Here is another way. A vectorize way here is difficult to implement, but using for loop on an array (or a list) will be faster than using loc at each iteration. Also, not a good practice to change df within the loop, it can only bring problem.
# define variables
s = 0 #for the sum of observations
gr = [] #for the final grouping values
i = 0 #for the group indices
# loop over observations from an array
for obs in df['obs'].to_numpy():
s+= obs
gr.append(i)
# check that the size of the group is big enough
if s>1000:
s = 0
i+=1
# condition to deal with last rows if last group not big enough
if s!=0:
gr = [i-1 if val==i else val for val in gr]
# now create your new df
new_df = (
df.groupby(gr).agg({'obs':sum, 'x_max':max})
.reset_index().rename(columns={'index':'cluster'})
)
print(new_df)
# cluster obs x_max
# 0 0 1041 10
# 1 1 2502 40
# 2 2 1862 50
# 3 3 2120 60
# 4 4 2916 70
# 5 5 5148 80
# 6 6 12734 100
I have a data frame which essentially looks like this:
number
value
200
0
201
1
202
2
..
..
399
3
400
4
What I want to do is to create a new column which has the range of 3 consecutive numbers:
number
value
range
200
0
200 - 202
201
1
200 - 202
202
2
200 - 202
..
..
..
399
3
398 - 400
400
4
398 - 400
One thing I can do is to create my own function and write if statements like this:
def function(number):
if number < 203 & number > 199:
return "200-202"
elif number < 206 & number > 202:
return "203-205"
....
and so on
But this would require I write about 70 if statements. I'm sure there is an easier way to do this. Can someone please guide me?
You can determine the range from the number itself.
Assuming you want to start on the first value and use ranges of n=3, you can use:
n = 3
first = df['number'].iloc[0] # initial value (could be set to 0 to have fixed ranges)
start = (df['number']
.sub(first).floordiv(n)
.mul(n).add(first)
)
df['range'] = start.astype(str)+'-'+start.add(n-1).astype(str)
Output:
number value range
0 200 0 200-202
1 201 1 200-202
2 202 2 200-202
3 399 3 398-400
4 400 4 398-400
You should use pd.cut
import pandas as pd
import numpy as np
x = pd.DataFrame({
'a': range(200,400),
'b': np.random.randint(0,10,200)
})
wks = 2
x.loc[:,'range'] = pd.cut(x.a, bins=range(x.a.min(), x.a.max()+wks, wks), right=False)
with pd.option_context('display.max_rows',5):
display(x)
Output:
a b range
0 200 7 [200, 202)
1 201 6 [200, 202)
... ... ... ...
198 398 1 [398, 400)
199 399 3 [398, 400)
After which I presume you want to do something like:
with pd.option_context('display.max_rows',5):
display(x.groupby('range').b.sum())
Output:
range
[200, 202) 13
[202, 204) 6
..
[396, 398) 6
[398, 400) 4
Name: b, Length: 100, dtype: int32
I am trying to add a column with the weighted average of 4 columns with 4 columns of weights
df = pd.DataFrame.from_dict(dict([('A', [2000, 1000, 2509, 2145]),
('A_Weight', [37, 47, 33, 16]),
('B', [2100, 1500, 2000, 1600]),
('B_weights', [17, 21, 6, 2]),
('C', [2500, 1400, 0, 2300]),
('C_weights', [5, 35, 0, 40]),
('D', [0, 1600, 2100, 2000]),
('D_weights', [0, 32, 10, 5])]))
I want the weighted average to be in a new column named "WA" but every time I try it displays NaN
Desired Dataframe would be a new column with the following values as ex:
Formula I used (((A * A_weight)+(B * b_weight)+(C * C_weight)+(D * D_weight)) / sum(all weights)
df['WA'] = [2071.19,1323.70, 2363.20,2214.60 ]
Thank you
A straight-forward and simple way to do is as follows:
(Since your columns name for the weights are not consistently named, e.g. some with 's' and some without, some with capital 'W' and some with lower case 'w', it is not convenient to group columns e.g. by .filter())
df['WA'] = ( (df['A'] * df['A_Weight']) + (df['B'] * df['B_weights']) + (df['C'] * df['C_weights']) + (df['D'] * df['D_weights']) ) / (df['A_Weight'] + df['B_weights'] + df['C_weights'] + df['D_weights'])
Result:
print(df)
A A_Weight B B_weights C C_weights D D_weights WA
0 2000 37 2100 17 2500 5 0 0 2071.186441
1 1000 47 1500 21 1400 35 1600 32 1323.703704
2 2509 33 2000 6 0 0 2100 10 2363.204082
3 2145 16 1600 2 2300 40 2000 5 2214.603175
The not so straight-forward way:
Group columns by prefix via str.split
get the column-wise product via groupby prod
get the row-wise sum of the products with sum on axis 1.
filter + sum on axis 1 to get sum of "weights" columns
Divide the the group product sums with the weight sums.
df['WA'] = (
df.groupby(df.columns.str.split('_').str[0], axis=1).prod().sum(axis=1)
/ df.filter(regex='_[wW]eight(s)?$').sum(axis=1)
)
A A_Weight B B_weights C C_weights D D_weights WA
0 2000 37 2100 17 2500 5 0 0 2071.186441
1 1000 47 1500 21 1400 35 1600 32 1323.703704
2 2509 33 2000 6 0 0 2100 10 2363.204082
3 2145 16 1600 2 2300 40 2000 5 2214.603175
Another option to an old question:
Split data into numerator and denominator:
numerator = df.filter(regex=r"[A-Z]$")
denominator = df.filter(like='_')
Convert denominator into a MultiIndex, comes in handy when computing with numerator:
denominator.columns = denominator.columns.str.split('_', expand = True)
Multiply numerator by denominator, and divide the sum of the outcome with the sum of the denominator:
outcome = numerator.mul(denominator, level=0, axis=1).sum(1)
outcome = outcome.div(denominator.sum(1))
df.assign(WA = outcome)
A A_Weight B B_weights C C_weights D D_weights WA
0 2000 37 2100 17 2500 5 0 0 2071.186441
1 1000 47 1500 21 1400 35 1600 32 1323.703704
2 2509 33 2000 6 0 0 2100 10 2363.204082
3 2145 16 1600 2 2300 40 2000 5 2214.603175
I have a dataframe as below. I understand that df.groupby("degree").mean() would provide me mean by column degree. I would like to take those means and find distance between each data point and those mean. In this case. For each data point, I would like to get 3 distances from means (output of df.groupby("degree").mean()) (4,40) (2,80) and (4,94) and create 3 new columns. Distance should be calculated by formula, BCA_mean=(name-4)^3+(score-40)^3,M.Tech_mean=(name-2)^3+(score-80)^3,MBA_mean=(name-4)^3+(score-94)^3
import pandas as pd
# dictionary of lists
dict = {'name':[5, 4, 2, 3],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
# creating a dataframe from a dictionary
df = pd.DataFrame(dict)
print (df)
name degree score
0 5 MBA 90
1 4 BCA 40
2 2 M.Tech 80
3 3 MBA 98
df.groupby("degree").mean()
degree name score
BCA 4 40
M.Tech 2 80
MBA 4 94
update1
my real dataset has more than 100 columns. i would prefer something that could suit that need. The logic is still the same, for each mean, subtract mean value from a column and take cube of each cell and add
I found something like below. But not sure if there is any other efficient way
y=df.groupby("degree").mean()
print (y)
import numpy as np
(np.square(df[['name','score']].subtract(y.iloc[0,:],axis=1))).sum(axis=1)
df["mean0"]=(np.square(df[['name','score']].subtract(y.iloc[0,:],axis=1))).sum(axis=1)
df
import pandas as pd
# dictionary of lists
dict = {'degree': ["MBA", "BCA", "M.Tech", "MBA","BCA"],
'name':[5, 4, 2, 3,2],
'score':[90, 40, 80, 98,60],
'game':[100,200,300,100,400],
'money':[100,200,300,100,400],
'loan':[100,200,300,100,400],
'rent':[100,200,300,100,400],
'location':[100,200,300,100,400]}
# creating a dataframe from a dictionary
df = pd.DataFrame(dict)
print (df)
dfx=df.groupby("degree").mean()
print(dfx)
def fun(x):
if x[0]=='BCA':
return x[1:] - dfx.iloc[0,:].tolist()
if x[0]=='M.Tech':
return x[1:]-dfx.iloc[1,:].tolist()
if x[0]=='MBA':
return x[1:]-dfx.iloc[2,:].tolist()
df_added=df.apply(fun,axis=1)
df_added
result
degree name score game money loan rent location
0 MBA 5 90 100 100 100 100 100
1 BCA 4 40 200 200 200 200 200
2 M.Tech 2 80 300 300 300 300 300
3 MBA 3 98 100 100 100 100 100
4 BCA 2 60 400 400 400 400 400
``````
mean which is dfx
``````````
name score game money loan rent location
degree
BCA 3 50 300 300 300 300 300
M.Tech 2 80 300 300 300 300 300
MBA 4 94 100 100 100 100 100
````````````
df_added********
difference of each element from their mean column value
``````````
name score game money loan rent location
0 1 -4 0 0 0 0 0
1 1 -10 -100 -100 -100 -100 -100
2 0 0 0 0 0 0 0
3 -1 4 0 0 0 0 0
4 -1 10 100 100 100 100 100