I've managed to successfully confuse myself with this problem. Here's a sample of my dataframe:
Model Rank Prediction Runtime
0.05 1 0.516267 250.500
0.05 2 0.504968 253.875
0.05 3 0.482915 310.875
0.05 4 0.470865 251.375
0.05 5 0.459580 277.250
. . . .
. . . .
. . . .
0.50 96 0.130696 250.500
0.50 97 0.130696 220.375
0.50 98 0.130696 314.625
0.50 99 0.130696 232.000
0.50 100 0.130696 258.000
And my use case is as follows:
I would, for each Model, like to calculate the total Runtime with respect to its Rank. By that I mean, the Runtime at Rank 1 should be the sum of all Runtimes (for its respective Model) and the Runtime at Rank 100 should be only the Runtime for Rank 100 (for its respective Model).
So for instance,
If the Rank is 1, the Runtime column at that row should represent the total sum of all Runtimes for Model 0.05
If the Rank is 2, it should be all of the Runtimes for Model 0.05 minus the Runtime for Model 0.05 at Rank 1
...
If the Rank is 100, it should be only the Runtime for Model 0.05 at Rank 100.
I have the idea in my head but I'm not sure how this is achieved in Pandas. I know how to sum the column, but not to sum based on a condition like this. If any more data or explanation is required, I'd be happy to attach it.
If I understand correctly, what you're asking for is essentially a reversed cumulative sum, which you can do by a reverse, cumsum, reverse operation:
In [4]: df["model_runtimes"] = df[::-1].groupby("Model")["Runtime"].cumsum()[::-1]
In [5]: df
Out[5]:
Model Rank Prediction Runtime model_runtimes
0 0.05 1 0.516267 250.500 1343.875
1 0.05 2 0.504968 253.875 1093.375
2 0.05 3 0.482915 310.875 839.500
3 0.05 4 0.470865 251.375 528.625
4 0.05 5 0.459580 277.250 277.250
5 0.50 96 0.130696 250.500 1275.500
6 0.50 97 0.130696 220.375 1025.000
7 0.50 98 0.130696 314.625 804.625
8 0.50 99 0.130696 232.000 490.000
9 0.50 100 0.130696 258.000 258.000
I would frame your problem in two steps.
First, each model is independent, so you can split the dataframe by the model field, and solve each one independently. This is a good place to use groupby().
Second, your problem is like a cumulative sum, except that a normal cumulative sum starts at the top and carries the sum down, and you want to do the opposite. You can solve this by reversing the dataframe, or by sorting in descending order.
With that in mind, here's how I would approach this problem. (Lines 1-14 are just setting up the dataset.)
import pandas as pd
import numpy as np
np.random.seed(42)
# Only used for setting up dataframe. Ignore.
def flatten(t):
return [item for sublist in t for item in sublist]
df = pd.DataFrame({
"Model": flatten([list(c * 20) for c in "ABCDE"]),
"Rank": flatten([range(1, 21) for i in range(5)]),
"Prediction": np.random.rand(100),
"Runtime": np.random.rand(100),
})
def add_sum(d):
d["Runtime"] = d["Runtime"].cumsum()
return d
df = df.sort_values(by=["Model", "Rank"], ascending=False) \
.groupby("Model") \
.apply(add_sum) \
.sort_index()
print(df)
Related
I'm trying to create a list of timestamps from a column in a dataframe, that resets after a certain time to zero. So, if the limit was 4, I want the count to add up the values of the column up to position 4, and then reset to zero, and continue adding the values of the column, from position 5, and so forth until it reaches the length of the column. I used itertools.islice earlier in the script to create a counter, so I was wondering if I could use a combination of this and itertools.count to do something similar? So far, this is my code:
cycle_time = list(itertools.islice(itertools.count(0,raw_data['Total Time (s)'][lens]),range(0, block_cycles),lens))
Where raw_data['Total Time (s)'] contains the values I wish to add up, block_cycles is the number I want to add up to in the dataframe column before resetting, and lens is the length of the column in the dataframe. Ideally, the output from my list would look like this:
print(cycle_time)
0
0.24
0.36
0.57
0
0.13
0.32
0.57
Which is calculated from this input:
print(raw_data['Total Time (s)'])
0
0.24
0.36
0.57
0.7
0.89
1.14
Which I would then append to a new column in a dataframe, interim_data_output['Cycle time (s)'] which details the time elapsed at that point in the 'cycle'. block_cycles is the number of iterations in each large 'cycle' This is what I would do with the list:
interim_data_output['Cycle time (s)'] = cycle_time
I'm a bit lost here, is this even possible using these methods? I'd like to use itertools for performance reasons. Any help would be greatly appreciated!
Given the discussion in the comments, here is an example:
df = pd.DataFrame({'Total Time (s)':[0, 0.24, 0.36, 0.57, 0.7, 0.89, 1.14]})
Total Time (s)
0 0.00
1 0.24
2 0.36
3 0.57
4 0.70
5 0.89
6 1.14
You can do:
block_cycles = 4
# Calculate cycle times.
cycle_times = df['Total Time (s)'].diff().fillna(0).groupby(df.index // block_cycles).cumsum()
# Insert the desired zeros after all cycles.
for idx in range(block_cycles, cycle_times.index.max(), block_cycles):
cycle_times.loc[idx-0.5] = 0
cycle_times = cycle_times.sort_index().reset_index(drop=True)
print(cycle_times)
Which gives:
0 0.00
1 0.24
2 0.36
3 0.57
4 0.00
5 0.13
6 0.32
7 0.57
Name: Total Time (s), dtype: float64
I am very new to asking questions to stack overflow. Please let me know if I have missed something.
I am trying to rearrange some data from excel-like below
Excel Data
To like:
Rearranged
I already tried one in stack overflow How to Rearrange Data
I just need to add one more column next to the above answer, but couldn't find an answer with my short python knowledge.
Anyone could suggest a way to rearrange a little more complex than the above link?
You will have to transform a little bit your data in order to get to the result you want, but here is my solution:
1.Imports
import pandas as pd
import numpy as np
Remove the merged title from your data ("Budget and Actual"). You may want to rename you columns as 1/31/2020 Actual and 1/31/2020 Budget. Otherwise, if you have the same column name, Pandas will bring you the columns with a differentiator like '.1'. Sample data below with only a couple of columns for demonstration purposes.
Item 1/31/2020 2/29/2020 1/31/2020.1 2/29/2020.1
0 A 0.01 0.02 0.03 0.04
1 B 0.20 0.30 0.40 0.50
2 C 0.33 0.34 0.35 0.36
3.Create two separate datasets for Actuals and Budget
#item name and all budget columns from your dataset
df_budget = df.iloc[:, 0:12]
# item name and the actuals columns
df_actuals = df.iloc[:, [0,13,14,15,16,17,18,19,20,21,22,22,24,25]]
4.Correct the names of the columns to remove the differentiator '.1' and reflect your dates
df_actuals.columns = ['Item','1/31/2020','2/29/2020' so far so on...]
5.Transform the Date columns in rows
df_actuals = df_actuals.melt(id_vars=['Item'], value_vars=['1/31/2020', '2/29/2020'], var_name = 'Date', value_name='Actual')
df_budget = df_budget.melt(id_vars=['Item'], value_vars=['1/31/2020', '2/29/2020'], var_name = 'Date', value_name='Budget')
You should see something like this at this point
Item Date Actual
0 A 1/31/2020 0.01
1 B 1/31/2020 0.20
Item Date Budget
0 A 1/31/2020 0.03
1 B 1/31/2020 0.40
6.Merge Both datasets
pd.merge(df_actuals, df_budget, on=['Item', 'Date'], sort=True)
Result:
Item Date Actual Budget
0 A 1/31/2020 0.01 0.03
1 A 2/29/2020 0.02 0.04
2 B 1/31/2020 0.20 0.40
3 B 2/29/2020 0.30 0.50
4 C 1/31/2020 0.33 0.35
5 C 2/29/2020 0.34 0.36
I have a dataframe with survey data like so, with each row being a different respondent.
weight race Question_1 Question_2 Question_3
0.9 white 1 5 4
1.1 asian 5 4 3
0.95 white 2 1 5
1.25 black 5 4 3
0.80 other 4 5 2
Each question is on a scale from 1 to 5 (there are several more questions in the actual data). For each question, I am trying to calculate the percentage of respondents who responded with a 5, grouped by race and weighted by the weight column.
I believe that the code below works for calculating the percentage who responded with a 5 for each question, grouped by race. But I do not know how to weight it by the weight column.
df.groupby('race').apply(lambda x: ((x == 5).sum()) / x.count())
I am new to pandas. Could someone please explain how to do this? Thanks for any help.
Edit: The desired output for the above dataframe would look something like this. Obviously the real data has far more respondents (rows) and many more questions.
Question_1 Question_2 Question_3
white 0.00 0.49 0.51
black 1.00 0.00 0.00
asian 1.00 0.00 0.00
other 0.00 1.00 0.00
Thank you.
Here is a solution by defining a custom function and applying that function to each columns. Then you could concatenate each column into a dataframe:
def wavg(x, col):
return (x['weight']*(x[col]==5)).sum()/x['weight'].sum()
grouped = df.groupby('race')
pd.concat([grouped.apply(wavg,col) for col in df.columns if col.startswith('Question')],axis=1)\
.rename(columns = {num:f'Question_{num+1}' for num in range(3)})
Output:
Question_1 Question_2 Question_3
race
asian 1.0 0.000000 0.000000
black 1.0 0.000000 0.000000
other 0.0 1.000000 0.000000
white 0.0 0.486486 0.513514
Here's how you could do it for question 1. You can easily generalize it for the other questions.
# Define a dummy indicating a '5 response'
df['Q1'] = np.where(df['Question_1']==5 ,1, 0)
# Create a weighted version of the above dummy
df['Q1_w'] = df['Q1'] * df['weight']
# Compute the sum by race
ds = df.groupby(['race'])[['Q1_w', 'weight']].sum()
# Compute the weighted average
ds['avg'] = ds['Q1_w'] / ds['weight']
Basically, you first take the sum of the weights and of the weighted 5 dummy by race and then divide by the sum of the weights.
This gives you the weighted average.
I have to make the following calculation (or similar) many times in my code and it takes a long time to run. I was wondering if it was possible to make the code more pythonic (reduce the time to run).
I am calculating the weighting of the "loan_size" proportional to all other loans that are have the same origination month
loan_plans['weighting'] = loan_plans.loan_size / loan_plans.apply(lambda S: loan_plans.loc[(loan_plans.origination_month == S.origination_month) 'loan_size'].sum(), axis=1)
The following is a set of example data with the desired result:
loan_size origination_month weighting
1000 01-2018 0.25
2000 02-2018 0.2
3000 01-2018 0.75
8000 02-2018 0.8
Update (per OP update):
There's nothing wrong with your approach; you might use groupby instead to get origination_month sums, and then do the weighting:
loan_plans = loan_plans.reset_index().merge(
loan_plans.groupby("origination_month").loan_size.sum().reset_index(), on="origination_month"
)
loan_plans["weighting"] = loan_plans.loan_size_x / loan_plans.loan_size_y
loan_plans.sort_values("index").set_index("index")
loan_size_x origination_month loan_size_y weighting
index
0 1000 01-2018 4000 0.25
1 2000 02-2018 10000 0.20
2 3000 01-2018 4000 0.75
3 8000 02-2018 10000 0.80
Cosmetics:
(loan_plans
.sort_values("index")
.set_index("index")
.rename(columns={"loan_size_x": "loan_size"})
.drop("loan_size_y", 1))
loan_size origination_month weighting
index
0 1000 01-2018 0.25
1 2000 02-2018 0.20
2 3000 01-2018 0.75
3 8000 02-2018 0.80
Earlier answer
You can use div and sum, no need for apply:
loan_plans.loan_size.div(
loan_plans.loc[loan_plans.loan_number.eq(1), "loan_size"].sum()
)
Output:
0 0.024714
1 0.053143
2 0.012143
3 0.010929
4 0.039643
...
Data:
N = 100
data = {"loan_size": np.random.randint(100, 1000, size=N),
"loan_number": np.random.binomial(n=1, p=.3, size=N)}
loan_plans = pd.DataFrame(data)
I want to find out how many sample will be taken from each level using proportion allocation method.
I have total 3 level's : [Small , Medium , Large ].
First , I want to take a sum for this 3 level's.
Next, I want to find out probability for this 3 levels
Next, I want to use this probability answer with multiply by how many samples given for this 3 levels
And, Last step is : sample will be select as top village's for the each level.
Data :
Village Workers Level
Aagar 10 Small
Dhagewadi 32 Small
Sherewadi 34 Small
Shindwad 42 Small
Dhokari 84 Medium
Khanapur 65 Medium
Ambikanagar 45 Medium
Takali 127 Large
Gardhani 122 Large
Pi.Khand 120 Large
Pangri 105 Large
let me explain, I am attaching excel photo
In the first step: I want to get sum values for level -> Small, Medium and High. i.e ( 10+32+34+42)=118 for Small level.
In the next step I want to find out probability for each levels rounding in 2 decimal.
i.e ( 118/786) =0.15 for small level.
And using length(size) of each level multiply by probability for find out how many sample(village) taken from each level.
i.e for Medium level we have probability 0.25 , and we have 3 villages in Medium level. so, 0.25*3 = 0.75 will be sample taken from medium level.
So, it will rounding to the next whole number 0.75 ~ 1 sample taken from Medium level, and it will take top village in this level. so, in medium level "Dhokri" village will be select,
I have done some work,
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.read_csv("/home/desktop/Desktop/t.csv")
df = df.sort('Workers', ascending=True)
df['level'] = pd.qcut(df['Workers'], 3, ['Small','Medium','Large'])
df
I am use this command for get the sum for level's. next what to do I am confuse,
df=df.groupby(['level'])['Workers'].aggregate(['sum']).unstack()
Is it possible in python , to get that village name what I get in the using excel ?
You can use:
transform with sum for same length of column
divide by div with sum and round
another transform with size
last custom function
df['Sum_Level_wise'] = df.groupby('Level')['Workers'].transform('sum')
df['Probability'] = df['Sum_Level_wise'].div(df['Workers'].sum()).round(2)
df['Sample'] = df['Probability'] * df.groupby('Level')['Workers'].transform('size')
df['Selected villages'] = df['Sample'].apply(np.ceil).astype(int)
df['Selected village'] = df.groupby('Level')
.apply(lambda x: x['Village'].head(x['Selected villages'].iat[0]))
.reset_index(level=0)['Village']
df['Selected village'] = df['Selected village'].fillna('')
print (df)
Village Workers Level Sum_Level_wise Probability Sample \
0 Aagar 10 Small 118 0.15 0.60
1 Dhagewadi 32 Small 118 0.15 0.60
2 Sherewadi 34 Small 118 0.15 0.60
3 Shindwad 42 Small 118 0.15 0.60
4 Dhokari 84 Medium 194 0.25 0.75
5 Khanapur 65 Medium 194 0.25 0.75
6 Ambikanagar 45 Medium 194 0.25 0.75
7 Takali 127 Large 474 0.60 2.40
8 Gardhani 122 Large 474 0.60 2.40
9 Pi.Khand 120 Large 474 0.60 2.40
10 Pangri 105 Large 474 0.60 2.40
Selected villages Selected village
0 1 Aagar
1 1
2 1
3 1
4 1 Dhokari
5 1
6 1
7 3 Takali
8 3 Gardhani
9 3 Pi.Khand
10 3
You can try debug with custom function:
def f(x):
a = x['Village'].head(x['Selected villages'].iat[0])
print (x['Village'])
print (a)
if (len(x) < len(a)):
print ('original village cannot be filled to Selected village, because length is higher')
return a
df['Selected village'] = df.groupby('Level').apply(f).reset_index(level=0)['Village']
df['Selected village'] = df['Selected village'].fillna('')