How to rearrange Excel data set with some complex? - python

I am very new to asking questions to stack overflow. Please let me know if I have missed something.
I am trying to rearrange some data from excel-like below
Excel Data
To like:
Rearranged
I already tried one in stack overflow How to Rearrange Data
I just need to add one more column next to the above answer, but couldn't find an answer with my short python knowledge.
Anyone could suggest a way to rearrange a little more complex than the above link?

You will have to transform a little bit your data in order to get to the result you want, but here is my solution:
1.Imports
import pandas as pd
import numpy as np
Remove the merged title from your data ("Budget and Actual"). You may want to rename you columns as 1/31/2020 Actual and 1/31/2020 Budget. Otherwise, if you have the same column name, Pandas will bring you the columns with a differentiator like '.1'. Sample data below with only a couple of columns for demonstration purposes.
Item 1/31/2020 2/29/2020 1/31/2020.1 2/29/2020.1
0 A 0.01 0.02 0.03 0.04
1 B 0.20 0.30 0.40 0.50
2 C 0.33 0.34 0.35 0.36
3.Create two separate datasets for Actuals and Budget
#item name and all budget columns from your dataset
df_budget = df.iloc[:, 0:12]
# item name and the actuals columns
df_actuals = df.iloc[:, [0,13,14,15,16,17,18,19,20,21,22,22,24,25]]
4.Correct the names of the columns to remove the differentiator '.1' and reflect your dates
df_actuals.columns = ['Item','1/31/2020','2/29/2020' so far so on...]
5.Transform the Date columns in rows
df_actuals = df_actuals.melt(id_vars=['Item'], value_vars=['1/31/2020', '2/29/2020'], var_name = 'Date', value_name='Actual')
df_budget = df_budget.melt(id_vars=['Item'], value_vars=['1/31/2020', '2/29/2020'], var_name = 'Date', value_name='Budget')
You should see something like this at this point
Item Date Actual
0 A 1/31/2020 0.01
1 B 1/31/2020 0.20
Item Date Budget
0 A 1/31/2020 0.03
1 B 1/31/2020 0.40
6.Merge Both datasets
pd.merge(df_actuals, df_budget, on=['Item', 'Date'], sort=True)
Result:
Item Date Actual Budget
0 A 1/31/2020 0.01 0.03
1 A 2/29/2020 0.02 0.04
2 B 1/31/2020 0.20 0.40
3 B 2/29/2020 0.30 0.50
4 C 1/31/2020 0.33 0.35
5 C 2/29/2020 0.34 0.36

Related

Can I use itertools.count to add values in a column, resetting at a certain point?

I'm trying to create a list of timestamps from a column in a dataframe, that resets after a certain time to zero. So, if the limit was 4, I want the count to add up the values of the column up to position 4, and then reset to zero, and continue adding the values of the column, from position 5, and so forth until it reaches the length of the column. I used itertools.islice earlier in the script to create a counter, so I was wondering if I could use a combination of this and itertools.count to do something similar? So far, this is my code:
cycle_time = list(itertools.islice(itertools.count(0,raw_data['Total Time (s)'][lens]),range(0, block_cycles),lens))
Where raw_data['Total Time (s)'] contains the values I wish to add up, block_cycles is the number I want to add up to in the dataframe column before resetting, and lens is the length of the column in the dataframe. Ideally, the output from my list would look like this:
print(cycle_time)
0
0.24
0.36
0.57
0
0.13
0.32
0.57
Which is calculated from this input:
print(raw_data['Total Time (s)'])
0
0.24
0.36
0.57
0.7
0.89
1.14
Which I would then append to a new column in a dataframe, interim_data_output['Cycle time (s)'] which details the time elapsed at that point in the 'cycle'. block_cycles is the number of iterations in each large 'cycle' This is what I would do with the list:
interim_data_output['Cycle time (s)'] = cycle_time
I'm a bit lost here, is this even possible using these methods? I'd like to use itertools for performance reasons. Any help would be greatly appreciated!
Given the discussion in the comments, here is an example:
df = pd.DataFrame({'Total Time (s)':[0, 0.24, 0.36, 0.57, 0.7, 0.89, 1.14]})
Total Time (s)
0 0.00
1 0.24
2 0.36
3 0.57
4 0.70
5 0.89
6 1.14
You can do:
block_cycles = 4
# Calculate cycle times.
cycle_times = df['Total Time (s)'].diff().fillna(0).groupby(df.index // block_cycles).cumsum()
# Insert the desired zeros after all cycles.
for idx in range(block_cycles, cycle_times.index.max(), block_cycles):
cycle_times.loc[idx-0.5] = 0
cycle_times = cycle_times.sort_index().reset_index(drop=True)
print(cycle_times)
Which gives:
0 0.00
1 0.24
2 0.36
3 0.57
4 0.00
5 0.13
6 0.32
7 0.57
Name: Total Time (s), dtype: float64

Pandas: Sum down a column based on other columns

I've managed to successfully confuse myself with this problem. Here's a sample of my dataframe:
Model Rank Prediction Runtime
0.05 1 0.516267 250.500
0.05 2 0.504968 253.875
0.05 3 0.482915 310.875
0.05 4 0.470865 251.375
0.05 5 0.459580 277.250
. . . .
. . . .
. . . .
0.50 96 0.130696 250.500
0.50 97 0.130696 220.375
0.50 98 0.130696 314.625
0.50 99 0.130696 232.000
0.50 100 0.130696 258.000
And my use case is as follows:
I would, for each Model, like to calculate the total Runtime with respect to its Rank. By that I mean, the Runtime at Rank 1 should be the sum of all Runtimes (for its respective Model) and the Runtime at Rank 100 should be only the Runtime for Rank 100 (for its respective Model).
So for instance,
If the Rank is 1, the Runtime column at that row should represent the total sum of all Runtimes for Model 0.05
If the Rank is 2, it should be all of the Runtimes for Model 0.05 minus the Runtime for Model 0.05 at Rank 1
...
If the Rank is 100, it should be only the Runtime for Model 0.05 at Rank 100.
I have the idea in my head but I'm not sure how this is achieved in Pandas. I know how to sum the column, but not to sum based on a condition like this. If any more data or explanation is required, I'd be happy to attach it.
If I understand correctly, what you're asking for is essentially a reversed cumulative sum, which you can do by a reverse, cumsum, reverse operation:
In [4]: df["model_runtimes"] = df[::-1].groupby("Model")["Runtime"].cumsum()[::-1]
In [5]: df
Out[5]:
Model Rank Prediction Runtime model_runtimes
0 0.05 1 0.516267 250.500 1343.875
1 0.05 2 0.504968 253.875 1093.375
2 0.05 3 0.482915 310.875 839.500
3 0.05 4 0.470865 251.375 528.625
4 0.05 5 0.459580 277.250 277.250
5 0.50 96 0.130696 250.500 1275.500
6 0.50 97 0.130696 220.375 1025.000
7 0.50 98 0.130696 314.625 804.625
8 0.50 99 0.130696 232.000 490.000
9 0.50 100 0.130696 258.000 258.000
I would frame your problem in two steps.
First, each model is independent, so you can split the dataframe by the model field, and solve each one independently. This is a good place to use groupby().
Second, your problem is like a cumulative sum, except that a normal cumulative sum starts at the top and carries the sum down, and you want to do the opposite. You can solve this by reversing the dataframe, or by sorting in descending order.
With that in mind, here's how I would approach this problem. (Lines 1-14 are just setting up the dataset.)
import pandas as pd
import numpy as np
np.random.seed(42)
# Only used for setting up dataframe. Ignore.
def flatten(t):
return [item for sublist in t for item in sublist]
df = pd.DataFrame({
"Model": flatten([list(c * 20) for c in "ABCDE"]),
"Rank": flatten([range(1, 21) for i in range(5)]),
"Prediction": np.random.rand(100),
"Runtime": np.random.rand(100),
})
def add_sum(d):
d["Runtime"] = d["Runtime"].cumsum()
return d
df = df.sort_values(by=["Model", "Rank"], ascending=False) \
.groupby("Model") \
.apply(add_sum) \
.sort_index()
print(df)

Pandas GroupBy to calculate weighted percentages meeting a certain condition

I have a dataframe with survey data like so, with each row being a different respondent.
weight race Question_1 Question_2 Question_3
0.9 white 1 5 4
1.1 asian 5 4 3
0.95 white 2 1 5
1.25 black 5 4 3
0.80 other 4 5 2
Each question is on a scale from 1 to 5 (there are several more questions in the actual data). For each question, I am trying to calculate the percentage of respondents who responded with a 5, grouped by race and weighted by the weight column.
I believe that the code below works for calculating the percentage who responded with a 5 for each question, grouped by race. But I do not know how to weight it by the weight column.
df.groupby('race').apply(lambda x: ((x == 5).sum()) / x.count())
I am new to pandas. Could someone please explain how to do this? Thanks for any help.
Edit: The desired output for the above dataframe would look something like this. Obviously the real data has far more respondents (rows) and many more questions.
Question_1 Question_2 Question_3
white 0.00 0.49 0.51
black 1.00 0.00 0.00
asian 1.00 0.00 0.00
other 0.00 1.00 0.00
Thank you.
Here is a solution by defining a custom function and applying that function to each columns. Then you could concatenate each column into a dataframe:
def wavg(x, col):
return (x['weight']*(x[col]==5)).sum()/x['weight'].sum()
grouped = df.groupby('race')
pd.concat([grouped.apply(wavg,col) for col in df.columns if col.startswith('Question')],axis=1)\
.rename(columns = {num:f'Question_{num+1}' for num in range(3)})
Output:
Question_1 Question_2 Question_3
race
asian 1.0 0.000000 0.000000
black 1.0 0.000000 0.000000
other 0.0 1.000000 0.000000
white 0.0 0.486486 0.513514
Here's how you could do it for question 1. You can easily generalize it for the other questions.
# Define a dummy indicating a '5 response'
df['Q1'] = np.where(df['Question_1']==5 ,1, 0)
# Create a weighted version of the above dummy
df['Q1_w'] = df['Q1'] * df['weight']
# Compute the sum by race
ds = df.groupby(['race'])[['Q1_w', 'weight']].sum()
# Compute the weighted average
ds['avg'] = ds['Q1_w'] / ds['weight']
Basically, you first take the sum of the weights and of the weighted 5 dummy by race and then divide by the sum of the weights.
This gives you the weighted average.

How can I repercentage a cell from several data points in Python Pandas?

I've been browsing different questions here on Stackexchange but haven't figured out how to do what I need in Pandas. I think it'll ultimately be pretty simple!
I'm doing a task where a dataset has a bunch of products, and each product has a row for each of the stores it's located in. So, Product A will have individual lines for food, drugstore, Target, Walmart, etc. Then, its availability and the importance of that outlet is multiplied and I need to repercentage that result to equal 100%.
Right now I'm doing it manually in Excel/Google Sheets, but that's annoying and tedious. I can tell how to get the sum total of column E per Product by using Groupby, but I can't figure out how to then make that number appear for each product so that each figure from column E can be divided into it.
Anyone have suggestions?Link to example of what the dataset looks like
To get the sum to show up for every product you want to .transform('sum')
In one line:
df['Repercentaged'] = df.groupby('Product').Multiplied.transform(lambda x: x/x.sum())
But if you want to keep the Sum Column...
import pandas as pd
df['Sum'] = df.groupby('Product').Multiplied.transform('sum')
# Location Multiplied Product Sum
#0 Food 0.09 A 0.88
#1 Drugstore 0.21 A 0.88
#2 Walmart 0.35 A 0.88
#3 Target 0.23 A 0.88
#4 Food 0.13 B 0.73
#5 Drugstore 0.13 B 0.73
#6 Walmart 0.25 B 0.73
#7 Target 0.22 B 0.73
df['Repercentaged'] = df['Multiplied']/df['Sum']

Make console-friendly string a useable pandas dataframe python

A quick question as I'm currently changing from R to pandas for some projects:
I get the following print output from metrics.classification_report from sci-kit learn:
precision recall f1-score support
0 0.67 0.67 0.67 3
1 0.50 1.00 0.67 1
2 1.00 0.80 0.89 5
avg / total 0.83 0.78 0.79 9
I want to use this (and similar ones) as a matrix/dataframe so, that I could subset it to extract, say the precision of class 0.
In R, I'd give the first "column" a name like 'outcome_class' and then subset it:
my_dataframe[my_dataframe$class_outcome == 1, 'precision']
And I can do this in pandas but the dataframe that I want to use is simply a string see sckikit's doc
How can I make the table output here to a useable dataframe in pandas?
Assign it to a variable, s:
s = classification_report(y_true, y_pred, target_names=target_names)
Or directly:
s = '''
precision recall f1-score support
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
avg / total 0.70 0.60 0.61 5
'''
Use that as the string input for StringIO:
import io # For Python 2.x use import StringIO
df = pd.read_table(io.StringIO(s), sep='\s{2,}') # For Python 2.x use StringIO.StringIO(s)
df
Out:
precision recall f1-score support
class 0 0.5 1.00 0.67 1
class 1 0.0 0.00 0.00 1
class 2 1.0 0.67 0.80 3
avg / total 0.7 0.60 0.61 5
Now you can slice it like an R data.frame:
df.loc['class 2']['f1-score']
Out: 0.80000000000000004
Here, classes are the index of the DataFrame. You can use reset_index() if you want to use it as a regular column:
df = df.reset_index().rename(columns={'index': 'outcome_class'})
df.loc[df['outcome_class']=='class 1', 'support']
Out:
1 1
Name: support, dtype: int64

Categories

Resources