I would like to compare the sum of an original df and rounded df.
If there is a delta from its sum, apply this delta, whether by subtraction or addition to the last quarter.
The first sum difference between AA is 4. (12-8 = 4)
The second sum difference with BB is 2. (14-12 = 2)
Data
original_df
id q121 q221 q321 q421 sum
AA 1 0.5 0.5 6.1 8
BB 1 0.5 6.5 3.1 12
rounded_df
id q121 q221 q321 q421 sum
AA 2 2 2 6 12
BB 2 2 6 4 14
Desired
We've subtracted 4 from 12 to obtain 8 for AA.
We've subtracted 2 from 14 to obtain 12 for BB
(when comparing original to rounded)
Now the new final_df matches the sum of the original_df
final_df
id q121 q221 q321 q421 sum delta
AA 2 2 2 2 8 4
BB 2 2 6 2 12 2
Doing
Compare sum and create delta
final_df['delta'] = np.where(original_df['sum'] ==
rounded_df['sum'], 0, original_df['sum'] - rounded_df['sum'])
Apply delta to last quarter of the year
I am still not sure how to complete thee second half of the problem. I am still researching, any suggestion is appreciated.
using sub, filter, update, iloc
# create the delta with difference b/w sum of the two DF
df2['delta']=df2['sum'].sub(df['sum'])
# subtract the delta from the last quarter, obtained
# using filter
# create a placeholder df3
df3=df2.filter(like='q' ).iloc[:,-1:].sub(df2.iloc[:,-1:].values)
# filter(like='q' ) : Filter columns that has 'q' in their name
# .iloc[:,-1:] : using iloc, choose the last column that has 'q' in their name, -1 gives us last column
# df2.[iloc][1][:,-1:].values : gives the values of the last column of the table
# the subtraction results in DF3,
# update df2 based on df3
df2.update(df3)
df2
# updates will update the values of matching column from df3 into df2
id q121 q221 q321 q421 sum delta
0 AA 2 2 2 2 12 4
1 BB 2 2 6 2 14 2
Related
I have a dataframe, df, where I would like to transform and pivot select values.
I wish to groupby id and date, sum the 'pwr' values and then count the type values.
df
df values that will be column headers: 'hi', 'hey'
id date type pwr de_id de_date de_type de_pwr base base_pos
aa q1 hey 10 aa q1 hey 5 200 40
aa q1 hi 5 200 40
aa q1 hey 5 200 40
aa q2 hey 2 aa q2 hey 3 200 40
aa q2 hey 2 aa q2 hey 3 200 40
bb q1 0 bb q1 hi 6 500 10
bb q1 0 bb q1 hi 6 500 10
Desired
id date hey hi total sum hey hi totald desum base base_pos
aa q1 2 1 3 20 1 0 1 5 200 40
aa q2 2 0 2 4 2 0 2 6 200 40
bb q1 0 0 0 0 0 2 2 12 500 10
Doing
sum1 = df.groupby(['id','date']).agg({'pwr': 'sum', 'type': 'count', 'de_pwr': 'sum', 'de_type': 'count'})
pd.pivot_table(df, values = '' , columns = 'type')
Any suggestion will be helpful.
So, this is definitely not a 'clean' way to go around it, but since you have 2 separate totals summing along columns, I don't know how much cleaner it could get (and the output seems accurate).
You don't mention what aggregation you use to get base and base_pos values, so I went with mean (might need to change it).
type_col = pd.crosstab(index = [df['id'], df['date']], columns = df['type'])
type_col['total'] = type_col.sum(axis = 1)
pwr_sum = df.groupby(['id','date'])['pwr'].sum()
de_type_col = pd.crosstab(index = [df['id'], df['date']], columns = df['de_type'])
de_type_col['total_de'] = de_type_col.sum(axis = 1)
pwr_de_sum = df.groupby(['id','date'])['de_pwr'].sum()
base_and_pos = df.groupby(['id','date'])[['base','base_pos']].mean()
out = pd.concat([type_col, pwr_sum, de_type_col, pwr_de_sum, base_and_pos], axis = 1).fillna(0).astype('int')
Essentially use crosstab to get value counts and sum them along columns. The index of resulting DataFrame is the same as groupby(['id','date']), so you can then concatenate results of groupby without issue. Repeat the same process for de columns, apply groupby with your choice of aggregation to base and base_pos columns, and concatenate all results along axis = 1. Obviously, you can group some operations together (such as pwr sum, de_pwr sum and base/base_pos aggregation), but you'll need to reorder your columns after that to get the desired order.
Output:
id date hey hi total pwr hey hi total_de de_pwr base base_pos
aa q1 2 1 3 20 1 0 1 5 200 40
aa q2 2 0 2 4 2 0 2 6 200 40
bb q1 0 0 0 0 0 2 2 12 500 10
What I want to do is group by column A and then take the sum of first two rows, then assign that value as a new column. Example below:
DF:
ColA ColB
AA 2
AA 1
AA 5
AA 3
BB 9
BB 3
BB 2
BB 12
CC 0
CC 10
CC 5
CC 3
Desired DF:
ColA ColB NewCol
AA 2 3
AA 1 3
AA 5 3
AA 3 3
BB 9 12
BB 3 12
BB 2 12
BB 12 12
CC 0 10
CC 10 10
CC 5 10
CC 3 10
For AA, it looks at ColB and take the sum of the first two rows and assigns that summed value to newCol. I've tried this by creating a dictionary by looping through the unique ColA values, creating a subset dataframe of the first two rows, summing, then populating the dictionary with values. Then mapping the dictionary back - but my dataframe is VERY big and it takes forever. Any ideas?
Thank you!
You can use transform to get a new value per each row and a lambda function. In lambda you can use head(2) to get first 2 rows for each group and sum() them:
df.groupby('ColA')['ColB'].transform(lambda x: x.head(2).sum())
Here is a snippet of a dataframe I'm trying to analyze. What I want to do is simply subtract FP_FLOW FORMATTED_ENTRY values from D8_FLOW FORMATTED_ENTRY values only if the X_LOT_NAME is the same. For example, in the X_LOT_NAME column you can see MPACZX2. The D8_FLOW FORMATTED_ENTRY is 12.3%. The FP_FLOW FORMATTED_ENTRY value is 7.8% . The difference between the two would be 4.5%. I want to apply this logic across the whole data set
it is advisable to first convert your data into a format where the values to be added / subtracted are in the same row, and after that subtract / add the corresponding oclumns. You can do this using pd.pivot-table. The below example will demonstrate this using a sample dataframe similar to what you've shared:
wanted_data
X_LOT_NAME SPEC_TYPE FORMATTED_ENTRY
0 a FP_FLOW 1
1 a D8_FLOW 2
2 c FP_FLOW 3
3 c D8_FLOW 4
pivot_data = pd.pivot_table(wanted_data,values='FORMATTED_ENTRY',index='X_LOT_NAME',columns='SPEC_TYPE')
pivot_data
SPEC_TYPE D8_FLOW FP_FLOW
X_LOT_NAME
a 2 1
c 4 3
After this step, the resultant pivot_data contains the same data, but the columns are D8_FLOW and FP_FLOW, with X_LOT_NAME as the index. Now you can get the intended value in a new column using:
pivot_data['DIFF'] = pivot_data['D8_FLOW'] - pivot_data['FP_FLOW']
Is this what you are looking for?
df.groupby(['x_lot'])['value'].diff()
0 NaN
1 NaN
2 -5.0
3 8.0
4 -3.0
5 NaN
6 -10.0
Name: value, dtype: float64
This is the data i used to get the above results
x_lot type value
0 mpaczw1 fp 21
1 mpaczw2 d8 12
2 mpaczw2 fp 7
3 mpaczw2 d8 15
4 mpaczw2 fp 12
5 mpaczw3 d8 21
6 mpaczw3 fp 11
Having the following Data Frame:
name value count total_count
0 A 0 1 20
1 A 1 2 20
2 A 2 2 20
3 A 3 2 20
4 A 4 3 20
5 A 5 3 20
6 A 6 2 20
7 A 7 2 20
8 A 8 2 20
9 A 9 1 20
----------------------------------
10 B 0 10 75
11 B 5 30 75
12 B 6 20 75
13 B 8 10 75
14 B 9 5 75
I would like to pivot the data, grouping each row by the name value, then create columns based on the value & count columns aggregated into bins.
Explanation: I have 10 possible values, range 0-9, not all the values are present in each group. In the above example group B is missing values 1,2,3,4,7. I would like to create an histogram with 5 bins, ignore missing values and calculate the percentage of count for each bin. So the result will look like so:
name 0-1 2-3 4-5 6-7 8-9
0 A 0.150000 0.2 0.3 0.2 0.150000
1 B 0.133333 0.0 0.4 0.4 0.066667
For example for bin 0-1 of group A the calculation is the sum of count for the values 0,1 (1+2) divided by the total_count of group A
name 0-1
0 A (1+2)/20 = 0.15
I was looking into hist method and this StackOverflow question, but still struggling with figuring out what is the right approach.
Use pd.cut to bin your feature, then use a df.groupby().count() and the .unstack() method to get the dataframe you are looking for. During the group by you can use any aggregation function (.sum(), .count(), etc) to get the results you are looking for. The code below works if you are looking for an example.
import pandas as pd
import numpy as np
df = pd.DataFrame(
data ={'name': ['Group A','Group B']*5,
'number': np.arange(0,10),
'value': np.arange(30,40)})
df['number_bin'] = pd.cut(df['number'], bins=np.arange(0,10))
# Option 1: Sums
df.groupby(['number_bin','name'])['value'].sum().unstack(0)
# Options 2: Counts
df.groupby(['number_bin','name'])['value'].count().unstack(0)
The null values in the original data will not affect the result.
To get the exact result you could try this.
bins=range(10)
res = df.groupby('name')['count'].sum()
intervals = pd.cut(df.value, bins=bins, include_lowest=True)
df1 = (df.groupby([intervals,"name"])['count'].sum()/res).unstack(0)
df1.columns = df1.columns.astype(str) # convert the cols to string
df1.columns = ['a','b','c','d','e','f','g','h','i'] # rename the cols
cols = ['a',"b","d","f","h"]
df1 = df1.add(df1.iloc[:,1:].shift(-1, axis=1), fill_value=0)[cols]
print(df1)
You can manually rename the cols later.
# Output:
a b d f h
name
A 0.150000 0.2 0.3 0.200000 0.15
B 0.133333 NaN 0.4 0.266667 0.20
You can replace the NaN values using df1.fillna("0.0")
In Python, I have a pandas data frame df.
ID Ref Dist
A 0 10
A 0 10
A 1 20
A 1 20
A 2 30
A 2 30
A 3 5
A 3 5
B 0 8
B 0 8
B 1 40
B 1 40
B 2 7
B 2 7
I want to group by ID and Ref, and take the first row of the Dist column in each group.
ID Ref Dist
A 0 10
A 1 20
A 2 30
A 3 5
B 0 8
B 1 40
B 2 7
And I want to sum up the Dist column in each ID group.
ID Sum
A 65
B 55
I tried this to do the first step, but this gives me just an index of the row and Dist, so I cannot move on to the second step.
df.groupby(['ID', 'Ref'])['Dist'].head(1)
It'd be wonderful if somebody helps me for this.
Thank you!
I believe this is what you're looking for.
The first step you need to use first since you want the first in the groupby. Once you've done that, use reset_index() so you can use a groupby afterwards and sum it up using ID.
df.groupby(['ID','Ref'])['Dist'].first()\
.reset_index().groupby(['ID'])['Dist'].sum()
ID
A 65
B 55
Just drop_duplicates before the groupby. The default behavior is to keep the first duplicate row, which is what you want.
df.drop_duplicates(['ID', 'Ref']).groupby('ID').Dist.sum()
#A 65
#B 55
#Name: Dist, dtype: int64