I have a dataframe, df, where I would like to transform and pivot select values.
I wish to groupby id and date, sum the 'pwr' values and then count the type values.
df
df values that will be column headers: 'hi', 'hey'
id date type pwr de_id de_date de_type de_pwr base base_pos
aa q1 hey 10 aa q1 hey 5 200 40
aa q1 hi 5 200 40
aa q1 hey 5 200 40
aa q2 hey 2 aa q2 hey 3 200 40
aa q2 hey 2 aa q2 hey 3 200 40
bb q1 0 bb q1 hi 6 500 10
bb q1 0 bb q1 hi 6 500 10
Desired
id date hey hi total sum hey hi totald desum base base_pos
aa q1 2 1 3 20 1 0 1 5 200 40
aa q2 2 0 2 4 2 0 2 6 200 40
bb q1 0 0 0 0 0 2 2 12 500 10
Doing
sum1 = df.groupby(['id','date']).agg({'pwr': 'sum', 'type': 'count', 'de_pwr': 'sum', 'de_type': 'count'})
pd.pivot_table(df, values = '' , columns = 'type')
Any suggestion will be helpful.
So, this is definitely not a 'clean' way to go around it, but since you have 2 separate totals summing along columns, I don't know how much cleaner it could get (and the output seems accurate).
You don't mention what aggregation you use to get base and base_pos values, so I went with mean (might need to change it).
type_col = pd.crosstab(index = [df['id'], df['date']], columns = df['type'])
type_col['total'] = type_col.sum(axis = 1)
pwr_sum = df.groupby(['id','date'])['pwr'].sum()
de_type_col = pd.crosstab(index = [df['id'], df['date']], columns = df['de_type'])
de_type_col['total_de'] = de_type_col.sum(axis = 1)
pwr_de_sum = df.groupby(['id','date'])['de_pwr'].sum()
base_and_pos = df.groupby(['id','date'])[['base','base_pos']].mean()
out = pd.concat([type_col, pwr_sum, de_type_col, pwr_de_sum, base_and_pos], axis = 1).fillna(0).astype('int')
Essentially use crosstab to get value counts and sum them along columns. The index of resulting DataFrame is the same as groupby(['id','date']), so you can then concatenate results of groupby without issue. Repeat the same process for de columns, apply groupby with your choice of aggregation to base and base_pos columns, and concatenate all results along axis = 1. Obviously, you can group some operations together (such as pwr sum, de_pwr sum and base/base_pos aggregation), but you'll need to reorder your columns after that to get the desired order.
Output:
id date hey hi total pwr hey hi total_de de_pwr base base_pos
aa q1 2 1 3 20 1 0 1 5 200 40
aa q2 2 0 2 4 2 0 2 6 200 40
bb q1 0 0 0 0 0 2 2 12 500 10
Related
I would like to compare the sum of an original df and rounded df.
If there is a delta from its sum, apply this delta, whether by subtraction or addition to the last quarter.
The first sum difference between AA is 4. (12-8 = 4)
The second sum difference with BB is 2. (14-12 = 2)
Data
original_df
id q121 q221 q321 q421 sum
AA 1 0.5 0.5 6.1 8
BB 1 0.5 6.5 3.1 12
rounded_df
id q121 q221 q321 q421 sum
AA 2 2 2 6 12
BB 2 2 6 4 14
Desired
We've subtracted 4 from 12 to obtain 8 for AA.
We've subtracted 2 from 14 to obtain 12 for BB
(when comparing original to rounded)
Now the new final_df matches the sum of the original_df
final_df
id q121 q221 q321 q421 sum delta
AA 2 2 2 2 8 4
BB 2 2 6 2 12 2
Doing
Compare sum and create delta
final_df['delta'] = np.where(original_df['sum'] ==
rounded_df['sum'], 0, original_df['sum'] - rounded_df['sum'])
Apply delta to last quarter of the year
I am still not sure how to complete thee second half of the problem. I am still researching, any suggestion is appreciated.
using sub, filter, update, iloc
# create the delta with difference b/w sum of the two DF
df2['delta']=df2['sum'].sub(df['sum'])
# subtract the delta from the last quarter, obtained
# using filter
# create a placeholder df3
df3=df2.filter(like='q' ).iloc[:,-1:].sub(df2.iloc[:,-1:].values)
# filter(like='q' ) : Filter columns that has 'q' in their name
# .iloc[:,-1:] : using iloc, choose the last column that has 'q' in their name, -1 gives us last column
# df2.[iloc][1][:,-1:].values : gives the values of the last column of the table
# the subtraction results in DF3,
# update df2 based on df3
df2.update(df3)
df2
# updates will update the values of matching column from df3 into df2
id q121 q221 q321 q421 sum delta
0 AA 2 2 2 2 12 4
1 BB 2 2 6 2 14 2
I have a dataframe who looks like this:
Name rent sale
0 A 180 2
1 B 1 4
2 M 12 1
3 O 10 1
4 A 180 5
5 M 2 19
that i want to make condition that if i have a duplicate row and a duplicate value in column field => Example :
duplicate row A have duplicate value 180 in rent column
I keep only one (without making the sum)
Or make the sum => Example duplicate row A with different values 2 & 5 in Sale column and duplicate row M with different values in rent & sales columns
Expected output:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1
I tried this code but it's not workin as i want
import pandas as pd
df=pd.DataFrame({'Name':['A','B','M','O','A','M'],
'rent':[180,1,12,10,180,2],
'sale':[2,4,1,1,5,19]})
df2 = df.drop_duplicates().groupby('Name',sort=False,as_index=False).agg(Name=('Name','first'),
rent=('rent', 'sum'),
sale=('sale','sum'))
print(df2)
I got this output
Name rent sale
0 A 360 7
1 B 1 4
2 M 14 20
3 O 10 1
Can try summing only the unique values per group:
def sum_unique(s):
return s.unique().sum()
df2 = df.groupby('Name', sort=False, as_index=False).agg(
Name=('Name', 'first'),
rent=('rent', sum_unique),
sale=('sale', sum_unique)
)
df2:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1
You can first groupby by Name and rent, and then just by Name:
df2 = df.groupby(['Name', 'rent'], as_index=False).sum().groupby('Name', as_index=False).sum()
What I want to do is group by column A and then take the sum of first two rows, then assign that value as a new column. Example below:
DF:
ColA ColB
AA 2
AA 1
AA 5
AA 3
BB 9
BB 3
BB 2
BB 12
CC 0
CC 10
CC 5
CC 3
Desired DF:
ColA ColB NewCol
AA 2 3
AA 1 3
AA 5 3
AA 3 3
BB 9 12
BB 3 12
BB 2 12
BB 12 12
CC 0 10
CC 10 10
CC 5 10
CC 3 10
For AA, it looks at ColB and take the sum of the first two rows and assigns that summed value to newCol. I've tried this by creating a dictionary by looping through the unique ColA values, creating a subset dataframe of the first two rows, summing, then populating the dictionary with values. Then mapping the dictionary back - but my dataframe is VERY big and it takes forever. Any ideas?
Thank you!
You can use transform to get a new value per each row and a lambda function. In lambda you can use head(2) to get first 2 rows for each group and sum() them:
df.groupby('ColA')['ColB'].transform(lambda x: x.head(2).sum())
I have a dataframe that looks like below.
dataframe1 =
In AA BB CC
0 10 1 0
1 11 2 3
2 10 6 0
3 9 1 0
4 10 3 1
5 1 2 0
now I want to create a dataframe that gives me the count of modes for each column, for column AA the count is 3 for mode 10, for columns CC the count is 4 for mode 0, but for BB there are two modes 1 and 2, so for BB I want the sum of counts for the modes. so for BB the count is 2+2=4, for mode 1 and 2.
Therefore the final dataframe that I want looks like below.
Columns Counts
AA 3
BB 4
CC 4
How to do it?
Another slightly more scalable solution using list comprehension:
pd.concat([df.eq(x) for _, x in df.mode().iterrows()]).sum()
[out]
AA 3
BB 4
CC 4
dtype: int64
You can compare columns with modes and count matches by sum:
df = pd.DataFrame({'Columns': df.columns,
'Val':[df[x].isin(df[x].mode()).sum() for x in df]})
print (df)
Columns Val
0 AA 3
1 BB 4
2 CC 4
First we get the modes of the columns with DataFrame.mode
Then we compare each column to it's modes and use Series.isin to check the amount of modes and sum these.
modes = df.iloc[:, 1:].mode()
data = {col: df[col].isin(modes[col]).sum() for col in df.iloc[:, 1:].columns}
df = pd.DataFrame.from_dict(data, orient='index', columns=['Counts'])
Counts
AA 3
BB 4
CC 4
Used pyjanitor module to bring in the transform function and return a dataframe:
(df.melt(id_vars='In')
.groupby('variable')
.agg(numbers=('value','value_counts'))
.groupby_agg(by='variable',
#here, it subtracts the max of numbers(for each group) from each
number in the group
agg = lambda x : x - x.max(),
agg_column_name='numbers',
new_column_name = 'test'
)
.query('test==0')
.groupby('variable')
.agg(count=('numbers','sum'))
)
count
variable
AA 3
BB 4
CC 4
Derive a new pandas column based on lengh of string in other columns
I want to count the number of columns which have a value in each row and create a new column with that number. Assume if I have 3 columns and two columns have some value then new column for that row will have the value 2.
df = pd.DataFrame({'ID':['1','2','3'], 'J1': ['a','ab',''],'J2':['22','','33']})
print df
The output should be like:
ID J1 J2 Count_of_cols_have_values
0 1 a 22 2
1 2 ab 1
2 3 33 1
One way could be to check which cells are not equal (DataFrame.ne) to an empty string, and take the sum along the rows:
df['Count_of_cols_have_values '] = df.set_index('ID').ne('').sum(1).values
ID J1 J2 Count_of_cols_have_values
0 1 a 22 2
1 2 ab 1
2 3 33 1
Or you can also replace with NaNs and count, which returns the amount of non_NA values:
df['Count_of_cols_have_values '] = df.set_index('ID').replace('',np.nan).count(1).values
ID J1 J2 Count_of_cols_have_values
0 1 a 22 2
1 2 ab 1
2 3 33 1