Pandas grouping and summing just a certain column - python

below is a minimal example, showing the problem that I am facing. Let our initial state be the following (I only use dictionary for the purpose of demonstration):
A = [{'D': '16.5.2013', 'A':1, 'B': 0.0, 'C': 2}, {'D': '16.5.2013', 'A':1, 'B': 0.0, 'C': 4}, {'D': '16.5.2013', 'A':1, 'B': 0.5, 'C': 7}]
df = pd.DataFrame(A)
>>> df
A B C D
0 1 0.0 2 16.5.2013
1 1 0.0 4 16.5.2013
2 1 0.5 7 16.5.2013
How do I get from df to df_new which is:
A_new = [{'D': '16.5.2013', 'A':1, 'B': 0.0, 'C': 6}, {'D': '16.5.2013', 'A':1, 'B': 0.5, 'C': 7}]
df_new = pd.DataFrame(A_new)
>>> df_new
A B C D
0 1 0.0 6 16.5.2013
1 1 0.5 7 16.5.2013
The first and the second rows of the 'C' column are summed, because 'B' is the same for these two rows. The rest is left the same, for instance, column 'A' is not summed, column 'D' is unchanged. How do I do that assuming I only have df and I want to get df_new. I would really like to find some kind of elegant solution if possible.
Thanks in advance.

Assuming the other columns are always the same, and should not be treated specially.
First create the df_new grouped by B where I take for each column the first row in the group:
In [17]: df_new = df.groupby('B', as_index=False).first()
and then calculate specificaly the C column as a sum for each group:
In [18]: df_new['C'] = df.groupby('B', as_index=False)['C'].sum()['C']
In [19]: df_new
Out[19]:
B A C D
0 0.0 1 6 16.5.2013
1 0.5 1 7 16.5.2013
If you have a limited number of columns, you can also do this in one step (but the above will be handier (less manual) if you have more columns) by specifying the desired function for each column:
In [20]: df_new = df.groupby('B', as_index=False).agg({'A':'first', 'C':'sum', 'D':'first'})

If A, and D are always equal when grouping by B, then you can can just group by A, B D, and sum C:
df.groupby(['A', 'B', 'D'], as_index = False).agg(sum)
Output:
A B D C
0 1 0.0 16.5.2013 6
1 1 0.5 16.5.2013 7
Alternatively:
You essentially want to aggregate the data grouped by column 'B'. To aggregate column C you will just use the built in sum function. For the other columns, you basically just want to select a sole value as you believe they are always the same within groups. To do that, just write a very simple function that aggregates those columns simply by taking the first value.
# will take first value of the grouped data
sole_value = lambda x : list(x)[0]
#dictionary that maps columns to aggregation functions
agg_funcs = {'A' : sole_value, 'C' : sum, 'D' : sole_value}
#group and aggregate
df.groupby('B', as_index = False).agg(agg_funcs)
Output:
B A C D
0 0.0 1 6 16.5.2013
1 0.5 1 7 16.5.2013
Of course you really need to be sure that you have values that are definitely equal in columns A, and D, otherwise you might be preserving the wrong data.

Related

Why won't the elements in the last row of 2 columns in my DF change?

I have letter ratings that I need to rank against each other and then pull the minimum ranking and median rating in a specific format. I've converted the minimum ranking and median rating to the correct specific format EXCEPT for the very last row in those columns. Could someone help me troubleshoot why my last row does not seem to be converting to the correct format?
For context, I have created a dataframe using 3 columns at the end of my csv:
df = pdf.read_csv('csv path', usecols=['rating1','rating2','rating3'])
The data frame looks like this:
rating1 rating2 rating3
0 D Dd c
1 C Bb A
2 B Bb b
And i created a mapping dictionary that looks like this:
ranking = {
'D': 1, 'Dd': 1, 'd': 1, 'C':2, 'Cc': 2, 'c': 2, 'B': 3, 'Bb': 3, 'b':3, 'A' : 4
}
The ranking rules are: 1. if all the ratings are the same, you can pull the minimum rating; 2. if two of the ratings are the same, pull in the lowest rating; 3.if the three ratings differ, filter out the max rating and the min rating. Since I need to rank them and create two columns that pull the minimum ranking and medium ranking, I need my data frame to look like this when I rank them:
rating1 rating2 rating3 medium_rating minimum_rating
0 D Dd c 1 1
1 C Bb A 3 2
2 B Bb b 3 3
My final step is that once i've ranked the medium_rating and the minimum_rating, I need to convert the number rankings back to the rating1 format so my data frame should FINALLY look like this:
rating1 rating2 rating3 medium_rating minimum_rating
0 D Dd c D D
1 C Bb A B C
2 B Bb b B B
I created a mapping dictionary to convert the rankings to the rating1 format that looks like this:
rating1_rank = {
1: 'D', 2: 'C', 3: 'B', 4: 'A'
}
When I try to convert my medium and minimum rating to the rating1_rank format, all rows in these columns are being converted except for the last one. See below for my code:
This is setting up the dataframe columns and the rating dictionaries:
df = pdf.read_csv('csv path', usecols=['rating1','rating2','rating3'])
ranking = {
'D': 1, 'Dd': 1, 'd': 1, 'C':2, 'Cc': 2, 'c': 2, 'B': 3, 'Bb': 3, 'b':3, 'A' : 4
}
rating1_rank = {
1: 'D', 2: 'C', 3: 'B', 4: 'A'
}
Below is the logic for creating the medium rating column and minimum rating:
df['Medium_Rating'] = np.where(df.replace(ranking).nunique(axis=1).isin([1,2]),
df.replace(ranking).min(axis=1),
df.replace(ranking).median(axis=1))
df['Minimum_Rating'] = df.replace(ranking).min(axis=1)
This is meant to convert the medium rating and minimum rating elements to rating1_rank:
df[:-1] = df.replace(rating1_rank)
df[:-2] = df.replace(rating1_rank)
My result is:
rating1 rating2 rating3 medium_rating minimum_rating
0 D Dd c D D
1 C Bb A B C
**2 B Bb b 3 3**
Why is the last row not converting??
In the present case you can simply use:
df = df.replace(rating1_rank)
print(df)
rating1 rating2 rating3 Medium_Rating Minimum_Rating
0 D Dd c D D
1 C Bb A B C
2 B Bb b B B
To avoid the risk of inadvertently changing columns we don't want to change (though not really applicable here), select only the last two columns with df.iloc:
df.iloc[:, -2:] = df.iloc[:, -2:].replace(rating1_rank)
Or with df.loc:
cols = ['Medium_Rating', 'Minimum_Rating']
df.loc[:, cols] = df.loc[:, cols].replace(rating1_rank)
As to why your approach isn't working: I think you are misunderstanding the meaning of df[:-1] and df[:-2]. You are selecting rows here. Let's see a print:
print(df[:-1])
rating1 rating2 rating3 Medium_Rating Minimum_Rating
0 D Dd c D D
1 C Bb A B C
So, here you are selecting row 0 through to and including -1 (being 1 in this case, since we have rows 0, 1, 2). With df[:-2], along the same lines, you are only selecting the first row: row 0 through to and including 0!
Have a further look at the documentation for all the different ways in which you can select data from dfs and Series.

use specific columns to map new column with json

I have a data frame with:
A B C
1 3 6
I want to take the 2 columns and create column D that reads {"A":"1", "C":"6}
new dataframe output would be:
A B C D
1 3 6 {"A":"1", "C":"6}
I have the following code:
df['D'] = n.apply(lambda x: x.to_json(), axis=1)
but this is taking all columns while I only need columns A and C and want to leave B from the JSON that is created.
Any tips on just targeting the two columns would be appreciated.
It's not exactly what you ask but you can convert your 2 columns into a dict then if you want to export your data in JSON format, use df['D'].to_json():
df['D'] = df[['A', 'C']].apply(dict, axis=1)
print(df)
# Output
A B C D
0 1 3 6 {'A': 1, 'C': 6}
For example, export the column D as JSON:
print(df['D'].to_json(orient='records', indent=4))
# Output
[
{
"A":1,
"C":6
}
]
Use subset in lambda function:
df['D'] = df.apply(lambda x: x[['A','C']].to_json(), axis=1)
Or sellect columns before apply:
df['D'] = df[['A','C']].apply(lambda x: x.to_json(), axis=1)
If possible create dictionaries:
df['D'] = df[['A','C']].to_dict(orient='records')
print (df)
A B C D
0 1 3 6 {'A': 1, 'C': 6}

how to choose multiple columns in aggregate functions?

I have data like this :
A,B,C,D
1,50,1 ,3.9
2,20,22,1.5
3,10,10,2.3
2,15,11,1.8
1,16,13,4.2
and I want to group them by A that I would take mean for BandC and sum for D .
the solution would be like this :
df = df.groupby(['A']).agg({
'B': 'mean', 'C': 'mean', 'D': sum
})
I am asking about if there is a way to choose multiple columns for the same function rather than repeating it as in the case of BandC
If you require at most one aggregation per column, you can store the aggregations in a dict {func: col_list}, then unpack it when you aggregate.
d = {'mean': ['B', 'C'], sum: ['D']}
df.groupby(['A']).agg({col: f for f,cols in d.items() for col in cols})
# B C D
#A
#1 33.0 7.0 8.1
#2 17.5 16.5 3.3
#3 10.0 10.0 2.3

Pandas: convert each row to a <column name,row value> dict and add as a new column

I have a df such that
STATUS_ID STATUS_NM
0 1 A
1 2 B
2 3 C
3 4 D
I want to perform a row by apply to get a key, value par for each row in a separate column. The final df should be
STATUS
0 {STATUS_ID:1,STATUS_NM:A}
1 {STATUS_ID:2,STATUS_NM:B}
2 {STATUS_ID:3,STATUS_NM:C}
3 {STATUS_ID:4,STATUS_NM:D}
UPDATE:
I have tried df[cols].apply(pd.Series.to_dict, axis=1) and df[cols].apply(lambda x: x.to_dict(), axis=1) but instead of getting the actual dict, I get
<built-in method values of dict object at 0x00...
I believe its my version of pandas that is causing the issue. This has been discussed here - https://github.com/pandas-dev/pandas/issues/8735
So the question is if there's another way to perform the same operation circumventing this issue. I cannot update my Pandas version to 0.17
df['STATUS'] = df.apply(pd.Series.to_dict, axis=1)
df
Out:
STATUS_ID STATUS_NM STATUS
0 1 A {'STATUS_NM': 'A', 'STATUS_ID': 1}
1 2 B {'STATUS_NM': 'B', 'STATUS_ID': 2}
2 3 C {'STATUS_NM': 'C', 'STATUS_ID': 3}
3 4 D {'STATUS_NM': 'D', 'STATUS_ID': 4}
If in your real DataFrame you have other columns too, you may need to specify the columns you want to have in the dictionary.
cols = ['STATUS_ID', 'STATUS_NM']
df['STATUS'] = df[cols].apply(pd.Series.to_dict, axis=1)
An alternative would be iterating over the DataFrame:
lst = []
for _, row in df[cols].iterrows():
lst.append({col: row[col] for col in cols})
This creates a list:
[{'STATUS_ID': 1, 'STATUS_NM': 'A'},
{'STATUS_ID': 2, 'STATUS_NM': 'B'},
{'STATUS_ID': 3, 'STATUS_NM': 'C'},
{'STATUS_ID': 4, 'STATUS_NM': 'D'}]
You can directly assign this to your DataFrame:
df['STATUS'] = lst

Mean of repeated columns in pandas dataframe

I have a dataframe with repeated column names which account for repeated measurements.
df = pd.DataFrame({'A': randn(5), 'B': randn(5)})
df2 = pd.DataFrame({'A': randn(5), 'B': randn(5)})
df3 = pd.concat([df,df2], axis=1)
df3
A B A B
0 -0.875884 -0.298203 0.877414 1.282025
1 1.605602 -0.127038 -0.286237 0.572269
2 1.349540 -0.067487 0.126440 1.063988
3 -0.142809 1.282968 0.941925 -1.593592
4 -0.630353 1.888605 -1.176436 -1.623352
I'd like to take the mean of cols 'A's and 'B's such that the dataframe shrinks to
A B
0 0.000765 0.491911
1 0.659682 0.222616
2 0.737990 0.498251
3 0.399558 -0.155312
4 -0.903395 0.132627
If I do the typical
df3['A'].mean(axis=1)
I get a Series (with no column name) and I should then build a new dataframe with the means of each col group. Also the .groupby() method apparently doesn't allow you to group by column name, but rather you give the columns and it sorts the indexes. Is there a fancy way to do this?
Side question: why does
df = pd.DataFrame({'A': randn(5), 'B': randn(5), 'A': randn(5), 'B': randn(5)})
not generate a 4-column dataframe but merges same-name cols?
You can use the level keyword (regarding your columns as the first level (level 0) of the index with only one level in this case):
In [11]: df3
Out[11]:
A B A B
0 -0.367326 -0.422332 2.379907 1.502237
1 -1.060848 0.083976 0.619213 -0.303383
2 0.805418 -0.109793 0.257343 0.186462
3 2.419282 -0.452402 0.702167 0.216165
4 -0.464248 -0.980507 0.823302 0.900429
In [12]: df3.mean(axis=1, level=0)
Out[12]:
A B
0 1.006291 0.539952
1 -0.220818 -0.109704
2 0.531380 0.038334
3 1.560725 -0.118118
4 0.179527 -0.040039
You've created df3 in a strange way for this simple case the following would work:
In [86]:
df = pd.DataFrame({'A': randn(5), 'B': randn(5)})
df2 = pd.DataFrame({'A': randn(5), 'B': randn(5)})
print(df)
print(df2)
A B
0 -0.732807 -0.571942
1 -1.546377 -1.586371
2 0.638258 0.569980
3 -1.017427 1.395300
4 0.666853 -0.258473
[5 rows x 2 columns]
A B
0 0.589185 1.029062
1 -1.447809 -0.616584
2 -0.506545 0.432412
3 -1.168424 0.312796
4 1.390517 1.074129
[5 rows x 2 columns]
In [87]:
(df+df2)/2
Out[87]:
A B
0 -0.071811 0.228560
1 -1.497093 -1.101477
2 0.065857 0.501196
3 -1.092925 0.854048
4 1.028685 0.407828
[5 rows x 2 columns]
to answer your side question, this is nothing to do with Pandas and more to do with the dict constructor:
In [88]:
{'A': randn(5), 'B': randn(5), 'A': randn(5), 'B': randn(5)}
Out[88]:
{'B': array([-0.03087831, -0.24416885, -2.29924624, 0.68849978, 0.41938536]),
'A': array([ 2.18471335, 0.68051101, -0.35759988, 0.54023489, 0.49029071])}
dict keys must be unique so my guess is that in the constructor it just reassigns the values to the pre-existing keys
EDIT
If you insist on having duplicate columns then you have to create a new dataframe from this because if you were to update the columns 'A' and 'B', the mean will be duplicated still as the columns are repeated:
In [92]:
df3 = pd.concat([df,df2], axis=1)
new_df = pd.DataFrame()
new_df['A'], new_df['B'] = df3['A'].sum(axis=1)/df3['A'].shape[1], df3['B'].sum(axis=1)/df3['B'].shape[1]
new_df
Out[92]:
A B
0 -0.071811 0.228560
1 -1.497093 -1.101477
2 0.065857 0.501196
3 -1.092925 0.854048
4 1.028685 0.407828
[5 rows x 2 columns]
So the above would work with df3 and in fact for an arbritary numer of repeated columns which is why I am using shape, you could hard code this to 2 if you new the columns were only ever duplicated once

Categories

Resources