Pandas GroupBy sum concatenates numbers instead of summing them

Pandas GroupBy sum concatenates numbers instead of summing them - python

When I use the following code:
print(self.df.groupby(by=[2])[3].agg(['sum']))
On the following Dataframe:
0 1 2 3 4 5 6 7
0 15 LCU Test 1 308.02 170703 ALCU 4868 MS10
1 16 LCU Test 2 127.37 170703 ALCU 4868 MS10
The sum function is not completed correctly because the value column (col 3) returns a concatenated string of the values (308.02127.37) instead of maintaining the integrity of the individual values to allow operation.

It seems like your 3rd column is a string. Did you load in your dataframe using dtype=str?
Furthermore, try not to hardcode your columns. You can use .astype or pd.to_numeric to cast and then apply sum:
self.df.groupby(self.df.columns[2])[self.df.columns[3]].agg(
lambda x: pd.to_numeric(x, errors='coerce').sum()
)
Or
self.df.groupby(self.df.columns[2])[self.df.columns[3]].agg(
lambda x: x.astype(float).sum()
)

Related

Column are missing in the result of groupby

I have dataframe like this:
source_ip dest_ip dest_ip_usage dest_ip_count
0 4:107:27:41 1:23:54:114 2028544 2
1 4:107:27:41 2:112:41:134 3145639 1
2 4:107:27:41 2:112:41:178 4145639 1
3 1:192:221:145 32:107:27:134 6358000 1
4 1:192:344:161 3:243:82:204 6341359 1
I am using syntax: df1 = df.groupby(['source_ip','dest_ip'])['dest_ip_usage'].nlargest(2)
But I am not getting indexes and getting result:
0 2028544
1 3145639
2 4145639
3 6358000
4 6341359

This is not possible when using groupby on multiple columns.
If you want to find nlargest with groupby on multiple columns you must use apply method on that specific column on which you are trying to find nlargest.
df.groupby(['source_ip'])['dest_ip','dest_ip_usage'].apply(lambda x: x.nlargest(2, columns=['dest_ip_usage']))

Python pandas convert single value in object column

Solved Below
Issue: Cannot .groupby() sort because a single value is a string type object. The value at Index 5, ColA 10 for Data In is the issue. The value at Index 5 for ColA, 10, is a string object. pd.to_numeric() properly sorts the column if only sorted by that column.
Question: Can a single value in ColA be converted?
Method:
ind = pd.to_numeric(df['ColA'], errors='coerce').fillna(999).astype(int).argsort()
df = df.reindex(ind)
df = df.groupby(df.ColA).apply(pd.DataFrame.sort_values, 'ColB')
df = df.reset_index(drop=True)
Data in:
Index ColA ColB ColC
0 2 14-5 MumboJumbo
1 4 18-2 MumboJumbo2
2 2 24-5 MumboJumbo3
3 3 23-8 MumboJumbo4
4 2 13-6 MumboJumbo5
5 10 86-1 MumboJumbo6
6 10 42-1 MumboJumbo7
7 2 35-6 MumboJumbo8
8 Load NaN MumboJumbo9
Desired Output:
Index ColA ColB ColC
0 2 13-6 MumboJumbo5
1 2 14-5 MumboJumbo
2 2 24-5 MumboJumbo3
3 2 35-6 MumboJumbo8
4 3 23-8 MumboJumbo4
5 4 18-2 MumboJumbo2
6 10 42-1 MumboJumbo7
7 10 86-1 MumboJumbo6
8 Load NaN MumboJumbo9
Thanks!

I don't really understand the problem in the question but you can select specific values in a DataFrame using iloc (positional index) or loc (label index). Since you are asking to replace the value in the fifth row in the first column in your dataset, we use iloc.
df.iloc[from_row:to_row,column_position]
To convert value '10' in ColA at row 5 to an int('10') you simply select it and then update it.
df.iloc[5:6,0] = 10
If you don't know the location of the value you need to convert, then iloc and loc is no help.
There are several ways to convert all values in a column to a specific dtype. One way would be using a lambda-function.
df[column_name].apply(lambda x: int(x))
The lambda above will break because your data also contains the string Load and you can't convert that to an int. One way to solve this is adding conditions to your lambda.
df[column_name].apply(lambda x: int(x) if something else something)
Given the data in your question the most straightforward way would be to check if x is not 'Load':
df[column_name].apply(lambda x: int(x) if x != 'Load' else x)
This becomes a hassle if you have loads of actual strings in your column. If you want to use a lambda you could make a list of the actual strings. And then check if x is in the list.
list_of_strings = ['Load', 'Road', 'Toad']
df[column_name].apply(lambda x: int(x) if x not in list_of_strings else x)
Another way would be to write a separate function to manage the converting using try/catch blocks.

Counting instances in a pandas DF

I have a dataframe that looks like this:
TF name
0 A
1 A
0 A
0 A
1 B
1 B
0 B
1 B
1 B
I need to produce a resulting dataframe that would count how many 0's and 1's each person in my dataframe has.
So the result for the above would be:
name True False
A 3 1
B 1 4
I don't think groupby would work in this instance. Any solution other than looping and counting?

You can perform groupby letting TF be the grouped key. Take the corresponding value_counts of the name column to get distinct counts.
Unstack level=0 of the multi-index series so that a dataframe object gets produced. Finally, rename the integer columns by type-casting them as boolean values.
df.groupby('TF')['name'].value_counts().unstack(0).rename(columns=bool)
To have the column names take on string values:
1) Use lambda function:
<...operations on df...>.rename(columns=lambda x: str(x.astype(bool)))
2) Or chain the syntaxes together:
<...operations on df...>.rename(columns=bool).rename(columns=str)

I would first convert your columns to boolean and then group by both name and TF and then unstack the boolean column TF.
df['TF']=df['TF'].astype(bool)
df.groupby(['name', 'TF']).size().unstack('TF')
TF False True
name
A 3 1
B 1 4

Python filling string column "forward" and groupby attaching groupby result to dataframe

I have a dataframe looking generated by:
df = pd.DataFrame([[100, ' tes t ', 3], [100, np.nan, 2], [101, ' test1', 3 ], [101,' ', 4]])
It looks like
0 1 2
0 100 tes t 3
1 100 NaN 2
2 101 test1 3
3 101 4
I would like to a fill column 1 "forward" with test and test1. I believe one approach would be to work with replacing whitespace by np.nan, but it is difficult since the words contain whitespace as well. I could also groupby column 0 and then use the first element of each group to fill forward. Could you provide me with some code for both alternatives I do not get it coded?
Additionally, I would like to add a column that contains the group means that is
the final dataframe should look like this
0 1 2 3
0 100 tes t 3 2.5
1 100 tes t 2 2.5
2 101 test1 3 3.5
3 101 test1 4 3.5
Could you also please advice how to accomplish something like this?
Many thanks please let me know in case you need further information.

IIUC, you could use str.strip and then check if the stripped string is empty.
Then, perform groupby operations and filling the Nans by the method ffill and calculating the means using groupby.transform function as shown:
df[1] = df[1].str.strip().dropna().apply(lambda x: np.NaN if len(x) == 0 else x)
df[1] = df.groupby(0)[1].fillna(method='ffill')
df[3] = df.groupby(0)[2].transform(lambda x: x.mean())
df
Note: If you must forward fill NaN values with first element of that group, then you must do this:
df.groupby(0)[1].apply(lambda x: x.fillna(x.iloc[0]))
Breakup of steps:
Since we want to apply the function only on strings, we drop all the NaN values present before, else we would be getting the TypeError due to both floats and string elements present in the column and complains of float having no method as len.
df[1].str.strip().dropna()
0 tes t # operates only on indices where strings are present(empty strings included)
2 test1
3
Name: 1, dtype: object
The reindexing part isn't a necessary step as it only computes on the indices where strings are present.
Also, the reset_index(drop=True) part was indeed unwanted as the groupby object returns a series after fillna which could be assigned back to column 1.

Pandas dataframe: slicing column values using second column for slice index

I'm trying to create a column of microsatellite motifs in a pandas dataframe. I have one column that gives the length of the motif and another that has the whole microsatellite.
Here's an example of the columns of interest.
motif_len sequence
0 3 ATTATTATTATT
1 4 ATCTATCTATCT
2 3 ATCATCATCATC
I would like to slice the values in sequence using the values in motif_len to give a single repeat(motif) of each microsatellite. I'd then like to add all these motifs as a third column in the data frame to give something like this.
motif_len sequence motif
0 3 ATTATTATTATT ATT
1 4 ATCTATCTATCT ATCT
2 3 ATCATCATCATC ATC
I've tried a few things with no luck.
>>df['motif'] = df.sequence.str[:df.motif_len]
>>df['motif'] = df.sequence.str[:df.motif_len.values]
Both make the motif column but all the values are NaN.
I think I understand why these don't work. I'm passing a series/array as the upper index in the slice rather than the a value from the mot_len column.
I also tried to create a series by iterating through each
Any ideas?

You can call apply on the df pass axis=1 to apply row-wise and use the column values to slice the str:
In [5]:
df['motif'] = df.apply(lambda x: x['sequence'][:x['motif_len']], axis=1)
df
Out[5]:
motif_len sequence motif
0 3 ATTATTATTATT ATT
1 4 ATCTATCTATCT ATCT
2 3 ATCATCATCATC ATC

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas GroupBy sum concatenates numbers instead of summing them - python

Related

Column are missing in the result of groupby

Python pandas convert single value in object column

Counting instances in a pandas DF

Python filling string column "forward" and groupby attaching groupby result to dataframe

Pandas dataframe: slicing column values using second column for slice index

Categories

Resources