Counting instances in a pandas DF

Counting instances in a pandas DF - python

I have a dataframe that looks like this:
TF name
0 A
1 A
0 A
0 A
1 B
1 B
0 B
1 B
1 B
I need to produce a resulting dataframe that would count how many 0's and 1's each person in my dataframe has.
So the result for the above would be:
name True False
A 3 1
B 1 4
I don't think groupby would work in this instance. Any solution other than looping and counting?

You can perform groupby letting TF be the grouped key. Take the corresponding value_counts of the name column to get distinct counts.
Unstack level=0 of the multi-index series so that a dataframe object gets produced. Finally, rename the integer columns by type-casting them as boolean values.
df.groupby('TF')['name'].value_counts().unstack(0).rename(columns=bool)
To have the column names take on string values:
1) Use lambda function:
<...operations on df...>.rename(columns=lambda x: str(x.astype(bool)))
2) Or chain the syntaxes together:
<...operations on df...>.rename(columns=bool).rename(columns=str)

I would first convert your columns to boolean and then group by both name and TF and then unstack the boolean column TF.
df['TF']=df['TF'].astype(bool)
df.groupby(['name', 'TF']).size().unstack('TF')
TF False True
name
A 3 1
B 1 4

Related

Can't sort values after aggregation using Pandas dataframe

I have the following dataframe:
df[['ID','Team']].groupby(['Team']).agg([('total','count')]).reset_index("total").sort_values("count")
I basically, need to count the number of IDs by Team and then sort by the total number of IDs.
The aggregation part it's good and it gives me the expected result. But when I try the sort part I got this:
KeyError: 'Requested level (total) does not match index name (Team)'
What I am doing wrong?

Use names aggregation for specify new columns names in aggregate function, remove total from DataFrame.reset_index:
df = pd.DataFrame({
'ID':list('abcdef'),
'Team':list('aaabcb')
})
df = df.groupby('Team').agg(count=('ID','count')).reset_index().sort_values("count")
print (df)
Team count
2 c 1
1 b 2
0 a 3
Your solution should be changed by specify column after groupby for processing, then specify new column name with aggregate function in tuple and last also remove total from reset_index:
df = df.groupby('Team')['ID'].agg([('count','count')]).reset_index().sort_values("count")
print (df)
Team count
2 c 1
1 b 2
0 a 3

Extract row data from dictionary if dataframes based on filter on a column value

The dictionary dict_set has dataframes as the value for their keys.
I'm trying to extract data from a dictionary of dataframes based on a filter on 'A' column in the dataframe based on the value in column.
dict_set={}
dict_set['a']=pd.DataFrame({'A':[1,2,3],'B':[1,2,3]})
dict_set['b']=pd.DataFrame({'A':[1,4,5],'B':[1,5,6]})
df=pd.concat([dict_set[x][dict_set[x]['A']==1] for x in dict_set.keys()],axis=0)
output being the below.
A B
0 1 1
0 1 1
But I would want the output to be
A B x
0 1 1 a
0 1 1 b
Basically, I want the value of x to be present in the new dataframe formed as a column, say column x in the dataframe formed such that df[x] would give me the x values. Is there a simple way to do this?

Try this:
pd.concat([df.query("A == 1") for df in dict_set.values()], keys=dict_set.keys())\
.reset_index(level=0)\
.rename(columns={'level_0':'x'})
Output:
x A B
0 a 1 1
0 b 1 1
Details:
Let's get the dataframes from the dictionary using list comprehension and filter the datafames. Here, I choose to use query, but you could use boolean index with df[df['A'] == 1] also, then pd.concat with the keys parameter set to the dictionary keys. Lastly, reset_index level=0 and rename.

Adding several columns at the same time with multiindex

I have a dataframe with a variable number of columns and with are handled inside MultiIndex for the columns. I'm trying to add several columns into the same MultiIndex structure
I've tried to add the new columns like if I would if there was only one column but it doesn't work
I have tried this:
df = pd.DataFrame(np.random.rand(4,2), columns=pd.MultiIndex.from_tuples([('plus_zero', 'A'), ('plus_zero', 'B')]))
df['plus_one'] = df['plus_zero'] + 1
But I get ValueError: Wrong number of items passed 2, placement implies 1.
The original df should look like this:
plus_zero
A B
0 0.602891 0.701130
1 0.395749 0.960206
2 0.268238 0.140606
3 0.165802 0.971707
And the result I want:
plus_zero plus_one
A B A B
0 0.602891 0.701130 1.602891 1.701130
1 0.395749 0.960206 1.395749 1.960206
2 0.268238 0.140606 1.268238 1.140606
3 0.165802 0.971707 1.165802 1.971707

Using pd.concat:
You must specify the names of the new columns and the axis=1 or axis='columns'
pd.concat([df.loc[:,'plus_zero'],df.loc[:,'plus_zero']+1],
keys=['plus_zero','plus_one'],
axis=1)
plus_zero plus_one
A B A B
0 0.049735 0.013907 1.049735 1.013907
1 0.782054 0.449790 1.782054 1.449790
2 0.148571 0.172844 1.148571 1.172844
3 0.875560 0.393258 1.875560 1.393258

Pandas GroupBy sum concatenates numbers instead of summing them

When I use the following code:
print(self.df.groupby(by=[2])[3].agg(['sum']))
On the following Dataframe:
0 1 2 3 4 5 6 7
0 15 LCU Test 1 308.02 170703 ALCU 4868 MS10
1 16 LCU Test 2 127.37 170703 ALCU 4868 MS10
The sum function is not completed correctly because the value column (col 3) returns a concatenated string of the values (308.02127.37) instead of maintaining the integrity of the individual values to allow operation.

It seems like your 3rd column is a string. Did you load in your dataframe using dtype=str?
Furthermore, try not to hardcode your columns. You can use .astype or pd.to_numeric to cast and then apply sum:
self.df.groupby(self.df.columns[2])[self.df.columns[3]].agg(
lambda x: pd.to_numeric(x, errors='coerce').sum()
)
Or
self.df.groupby(self.df.columns[2])[self.df.columns[3]].agg(
lambda x: x.astype(float).sum()
)

pandas Series to Dataframe using Series indexes as columns

I have a Series, like this:
series = pd.Series({'a': 1, 'b': 2, 'c': 3})
I want to convert it to a dataframe like this:
a b c
0 1 2 3
pd.Series.to_frame() doesn't work, it got result like,
0
a 1
b 2
c 3
How can I construct a DataFrame from Series, with index of Series as columns?

You can also try this :
df = DataFrame(series).transpose()
Using the transpose() function you can interchange the indices and the columns.
The output looks like this :
a b c
0 1 2 3

You don't need the transposition step, just wrap your Series inside a list and pass it to the DataFrame constructor:
pd.DataFrame([series])
a b c
0 1 2 3
Alternatively, call Series.to_frame, then transpose using the shortcut .T:
series.to_frame().T
a b c
0 1 2 3

you can also try this:
a = pd.Series.to_frame(series)
a['id'] = list(a.index)
Explanation:
The 1st line convert the series into a single-column DataFrame.
The 2nd line add an column to this DataFrame with the value same as the index.

Try reset_index. It will convert your index into a column in your dataframe.
df = series.to_frame().reset_index()

This
pd.DataFrame([series]) #method 1
produces a slightly different result than
series.to_frame().T #method 2
With method 1, the elements in the resulted dataframe retain the same type. e.g. an int64 in series will be kept as an int64.
With method 2, the elements in the resulted dataframe become objects IF there is an object type element anywhere in the series. e.g. an int64 in series will be become an object type.
This difference may cause different behaviors in your subsequent operations depending on the version of pandas.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting instances in a pandas DF - python

I would first convert your columns to boolean and then group by both name and TF and then unstack the boolean column TF. df['TF']=df['TF'].astype(bool) df.groupby(['name', 'TF']).size().unstack('TF') TF False True name A 3 1 B 1 4

Related

Can't sort values after aggregation using Pandas dataframe

Extract row data from dictionary if dataframes based on filter on a column value

Adding several columns at the same time with multiindex

Pandas GroupBy sum concatenates numbers instead of summing them

pandas Series to Dataframe using Series indexes as columns

Categories

Resources