I've heard in Pandas there's often multiple ways to do the same thing, but I was wondering –
If I'm trying to group data by a value within a specific column and count the number of items with that value, when does it make sense to use df.groupby('colA').count() and when does it make sense to use df['colA'].value_counts() ?
There is difference value_counts return:
The resulting object will be in descending order so that the first element is the most frequently-occurring element.
but count not, it sort output by index (created by column in groupby('col')).
df.groupby('colA').count()
is for aggregate all columns of df by function count. So it count values excluding NaNs.
So if need count only one column need:
df.groupby('colA')['colA'].count()
Sample:
df = pd.DataFrame({'colB':list('abcdefg'),
'colC':[1,3,5,7,np.nan,np.nan,4],
'colD':[np.nan,3,6,9,2,4,np.nan],
'colA':['c','c','b','a',np.nan,'b','b']})
print (df)
colA colB colC colD
0 c a 1.0 NaN
1 c b 3.0 3.0
2 b c 5.0 6.0
3 a d 7.0 9.0
4 NaN e NaN 2.0
5 b f NaN 4.0
6 b g 4.0 NaN
print (df['colA'].value_counts())
b 3
c 2
a 1
Name: colA, dtype: int64
print (df.groupby('colA').count())
colB colC colD
colA
a 1 1 1
b 3 2 2
c 2 2 1
print (df.groupby('colA')['colA'].count())
colA
a 1
b 3
c 2
Name: colA, dtype: int64
Groupby and value_counts are totally different functions. You cannot perform value_counts on a dataframe.
Value Counts are limited only for a single column or series and it's sole purpose is to return the series of frequencies of values
Groupby returns a object so one can perform statistical computations over it. So when you do df.groupby(col).count() it will return the number of true values present in columns with respect to the specific columns in groupby.
When should be value_counts used and when should groupby.count be used :
Lets take an example
df = pd.DataFrame({'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]})
Groupby count:
df.groupby('color').count()
id size
color
b 2 2
g 2 2
r 3 3
Groupby count is generally used for getting the valid number of values
present in all the columns with reference to or with respect to one
or more columns specified. So not a number (nan) will be excluded.
To find the frequency using groupby you need to aggregate against the specified column itself like #jez did. (maybe to avoid this and make developers life easy value_counts is implemented ).
Value Counts:
df['color'].value_counts()
r 3
g 2
b 2
Name: color, dtype: int64
Value count is generally used for finding the frequency of the values
present in one particular column.
In conclusion :
.groupby(col).count() should be used when you want to find the frequency of valid values present in columns with respect to specified col.
.value_counts() should be used to find the frequencies of a series.
in simple words: .value_counts() Return a Series containing counts of unique rows in the DataFrame which means it counts up the individual values in a specific row and reports how many of the values are in the column:
imagine we have a dataframe like:
df = pd.DataFrame({'first_name': ['John', 'Anne', 'John', 'Beth'],
'middle_name': ['Smith', pd.NA, pd.NA, 'Louise']})
first_name middle_name
0 John Smith
1 Anne <NA>
2 John <NA>
3 Beth Louise
then we apply value_counts on it:
df.value_counts()
first_name middle_name
Beth Louise 1
John Smith 1
dtype: int64
as you can see it didn't count rows with NA values.
however count() count non-NA cells for each column or row.
in our example:
df.count()
first_name 4
middle_name 2
dtype: int64
Related
Context: I have 5 years of weight data. The first column is the date (month and day), the succeeding columns are the years with corresponding weight for each day of the month. I want to have a full plot of all of my data among other things and so I want to combine all into just two columns. First column is the dates from 2018 to 2022, then the second column is the corresponding weight to each date. I have managed the date part, but can't combine the weight data. In essence, I want to turn ...
0 1
0 1 4.0
1 2 NaN
2 3 6.0
Into ...
0
0 1
1 2
2 3
3 4
4 NaN
5 6.0
pd.concat only puts the year columns next to each other. .join, .merge, melt, stack. agg don't work either. How do I do this?
sample code:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'2018': [1, 2, 3]})
df2 = pd.DataFrame({'2019': [4, np.NaN, 6]})
merged_df = pd.concat([df1,df2],axis=1, ignore_index=True, levels = 0)
print(merged_df)
P.S. I particularly don't want to input any index names (like id_vars="2018") because I want this process to be automated as the years go by with more data.
concat, merge, melt, join, stack, agg. i want to combine all column data into just one series
I think np.ravel(merged_df,order='F') will do the job for you.
If you want it in the form of a dataframe then pd.DataFrame(np.ravel(merged_df,order='F')).
It's not fully clear what's your I/O but based on your first example, you can use concat like this :
pd.concat([df["0"], df["1"].rename("0")], ignore_index=True)
Output :
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
5 6.0
Name: 0, dtype: float64
I have a dataframe that tracks the changes of an object which is identified by an id. Instead of each row representing a change of state, I want one row for each object and all of the changes tracked in columns instead.
import pandas as pd
import numpy as np
df1=pd.DataFrame({'ID':['1','2','3','1','2','1','4'], 'Original_Status':['Admitted','Admitted','Admitted','Probation','LateAdmission','Admitted','Admitted'],'New_Status':['Probation','LateAdmission','Pass','Admitted','Pass','Pass','Fail']})
df2=pd.DataFrame({'ID':['1','2','3','4'],'Original_Status_1':['Admitted','Admitted','Admitted','Admitted'],'New_Status_1':['Probation','LateAdmission','Pass','Fail'],'Original_Status_2':['Probation','LateAdmission',np.nan,np.nan],'New_Status_2':['Admitted','Pass',np.nan,np.nan],'Original_Status_3':['Admitted',np.nan,np.nan,np.nan],'New_Status_3':['Pass',np.nan,np.nan,np.nan],})`
ID Original_Status New_Status
0 1 Admitted Probation
1 2 Admitted LateAdmission
2 3 Admitted Pass
3 1 Probation Admitted
4 2 LateAdmission Pass
5 1 Admitted Pass
6 4 Admitted Fail
Original Dataframe
Change to:
ID Original_Status_1 New_Status_1 Original_Status_2 New_Status_2 Original_Status_3 New_Status_3
0 1 Admitted Probation Probation Admitted Admitted Pass
1 2 Admitted LateAdmission LateAdmission Pass NaN NaN
2 3 Admitted Pass NaN NaN NaN NaN
3 4 Admitted Fail NaN NaN NaN NaN
New Dataframe
I was able to achieve this outcome using a loops, but I'd prefer a more succint solution if possible.
This adds a columns to df1 to count the occurrence of the 'ID', then uses pd.pivot to make the wide df with a multi-index columns. The steps after the pivot are to flatten the column names and to order them correctly
df1['occurrence'] = df1.groupby('ID').cumcount()
df2 = df1.pivot(
index='ID',
values=['Original_Status','New_Status'],
columns='occurrence',
)
df2.columns = [s+'_'+str(o+1) for s,o in df2.columns]
c_order = sorted(df2.columns, key = lambda c: c[-1]) #re-order the columns
df2 = df2[c_order]
df2
I have two data frames df (with 15000 rows) and df1 ( with 20000 rows)
Where df looks like
Number Color Code Quantity
1 Red 12380 2
2 Bleu 14440 3
3 Red 15601 1
and df1 that has two columns Code and Quantity where I want to fill Quantity column under certain conditions using python in order to obtain like this
Code Quantity
12380 2
15601 1
15640 1
14400 0
The conditions that I want to take in considerations are:
If the two last caracters of Code column of df1 are both equal to zero, in this case I want to have 0 in the Quantity column of df1
If I don't find the Code in df, in this cas I put 1 in the Quantity column of df1
Otherwise I take the quantity value of df
Let us try:
mask = df1['Code'].astype(str).str[-2:].eq('00')
mapped = df1['Code'].map(df.set_index('Code')['Quantity'])
df1['Quantity'] = mapped.mask(mask, 0).fillna(1)
Details:
Create a boolean mask specifying the condition where the last two characters of Code are both 0:
>>> mask
0 False
1 False
2 False
3 True
Name: Code, dtype: bool
Using Series.map map the values in Code column in df1 to the Quantity column in df based on the matching Code:
>>> mapped
0 2.0
1 1.0
2 NaN
3 NaN
Name: Code, dtype: float64
mask the values in the above mapped column where the boolean mask is True, and lastly fill the NaN values with 1:
>>> df1
Code Quantity
0 12380 2.0
1 15601 1.0
2 15640 1.0
3 14400 0.0
I have a dataframe with values spread over several columns. I want to calculate the mean value of all items from specific columns.
All the solutions I looked up end up giving me either the separate means of each column or the mean of the means of the selected columns.
E.g. my Dataframe looks like this:
Name a b c d
Alice 1 2 3 4
Alice 2 4 2
Alice 3 2
Alice 1 5 2
Ben 3 3 1 3
Ben 4 1 2 3
Ben 1 2 2
And I want to see the mean of the values in columns b & c for each "Alice":
When I try:
df[df["Name"]=="Alice"][["b","c"]].mean()
The result is:
b 2.00
c 4.00
dtype: float64
In another post I found a suggestion to try a "double" mean one time for each axis e.g:
df[df["Name"]=="Alice"][["b","c"]].mean(axis=1).mean()
But the result was then:
3.00
which is the mean of the means of both columns.
I am expecting a way to calculate:
(2 + 3 + 4 + 5) / 4 = 3.50
Is there a way to do this in Python?
You can use numpy's np.nanmean [numpy-doc] here this will simply see your section of the dataframe as an array, and calculate the mean over the entire section by default:
>>> np.nanmean(df.loc[df['Name'] == 'Alice', ['b', 'c']])
3.5
Or if you want to group by name, you can first stack the dataframe, like:
>>> df[['Name','b','c']].set_index('Name').stack().reset_index().groupby('Name').agg('mean')
0
Name
Alice 3.500000
Ben 1.833333
Can groupby to sum all values and get their respective sizes. Then, divide to get the mean.
This way you get for all Names at once.
g = df.groupby('Name')[['b', 'c']]
g.sum().sum(1)/g.count().sum(1)
Name
Alice 3.500000
Ben 1.833333
dtype: float64
PS: In your example, looks like you have empty strings in some cells. That's not advisable, since you'll have dtypes set to object for your columns. Try to have NaNs instead, to take full advantage of vectorized operations.
Assume all your columns are numeric type and empty spaces are NaN. A simple set_index and stack and direct mean
df.set_index('Name')[['b','c']].stack().mean(level=0)
Out[117]:
Name
Alice 3.500000
Ben 1.833333
dtype: float64
(no idea how to introduce a matrix here for readability)
I have two dataframes obtained with Panda and Python.
df1 = pd.DataFrame({'Index': ['0','1','2'], 'number':[3,'dd',1], 'people':[3,'s',3]})
df1 = df1.set_index('Index')
df2 = pd.DataFrame({'Index': ['0','1','2'], 'quantity':[3,2,'hi'], 'persons':[1,5,np.nan]})
I would like to sum the quantities of columns based on Index. Columns do not have the same name and may contain strings. (I have in fact 50 columns on each df). I want to consider nan as 0. The result should look:
df3
Index column 1 column 2
0 6 4
1 nan nan
2 nan nan
I was wondering how could this be done.
Note:
For sure a double while or for would do the trick, just not very elegant...
indices=0
columna=0
while indices<len(df.index)-1:
while columna<numbercolumns-1:
df3.iloc[indices,columna]=df1.iloc[indices,columna] +df2.iloc[indices,columna]
indices += 1
columna += 1
Thank you.
You can try of concatenating both dataframes, then add based on the index group
df1.columns = df.columns
df1.people = pd.to_numeric(df1.people,errors='coerce')
pd.concat([df,df1]).groupby('Index').sum()
Out:
number people
Index
A 8 5.0
B 2 2.0
C 2 5.0
F 3 3.0