How to count id's per name in a dataframe - python

I have a list of names:
lst = ['Albert', 'Carl', 'Julian', 'Mary']
and I have a DF:
target id name
A 100 Albert
A 110 Albert
B 200 Carl
D 500 Mary
E 235 Mary
I want to make another dataframe counting how many id per name in lst:
lst_names Count
Albert 2
Carl 1
Julian 0
Mary 2
What's the most efficient way to do this considering the list of names has 12k unique names on it?

Check with value_counts
pd.Categorical(df['name'],lst).value_counts()
Out[894]:
Albert 2
Carl 1
Julian 0
Mary 2
dtype: int64
Or
df['name'].value_counts().reindex(lst,fill_value=0)
Out[896]:
Albert 2
Carl 1
Julian 0
Mary 2
Name: name, dtype: int64

You can use value_counts, and then create an empty Series with lst as the index, and then add them together, filling NaN with 0:
(df['name'].value_counts() + pd.Series(index=lst, dtype=int)).fillna(0).astype(int)
Output:
>>> df
Albert 2
Carl 1
Julian 0
Mary 2
Name: count, dtype: int64

Related

Count occurrence of a max value within aggregation

I have a table like this:
Column1
Column2
John
2
John
8
John
8
John
8
Robert
5
Robert
5
Robert
1
Carl
8
Carl
7
Now what I want is to aggregate this DataFrame by Column1 and get the max value as well as to count how many times does the given max value occurs for every group.
So the output should look like this:
Column1
Max
Count_of_Max
John
8
3
Robert
5
2
Carl
8
1
I've been trying to do something like this:
def Max_Count(x):
a = df.loc[x.index]
return a.loc[a['Column2'] == a['Column2'].max(), 'Column2'].count()
df.groupby(["Column1"]).agg({'Column2': ["max", Max_Count]}).reset_index()
But it's not really working :(
What would be the way get the desired result?
df.groupby('Column1').agg({
'Column2': [max, lambda x: (x==max(x)).sum()]
}).rename(columns={'max': 'Max', '<lambda_0>': 'Count_of_Max'})

Pandas - dense rank but keep current group numbers

I'm dealing with pandas dataframe and have a frame like:
data = {
"name": ["Andrew", "Andrew", "James", "James", "Mary", "Andrew", "Michael"],
"id": [3, 3, 1, 0, 0, 0, 2]
}
df = pd.DataFrame(data)
----------------------
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 0
4 Mary 0
5 Andrew 0
6 Michael 2
I'm trying to write code to group values by "name" column. However, I want to keep the current group numbers.
If the value is 0, it means that there is no assignment.
For the example above, assign a value of 3 for each occurrence of Andrew and a value of 1 for each occurrence of James. For Mary, there is no assignment so assign next/unique number.
The expected output:
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2
I've spent time already trying to figure this out. I managed to get to something like this:
df.loc[df["id"].eq(0), "id"] = ( df['name'].rank(method='dense').astype(int))
The issue with above it that it ignore records equal 0, thus numbers are incorrect. I removed that part (values equal to 0) but then numbering is not preserved.
Can u please support me?
Replace 0 values to missing values, so if use GroupBy.transform with first get all existing values instead them and then replace missing values by Series.rank with add maximal id and converting to integers:
df = df.replace({'id':{0:np.nan}})
df['id'] = df.groupby('name')['id'].transform('first')
s = df.loc[df["id"].isna(), 'name'].rank(method='dense') + df['id'].max()
df['id'] = df['id'].fillna(s).astype(int)
print (df)
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2
IIUC you can first fill in the non-zero IDs with groupby.transform('max') to get the max existing ID, then complete the names without ID to the next available ID on the masked data (you can use factorize or rank as you wish):
# fill existing non-zero IDs
s = df.groupby('name')['id'].transform('max')
m = s.eq(0)
df['id'] = s.mask(m)
# add new ones
df.loc[m, 'id'] = pd.factorize(df.loc[m, 'name'])[0]+df['id'].max()+1
# or rank, although factorize is more appropriate for non numerical data
# df.loc[m, 'id'] = df.loc[m, 'name'].rank(method='dense')+df['id'].max()
# optional, if you want integers
df['id']= df['id'].convert_dtypes()
output:
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2

Split Columns in pandas with str.split and keep values

So I am stuck with a problem here:
I have a pandas dataframe which looks like the following:
ID Name Value
0 Peter 21,2
1 Frank 24
2 Tom 23,21/23,60
3 Ismael 21,2/ 21,54
4 Joe 23,1
and so on...
What I am trying to is to split the "Value" column by the slash forward (/) but keep all the values, which do not have this kind of pattern.
Like here:
ID Name Value
0 Peter 21,2
1 Frank 24
2 Tom 23,21
3 Ismael 21,2
4 Joe 23,1
How can I achieve this? I tried the str.split method but it's not giving me the solution I want. Instead, it returns NaN as can be seen in the following.
My Code: df['Value']=df['value'].str.split('/', expand=True)[0]
Returns:
ID Name Value
0 Peter NaN
1 Frank NaN
2 Tom 23,21
3 Ismael 21,2
4 Joe Nan
All I need is the very first Value before the '/' is coming.
Appreciate any kind of help!
Remove expand=True for return lists and add str[0] for select first value:
df['Value'] = df['Value'].str.split('/').str[0]
print (df)
ID Name Value
0 0 Peter 21,2
1 1 Frank 24
2 2 Tom 23,21
3 3 Ismael 21,2
4 4 Joe 23,1
If performance is important use list comprehension:
df['Value'] = [x.split('/')[0] for x in df['Value']]
pandas.Series.str.replace with regex
df.assign(Value=df.Value.str.replace('/.*', ''))
ID Name Value
0 0 Peter 21,2
1 1 Frank 24
2 2 Tom 23,21
3 3 Ismael 21,2
4 4 Joe 23,1
Optionally, you can assign results directly back to dataframe
df['Value'] = df.Value.str.replace('/.*', '')

Pandas: Cumulative count from two columns

winner loser winner_matches loser_matches
Dave Harry 1 1
Jim Dave 1 2
Dave Steve 3 1
I'm trying to build a running count of how many matches a player has participated in based on their name's appearance in either the winner or loser column (ie, Dave above has a running count of 3 since he's been in every match). I'm new to pandas and have tried a few combinations of cumcount and groupby but I'm not sure if I just need to manually loop over the dataset and store all the names myself.
EDIT: to clarify, I need the running totals in the dataframe as shown above and not just a Series printed out later on! Thanks
First create MultiIndex Series by DataFrame.stack, then GroupBy.cumcount, for DataFrame add unstack with add_suffix:
print (df)
winner loser
0 Dave Harry
1 Jim Dave
2 Dave Steve
s = df.stack()
#if multiple columns in original df
#s = df[['winner','loser']].stack()
df1 = s.groupby(s).cumcount().add(1).unstack().add_suffix('_matches')
print (df1)
winner_matches loser_matches
0 1 1
1 1 2
2 3 1
Last append to original DataFrame by join:
df = df.join(df1)
print (df)
winner loser winner_matches loser_matches
0 Dave Harry 1 1
1 Jim Dave 1 2
2 Dave Steve 3 1
you need flatten,
pd.Series(df[['winner','loser']].values.flatten()).value_counts()
[out]
Dave 3
Jim 1
Harry 1
Steve 1

Flatten a pandas dataframe column

I currently have the following column:
0 [Joe]
1 John
2 Mary
3 [Joey]
4 Harry
5 [Susan]
6 Kevin
I can't seem to remove the [] with out making the rows with [] = NaN
To be clear I want the column to look like this:
0 Joe
1 John
2 Mary
3 Joey
4 Harry
5 Susan
6 Kevin
Can anyone help?
Your title seems to imply that some elements of your series are lists.
setup
s = pd.Series([['Joe'], 'John', 'Mary', ['Joey'], 'Harry', ['Susan'], 'Kevin'])
s
0 [Joe]
1 John
2 Mary
3 [Joey]
4 Harry
5 [Susan]
6 Kevin
dtype: object
option 1
apply with pd.Series
s.apply(pd.Series).squeeze()
0 Joe
1 John
2 Mary
3 Joey
4 Harry
5 Susan
6 Kevin
Name: 0, dtype: object
Try this:
df['column_name'] = df['column_name'].apply(lambda x: str(x).strip("'[]") if type(x) == list else x)
Why not just do s.astype(str).str.strip ("'[]'")
or
s.map(lambda x: x if type(x) != list else x [0])

Categories

Resources