I have a Pandas dataframe. My question is how do I group all the sellers (indicated under sellerUserName) for each date. For example, for any date e.g. 29/03/2018 I want to retrieve a sum of all the unique sellers.
ScrapeDate sellerUserName
0 29/03/2018 BOB
1 29/03/2018 BOB
2 29/03/2018 BOB
3 29/03/2018 MARY
4 29/03/2018 IAN
5 29/03/2018 ANISA
6 30/03/2018 BOB
7 30/03/2018 BOB
8 30/03/2018 BOB
9 30/03/2018 KARL
10 30/03/2018 KARL
11 30/03/2018 IAN
12 01/04/2018 NGI
13 01/04/2018 NICEE
So the output dataframe should be
ScrapeDate No.of Sellers
0 29/03/2018 4
1 30/03/2018 3
2 01/04/2018 2
Just using nunique
df.groupby('ScrapeDate')['sellerUserName'].nunique()
Out[38]:
ScrapeDate
01/04/2018 2
29/03/2018 4
30/03/2018 3
Name: sellerUserName, dtype: int64
Related
I have the following dataframe which is a list of althete times:
Name Time Excuse Injured Margin
John 15 nan 0 1
John 18 nan 0 5
John 30 leg injury 1 11
John 16 nan 0 4
John 40 nan 0 18
John 15 nan 0 3
John 22 nan 0 6
I then am using a function to get the mean of the previous last 5 times shifted:
df['last5'] = df.groupby(['Name']).Time.apply(
lambda x: x.shift().rolling(5, min_periods=1).mean().fillna(.5))
This works but I am hoping to perform the same calculation but I want to ignore the Time if there is an Excuse, Injured = 1 or Margin >10.
My Expected output would be:
Name Time Excuse Injured Margin last5
John 15 0 1 .5
John 18 0 5 15
John 30 leg injury 1 11 16.5
John 16 0 4 16.5
John 40 0 18 16.33
John 15 0 3 16.33
John 22 0 6 16
Can I just add a condition onto the end of the orginal function? Thanks in advance!
You can filter the dataframe according to criteria before applying the rolling calculation
Use bfill() to backwards fill the NaN values as required:
df['last5'] = (df[(df['Excuse'].isnull()) & (df['Injured'] != 1) & (df['Margin'] <= 10)]
.groupby(['Name']).Time.apply(lambda x: x.shift().rolling(5, min_periods=1)
.mean().fillna(.5)))
df['last5'] = df.groupby(['Name'])['last5'].bfill()
df
Out[1]:
Name Time Excuse Injured Margin last5
0 John 15 NaN 0 1 0.500000
1 John 18 NaN 0 5 15.000000
2 John 30 leg injury 1 11 16.500000
3 John 16 NaN 0 4 16.500000
4 John 40 NaN 0 18 16.333333
5 John 15 NaN 0 3 16.333333
6 John 22 NaN 0 6 16.000000
I have the following Python Pandas Dataframe:
Name Sales Qty
0 JOHN BARNES 10
1 John Barnes 5
2 John barnes 4
3 Peter K. 4
4 Peter K 6
5 Peter Krammer 5
6 Charles 3
7 CHARLES 2
8 Julie Moore 3
9 Julie moore 7
10
And many more, with same name spelling variations.
I would like to combine the rows with similar values, such that I have the following Dataframe:
Name Sales Qty
0 John Barness 19
1 Peter Krammer 15
2 Charles 5
3 Julie Moore 10
and many more
How should I do?
The requirements are vague, as you can see in the comments, but I've tabulated the totals as far as I can tell. I tallied the total by lowercasing the name and removing the period, and converted it to uppercase with str.title().
import pandas as pd
import io
data = '''
Name Sales
0 "JOHN BARNES" 10
1 "John Barnes" 5
2 "John barnes" 4
3 "Peter K." 4
4 "Peter K" 6
5 "Peter Krammer" 5
6 "Charles" 3
7 "CHARLES" 2
8 "Julie Moore" 3
9 "Julie moore" 7
'''
df = pd.read_csv(io.StringIO(data), sep='\s+')
df['lower'] = df['Name'].str.lower()
df['lower'] = df['lower'].str.replace('.','')
new = df.groupby('lower')['Sales'].sum().reset_index()
new['lower'] = new['lower'].str.title()
new
lower Sales
0 Charles 5
1 John Barnes 19
2 Julie Moore 10
3 Peter K 10
4 Peter Krammer 5
Let's say I have the following pandas DataFrame:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13], ['Bob', '#'], ['Bob', '#'], ['Bob', '#']]
df = pd.DataFrame(data,columns=['Name','Age'], dtype=float)
print(df)
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
3 Bob #
4 Bob #
5 Bob #
So, there are odd rows in the DataFrame for Bob, namely rows 3, 4, and 5. These values are consistently #, not 12. Row 1 shows that Bob should be 12, not #.
In this example, it's straightforward to fix this with replace():
df = df.replace("#", 12)
print(df)
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
3 Bob 12
4 Bob 12
5 Bob 12
However, this wouldn't work for larger dataframes, e.g.
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
3 Bob #
4 Bob #
5 Bob #
6 Clarke #
whereby row 6 should be 6 Clarke 13.
How does one replace any row in Age with # with the correct integer as given in other rows, based on Name? If # exists, check other rows with the same Name value and replace #.
try this,
d= df[df['Age']!='#'].set_index('Name')['Age']
df['Age']=df['Name'].replace(d)
O/P:
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
3 Bob 12
4 Bob 12
5 Bob 12
6 Clarke 13
You want to use the valid values to fill the invalid ones? In that case, use map:
v = df.assign(Age=pd.to_numeric(df['Age'], errors='coerce')).dropna()
df['Age'] = df['Name'].map(v.set_index('Name').Age)
df
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
3 Bob 12.0
4 Bob 12.0
5 Bob 12.0
6 Clarke 13.0
I have a list of names like this
names= [Josh,Jon,Adam,Barsa,Fekse,Bravo,Talyo,Zidane]
and i have a dataframe like this
Number Names
0 1 Josh
1 2 Jon
2 3 Adam
3 4 Barsa
4 5 Fekse
5 6 Barsa
6 7 Barsa
7 8 Talyo
8 9 Jon
9 10 Zidane
i want to create a dataframe that will have all the names in names list and the corresponding numbers from this dataframe grouped, for the names that does not have corresponding numbers there should be an asterisk like below
Names Number
Josh 1
Jon 2,9
Adam 3
Barsa 4,6,7
Fekse 5
Bravo *
Talyo 8
Zidane 10
Do we have any built in functions to get this done
You can use GroupBy with str.join, then reindex with your names list:
res = df.groupby('Names')['Number'].apply(lambda x: ','.join(map(str, x))).to_frame()\
.reindex(names).fillna('*').reset_index()
print(res)
Names Number
0 Josh 1
1 Jon 2,9
2 Adam 3
3 Barsa 4,6,7
4 Fekse 5
5 Bravo *
6 Talyo 8
7 Zidane 10
I have two panda series, and would simply like to compare their string values, and returning the strings (and maybe indices too) of the values they have in common e.g. Hannah, Frank and Ernie in the example below::
print(x)
print(y)
0 Anne
1 Beth
2 Caroline
3 David
4 Ernie
5 Frank
6 George
7 Hannah
Name: 0, dtype: object
1 Hannah
2 Frank
3 Ernie
4 NaN
5 NaN
6 NaN
7 NaN
Doing
x == y
throws a
ValueError: Can only compare identically-labeled Series objects
as does
x.sort_index(axis=0) == y.sort_index(axis=0)
and
x.reindex_like(y) > y
does something, but not the right thing!
If need common values only you can use convert first column to set and use intersection:
a = set(x).intersection(y)
print (a)
{'Hannah', 'Frank', 'Ernie'}
And for indices need merge by default inner join with reset_index for convert indices to columns:
df = pd.merge(x.rename('a').reset_index(), y.rename('a').reset_index(), on='a')
print (df)
index_x a index_y
0 4 Ernie 3
1 5 Frank 2
2 7 Hannah 1
Detail:
print (x.rename('a').reset_index())
index a
0 0 Anne
1 1 Beth
2 2 Caroline
3 3 David
4 4 Ernie
5 5 Frank
6 6 George
7 7 Hannah
print (y.rename('a').reset_index())
index a
0 1 Hannah
1 2 Frank
2 3 Ernie
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN