Python Pandas Complex Grouping - python

I have a dataframe that looks like the following:
Name Status Date
1 Joe In 1/2/2003
2 Pete Out 1/2/2003
3 Mary In 1/2/2003
• • •
4 Joe In 3/4/2004
5 Pete In 3/5/2004
6 Mary Out 4/8/2004
If I do the following group-by action :
df.groupby(["Name", "Status"]).last()
I get the following:
Joe In 3/4/2004
Pete In 3/5/2004
Out 1/2/2003
Mary In 1/2/2003
Out 4/8/2004
Notice that Joe has no "out" grouping results because there are no "out" values for Joe in the dataframe.
I want to be able to select people from the dataframe or the subsequent groupby who have only "In" status or only "out" status across a date range from people as opposed to people who have both "in"s AND "outs" across a particular date range. I'm stumped as to how to approach this. I could proceed if the groupby result gave me something like:
Joe Out np. Nan
But it doesn't.
oh, I do the groupby last because I need to get the last date where people leave both "In" and "out " status like Pete and Mary. But I need to treat Joe - who only has " in" status and no "out" status for the period - differently.
Any guidance appreciated.

Not sure what you want. But you can try reindexing
From
x = df.groupby(['Name', 'Status']).last()
Date
Name Status
Joe In 3/4/2004
Mary In 1/2/2003
Out 4/8/2004
Pete In 3/5/2004
Out 1/2/2003
You can make it
size = x.index.levels[0].size
f = np.repeat(np.arange(size), 2)
s = [0,1] * size
x.reindex(pd.MultiIndex(levels=x.index.levels, labels=[f, s]))
Date
Name Status
Joe In 3/4/2004
Out NaN
Mary In 1/2/2003
Out 4/8/2004
Pete In 3/5/2004
Out 1/2/2003

Related

Pandas Number of Unique Values from 2 Fields

I am trying to find the number of unique values that cover 2 fields. So for example, a typical example would be last name and first name. I have a data frame.
When I do the following, I just get the number of unique fields for each column, in this case, Last and First. Not a composite.
df[['Last Name','First Name']].nunique()
Thanks!
Groupby both columns first, and then use nunique
>>> df.groupby(['First Name', 'Last Name']).nunique()
IIUC, you could use value_counts() for that:
df[['Last Name','First Name']].value_counts().size
3
For another example, if you start with this extended data frame that contains some dups:
Last Name First Name
0 Smith Bill
1 Johnson Bill
2 Smith John
3 Curtis Tony
4 Taylor Elizabeth
5 Smith Bill
6 Johnson Bill
7 Smith Bill
Then value_counts() gives you the counts by unique composite last-first name:
df[['Last Name','First Name']].value_counts()
Last Name First Name
Smith Bill 3
Johnson Bill 2
Curtis Tony 1
Smith John 1
Taylor Elizabeth 1
Then the length of that frame will give you the number of unique composite last-first names:
df[['Last Name','First Name']].value_counts().size
5

Most elegant way to transform this type of table?

I have a dataframe that looks something like this:
id name last attribute_1_name attribute_1_rating attribute_2_name attribute_2_rating
1 Linda Smith Age 23 Hair Brown
3 Brian Lin Hair Black Job Barista
Essentially I'd like to transform this table to look like so:
id name last attribute_name attribute_rating
1 Linda Smith Age 23
1 Linda Smith Hair Brown
3 Brian Lin Hair Black
3 Brian Lin Job Barista
What's the most elegant and efficient way to perform this transformation in Python? Assuming there are many more rows and the attribute numbers go up to 13.
Assuming attribute columns are named coherently, you can do this:
result = pd.DataFrame()
# n is the number of attribute columns
for i in range(1, n):
attribute_name_col = f'attribute_{i}_name'
attribute_rating_col = f'attribute_{i}_rating'
melted = pd.melt(
df,
id_vars=['id', 'name', 'last', attribute_name_col],
value_vars=[attribute_rating_col]
)
melted = melted.rename(
columns={attribute_name_col: 'attribute_name',
'value': 'attribute_rating'}
)
melted = melted.drop('variable', axis=1)
result = pd.concat([result, melted])
where df is your original dataframe. Then printing result gives
id name last attribute_name attribute_rating
1 Linda Smith Age 23
3 Brian Lin Hair Black
1 Linda Smith Hair Brown
3 Brian Lin Job Barista

How to display values of a column as separate columns

I want to display the values in a column along with their count in separate columns
Dataframe is
Date Name SoldItem
15-Jul Joe TV
15-Jul Joe Fridge
15-Jul Joe Washing Machine
15-Jul Joe TV
15-Jul Joe Fridge
15-Jul Mary Chair
15-Jul Mary Fridge
16-Jul Joe Fridge
16-Jul Joe Fridge
16-Jul Tim Washing Machine
17-Jul Joe Washing Machine
17-Jul Jimmy Washing Machine
17-Jul Joe Washing Machine
17-Jul Joe Washing Machine
And I get the output as
Date Name Count
15-Jul Joe 2
Mary 1
16-Jul Joe 2
I want the final output to be
Date Joe Mary
15-Jul 2 1
16-Jul 2
below is the code
fields = ['Date', 'Name', 'SoldItem']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
df_fridge = df.loc[(df['SoldItem'] == 'Fridge')]
df_fridge_grp = df_fridge.groupby(["Date", "Name"]).size()
print df_fridge_grp
If anyone can advise some pointers. I am guessing it can be done with loc, iloc, but am wondering then if my approach is wrong. Basically i want to count the values for certain types of items per person and then display that count against the name in a column display.
Does
df_fridge_grp.unstack()
Work?
Code:
df_new = df[df['SoldItem'] == 'Fridge'].groupby(['Date', 'Name']).count()
df_new = df_new.unstack().fillna(0).astype(int)
print(df_new)
Output:
SoldItem
Name Joe Mary
Date
15-Jul 2 1
16-Jul 2 0

Python - Sort a Pandas Dataframe twice

I would like to sort a Pandas dataframe twice the same way excel does. Given the following df:
Name Date
John 13/01
Mike 13/01
John 15/01
John 14/01
Mike 12/01
When adding the following code:
df=df.sort_values(['Date','Name'], ascending=[True, True])
I would expect the following result:
Name Date
John 13/01
John 14/01
John 15/01
Mike 12/01
Mike 13/01
I'm getting nothing close to this result with the code above. Any idea where's the mistake?
Many thanks!
You need swap columns, because first sort by Name and then by Date, ascending=[True, True] should be removed, because default parameter:
df = df.sort_values(['Name','Date'])
print (df)
Name Date
0 John 13/01
3 John 14/01
2 John 15/01
4 Mike 12/01
1 Mike 13/01

Pivot pandas dataframe with dates and showing counts per date

I have the following pandas DataFrame: (currently ~500 rows):
merged_verified =
Last Verified Verified by
0 2016-07-11 John Doe
1 2016-07-11 John Doe
2 2016-07-12 John Doe
3 2016-07-11 Mary Smith
4 2016-07-12 Mary Smith
I am attempting to pivot_table() it to receive the following:
Last Verified 2016-07-11 2016-07-12
Verified by
John Doe 2 1
Mary Smith 1 1
Currently I'm running
merged_verified = merged_verified.pivot_table(index=['Verified by'], values=['Last Verified'], aggfunc='count')
which gives me close to what I need, but not exactly:
Last Verified
Verified by
John Doe 3
Mary Smith 2
I've tried a variety of things with the parameters, but none of it worked. The result above is the closest I've come to what I need. I read somewhere I would need to add an additional column that uses dummy values (1's) that I can then add but that seems counter-intuitive for a what I believe to be simple DataFrame layout.
You can add parameter columns and aggragate by len:
merged_verified = merged_verified.pivot_table(index=['Verified by'],
columns=['Last Verified'],
values=['Last Verified'],
aggfunc=len)
print (merged_verified)
Last 2016-07-11 2016-07-12
Verified by
Doe 2 1
Smith 1 1
Or you also omit values:
merged_verified = merged_verified.pivot_table(index=['Verified by'],
columns=['Last Verified'],
aggfunc=len)
print (merged_verified)
Last Verified 2016-07-11 2016-07-12
Verified by
John Doe 2 1
Mary Smith 1 1
Use groupby, value_counts, and unstack:
merged_verified.groupby('Last Verified')['Verified by'].value_counts().unstack(0)
Timing
Example dataframe
Large dataframe 1 million rows
idx = pd.MultiIndex.from_product(
[
pd.date_range('2016-03-01', periods=100),
pd.DataFrame(np.random.choice(letters, (10000, 10))).sum(1)
], names=['Last Verified', 'Verified by'])
merged_verified = idx.to_series().reset_index()[idx.names]

Categories

Resources