Python Grouping column values into one value

Python Grouping column values into one value - python

Hi all so using this past link:
I am trying to consolidate columns of values into rows using groupby:
hp = hp[hp.columns[:]].groupby('LC_REF').apply(lambda x: ','.join(x.dropna().astype(str)))
#what I have
22 23 24 LC_REF
TV | WATCH | HELLO | 2C16
SCREEN | SOCCER | WORLD | 2C16
TEST | HELP | RED | 2C17
SEND |PLEASE |PARFAIT | 2C17
#desired output
22 | TV,SCREEN
23 | WATCH, SOCCER
24 | HELLO, WORLD
25 | TEST, SEND
26 | HELP,PLEASE
27 | RED, PARFAIT
Or some sort of variation where column 22,23,24 is combined and grouped by LC_REF. My current code turns all of column 22 into one row, all of column 23 into one row, etc. I am so close I can feel it!! Any help is appreciated

It seems you need:
df = hp.groupby('LC_REF')
.agg(lambda x: ','.join(x.dropna().astype(str)))
.stack()
.rename_axis(('LC_REF','a'))
.reset_index(name='vals')
print (df)
LC_REF a vals
0 2C16 22 TV,SCREEN
1 2C16 23 WATCH,SOCCER
2 2C16 24 HELLO,WORLD
3 2C17 22 TEST,SEND
4 2C17 23 HELP,PLEASE
5 2C17 24 RED,PARFAIT

Related

Pandas - find max number of taxi stops before back to depot

I've a pandas dataframe as below:
driver_id | trip_id | pickup_from | drop_off_to | date
1 | 3 | stop 1 | city1 | 2018-02-04
1 | 7 | city 2 | city3 | 2018-02-04
1 | 4 | city 4 | stop1 | 2018-02-04
2 | 8 | stop 1 | city7 | 2018-02-06
2 | 9 | city 8 | stop1 | 2018-02-06
2 | 12 | stop 1 | city5 | 2018-02-07
2 | 10 | city 3 | city1 | 2018-02-07
2 | 1 | city 4 | city7 | 2018-02-07
2 | 6 | city 2 | stop1 | 2018-02-07
I want to calculate the longest trip for each driver between stop 1 in the (pick_from) column and stop 1 in the (drop_off_to) column. i.e: for driver 1 if he started from stop 1 and went to city 1 then city 2 then city 3 then city 4 and then back to stop 1. so the max number of trips should be the number of cities he visited = 4 cities.
For driver 2 he started from stop 1 and then went to city 7 and city 8 before he goes back to stop 1 so he visited 2 cities. Then he started from stop 1 again and visited city 5, city 3, city 1, city 4, city 7 and city 2 before he goes back to stop 1 then total number of cities he worked in is 6 cities. So for driver 2 the max number of cities he visited = 6. The date doesn't matter in our calculation.
How can I do this using Pandas

Define the following function computing the longest trip
for a driver:
def maxTrip(grp):
trip = pd.DataFrame({'city': grp[['pickup_from', 'drop_off_to']]
.values.reshape(1, -1).squeeze()})
return trip.groupby(trip.city.str.match('stop').cumsum())\
.apply(lambda grp2: grp2.drop_duplicates().city.size).max() - 1
Then apply it:
result = df.groupby('driver_id').apply(maxTrip)
The result, for your data sample, is:
driver_id
1 4
2 6
dtype: int64
Note: It is up to you whether you want to eliminate repeating cities during
one sub-trip (from leaving the stop to return).
I assumed that they are to be eliminated. If you don't want this, drop
.drop_duplicates() from my code.
For your data sample this does not matter, since in each sub-trip
city names are unique. But it can happen that a driver visits a city,
then goes to another city and sometimes later (but before the return
to the stop) visits the same city again.

This would be best done using a function. Assuming the name of your dataframe above is df, Consider this as a continuation:
drop_cities1 = list(df.loc[1:3, "drop_off_to"]) #Makes a list of all driver 1 drops
pick_cities1 = list(df.loc[1:3, "pickup from"]) #Makes a list of all driver 1 pickups
drop_cities2 = list(df.loc[4:, "drop_off_to"])
pick_cities2 = list(df.loc[4:, "pickup from"])
no_of_c1 = 0 #Number of cities visited by driver 1
no_of_c2 = 0 #Number of cities visited by driver 2
def max_trips(pick_list, drop_list):
no_of_cities = 0
for pick in pick_list:
if pick == stop1:
for drop in drop_list:
no_of_cities += 1
if drop == stop1:
break
return no_of_cities
max_trips(pick_cities1, drop_cities1)
max_trips(pick_cities2, drop_cities2)

find rows that share values

I have a pandas dataframe that look like this:
df = pd.DataFrame({'name': ['bob', 'time', 'jane', 'john', 'andy'], 'favefood': [['kfc', 'mcd', 'wendys'], ['mcd'], ['mcd', 'popeyes'], ['wendys', 'kfc'], ['tacobell', 'innout']]})
-------------------------------
name | favefood
-------------------------------
bob | ['kfc', 'mcd', 'wendys']
tim | ['mcd']
jane | ['mcd', 'popeyes']
john | ['wendys', 'kfc']
andy | ['tacobell', 'innout']
For each person, I want to find out how many favefood's of other people overlap with their own.
I.e., for each person I want to find out how many other people have a non-empty intersection with them.
The resulting dataframe would look like this:
------------------------------
name | overlap
------------------------------
bob | 3
tim | 2
jane | 2
john | 1
andy | 0
The problem is that I have about 2 million rows of data. The only way I can think of doing this would be through a nested for-loop - i.e. for each person, go through the entire dataframe to see what overlaps (this would be extremely inefficient). Would there be anyway to do this more efficiently using pandas notation? Thanks!

Logic behind it
s=df['favefood'].explode().str.get_dummies().sum(level=0)
s.dot(s.T).ne(0).sum(axis=1)-1
Out[84]:
0 3
1 2
2 2
3 1
4 0
dtype: int64
df['overlap']=s.dot(s.T).ne(0).sum(axis=1)-1
Method from sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s=pd.DataFrame(mlb.fit_transform(df['favefood']),columns=mlb.classes_, index=df.index)
s.dot(s.T).ne(0).sum(axis=1)-1
0 3
1 2
2 2
3 1
4 0
dtype: int64

how do I print parts of regex

I cannot print components of matched regex.
I am learning python3 and I need to verify that output of my command matches my needs. I have following short code:
#!/usr/bin/python3
import re
text_to_search = '''
1 | 27 23 8 |
2 | 21 23 8 |
3 | 21 23 8 |
4 | 21 21 21 |
5 | 21 21 21 |
6 | 27 27 27 |
7 | 27 27 27 |
'''
pattern = re.compile('(.*\n)*( \d \| 2[17] 2[137] [ 2][178] \|)')
matches = pattern.finditer(text_to_search)
for match in matches:
print (match)
print ()
print ('matched to group 0:' + match.group(0))
print ()
print ('matched to group 1:' + match.group(1))
print ()
print ('matched to group 2:' + match.group(2))
and following output:
<_sre.SRE_Match object; span=(0, 140), match='\n 1 | 27 23 8 |\n 2 | 21 23 8 |\n 3 >
matched to group 0:
1 | 27 23 8 |
2 | 21 23 8 |
3 | 21 23 8 |
4 | 21 21 21 |
5 | 21 21 21 |
6 | 27 27 27 |
7 | 27 27 27 |
matched to group 1: 6 | 27 27 27 |
matched to group 2: 7 | 27 27 27 |
please explain me:
1) why "print (match)" prints only beginning of match, does it have some kind of limit to trim output if its bigger than some threshold?
2) Why group(1) is printed as "6 | 27 27 27 |" ? I was hope (.*\n)* is as greedy as possible so it consumes everything from 1-6 lines, leaving last line of text_to_search to be matched against group(2), but seems (.*\n)* took only 6-th line. Why is that? Why lines 1-5 are not printed when printing group(1)?
3) I was trying to go through regex tutorial but failed to understand those tricks with (?...). How do I verify if numbers in last row are equal (so 27 27 27 is ok, but 21 27 27 is not)?

1) The print(match) only shows an outline of the object. match is an SRE_Match object, so in order to get information from it you need to do something like match.group(0), which is accessing a value stored in the object.
2) to capture lines 1-6 you need to change (.*\n)* to ((?:.*\n)*) according to this regex tester,
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
3) to match specific numbers you need to make it more specific and include these numbers into a seperate group at the end.

How to create a new dataframe based on value_counts of a column in another dataframe but with certain conditions on other columns?

I have a pandas data-frame of tickets raised on a group of servers like this:
a b c Users Problem
0 data data data User A Server Down
1 data data data User B Server Down
2 date data data User C Memory Full
3 date data data User C Swap Full
4 date data data User D Unclassified
5 date data data User E Unclassified
6 data data data User B RAM Failure
I need to create another dataframe like this with the data grouped by the type of tickets and the count of tickets raised by only two users, A and B separately and a single column with the count for other users.
Expected new Dataframe:
+---------------+--------+--------+-------------+
| Type Of Error | User A | User B | Other Users |
+---------------+--------+--------+-------------+
| Server Down | 50 | 60 | 150 |
+---------------+--------+--------+-------------+
| Memory Full | 40 | 50 | 20 |
+---------------+--------+--------+-------------+
| Swap Full | 10 | 20 | 15 |
+---------------+--------+--------+-------------+
| Unclassified | 10 | 20 | 50 |
+---------------+--------+--------+-------------+
| | | | |
+---------------+--------+--------+-------------+
I've tried .value_counts() which provides total count of that type. I however need it to be based on the User.

If no User A or User B change users to Other Users by Series.where and then use crosstab:
df['Users'] = df['Users'].where(df['Users'].isin(['User A','User B']), 'Other Users')
df = pd.crosstab(df['Problem'], df['Users'])[['User A','User B','Other Users']]
print (df)
Users User A User B Other Users
Problem
Memory Full 0 0 1
RAM Failure 0 1 0
Server Down 1 1 0
Swap Full 0 0 1
Unclassified 0 0 2

You could use pivot_table which is great at using aggregate functions:
users = df.Users.copy()
users[~users.isin(['User A', 'User B'])] = 'Other Users'
df.pivot_table(index='Problem', columns=users, aggfunc='count', values='a',
fill_value=0).reindex(['User A', 'User B', 'Other Users'], axis=1)
It gives:
Users User A User B Other Users
Problem
Memory Full 0 0 1
RAM Failure 0 1 0
Server Down 1 1 0
Swap Full 0 0 1
Unclassified 0 0 2

Pandas Pivot table, how to put a series of columns in the values attribute

First of all, I apologize! It's my first time using stack overflow so I hope I'm doing it right! I searched but can't find what I'm looking for.
I'm also quite new with pandas and python :)
I am going to try to use an example and I will try to be clear.
I have a dataframe with 30 columns that contains information about a shopping cart, 1 of the columns (order) have 2 values, either completed of in progress.
And I have like 20 columns with items, lets say apple, orange, bananas... And I need to know how many times there is an apple in a complete order and how many in a in progress order. I decided to use a pivot table with the aggregate function count.
This would be a small example of the dataframe:
Order | apple | orange | banana | pear | pineapple | ... |
-----------|-------|--------|--------|------|-----------|------|
completed | 2 | 4 | 10 | 5 | 1 | |
completed | 5 | 4 | 5 | 8 | 3 | |
iProgress | 3 | 7 | 6 | 5 | 2 | |
completed | 6 | 3 | 1 | 7 | 1 | |
iProgress | 10 | 2 | 2 | 2 | 2 | |
completed | 2 | 1 | 4 | 8 | 1 | |
I have the output I want but what I'm looking for is a more elegant way of selecting lots of columns without having to type them manually.
df.pivot_table(index=['Order'], values=['apple', 'bananas', 'orange', 'pear', 'strawberry',
'mango'], aggfunc='count')
But I want to select around 15 columns, so instead of typing one by one 15 times, I'm sure there is an easy way of doing it by using column numbers or something. Let's say I want to select columns from 6 till 15.
I have tried with things like values=[df.columns[6:15]], I have also tried using df.iloc, but as I said, I'm pretty new so I'm probably using things wrong or making silly things!
Is there also a way to get them in the order they have? Because in my answer they seem to have been ordered alphabetically and I want to keep the order of the columns. So it should be apple, orange, banana...
Order Completed In progress
apple 92 221
banana 102 144
mango 70 55
I'm just looking for a way of improving my code and I hope I have not made much mess. Thank you!

I think you can use:
#if need select only few columns - df.columns[1:3]
df = df.pivot_table(columns=['Order'], values=df.columns[1:3], aggfunc='count')
print (df)
Order completed iProgress
apple 4 2
orange 4 2
#if need use all column, parameter values can be omit
df = df.pivot_table(columns=['Order'], aggfunc='count')
print (df)
Order completed iProgress
apple 4 2
banana 4 2
orange 4 2
pear 4 2
pineapple 4 2
What is the difference between size and count in pandas?
df = df.pivot_table(columns=['Order'], aggfunc=len)
print (df)
Order completed iProgress
apple 4 2
banana 4 2
orange 4 2
pear 4 2
pineapple 4 2
#solution with groupby and transpose
df = df.groupby('Order').count().T
print (df)
Order completed iProgress
apple 4 2
orange 4 2
banana 4 2
pear 4 2
pineapple 4 2

Your example doesn't show an example of an item not in the cart. I'm assuming it comes up as None or 0. If this is correct, then I fill na values and count how many are greater than 0
df.set_index('Order').fillna(0).gt(0).groupby(level='Order').sum().T

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Grouping column values into one value - python

Related

Pandas - find max number of taxi stops before back to depot

find rows that share values

how do I print parts of regex

How to create a new dataframe based on value_counts of a column in another dataframe but with certain conditions on other columns?

Pandas Pivot table, how to put a series of columns in the values attribute

Categories

Resources