Pandas - find max number of taxi stops before back to depot - python

I've a pandas dataframe as below:
driver_id | trip_id | pickup_from | drop_off_to | date
1 | 3 | stop 1 | city1 | 2018-02-04
1 | 7 | city 2 | city3 | 2018-02-04
1 | 4 | city 4 | stop1 | 2018-02-04
2 | 8 | stop 1 | city7 | 2018-02-06
2 | 9 | city 8 | stop1 | 2018-02-06
2 | 12 | stop 1 | city5 | 2018-02-07
2 | 10 | city 3 | city1 | 2018-02-07
2 | 1 | city 4 | city7 | 2018-02-07
2 | 6 | city 2 | stop1 | 2018-02-07
I want to calculate the longest trip for each driver between stop 1 in the (pick_from) column and stop 1 in the (drop_off_to) column. i.e: for driver 1 if he started from stop 1 and went to city 1 then city 2 then city 3 then city 4 and then back to stop 1. so the max number of trips should be the number of cities he visited = 4 cities.
For driver 2 he started from stop 1 and then went to city 7 and city 8 before he goes back to stop 1 so he visited 2 cities. Then he started from stop 1 again and visited city 5, city 3, city 1, city 4, city 7 and city 2 before he goes back to stop 1 then total number of cities he worked in is 6 cities. So for driver 2 the max number of cities he visited = 6. The date doesn't matter in our calculation.
How can I do this using Pandas

Define the following function computing the longest trip
for a driver:
def maxTrip(grp):
trip = pd.DataFrame({'city': grp[['pickup_from', 'drop_off_to']]
.values.reshape(1, -1).squeeze()})
return trip.groupby(trip.city.str.match('stop').cumsum())\
.apply(lambda grp2: grp2.drop_duplicates().city.size).max() - 1
Then apply it:
result = df.groupby('driver_id').apply(maxTrip)
The result, for your data sample, is:
driver_id
1 4
2 6
dtype: int64
Note: It is up to you whether you want to eliminate repeating cities during
one sub-trip (from leaving the stop to return).
I assumed that they are to be eliminated. If you don't want this, drop
.drop_duplicates() from my code.
For your data sample this does not matter, since in each sub-trip
city names are unique. But it can happen that a driver visits a city,
then goes to another city and sometimes later (but before the return
to the stop) visits the same city again.

This would be best done using a function. Assuming the name of your dataframe above is df, Consider this as a continuation:
drop_cities1 = list(df.loc[1:3, "drop_off_to"]) #Makes a list of all driver 1 drops
pick_cities1 = list(df.loc[1:3, "pickup from"]) #Makes a list of all driver 1 pickups
drop_cities2 = list(df.loc[4:, "drop_off_to"])
pick_cities2 = list(df.loc[4:, "pickup from"])
no_of_c1 = 0 #Number of cities visited by driver 1
no_of_c2 = 0 #Number of cities visited by driver 2
def max_trips(pick_list, drop_list):
no_of_cities = 0
for pick in pick_list:
if pick == stop1:
for drop in drop_list:
no_of_cities += 1
if drop == stop1:
break
return no_of_cities
max_trips(pick_cities1, drop_cities1)
max_trips(pick_cities2, drop_cities2)

Related

Reduce time complexity for double for loop

I am a student and need help in reducing time complexity of my below code. I have python dataframe df which contain customer id as customerid and their purchase date as createdon, in order to find the premium customer ,want to count max purchase for each customer in last 48 hr so created a 48 hr back threshold date as last_48 and ranked the buy date in asc order as buy_rank for each customer id. For this i have written the below code but it takes forever to run for 540k customerid. Please help to optimize my piece of code
%%time
c=[]
for i in range(len(df)):
k=0
print(i)
b=df[df['customerid']==df['customerid'].iloc[i]]
for j in range(len(b)):
if (df['last_48'].iloc[i] `<b['createdon'].iloc[j]) & (df['buy_rank'].iloc[i]>b['buy_rank'].iloc[j]):`
k=k+1
c.append(k)
df['buy_count']=c
|customerid |createdon|buy_rank|buy_count|last_48 |
|-----------|---------|--------|---------|---------|
|11 |3jan 1:30| 1 | 0 |1jan 1:30|
|11 |3jan 3:30| 2 | 1 |1jan 3:30|
|11 |6jan 1:30| 3 | 0 |4jan 1:30|
|11 |6jan 2:30| 4 | 1 |4jan 2:30|
|11 |6jan 3:30| 5 | 2 |4jan 3:30|
|11 |6jan 4:40| 6 | 3 |4jan 4:40|
|11 |9jan 5:30| 7 | 0 |7jan 5:30|

Calculating the percentage of values based on the values in other columns

I am trying to create a column that includes a percentage of values based on the values in other columns in python. For example, let's assume that we have the following dataset.
+------------------------------------+------------+--------+
| Teacher | grades | counts |
+------------------------------------+------------+--------+
| Teacher1 | 1 | 1 |
| | 2 | 2 |
| | 3 | 1 |
| Teacher2 | 2 | 1 |
| Teacher3 | 3 | 2 |
| Teacher4 | 2 | 2 |
| | 3 | 2 |
+------------------------------------+------------+--------+
As you can see we have teachers in the first columns, grades that teacher gives (1,2 and 3) in the second column, and the number of given corresponding grade in third columns. Here, I am trying to get the percentage of grade numbers 1 and 2 in total given grade for each teacher. For instance, teacher 1 gave one grade 1, two grade 2, and one grade 3. In this case, the percentage of given grade numbers 1 and 2 in the total grade is 75%. Teacher 2 gave only 1 grade 2 so the percentage is 100%. Similarly, teacher 3 gave two grade 3 so the percentage 0% because he/she did not give any grades 1 and 2. So these percentages should be added to the new column in the dataset. Honestly, I couldn't even think about anything to try and I didn't find anything about it when I search it in here. Could you please help me to get the column.
I am not sure this is the most efficient way, but I find it quite readable and easy to follow.
percents = {} #store Teacher:percent
for t, g in df.groupby('Teacher'): #t,g is short for teacher,group
total = g.counts.sum()
one_two = g.loc[g.grades.isin([1,2])].counts.sum() #consider only 1&2
percent = (one_two/total)*100
#print(t, percent)
percents[t] = [percent]
xf = pd.DataFrame(percents).T.reset_index() #make a df from the dic
xf.columns = ['Teacher','percent'] #rename columns
df = df.merge(xf) #merge with initial df
print(df)
Teacher grades counts percent
0 Teacher1 1 1 75.0
1 Teacher1 2 2 75.0
2 Teacher1 3 1 75.0
3 Teacher2 2 1 100.0
4 Teacher3 3 2 0.0
5 Teacher4 2 2 50.0
6 Teacher4 3 2 50.0
I believe this will solve your query
y=0
data['Percentage']='None'
for teacher in teachers:
x=data[data['Teachers']==teacher]
total=sum(x['Counts'])
condition1= 1 in set(x['Grades'])
condition2= 2 in set(x['Grades'])
if (condition1==True or condition2==True):
for i in range(y,y+len(x)):
data['Percentage'].iloc[i]=(data['Counts'].iloc[i]/total)*100
else:
for i in range(y,y+len(x)):
data['Percentage'].iloc[i]=0
y=y+len(x)
Output:
Teachers Grades Counts Percentage
0 Teacher1 1 1 25
1 Teacher1 2 2 50
2 Teacher1 3 1 25
3 Teacher2 2 1 100
4 Teacher3 3 2 0
5 Teacher4 2 2 50
6 Teacher4 3 2 50
I have made used of boolean comprehension to segregate the data on
basis on each teacher. Most of the code is self explanatory. For any
other clarification please fill free to leave a comment.

How to create a new dataframe based on value_counts of a column in another dataframe but with certain conditions on other columns?

I have a pandas data-frame of tickets raised on a group of servers like this:
a b c Users Problem
0 data data data User A Server Down
1 data data data User B Server Down
2 date data data User C Memory Full
3 date data data User C Swap Full
4 date data data User D Unclassified
5 date data data User E Unclassified
6 data data data User B RAM Failure
I need to create another dataframe like this with the data grouped by the type of tickets and the count of tickets raised by only two users, A and B separately and a single column with the count for other users.
Expected new Dataframe:
+---------------+--------+--------+-------------+
| Type Of Error | User A | User B | Other Users |
+---------------+--------+--------+-------------+
| Server Down | 50 | 60 | 150 |
+---------------+--------+--------+-------------+
| Memory Full | 40 | 50 | 20 |
+---------------+--------+--------+-------------+
| Swap Full | 10 | 20 | 15 |
+---------------+--------+--------+-------------+
| Unclassified | 10 | 20 | 50 |
+---------------+--------+--------+-------------+
| | | | |
+---------------+--------+--------+-------------+
I've tried .value_counts() which provides total count of that type. I however need it to be based on the User.
If no User A or User B change users to Other Users by Series.where and then use crosstab:
df['Users'] = df['Users'].where(df['Users'].isin(['User A','User B']), 'Other Users')
df = pd.crosstab(df['Problem'], df['Users'])[['User A','User B','Other Users']]
print (df)
Users User A User B Other Users
Problem
Memory Full 0 0 1
RAM Failure 0 1 0
Server Down 1 1 0
Swap Full 0 0 1
Unclassified 0 0 2
You could use pivot_table which is great at using aggregate functions:
users = df.Users.copy()
users[~users.isin(['User A', 'User B'])] = 'Other Users'
df.pivot_table(index='Problem', columns=users, aggfunc='count', values='a',
fill_value=0).reindex(['User A', 'User B', 'Other Users'], axis=1)
It gives:
Users User A User B Other Users
Problem
Memory Full 0 0 1
RAM Failure 0 1 0
Server Down 1 1 0
Swap Full 0 0 1
Unclassified 0 0 2

I am not abe to make accurate pivot table

I have one dataframe which contain many columns and i am trying to make pivot table
like this
Data sample
program | InWappTable | InLeadExportTrack
VIC | True | 1
VIC | True |1
VIC | True |1
VIC | True | 1
Here is my code
rec.groupby(['InWappTable', 'InLeadExportTrack','program']).size()
And Expected Output is
IIUC, you can try this:
df_new=df.groupby(['program'])['InWappTable','InLeadExporttrack'].count().reset_index()
total = df_new.sum()
total['program'] = 'Total'
df_new=df_new.append(total, ignore_index=True)
print(df_new)
I do not believe that you require a pivot_table here, though a pivot_table approach with aggfunc can also be used effectively.
Here is how I approached this
Generate some data
a = [['program','InWappTable','InLeadExportTrack'],
['VIC',True,1],
['Mall',False,15],
['VIC',True,101],
['VIC',True,1],
['Mall',True,74],
['Mall',True,11],
['VIC',False,44]]
df = pd.DataFrame(a[1:], columns=a[0])
print(df)
program InWappTable InLeadExportTrack
0 VIC True 1
1 Mall False 15
2 VIC True 101
3 VIC True 1
4 Mall True 74
5 Mall True 11
6 VIC False 44
First do GROUP BY with count aggregation
df_grouped = df.groupby(['program']).count()
print(df_grouped)
InWappTable InLeadExportTrack
program
Mall 3 3
VIC 4 4
Then to get the sum of all columns
num_cols = ['InWappTable','InLeadExportTrack']
df_grouped[num_cols] = df_grouped[num_cols].astype(int)
df_grouped.loc['Total']= df_grouped.sum(axis=0)
df_grouped.reset_index(drop=False, inplace=True)
print(df_grouped)
program InWappTable InLeadExportTrack
0 Mall 3 3
1 VIC 4 4
2 Total 7 7
EDIT
Based on the comments in the OP, df_grouped = df.groupby(['program']).count() could be replaced by df_grouped = df.groupby(['program']).sum(). In this case, the output is shown below
program InWappTable InLeadExportTrack
0 Mall 2 100
1 VIC 3 147
2 Total 5 247

Tallying number of times certain strings occur in Python

Im working on a database of incidents affecting different sectors in different countries and want to create a table tallying the incident rate breakdown for each country.
The database looks like this atm
Incident Name | Country Affected | Sector Affected
incident_1 | US,TW,CN | Engineering,Media
incident_2 | FR,RU,CN | Government
etc., etc.
My aim would be to build a which looks like this:
Country | Engineering | Media | Government
CN | 3 | 0 | 5
etc.
Right now my method is basically to use an if loop to check if the country column contains a specific string (for example 'CN') and if this returns True then to run Counter from collections to create a dictionary of the initial tally, then save this.
My issue is how to scale this us to a level where it can be run across the entire database AND how to actually save the dictionary produced by Counter.
pd.Series.str.get_dummies and pd.DataFrame.dot
c = df['Country Affected'].str.get_dummies(sep=',')
s = df['Sector Affected'].str.get_dummies(sep=',')
c.T.dot(s)
Engineering Government Media
CN 1 1 1
FR 0 1 0
RU 0 1 0
TW 1 0 1
US 1 0 1
bigger example
np.random.seed([3,1415])
countries = ['CN', 'FR', 'RU', 'TW', 'US', 'UK', 'JP', 'AU', 'HK']
sectors = ['Engineering', 'Government', 'Media', 'Commodidty']
def pick_rnd(x):
i = np.random.randint(1, len(x))
j = np.random.choice(x, i, False)
return ','.join(j)
df = pd.DataFrame({
'Country Affected': [pick_rnd(countries) for _ in range(10)],
'Sector Affected': [pick_rnd(sectors) for _ in range(10)]
})
df
Country Affected Sector Affected
0 CN Government,Media
1 FR,TW,JP,US,UK,CN,RU,AU Commodidty,Government
2 HK,AU,JP Commodidty
3 RU,CN,FR,JP,UK Media,Commodidty,Engineering
4 CN,RU,FR,JP,TW,HK,US,UK Government,Media,Commodidty
5 FR,CN Commodidty
6 FR,HK,JP,TW,US,AU,CN Commodidty
7 CN,HK,RU,TW,UK,US,FR,JP Media,Commodidty
8 JP,UK,AU Engineering,Media
9 RU,UK,FR Media
Then
c = df['Country Affected'].str.get_dummies(sep=',')
s = df['Sector Affected'].str.get_dummies(sep=',')
c.T.dot(s)
Commodidty Engineering Government Media
AU 3 1 1 1
CN 6 1 3 4
FR 6 1 2 4
HK 4 0 1 2
JP 6 2 2 4
RU 4 1 2 4
TW 4 0 2 2
UK 4 2 2 5
US 4 0 2 2

Categories

Resources