Reduce time complexity for double for loop - python

I am a student and need help in reducing time complexity of my below code. I have python dataframe df which contain customer id as customerid and their purchase date as createdon, in order to find the premium customer ,want to count max purchase for each customer in last 48 hr so created a 48 hr back threshold date as last_48 and ranked the buy date in asc order as buy_rank for each customer id. For this i have written the below code but it takes forever to run for 540k customerid. Please help to optimize my piece of code
%%time
c=[]
for i in range(len(df)):
k=0
print(i)
b=df[df['customerid']==df['customerid'].iloc[i]]
for j in range(len(b)):
if (df['last_48'].iloc[i] `<b['createdon'].iloc[j]) & (df['buy_rank'].iloc[i]>b['buy_rank'].iloc[j]):`
k=k+1
c.append(k)
df['buy_count']=c
|customerid |createdon|buy_rank|buy_count|last_48 |
|-----------|---------|--------|---------|---------|
|11 |3jan 1:30| 1 | 0 |1jan 1:30|
|11 |3jan 3:30| 2 | 1 |1jan 3:30|
|11 |6jan 1:30| 3 | 0 |4jan 1:30|
|11 |6jan 2:30| 4 | 1 |4jan 2:30|
|11 |6jan 3:30| 5 | 2 |4jan 3:30|
|11 |6jan 4:40| 6 | 3 |4jan 4:40|
|11 |9jan 5:30| 7 | 0 |7jan 5:30|

Related

Pandas - find max number of taxi stops before back to depot

I've a pandas dataframe as below:
driver_id | trip_id | pickup_from | drop_off_to | date
1 | 3 | stop 1 | city1 | 2018-02-04
1 | 7 | city 2 | city3 | 2018-02-04
1 | 4 | city 4 | stop1 | 2018-02-04
2 | 8 | stop 1 | city7 | 2018-02-06
2 | 9 | city 8 | stop1 | 2018-02-06
2 | 12 | stop 1 | city5 | 2018-02-07
2 | 10 | city 3 | city1 | 2018-02-07
2 | 1 | city 4 | city7 | 2018-02-07
2 | 6 | city 2 | stop1 | 2018-02-07
I want to calculate the longest trip for each driver between stop 1 in the (pick_from) column and stop 1 in the (drop_off_to) column. i.e: for driver 1 if he started from stop 1 and went to city 1 then city 2 then city 3 then city 4 and then back to stop 1. so the max number of trips should be the number of cities he visited = 4 cities.
For driver 2 he started from stop 1 and then went to city 7 and city 8 before he goes back to stop 1 so he visited 2 cities. Then he started from stop 1 again and visited city 5, city 3, city 1, city 4, city 7 and city 2 before he goes back to stop 1 then total number of cities he worked in is 6 cities. So for driver 2 the max number of cities he visited = 6. The date doesn't matter in our calculation.
How can I do this using Pandas
Define the following function computing the longest trip
for a driver:
def maxTrip(grp):
trip = pd.DataFrame({'city': grp[['pickup_from', 'drop_off_to']]
.values.reshape(1, -1).squeeze()})
return trip.groupby(trip.city.str.match('stop').cumsum())\
.apply(lambda grp2: grp2.drop_duplicates().city.size).max() - 1
Then apply it:
result = df.groupby('driver_id').apply(maxTrip)
The result, for your data sample, is:
driver_id
1 4
2 6
dtype: int64
Note: It is up to you whether you want to eliminate repeating cities during
one sub-trip (from leaving the stop to return).
I assumed that they are to be eliminated. If you don't want this, drop
.drop_duplicates() from my code.
For your data sample this does not matter, since in each sub-trip
city names are unique. But it can happen that a driver visits a city,
then goes to another city and sometimes later (but before the return
to the stop) visits the same city again.
This would be best done using a function. Assuming the name of your dataframe above is df, Consider this as a continuation:
drop_cities1 = list(df.loc[1:3, "drop_off_to"]) #Makes a list of all driver 1 drops
pick_cities1 = list(df.loc[1:3, "pickup from"]) #Makes a list of all driver 1 pickups
drop_cities2 = list(df.loc[4:, "drop_off_to"])
pick_cities2 = list(df.loc[4:, "pickup from"])
no_of_c1 = 0 #Number of cities visited by driver 1
no_of_c2 = 0 #Number of cities visited by driver 2
def max_trips(pick_list, drop_list):
no_of_cities = 0
for pick in pick_list:
if pick == stop1:
for drop in drop_list:
no_of_cities += 1
if drop == stop1:
break
return no_of_cities
max_trips(pick_cities1, drop_cities1)
max_trips(pick_cities2, drop_cities2)

Calculating the percentage of values based on the values in other columns

I am trying to create a column that includes a percentage of values based on the values in other columns in python. For example, let's assume that we have the following dataset.
+------------------------------------+------------+--------+
| Teacher | grades | counts |
+------------------------------------+------------+--------+
| Teacher1 | 1 | 1 |
| | 2 | 2 |
| | 3 | 1 |
| Teacher2 | 2 | 1 |
| Teacher3 | 3 | 2 |
| Teacher4 | 2 | 2 |
| | 3 | 2 |
+------------------------------------+------------+--------+
As you can see we have teachers in the first columns, grades that teacher gives (1,2 and 3) in the second column, and the number of given corresponding grade in third columns. Here, I am trying to get the percentage of grade numbers 1 and 2 in total given grade for each teacher. For instance, teacher 1 gave one grade 1, two grade 2, and one grade 3. In this case, the percentage of given grade numbers 1 and 2 in the total grade is 75%. Teacher 2 gave only 1 grade 2 so the percentage is 100%. Similarly, teacher 3 gave two grade 3 so the percentage 0% because he/she did not give any grades 1 and 2. So these percentages should be added to the new column in the dataset. Honestly, I couldn't even think about anything to try and I didn't find anything about it when I search it in here. Could you please help me to get the column.
I am not sure this is the most efficient way, but I find it quite readable and easy to follow.
percents = {} #store Teacher:percent
for t, g in df.groupby('Teacher'): #t,g is short for teacher,group
total = g.counts.sum()
one_two = g.loc[g.grades.isin([1,2])].counts.sum() #consider only 1&2
percent = (one_two/total)*100
#print(t, percent)
percents[t] = [percent]
xf = pd.DataFrame(percents).T.reset_index() #make a df from the dic
xf.columns = ['Teacher','percent'] #rename columns
df = df.merge(xf) #merge with initial df
print(df)
Teacher grades counts percent
0 Teacher1 1 1 75.0
1 Teacher1 2 2 75.0
2 Teacher1 3 1 75.0
3 Teacher2 2 1 100.0
4 Teacher3 3 2 0.0
5 Teacher4 2 2 50.0
6 Teacher4 3 2 50.0
I believe this will solve your query
y=0
data['Percentage']='None'
for teacher in teachers:
x=data[data['Teachers']==teacher]
total=sum(x['Counts'])
condition1= 1 in set(x['Grades'])
condition2= 2 in set(x['Grades'])
if (condition1==True or condition2==True):
for i in range(y,y+len(x)):
data['Percentage'].iloc[i]=(data['Counts'].iloc[i]/total)*100
else:
for i in range(y,y+len(x)):
data['Percentage'].iloc[i]=0
y=y+len(x)
Output:
Teachers Grades Counts Percentage
0 Teacher1 1 1 25
1 Teacher1 2 2 50
2 Teacher1 3 1 25
3 Teacher2 2 1 100
4 Teacher3 3 2 0
5 Teacher4 2 2 50
6 Teacher4 3 2 50
I have made used of boolean comprehension to segregate the data on
basis on each teacher. Most of the code is self explanatory. For any
other clarification please fill free to leave a comment.

How to create a new dataframe based on value_counts of a column in another dataframe but with certain conditions on other columns?

I have a pandas data-frame of tickets raised on a group of servers like this:
a b c Users Problem
0 data data data User A Server Down
1 data data data User B Server Down
2 date data data User C Memory Full
3 date data data User C Swap Full
4 date data data User D Unclassified
5 date data data User E Unclassified
6 data data data User B RAM Failure
I need to create another dataframe like this with the data grouped by the type of tickets and the count of tickets raised by only two users, A and B separately and a single column with the count for other users.
Expected new Dataframe:
+---------------+--------+--------+-------------+
| Type Of Error | User A | User B | Other Users |
+---------------+--------+--------+-------------+
| Server Down | 50 | 60 | 150 |
+---------------+--------+--------+-------------+
| Memory Full | 40 | 50 | 20 |
+---------------+--------+--------+-------------+
| Swap Full | 10 | 20 | 15 |
+---------------+--------+--------+-------------+
| Unclassified | 10 | 20 | 50 |
+---------------+--------+--------+-------------+
| | | | |
+---------------+--------+--------+-------------+
I've tried .value_counts() which provides total count of that type. I however need it to be based on the User.
If no User A or User B change users to Other Users by Series.where and then use crosstab:
df['Users'] = df['Users'].where(df['Users'].isin(['User A','User B']), 'Other Users')
df = pd.crosstab(df['Problem'], df['Users'])[['User A','User B','Other Users']]
print (df)
Users User A User B Other Users
Problem
Memory Full 0 0 1
RAM Failure 0 1 0
Server Down 1 1 0
Swap Full 0 0 1
Unclassified 0 0 2
You could use pivot_table which is great at using aggregate functions:
users = df.Users.copy()
users[~users.isin(['User A', 'User B'])] = 'Other Users'
df.pivot_table(index='Problem', columns=users, aggfunc='count', values='a',
fill_value=0).reindex(['User A', 'User B', 'Other Users'], axis=1)
It gives:
Users User A User B Other Users
Problem
Memory Full 0 0 1
RAM Failure 0 1 0
Server Down 1 1 0
Swap Full 0 0 1
Unclassified 0 0 2

Searching through data base for partial and full match integers

I'm trying to search through a dataframe with a column that can have one or more integer values, to match one or more given integers.
The integers in the database has a '-' in between For example
--------------------------------------------------
| Customer 1 |1124 |
--------------------------------------------------
| Customer 2 |1124-1123 |
--------------------------------------------------
| Customer 3 |1124-1234-1642 |
--------------------------------------------------
| Customer 3 |1213-1234-1642 |
--------------------------------------------------
The objective here is to do a partial and full match, and be able to and be able to find out how many integers didn't match.
So for example let's say I have find all customers with 1124, the output would look like this(going off the example I provided)
--------------------------------------------------
| Customer 1 |1124 |None
--------------------------------------------------
| Customer 2 |1124-1123 |1
--------------------------------------------------
| Customer 3 |1124-1234-1642 |2
--------------------------------------------------
Thanks ahead of time!
Use set
define x as the test set
make s a series of sets
s - x creates a series of differences
(s - x).str.len() are the sizes of the differences
s & x is a boolean series indicating whether there is an intersection. Or in this case, if x is in s
x = {'1124'}
s = df['col2'].str.split('-').apply(set)
df.assign(col3=(s - x).str.len())[s & x]
col1 col2 col3
0 Customer 1 1124 0
1 Customer 2 1124-1123 1
2 Customer 3 1124-1234-1642 2
Setup
df = pd.DataFrame({
'col1': ['Customer 1', 'Customer 2', 'Customer 3', 'Customer 3'],
'col2': ['1124', '1124-1123', '1124-1234-1642', '1213-1234-1642']
})

Pandas Pivot table, how to put a series of columns in the values attribute

First of all, I apologize! It's my first time using stack overflow so I hope I'm doing it right! I searched but can't find what I'm looking for.
I'm also quite new with pandas and python :)
I am going to try to use an example and I will try to be clear.
I have a dataframe with 30 columns that contains information about a shopping cart, 1 of the columns (order) have 2 values, either completed of in progress.
And I have like 20 columns with items, lets say apple, orange, bananas... And I need to know how many times there is an apple in a complete order and how many in a in progress order. I decided to use a pivot table with the aggregate function count.
This would be a small example of the dataframe:
Order | apple | orange | banana | pear | pineapple | ... |
-----------|-------|--------|--------|------|-----------|------|
completed | 2 | 4 | 10 | 5 | 1 | |
completed | 5 | 4 | 5 | 8 | 3 | |
iProgress | 3 | 7 | 6 | 5 | 2 | |
completed | 6 | 3 | 1 | 7 | 1 | |
iProgress | 10 | 2 | 2 | 2 | 2 | |
completed | 2 | 1 | 4 | 8 | 1 | |
I have the output I want but what I'm looking for is a more elegant way of selecting lots of columns without having to type them manually.
df.pivot_table(index=['Order'], values=['apple', 'bananas', 'orange', 'pear', 'strawberry',
'mango'], aggfunc='count')
But I want to select around 15 columns, so instead of typing one by one 15 times, I'm sure there is an easy way of doing it by using column numbers or something. Let's say I want to select columns from 6 till 15.
I have tried with things like values=[df.columns[6:15]], I have also tried using df.iloc, but as I said, I'm pretty new so I'm probably using things wrong or making silly things!
Is there also a way to get them in the order they have? Because in my answer they seem to have been ordered alphabetically and I want to keep the order of the columns. So it should be apple, orange, banana...
Order Completed In progress
apple 92 221
banana 102 144
mango 70 55
I'm just looking for a way of improving my code and I hope I have not made much mess. Thank you!
I think you can use:
#if need select only few columns - df.columns[1:3]
df = df.pivot_table(columns=['Order'], values=df.columns[1:3], aggfunc='count')
print (df)
Order completed iProgress
apple 4 2
orange 4 2
#if need use all column, parameter values can be omit
df = df.pivot_table(columns=['Order'], aggfunc='count')
print (df)
Order completed iProgress
apple 4 2
banana 4 2
orange 4 2
pear 4 2
pineapple 4 2
What is the difference between size and count in pandas?
df = df.pivot_table(columns=['Order'], aggfunc=len)
print (df)
Order completed iProgress
apple 4 2
banana 4 2
orange 4 2
pear 4 2
pineapple 4 2
#solution with groupby and transpose
df = df.groupby('Order').count().T
print (df)
Order completed iProgress
apple 4 2
orange 4 2
banana 4 2
pear 4 2
pineapple 4 2
Your example doesn't show an example of an item not in the cart. I'm assuming it comes up as None or 0. If this is correct, then I fill na values and count how many are greater than 0
df.set_index('Order').fillna(0).gt(0).groupby(level='Order').sum().T

Categories

Resources