I have a dataframe with 100 columns filled with start dates. I’m trying to find the next date for each value compared against that row to produce another data frame of end dates. If there is no next date it will add 1 day.
Start dates :
| gas station | park | beach | store
| Car A | 1/1/2022 | 1/4/2021 | 1/2/2021 | 1/3/2021
| Car B | 2/14/2021 | 2/10/2021| 2/21/2021| 2/5/2021
Stop dates:
| gas station | park | beach | store
| Car A | 1/2/2022 | 1/5/2021 | 1/3/2021 | 1/4/2021
| Car B | 2/21/2021 | 2/14/2021| 2/22/2021| 2/10/2021
Explanation : The “start dates” is the current dataframe. Car A arrived to the column name locations on the dates shown. Same with car B. I want to create a new dataframe (“stop dates”) based on the start dates. Car A gas station start date compared against all other columns to find the next greatest date. That next greatest date will populate the “stop date” dataframe for car A gas station, etc
You can write a custom function that takes in a row as an input, and returns the desired row as a pd.Series, then apply this function to each row using df.apply with the argument axis=1. Also I believe you may have made a typo with your start dates and the first entry should be from the year 2021 as well. Otherwise, the next date after 1/4/2021 in the park column in the same row would be 1/1/2022.
For example:
import numpy as np
import pandas as pd
## recreate your start dates dataframe
df_start = pd.DataFrame(
columns=['gas station','park','beach','store'],
data = [
['1/1/2021','1/4/2021','1/2/2021','1/3/2021'],
['2/14/2021','2/10/2021','2/21/2021','2/5/2021']
],
index=['Car A', 'Car B']
)
for col in df_start.columns:
df_start[col] = pd.to_datetime(df_start[col])
## custom function that takes in a row as input
## and outputs a row as a series
def get_stop_dates(row):
sorted_dates = row.unique()
sorted_dates.sort()
new_row = []
for d in row.values:
idx = np.where(sorted_dates == d)[0][0]
if idx == len(sorted_dates) - 1:
new_date = pd.to_datetime(d) + pd.Timedelta("1d")
else:
new_date = pd.to_datetime(sorted_dates[idx+1])
new_row.append(new_date)
return pd.Series(new_row)
df_stop = df_start.apply(lambda row: get_stop_dates(row), axis=1)
df_stop.columns = df_start.columns
Input:
>>> df_start
gas station park beach store
Car A 2021-01-01 2021-01-04 2021-01-02 2021-01-03
Car B 2021-02-14 2021-02-10 2021-02-21 2021-02-05
Output:
>>> df_stop
gas station park beach store
Car A 2021-01-02 2021-01-05 2021-01-03 2021-01-04
Car B 2021-02-21 2021-02-14 2021-02-22 2021-02-10
Related
I am being provided with a data set and i am writing a function.
my objectice is quiet simple. I have a air bnb data base with various columns my onjective is simple. I am using a for loop over neighbourhood group list (that i created) and i am trying to extract (append) the data related to that particular element in a empty dataframe.
Example:
import pandas as pd
import numpy as np
dict1 = {'id' : [2539,2595,3647,3831,12937,18198,258838,258876,267535,385824],'name':['Clean & quiet apt home by the park','Skylit Midtown Castle','THE VILLAGE OF HARLEM....NEW YORK !','Cozy Entire Floor of Brownstone','1 Stop fr. Manhattan! Private Suite,Landmark Block','Little King of Queens','Oceanview,close to Manhattan','Affordable rooms,all transportation','Home Away From Home-Room in Bronx','New York City- Riverdale Modern two bedrooms unit'],'price':[149,225,150,89,130,70,250,50,50,120],'neighbourhood_group':['Brooklyn','Manhattan','Manhattan','Brooklyn','Queens','Queens','Staten Island','Staten Island','Bronx','Bronx']}
df = pd.DataFrame(dict1)
df
I created a function as follows
nbd_grp = ['Bronx','Queens','Staten Islands','Brooklyn','Manhattan']
# Creating a function to find the cheapest place in neighbourhood group
dfdf = pd.DataFrame(columns = ['id','name','price','neighbourhood_group'])
def cheapest_place(neighbourhood_group):
for elem in nbd_grp:
data = df.loc[df['neighbourhood_group']==elem]
cheapest = data.loc[data['price']==min(data['price'])]
dfdf = cheapest.copy()
cheapest_place(nbd_grp)
My Expected Output is :
id
name
Price
neighbourhood group
267535
Home Away From Home-Room in Bronx
50
Bronx
18198
Little King of Queens
70
Queens
258876
Affordable rooms,all transportation
50
Staten Island
3831
Cozy Entire Floor of Brownstone
89
Brooklyn
3647
THE VILLAGE OF HARLEM....NEW YORK !
150
Manhattan
My advice is that anytime you are working in a database or in a dataframe and you think "I need to loop", you should think again.
When in a dataframe you are in a world of set-based logic and there is likely a better set-based way of solving the problem. In your case you can groupby() your neighbourhood_group and get the min() of the price column and then merge or join that result set back to your original dataframe to get your id and name columns.
That would look something like:
df_min_price = df.groupby('neighbourhood_group').price.agg(min).reset_index().merge(df, on=['neighbourhood_group','price'])
+-----+---------------------+-------+--------+-------------------------------------+
| idx | neighbourhood_group | price | id | name |
+-----+---------------------+-------+--------+-------------------------------------+
| 0 | Bronx | 50 | 267535 | Home Away From Home-Room in Bronx |
| 1 | Brooklyn | 89 | 3831 | Cozy Entire Floor of Brownstone |
| 2 | Manhattan | 150 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! |
| 3 | Queens | 70 | 18198 | Little King of Queens |
| 4 | Staten Island | 50 | 258876 | Affordable rooms,all transportation |
+-----+---------------------+-------+--------+-------------------------------------+
I am attempting to resolve the below lab and having issues. This problem involves a csv input. There is criteria that the solution needs to meet. Any help or tips at all would be appreciated. My code is at the end of the problem along with my output.
Each row contains the title, rating, and all showtimes of a unique movie.
A space is placed before and after each vertical separator ('|') in each row.
Column 1 displays the movie titles and is left justified with a minimum of 44 characters.
If the movie title has more than 44 characters, output the first 44 characters only.
Column 2 displays the movie ratings and is right justified with a minimum of 5 characters.
Column 3 displays all the showtimes of the same movie, separated by a space.
This is the input:
16:40,Wonders of the World,G
20:00,Wonders of the World,G
19:00,End of the Universe,NC-17
12:45,Buffalo Bill And The Indians or Sitting Bull's History Lesson,PG
15:00,Buffalo Bill And The Indians or Sitting Bull's History Lesson,PG
19:30,Buffalo Bill And The Indians or Sitting Bull's History Lesson,PG
10:00,Adventure of Lewis and Clark,PG-13
14:30,Adventure of Lewis and Clark,PG-13
19:00,Halloween,R
This is the expected output:
Wonders of the World | G | 16:40 20:00
End of the Universe | NC-17 | 19:00
Buffalo Bill And The Indians or Sitting Bull | PG | 12:45 15:00 19:30
Adventure of Lewis and Clark | PG-13 | 10:00 14:30
Halloween | R | 19:00
My code so far:
import csv
rawMovies = input()
repeatList = []
with open(rawMovies, 'r') as movies:
moviesList = csv.reader(movies)
for movie in moviesList:
time = movie[0]
#print(time)
show = movie[1]
if len(show) > 45:
show = show[0:44]
#print(show)
rating = movie[2]
#print(rating)
print('{0: <44} | {1: <6} | {2}'.format(show, rating, time))
My output doesn't have the rating aligned to the right and I have no idea how to filter for repeated movies without removing the time portion of the list:
Wonders of the World | G | 16:40
Wonders of the World | G | 20:00
End of the Universe | NC-17 | 19:00
Buffalo Bill And The Indians or Sitting Bull | PG | 12:45
Buffalo Bill And The Indians or Sitting Bull | PG | 15:00
Buffalo Bill And The Indians or Sitting Bull | PG | 19:30
Adventure of Lewis and Clark | PG-13 | 10:00
Adventure of Lewis and Clark | PG-13 | 14:30
Halloween | R | 19:00
You could collect the input data in a dictionary, with the title-rating-tuples as keys and the showtimes collected in a list, and then print the consolidated information. For example (you have to adjust the filename):
import csv
movies = {}
with open("data.csv", "r") as file:
for showtime, title, rating in csv.reader(file):
movies.setdefault((title, rating), []).append(showtime)
for (title, rating), showtimes in movies.items():
print(f"{title[:44]: <44} | {rating: >5} | {' '.join(showtimes)}")
Output:
Wonders of the World | G | 16:40 20:00
End of the Universe | NC-17 | 19:00
Buffalo Bill And The Indians or Sitting Bull | PG | 12:45 15:00 19:30
Adventure of Lewis and Clark | PG-13 | 10:00 14:30
Halloween | R | 19:00
Since the input seems to come in connected blocks you could also use itertools.groupby (from the standard library) and print while reading:
import csv
from itertools import groupby
from operator import itemgetter
with open("data.csv", "r") as file:
for (title, rating), group in groupby(
csv.reader(file), key=itemgetter(1, 2)
):
showtimes = " ".join(time for time, *_ in group)
print(f"{title[:44]: <44} | {rating: >5} | {showtimes}")
For this consider the max length of the rating string. Subtract the length of the rating from that value. Make a string of spaces of that length and append the rating.
so basically
your_desired_str = ' '*(6-len(Rating))+Rating
also just replace
'somestr {value}'.format(value)
with f strings, much easier to read
f'somestr {value}'
Below is what I ended up with after some tips from the community.
rawMovies = input()
outputList = []
with open(rawMovies, 'r') as movies:
moviesList = csv.reader(movies)
movieold = [' ', ' ', ' ']
for movie in moviesList:
if movieold[1] == movie[1]:
outputList[-1][2] += ' ' + movie[0]
else:
time = movie[0]
# print(time)
show = movie[1]
if len(show) > 45:
show = show[0:44]
# print(show)
rating = movie[2]
outputList.append([show, rating, time])
movieold = movie
# print(rating)
#print(outputList)
for movie in outputList:
print('{0: <44} | {1: <5} | {2}'.format(movie[0], movie[1].rjust(5), movie[2]))
I would use Python's groupby() function for this which helps you to group consecutive rows with the same value.
For example:
import csv
from itertools import groupby
with open('movies.csv') as f_movies:
csv_movies = csv.reader(f_movies)
for title, entries in groupby(csv_movies, key=lambda x: x[1]):
movies = list(entries)
showtimes = ' '.join(row[0] for row in movies)
rating = movies[0][2]
print(f"{title[:44]: <44} | {rating: >5} | {showtimes}")
Giving you:
Wonders of the World | G | 16:40 20:00
End of the Universe | NC-17 | 19:00
Buffalo Bill And The Indians or Sitting Bull | PG | 12:45 15:00 19:30
Adventure of Lewis and Clark | PG-13 | 10:00 14:30
Halloween | R | 19:00
So how does groupby() work?
When reading a CSV file you will get a row at a time. What groupby() does is to group rows together into mini-lists containing rows which have the same value. The value it looks for is given using the key parameter. In this case the lambda function is passed a row at a time and it returns the current value of x[1] which is the title. groupby() keeps reading rows until that value changes. It then returns the current list as entries as an iterator.
This approach does assume that the rows you wish to group are in consecutive rows in the file. You could even write you own kind of group by generator function:
def group_by_title(csv):
title = None
entries = []
for row in csv:
if title and row[1] != title:
yield title, entries
entries = []
title = row[1]
entries.append(row)
if entries:
yield title, entries
with open('movies.csv') as f_movies:
csv_movies = csv.reader(f_movies)
for title, entries in group_by_title(csv_movies):
showtimes = ' '.join(row[0] for row in entries)
rating = entries[0][2]
print(f"{title[:44]: <44} | {rating: >5} | {showtimes}")
I have a dataframe with the below example
Type | date
Apple |01/01/2021
Apple |10/02/2021
Orange |05/01/2021
Orange |20/20/2020
Is there any easiest way transform the data as below?
Type | Date
Apple | 01/01/2020 | 10/20/2021
Orange| 05/01/2020 | 20/20/2020
The stack function does not match my requirement
You could group by "type", collect the "date" values and make a new dataframe.
df = pd.DataFrame({'type':['Apple','Apple','Orange','Orange'], 'date':['01/01/2021','10/02/2021','05/01/2021','20/20/2020']})
d = {}
for fruit, group in df.groupby('type'):
d[fruit] = group.date.values
pd.DataFrame(d).T
0 1
Apple 01/01/2021 10/02/2021
Orange 05/01/2021 20/20/2020
I've a pandas dataframe as below:
driver_id | trip_id | pickup_from | drop_off_to | date
1 | 3 | stop 1 | city1 | 2018-02-04
1 | 7 | city 2 | city3 | 2018-02-04
1 | 4 | city 4 | stop1 | 2018-02-04
2 | 8 | stop 1 | city7 | 2018-02-06
2 | 9 | city 8 | stop1 | 2018-02-06
2 | 12 | stop 1 | city5 | 2018-02-07
2 | 10 | city 3 | city1 | 2018-02-07
2 | 1 | city 4 | city7 | 2018-02-07
2 | 6 | city 2 | stop1 | 2018-02-07
I want to calculate the longest trip for each driver between stop 1 in the (pick_from) column and stop 1 in the (drop_off_to) column. i.e: for driver 1 if he started from stop 1 and went to city 1 then city 2 then city 3 then city 4 and then back to stop 1. so the max number of trips should be the number of cities he visited = 4 cities.
For driver 2 he started from stop 1 and then went to city 7 and city 8 before he goes back to stop 1 so he visited 2 cities. Then he started from stop 1 again and visited city 5, city 3, city 1, city 4, city 7 and city 2 before he goes back to stop 1 then total number of cities he worked in is 6 cities. So for driver 2 the max number of cities he visited = 6. The date doesn't matter in our calculation.
How can I do this using Pandas
Define the following function computing the longest trip
for a driver:
def maxTrip(grp):
trip = pd.DataFrame({'city': grp[['pickup_from', 'drop_off_to']]
.values.reshape(1, -1).squeeze()})
return trip.groupby(trip.city.str.match('stop').cumsum())\
.apply(lambda grp2: grp2.drop_duplicates().city.size).max() - 1
Then apply it:
result = df.groupby('driver_id').apply(maxTrip)
The result, for your data sample, is:
driver_id
1 4
2 6
dtype: int64
Note: It is up to you whether you want to eliminate repeating cities during
one sub-trip (from leaving the stop to return).
I assumed that they are to be eliminated. If you don't want this, drop
.drop_duplicates() from my code.
For your data sample this does not matter, since in each sub-trip
city names are unique. But it can happen that a driver visits a city,
then goes to another city and sometimes later (but before the return
to the stop) visits the same city again.
This would be best done using a function. Assuming the name of your dataframe above is df, Consider this as a continuation:
drop_cities1 = list(df.loc[1:3, "drop_off_to"]) #Makes a list of all driver 1 drops
pick_cities1 = list(df.loc[1:3, "pickup from"]) #Makes a list of all driver 1 pickups
drop_cities2 = list(df.loc[4:, "drop_off_to"])
pick_cities2 = list(df.loc[4:, "pickup from"])
no_of_c1 = 0 #Number of cities visited by driver 1
no_of_c2 = 0 #Number of cities visited by driver 2
def max_trips(pick_list, drop_list):
no_of_cities = 0
for pick in pick_list:
if pick == stop1:
for drop in drop_list:
no_of_cities += 1
if drop == stop1:
break
return no_of_cities
max_trips(pick_cities1, drop_cities1)
max_trips(pick_cities2, drop_cities2)
I have a two different datasets:
users:
+-------+---------+--------+
|user_id| movie_id|timestep|
+-------+---------+--------+
| 100 | 1000 |20200728|
| 101 | 1001 |20200727|
| 101 | 1002 |20200726|
+-------+---------+--------+
movies:
+--------+---------+--------------------------+
|movie_id| title | genre |
+--------+---------+--------------------------+
| 1000 |Toy Story|Adventure|Animation|Chil..|
| 1001 | Jumanji |Adventure|Children|Fantasy|
| 1002 | Iron Man|Action|Adventure|Sci-Fi |
+--------+---------+--------------------------+
How to get dataset in the following format? So I can get user's taste profile, so I can compare different users by their similarity score?
+-------+---------+--------+---------+---------+-----+
|user_id| Action |Adventure|Animation|Children|Drama|
+-------+---------+--------+---------+---------+-----+
| 100 | 0 | 1 | 1 | 1 | 0 |
| 101 | 1 | 1 | 0 | 1 | 0 |
+-------+---------+---------+---------+--------+-----+
Where df is the movies dataframe and dfu is the users dataframe
The 'genre' column needs to be split into a list with pandas.Series.str.split, and then using pandas.DataFrame.explode, transform each element of the list into a row, replicating index values.
pandas.merge the two dataframes on 'movie_id'
Use pandas.DataFrame.groupby on 'user_id' and 'genre' and aggregate by count.
Shape final
.unstack converts the groupby dataframe from long to wide format
.fillna replace NaN with 0
.astype changes the numeric values from float to int
Tested in python 3.10, pandas 1.4.3
import pandas as pd
# data
movies = {'movie_id': [1000, 1001, 1002],
'title': ['Toy Story', 'Jumanji', 'Iron Man'],
'genre': ['Adventure|Animation|Children', 'Adventure|Children|Fantasy', 'Action|Adventure|Sci-Fi']}
users = {'user_id': [100, 101, 101],
'movie_id': [1000, 1001, 1002],
'timestep': [20200728, 20200727, 20200726]}
# set up dataframes
df = pd.DataFrame(movies)
dfu = pd.DataFrame(users)
# split the genre column strings at '|' to make lists
df.genre = df.genre.str.split('|')
# explode the lists in genre
df = df.explode('genre', ignore_index=True)
# merge df with dfu
dfm = pd.merge(dfu, df, on='movie_id')
# groupby, count and unstack
final = dfm.groupby(['user_id', 'genre'])['genre'].count().unstack(level=1).fillna(0).astype(int)
# display(final)
genre Action Adventure Animation Children Fantasy Sci-Fi
user_id
100 0 1 1 1 0 0
101 1 2 0 1 1 1