Best way to remove specific words from column in pandas dataframe? - python

I'm working with a huge set of data that I can't work with in excel so I'm using Pandas/Python, but I'm relatively new to it. I have this column of book titles that also include genres, both before and after the title. I only want the column to contain book titles, so what would be the easiest way to remove the genres?
Here is an example of what the column contains:
Book Labels
Science Fiction | Drama | Dune
Thriller | Mystery | The Day I Died
Thriller | Razorblade Tears | Family | Drama
Comedy | How To Marry Keanu Reeves In 90 Days | Drama
...
So above, the book titles would be Dune, The Day I Died, Razorblade Tears, and How To Marry Keanu Reeves In 90 Days, but as you can see the genres precede as well as succeed the titles.
I was thinking I could create a list of all the genres (as there are only so many) and remove those from the column along with the "|" characters, but if anyone has suggestions on a simpler way to remove the genres and "|" key, please help me out.

It is an enhancement to #tdy Regex solution. The original regex Family|Drama will match the words "Family" and "Drama" in the string. If the book title contains the words in gernes, the words will be removed as well.
Supposed that the labels are separated by " | ", there are three match conditions we want to remove.
Gerne at start of string. e.g. Drama | ...
Gerne in the middle. e.g. ... | Drama | ...
Gerne at end of string. e.g. ... | Drama
Use regex (^|\| )(?:Family|Drama)(?=( \||$)) to match one of three conditions. Note that | Drama | Family has 2 overlapped matches, here I use ?=( \||$) to avoid matching once only. See this problem [Use regular expressions to replace overlapping subpatterns] for more details.
>>> genres = ["Family", "Drama"]
>>> df
# Book Labels
# 0 Drama | Drama 123 | Family
# 1 Drama 123 | Drama | Family
# 2 Drama | Family | Drama 123
# 3 123 Drama 123 | Family | Drama
# 4 Drama | Family | 123 Drama
>>> re_str = "(^|\| )(?:{})(?=( \||$))".format("|".join(genres))
>>> df['Book Labels'] = df['Book Labels'].str.replace(re_str, "", regex=True)
# 0 | Drama 123
# 1 Drama 123
# 2 | Drama 123
# 3 123 Drama 123
# 4 | 123 Drama
>>> df["Book Labels"] = df["Book Labels"].str.strip("| ")
# 0 Drama 123
# 1 Drama 123
# 2 Drama 123
# 3 123 Drama 123
# 4 123 Drama

Related

How to write a Function in python pandas to append the rows in dataframe in a loop?

I am being provided with a data set and i am writing a function.
my objectice is quiet simple. I have a air bnb data base with various columns my onjective is simple. I am using a for loop over neighbourhood group list (that i created) and i am trying to extract (append) the data related to that particular element in a empty dataframe.
Example:
import pandas as pd
import numpy as np
dict1 = {'id' : [2539,2595,3647,3831,12937,18198,258838,258876,267535,385824],'name':['Clean & quiet apt home by the park','Skylit Midtown Castle','THE VILLAGE OF HARLEM....NEW YORK !','Cozy Entire Floor of Brownstone','1 Stop fr. Manhattan! Private Suite,Landmark Block','Little King of Queens','Oceanview,close to Manhattan','Affordable rooms,all transportation','Home Away From Home-Room in Bronx','New York City- Riverdale Modern two bedrooms unit'],'price':[149,225,150,89,130,70,250,50,50,120],'neighbourhood_group':['Brooklyn','Manhattan','Manhattan','Brooklyn','Queens','Queens','Staten Island','Staten Island','Bronx','Bronx']}
df = pd.DataFrame(dict1)
df
I created a function as follows
nbd_grp = ['Bronx','Queens','Staten Islands','Brooklyn','Manhattan']
# Creating a function to find the cheapest place in neighbourhood group
dfdf = pd.DataFrame(columns = ['id','name','price','neighbourhood_group'])
def cheapest_place(neighbourhood_group):
for elem in nbd_grp:
data = df.loc[df['neighbourhood_group']==elem]
cheapest = data.loc[data['price']==min(data['price'])]
dfdf = cheapest.copy()
cheapest_place(nbd_grp)
My Expected Output is :
id
name
Price
neighbourhood group
267535
Home Away From Home-Room in Bronx
50
Bronx
18198
Little King of Queens
70
Queens
258876
Affordable rooms,all transportation
50
Staten Island
3831
Cozy Entire Floor of Brownstone
89
Brooklyn
3647
THE VILLAGE OF HARLEM....NEW YORK !
150
Manhattan
My advice is that anytime you are working in a database or in a dataframe and you think "I need to loop", you should think again.
When in a dataframe you are in a world of set-based logic and there is likely a better set-based way of solving the problem. In your case you can groupby() your neighbourhood_group and get the min() of the price column and then merge or join that result set back to your original dataframe to get your id and name columns.
That would look something like:
df_min_price = df.groupby('neighbourhood_group').price.agg(min).reset_index().merge(df, on=['neighbourhood_group','price'])
+-----+---------------------+-------+--------+-------------------------------------+
| idx | neighbourhood_group | price | id | name |
+-----+---------------------+-------+--------+-------------------------------------+
| 0 | Bronx | 50 | 267535 | Home Away From Home-Room in Bronx |
| 1 | Brooklyn | 89 | 3831 | Cozy Entire Floor of Brownstone |
| 2 | Manhattan | 150 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! |
| 3 | Queens | 70 | 18198 | Little King of Queens |
| 4 | Staten Island | 50 | 258876 | Affordable rooms,all transportation |
+-----+---------------------+-------+--------+-------------------------------------+

Handle csv file with almost similar records but different times - need to group them as one record

I am attempting to resolve the below lab and having issues. This problem involves a csv input. There is criteria that the solution needs to meet. Any help or tips at all would be appreciated. My code is at the end of the problem along with my output.
Each row contains the title, rating, and all showtimes of a unique movie.
A space is placed before and after each vertical separator ('|') in each row.
Column 1 displays the movie titles and is left justified with a minimum of 44 characters.
If the movie title has more than 44 characters, output the first 44 characters only.
Column 2 displays the movie ratings and is right justified with a minimum of 5 characters.
Column 3 displays all the showtimes of the same movie, separated by a space.
This is the input:
16:40,Wonders of the World,G
20:00,Wonders of the World,G
19:00,End of the Universe,NC-17
12:45,Buffalo Bill And The Indians or Sitting Bull's History Lesson,PG
15:00,Buffalo Bill And The Indians or Sitting Bull's History Lesson,PG
19:30,Buffalo Bill And The Indians or Sitting Bull's History Lesson,PG
10:00,Adventure of Lewis and Clark,PG-13
14:30,Adventure of Lewis and Clark,PG-13
19:00,Halloween,R
This is the expected output:
Wonders of the World | G | 16:40 20:00
End of the Universe | NC-17 | 19:00
Buffalo Bill And The Indians or Sitting Bull | PG | 12:45 15:00 19:30
Adventure of Lewis and Clark | PG-13 | 10:00 14:30
Halloween | R | 19:00
My code so far:
import csv
rawMovies = input()
repeatList = []
with open(rawMovies, 'r') as movies:
moviesList = csv.reader(movies)
for movie in moviesList:
time = movie[0]
#print(time)
show = movie[1]
if len(show) > 45:
show = show[0:44]
#print(show)
rating = movie[2]
#print(rating)
print('{0: <44} | {1: <6} | {2}'.format(show, rating, time))
My output doesn't have the rating aligned to the right and I have no idea how to filter for repeated movies without removing the time portion of the list:
Wonders of the World | G | 16:40
Wonders of the World | G | 20:00
End of the Universe | NC-17 | 19:00
Buffalo Bill And The Indians or Sitting Bull | PG | 12:45
Buffalo Bill And The Indians or Sitting Bull | PG | 15:00
Buffalo Bill And The Indians or Sitting Bull | PG | 19:30
Adventure of Lewis and Clark | PG-13 | 10:00
Adventure of Lewis and Clark | PG-13 | 14:30
Halloween | R | 19:00
You could collect the input data in a dictionary, with the title-rating-tuples as keys and the showtimes collected in a list, and then print the consolidated information. For example (you have to adjust the filename):
import csv
movies = {}
with open("data.csv", "r") as file:
for showtime, title, rating in csv.reader(file):
movies.setdefault((title, rating), []).append(showtime)
for (title, rating), showtimes in movies.items():
print(f"{title[:44]: <44} | {rating: >5} | {' '.join(showtimes)}")
Output:
Wonders of the World | G | 16:40 20:00
End of the Universe | NC-17 | 19:00
Buffalo Bill And The Indians or Sitting Bull | PG | 12:45 15:00 19:30
Adventure of Lewis and Clark | PG-13 | 10:00 14:30
Halloween | R | 19:00
Since the input seems to come in connected blocks you could also use itertools.groupby (from the standard library) and print while reading:
import csv
from itertools import groupby
from operator import itemgetter
with open("data.csv", "r") as file:
for (title, rating), group in groupby(
csv.reader(file), key=itemgetter(1, 2)
):
showtimes = " ".join(time for time, *_ in group)
print(f"{title[:44]: <44} | {rating: >5} | {showtimes}")
For this consider the max length of the rating string. Subtract the length of the rating from that value. Make a string of spaces of that length and append the rating.
so basically
your_desired_str = ' '*(6-len(Rating))+Rating
also just replace
'somestr {value}'.format(value)
with f strings, much easier to read
f'somestr {value}'
Below is what I ended up with after some tips from the community.
rawMovies = input()
outputList = []
with open(rawMovies, 'r') as movies:
moviesList = csv.reader(movies)
movieold = [' ', ' ', ' ']
for movie in moviesList:
if movieold[1] == movie[1]:
outputList[-1][2] += ' ' + movie[0]
else:
time = movie[0]
# print(time)
show = movie[1]
if len(show) > 45:
show = show[0:44]
# print(show)
rating = movie[2]
outputList.append([show, rating, time])
movieold = movie
# print(rating)
#print(outputList)
for movie in outputList:
print('{0: <44} | {1: <5} | {2}'.format(movie[0], movie[1].rjust(5), movie[2]))
I would use Python's groupby() function for this which helps you to group consecutive rows with the same value.
For example:
import csv
from itertools import groupby
with open('movies.csv') as f_movies:
csv_movies = csv.reader(f_movies)
for title, entries in groupby(csv_movies, key=lambda x: x[1]):
movies = list(entries)
showtimes = ' '.join(row[0] for row in movies)
rating = movies[0][2]
print(f"{title[:44]: <44} | {rating: >5} | {showtimes}")
Giving you:
Wonders of the World | G | 16:40 20:00
End of the Universe | NC-17 | 19:00
Buffalo Bill And The Indians or Sitting Bull | PG | 12:45 15:00 19:30
Adventure of Lewis and Clark | PG-13 | 10:00 14:30
Halloween | R | 19:00
So how does groupby() work?
When reading a CSV file you will get a row at a time. What groupby() does is to group rows together into mini-lists containing rows which have the same value. The value it looks for is given using the key parameter. In this case the lambda function is passed a row at a time and it returns the current value of x[1] which is the title. groupby() keeps reading rows until that value changes. It then returns the current list as entries as an iterator.
This approach does assume that the rows you wish to group are in consecutive rows in the file. You could even write you own kind of group by generator function:
def group_by_title(csv):
title = None
entries = []
for row in csv:
if title and row[1] != title:
yield title, entries
entries = []
title = row[1]
entries.append(row)
if entries:
yield title, entries
with open('movies.csv') as f_movies:
csv_movies = csv.reader(f_movies)
for title, entries in group_by_title(csv_movies):
showtimes = ' '.join(row[0] for row in entries)
rating = entries[0][2]
print(f"{title[:44]: <44} | {rating: >5} | {showtimes}")

How to Insert several separate characters in a sqlite3 cell?

I want to insert several different values ​​in just one cell
E.g.
Friends' names
ID | Grade | Names
----+--------------+----------------------------
1 | elementary | Kai, Matthew, Grace
2 | guidance | Eli, Zoey, David, Nora, William
3 | High school | Emma, James, Levi, Sophia
Or as a list or dictionary:
ID | Grade | Names
----+--------------+------------------------------
1 | elementary | [Kai, Matthew, Grace]
2 | guidance | [Eli, Zoey, David, Nora, William]
3 | High school | [Emma, James, Levi, Sophia]
or
ID | Grade | Names
----+--------------+---------------------------------------------
1 | elementary | { a:Kai, b:Matthew, c:Grace}
2 | guidance | { a:Eli, b:Zoey, c:David, d:Nora, e:William}
3 | High school | { a:Emma, b:James, c:Levi, d:Sophia}
Is there a way?
Yes there is a way, but that doesn't mean you should do it this way.
You could for example save your values as a json string and save them inside the column. If you later want to add a value you can simply parse the json, add the value and put it back into the database. (Might also work with a BLOB, but I'm not sure)
However, I would not recommend saving a list inside of a column, as SQL is not meant to be used like that.
What I would recommend is that you have a table and for every grade with its own primary key. Like this:
ID
Grade
1
Elementary
2
Guidance
3
High school
And then another table containing all the names, having its own primary key and the gradeId as its secondary key. E.g:
ID
GradeID
Name
1
1
Kai
2
1
Matthew
3
1
Grace
4
2
Eli
5
2
Zoey
6
2
David
7
2
Nora
8
2
William
9
3
Emma
10
3
James
11
3
Levia
12
3
Sophia
If you want to know more about this, you should read about Normalization in SQL.

How can I split a text between commas and make each new part a new column?

So, I have a Pandas dataframe with food and cuisine some people like. I have to break those into columns, so each food or cuisine should become a column. Each food/cuisine comes afer a comma, but if i break my string only by commas, I'll lost the content inside the parenthesis, which should be there, close to the dish. I think I should use '),' as a separator, right? But I don't know how to do that. This is my DF:
>>> PD_FOODS
USER_ID | FOODS_I_LIKE |
_______________________________________________________________________________
0 100 | Pizza(without garlic, tomatos and onion),pasta |
1 101 | Seafood,veggies |
2 102 | Indian food (no pepper, no curry),mexican food(no pepper) |
3 103 | Texmex, african food, japanese food,italian food |
4 104 | Seafood(no shrimps, no lobster),italian food(no gluten, no milk)|
Is it possible to get a result like this bellow?
>>> PD_FOODS
USER_ID | FOODS_I_LIKE_1 | FOODS_I_LIKE_2 |
_______________________________________________________________________________
0 100 | Pizza(without garlic, tomatos and onion)| pasta |
Thank you!
Try this:
df=pd.DataFrame({"User_ID":[1000,1001,1002,1003,1004],
"FOODS_I_LIKE":['Pizza(without garlic, tomatos and onion),pasta',
'Seafood,veggies',
'Indian food (no pepper, no curry),mexican food(no pepper)',
'Texmex, african food, japanese food,italian food',
'Seafood(no shrimps, no lobster),italian food(no gluten, no milk)']})
def my_func(my_string, item_num):
try:
if ')' in my_string:
if item_num == 0:
return my_string.split('),')[item_num]+')'
else:
return my_string.split('),')[item_num]
else:
return my_string.split(',')[item_num]
except IndexError:
return np.nan
for k in range(0,4):
K=str(k+1)
df[f'FOODS_I_LIKE_{K}']=df.FOODS_I_LIKE.apply(lambda x: my_func(x, k))
df.drop(columns='FOODS_I_LIKE')
Output:
User_ID
FOODS_I_LIKE_1
FOODS_I_LIKE_2
FOODS_I_LIKE_3
FOODS_I_LIKE_4
1000
Pizza(without garlic, tomatos and onion)
pasta
NaN
NaN
1001
Seafood
veggies
NaN
NaN
1002
Indian food (no pepper, no curry)
mexican food(no pepper)
NaN
NaN
1003
Texmex
african food
japanese food
italian food
1004
Seafood(no shrimps, no lobster)
italian food(no gluten, no milk)
NaN
NaN
You could use a regex with a negative lookahead:
(df['FOODS_I_LIKE'].str.split(',\s*(?![^()]*\))', expand=True)
.rename(columns=lambda x: int(x)+1)
.add_prefix('FOODS_I_LIKE_')
)
output:
FOODS_I_LIKE_1 FOODS_I_LIKE_2 FOODS_I_LIKE_3 FOODS_I_LIKE_4
0 Pizza(without garlic, tomatos and onion) pasta None None
1 Seafood veggies None None
2 Indian food (no pepper, no curry) mexican food(no pepper) None None
3 Texmex african food japanese food italian food
4 Seafood(no shrimps, no lobster) italian food(no gluten, no milk) None None
You can test the regex here
NB. this won't work on nested parenthesis, you would need to use a parser

How to get all the unique words in the data frame?

I have a dataframe with a list of products and its respective review
+---------+------------------------------------------------+
| product | review |
+---------+------------------------------------------------+
| product_a | It's good for a casual lunch |
+---------+------------------------------------------------+
| product_b | Avery is one of the most knowledgable baristas |
+---------+------------------------------------------------+
| product_c | The tour guide told us the secrets |
+---------+------------------------------------------------+
How can I get all the unique words in the data frame?
I made a function:
def count_words(text):
try:
text = text.lower()
words = text.split()
count_words = Counter(words)
except Exception, AttributeError:
count_words = {'':0}
return count_words
And applied the function to the DataFrame, but that only gives me the words count for each row.
reviews['words_count'] = reviews['review'].apply(count_words)
Starting with this:
dfx
review
0 United Kingdom
1 The United Kingdom
2 Dublin, Ireland
3 Mardan, Pakistan
To get all words in the "review" column:
list(dfx['review'].str.split(' ', expand=True).stack().unique())
['United', 'Kingdom', 'The', 'Dublin,', 'Ireland', 'Mardan,', 'Pakistan']
To get counts of "review" column:
dfx['review'].str.split(' ', expand=True).stack().value_counts()
United 2
Kingdom 2
Mardan, 1
The 1
Ireland 1
Dublin, 1
Pakistan 1
dtype: int64 ​

Categories

Resources