Extracting year from a column of string movie names

Extracting year from a column of string movie names - python

I have the following data, having two columns, "name" and "gross" in table called train_df:
gross name
760507625.0 Avatar (2009)
658672302.0 Titanic (1997)
652270625.0 Jurassic World (2015)
623357910.0 The Avengers (2012)
534858444.0 The Dark Knight (2008)
532177324.0 Rogue One (2016)
474544677.0 Star Wars: Episode I - The Phantom Menace (1999)
459005868.0 Avengers: Age of Ultron (2015)
448139099.0 The Dark Knight Rises (2012)
436471036.0 Shrek 2 (2004)
424668047.0 The Hunger Games: Catching Fire (2013)
423315812.0 Pirates of the Caribbean: Dead Man's Chest (2006)
415004880.0 Toy Story 3 (2010)
409013994.0 Iron Man 3 (2013)
408084349.0 Captain America: Civil War (2016)
408010692.0 The Hunger Games (2012)
403706375.0 Spider-Man (2002)
402453882.0 Jurassic Park (1993)
402111870.0 Transformers: Revenge of the Fallen (2009)
400738009.0 Frozen (2013)
381011219.0 Harry Potter and the Deathly Hallows: Part 2 (2011)
380843261.0 Finding Nemo (2003)
380262555.0 Star Wars: Episode III - Revenge of the Sith (2005)
373585825.0 Spider-Man 2 (2004)
370782930.0 The Passion of the Christ (2004)
I would like to read and extract the date from "name" to create a new column that will be called "year", which I will then use to filter the data set by specific year.
The new table will look like the following:
year gross name
2009 760507625.0 Avatar (2009)
1997 658672302.0 Titanic (1997)
2015 652270625.0 Jurassic World (2015)
2012 623357910.0 The Avengers (2012)
2008 534858444.0 The Dark Knight (2008)
I tried the apply and lambda approach, but got no results:
train_df[train_df.apply(lambda row: row['name'].startswith('2014'),axis=1)]
Is there a way to use contains (as in C# or "isin" to filter the strings in python?

If you know for sure that your years are going to be at the end of the string, you can do
df['year'] = df['name'].str[-5:-1].astype(int)
This takes the column name, uses the str accessor to access the value of each row as a string, and takes the -5:-1 slice from it. Then, it converts the result to int, and sets it as the year column. This approach will be much faster than iterating over the rows if you have lots of data.
Alternatively, you could use regex for more flexibility using the .extract() method of the str accessor.
df['year'] = df['name'].str.extract(r'\((\d{4})\)').astype(int)
This extracts groups matching the expression \((\d{4})\) (Try it here), which means capture the numbers inside a pair of parentheses containing exactly four digits, and will work anywhere in the string. To anchor it to the end of your string, use a $ at the end of your regex like so: \((\d{4})\)$. The result is the same using regex and using string slicing.
Now we have our new dataframe:
gross name year
0 760507625.0 Avatar (2009) 2009
1 658672302.0 Titanic (1997) 1997
2 652270625.0 Jurassic World (2015) 2015
3 623357910.0 The Avengers (2012) 2012
4 534858444.0 The Dark Knight (2008) 2008
5 532177324.0 Rogue One (2016) 2016
6 474544677.0 Star Wars: Episode I - The Phantom Menace (1999) 1999
7 459005868.0 Avengers: Age of Ultron (2015) 2015
8 448139099.0 The Dark Knight Rises (2012) 2012
9 436471036.0 Shrek 2 (2004) 2004
10 424668047.0 The Hunger Games: Catching Fire (2013) 2013
11 423315812.0 Pirates of the Caribbean: Dead Man's Chest (2006) 2006
12 415004880.0 Toy Story 3 (2010) 2010
13 409013994.0 Iron Man 3 (2013) 2013
14 408084349.0 Captain America: Civil War (2016) 2016
15 408010692.0 The Hunger Games (2012) 2012
16 403706375.0 Spider-Man (2002) 2002
17 402453882.0 Jurassic Park (1993) 1993
18 402111870.0 Transformers: Revenge of the Fallen (2009) 2009
19 400738009.0 Frozen (2013) 2013
20 381011219.0 Harry Potter and the Deathly Hallows: Part 2 (... 2011
21 380843261.0 Finding Nemo (2003) 2003
22 380262555.0 Star Wars: Episode III - Revenge of the Sith (... 2005
23 373585825.0 Spider-Man 2 (2004) 2004
24 370782930.0 The Passion of the Christ (2004) 2004

You can a regular expression with pandas.Series.str.extract for this:
df["year"] = df["name"].str.extract(r"\((\d{4})\)$", expand=False)
df["year"] = pd.to_numeric(df["year"])
print(df.head())
gross name year
0 760507625.0 Avatar (2009) 2009
1 658672302.0 Titanic (1997) 1997
2 652270625.0 Jurassic World (2015) 2015
3 623357910.0 The Avengers (2012) 2012
4 534858444.0 The Dark Knight (2008) 2008
The regular expression:
\(: find where there is a literal opening parentheses
(\d{4}) Then, find 4 digits appearing next to each other
The parentheses here means that we're storing our 4 digits as a capture group (in this case its the group of digits we want to extract from the larger string)
\): Then, find a closing parentheses
$: All of the above MUST occur at the end of the string
When all of the above criterion are met, get those 4 digits- or if no match is acquired, return NaN for that row.

Try this.
df = ['Avatar (2009)', 'Titanic (1997)', 'Jurassic World (2015)','The Avengers (2012)', 'The Dark Knight (2008)', 'Rogue One (2016)','Star Wars: Episode I - The Phantom Menace (1999)','Avengers: Age of Ultron (2015)', 'The Dark Knight Rises (2012)','Shrek 2 (2004)', 'Boiling Point (1990)', 'Terror Firmer (1999)', 'Adam's Apples (2005)', 'I Want You (1998)', 'Chalet Girl (2011)','Love, Honor and Obey (2000)', 'Perrier's Bounty (2009)','Into the White (2012)', 'The Decoy Bride (2011)','I Spit on Your Grave 2 (2013)']
for i in df:
mov_title = i[:-7]
year = i[-5:-1]
print(mov_title) //do your actual extraction
print(year) //do your actual extraction

def getYear(val):
startIndex = val.find('(')
endIndex = val.find(')')
return val[(int(startIndex) + 1):endIndex]
Am not much of a python dev, but i believe this will do. You will just need to loop through passing each to the above function. On each function call you will get the date extracted for you.

Related

Why isn't replace working in pandas dataframe?

I'm trying to parse some data and I cannot seem to use .replace to remove the junk i.e. [Bluray] and other data that is not year or resolution. My end goal is end up with columns of: Movie Name, Year and Resolution
,Movie Name,others
0,James Bond The Spy Who Loved Me ,1977) [1080p]
1,James Bond Live And Let Die ,1973) [1080p]
2,No Time To Die ,2021) [1080p] [BluRay] [5.1] [YTS.MX]
3,James Bond The Man With The Golden Gun ,1974) [1080p]
4,Casino Royale ,2006) [2160p] [4K] [BluRay] [5.1] [YTS.MX]
5,James Bond Moonraker ,1979) [1080p]
6,James Bond Licence To Kill ,1989) [1080p]
7,James Bond A View To A Kill ,1985) [1080p]
8,James Bond The Living Daylights ,1987) [1080p]
Code i'm using is:
df['others']=df['others'].replace(to_replace=[['','BluRay']],value='')
Can anyone see where i'm going wrong?

Given:
Movie Name others
0 James Bond The Spy Who Loved Me 1977) [1080p]
1 James Bond Live And Let Die 1973) [1080p]
2 No Time To Die 2021) [1080p] [BluRay] [5.1] [YTS.MX]
3 James Bond The Man With The Golden Gun 1974) [1080p]
4 Casino Royale 2006) [2160p] [4K] [BluRay] [5.1] [YTS.MX]
5 James Bond Moonraker 1979) [1080p]
6 James Bond Licence To Kill 1989) [1080p]
7 James Bond A View To A Kill 1985) [1080p]
8 James Bond The Living Daylights 1987) [1080p]
To clean this up I would do:
df['Movie Name'] = df['Movie Name'].str.strip()
df[['Year', 'Resolution']] = df['others'].str.extract('(\d{4})\).*\[(.*p)]')
print(df[['Movie Name', 'Year', 'Resolution']])
Output:
Movie Name Year Resolution
0 James Bond The Spy Who Loved Me 1977 1080p
1 James Bond Live And Let Die 1973 1080p
2 No Time To Die 2021 1080p
3 James Bond The Man With The Golden Gun 1974 1080p
4 Casino Royale 2006 2160p
5 James Bond Moonraker 1979 1080p
6 James Bond Licence To Kill 1989 1080p
7 James Bond A View To A Kill 1985 1080p
8 James Bond The Living Daylights 1987 1080p

str.findall returns all NA's

I have this df1 with a lot of different news articles. An example of a news article is this:
'Today is Monday Aug. 17 the 230th day of 2020 . There are 136 days left in the year . On August 17 2017 a van plowed through pedestrians along a packed promenade in the Spanish city of Barcelona killing 13 people and injuring 120 . A 14th victim died later from injuries . Another man was stabbed to death in a carjacking that night as the van driver made his getaway and a woman died early the next day in a vehicle-and-knife attack in a nearby coastal town . Six by police two more died when a bomb workshop exploded . In 1915 a mob in Cobb County Georgia lynched Jewish businessman Leo Frank 31 whose death sentence for the murder of 13-year-old Mary Phagan had been commuted to life imprisonment . Frank who d maintained his innocence was pardoned by the state of Georgia in 1986 . In 1960 the newly renamed Beatles formerly the Silver Beetles began their first gig in Hamburg West Germany Teamsters union president Jimmy Hoffa was sentenced in Chicago to five years in federal prison for defrauding his union s pension fund . Hoffa was released in 1971 after President Richard Nixon commuted his sentence for this conviction and jury tampering . In 1969 Hurricane Camille slammed into the Mississippi coast as a Category 5 storm that was blamed for 256 U.S. deaths three in Cuba . In 1978 the first successful trans-Atlantic balloon flight ended as Maxie Anderson Ben Abruzzo and Larry Newman landed In 1982 the first commercially produced compact discs a recording of ABBA s The Visitors were pressed at a Philips factory near Hanover West Germany .'
And I have this df2 with all the words from the news articles in the column "Word" with their corresponding LIWC category in the second column.
Data example:
data = {'Word': ['killing','even','guilty','brain'], 'Category': ['Affect', 'Adverb', 'Anx','Body']}
What I'm trying to do is: To calculate for each article in df1 how many words occur of each category in df2. So I want to create a column for each category mentioned in df2["category"].
And it should look like this in the end:
Content | Achieve | Affiliation | affect
article text here | 6 | 2 | 2
article text here | 2 | 43 | 2
article text here | 6 | 8 | 8
article text here | 2 | 13 | 7
I since it's all strings I tried str.findall but this returns all NA's for everything. This is what I tried:
from collections import Counter
liwc = df1['articles'].str.findall(fr"'({'|'.join(df2)})'") \
.apply(lambda x: pd.Series(Counter(x), index=df2["category"].unique())) \
.fillna(0).astype(int)
Both a pandas or r solution would be equally great.

First flatten df2 values to dictionary, add word boundaries \b\b and pass to Series.str.extractall, so possible use Series.map and create DataFrame by reset_index, last pass to crosstab and append to original by DataFrame.join:
df1 = pd.DataFrame({'articles':['Today is killing Aug. 17 the 230th day of 2020',
'Today is brain Aug. 17 the guilty day of 2020 ']})
print (df1)
articles
0 Today is killing Aug. 17 the 230th day of 2020
1 Today is brain Aug. 17 the guilty day of 2020
If list of values in Word column like in picture:
data = {'Word': [['killing'],['even'],['guilty'],['brain']],
'Category': ['Affect', 'Adverb', 'Anx','Body']}
df2 = pd.DataFrame(data)
print (df2)
Word Category
0 [killing] Affect
1 [even] Adverb
2 [guilty] Anx
3 [brain] Body
d = {x: b for a, b in zip(df2['Word'], df2['Category']) for x in a}
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}
If df2 is different:
data = {'Word': ['killing','even','guilty','brain'],
'Category': ['Affect', 'Adverb', 'Anx','Body']}
df2 = pd.DataFrame(data)
print (df2)
0 killing Affect
1 even Adverb
2 guilty Anx
3 brain Body
d = dict(zip(df2['Word'], df2['Category']))
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}
import re
#thank you for improve solution Wiktor Stribiżew
pat = r"\b(?:{})\b".format("|".join(re.escape(x) for x in d))
df = df1['articles'].str.extractall(rf'({pat})')[0].map(d).reset_index(name='Category')
df = df1.join(pd.crosstab(df['level_0'], df['Category']))
print (df)
articles Affect Anx Body
0 Today is killing Aug. 17 the 230th day of 2020 1 0 0
1 Today is brain Aug. 17 the guilty day of 2020 0 1 1

You can craft a custom regex with named capturing groups and use str.extractall.
With your dictionary the custom regex would be '(?P<Affect>\\bkilling\\b)|(?P<Adverb>\\beven\\b)|(?P<Anx>\\bguilty\\b)|(?P<Body>\\bbrain\\b)'
Then groupby+max the notna results, convert to int and join to the original dataframe:
regex = '|'.join(fr'(?P<{k}>\b{v}\b)' for v,k in zip(*data.values()))
(df1.join(df1['articles'].str.extractall(regex, flags=2) # re.IGNORECASE
.notna().groupby(level=0).max()
.astype(int)
)
)
output:
articles Affect Adverb Anx Body
0 Today is killing Aug. 17 the 230th day of 2020 1 0 0 0
1 Today is brain Aug. 17 the guilty day of 2020 0 0 1 1

How to count paragraphs from each article from dataframe?

I want to count paragraphs from data frames. However, it turns out that my result gets zero inside the list. Does anybody know how to fix it? Thank you so much.
Here is my code:
def count_paragraphs(df):
paragraph_count = []
linecount = 0
for i in df.text:
if i in ('\n','\r\n'):
if linecount == 0:
paragraphcount = paragraphcount + 1
return paragraph_count
count_paragraphs(df)
df.text
0 On Saturday, September 17 at 8:30 pm EST, an e...
1 Story highlights "This, though, is certain: to...
2 Critical Counties is a CNN series exploring 11...
3 McCain Criticized Trump for Arpaio’s Pardon… S...
4 Story highlights Obams reaffirms US commitment...
5 Obama weighs in on the debate\n\nPresident Bar...
6 Story highlights Ted Cruz refused to endorse T...
7 Last week I wrote an article titled “Donald Tr...
8 Story highlights Trump has 45%, Clinton 42% an...
9 Less than a day after protests over the police...
10 I woke up this morning to find a variation of ...
11 Thanks in part to the declassification of Defe...
12 The Democrats are using an intimidation tactic...
13 Dolly Kyle has written a scathing “tell all” b...
14 The Haitians in the audience have some newswor...
15 The man arrested Monday in connection with the...
16 Back when the news first broke about the pay-t...
17 Chicago Environmentalist Scumbags\n\nLeftists ...
18 Well THAT’S Weird. If the Birther movement is ...
19 Former President Bill Clinton and his Clinton ...
Name: text, dtype: object

Use Series.str.count:
def count_paragraphs(df):
return df.text.str.count(r'\n\n').tolist()
count_paragraphs(df)

This is my answer and It works!
def count_paragraphs(df):
paragraph_count = []
for i in range(len(df)):
paragraph_count.append(df.text[i].count('\n\n'))
return paragraph_count
count_paragraphs(df)

Python - Pandas: extract a number from column into new column

I've been working a lot with pandas in python to extract information. I have the following titles in one column of my dataframe:
0
In & Out (1997)
Simple Plan, A (1998)
Retro Puppetmaster (1999)
Paralyzing Fear: The Story of Polio in America, A (1998)
Old Man and the Sea, The (1958)
Body Shots (1999)
Coogan's Bluff (1968)
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)
Search for One-eye Jimmy, The (1996)
Funhouse, The (1981)
I'd like to take the years of those titles and put into a new column. The issue I'm running into is if I do the split on '(' as the delimiter, as you see on row 8, it's split there. So how do I split at the (yyyy) to form a new column with that year to look like this?
0 1
In & Out 1997
Simple Plan, A 1998
Retro Puppetmaster 1999
Paralyzing Fear:... 1998
Old Man and the S... 1958
Body Shots 1999
Coogan's Bluff 1968
Seven Samurai (T... 1954
Search for One-ey... 1996
Funhouse, The 1981

You can use expand:
df['year'] = df.iloc[:,0].str.extract('\((\d{4})\)'',expand=False)
df
Out[381]:
0 year
0 In & Out (1997) 1997
1 Simple Plan, A (1998) 1998
2 Retro Puppetmaster (1999) 1999
3 Paralyzing Fear: The Story of Polio in America... 1998
4 Old Man and the Sea, The (1958) 1958
5 Body Shots (1999) 1999
6 Coogan's Bluff (1968) 1968
7 Seven Samurai (The Magnificent Seven) (Shichin... 1954
8 Search for One-eye Jimmy, The (1996) 1996
9 Funhouse, The (1981) 1981

You can try string slicing operation.
rindex() method of string data type returns the index value of the matched pattern (in this case it is '(') starting from right end corner. With the index value we can perform string slicing as expected.
For example :
>>> a = "Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)"
>>>
>>> print a[:a.rindex('(')], a[a.rindex('(')+1:-1]
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) 1954
>>>
>>>

Pandas Dataframe

I want to represent data using pandas dataframe , the column name - Product Title and populate t .
For eg :
Product Title
Marvel : Movies Collection
Marvel
Diney Movie and so on..
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
r= requests.get("http://www.walmart.com/search/?query=marvel&cat_id=4096_530598")
r.content
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class" : "tile-conent"})
g_price = soup.find_all("div",{"class" : "item-price-container"})
g_star = soup.find_all("div",{"class" : "stars stars-small tile-row"})
for product_title in g_data:
a_product_title = product_title.find_all("a","js-product-title")
for text_product_title in a_product_title :
t = text_product_title.text
print t
Desired Output-
Product Title :
Marvel Heroes: Collection
Marvel: Guardians Of The Galaxy (Widescreen)
Marvel Complete Giftset (Widescreen)
Marvel's The Avengers (Widescreen)
Marvel Knights: Wolverine Versus Sabretooth - Reborn (Widescreen)
Superheroes Collection: The Incredible Hulk Returns / The Trial Of The Incredible Hulk / How To Draw Comics The Marvel Way (Widescreen)
Marvel: Iron Man & Hulk - Heroes United (Widescreen)
Marvel's The Avengers (DVD + Blu-ray) (Widescreen)
Captain America: The Winter Soldier (Widescreen)
Iron Man 3 (DVD + Digital Copy) (Widescreen)
Thor: The Dark World (Widescreen)
Spider-Man (2-Disc) (Special Edition) (Widescreen)
Elektra / Fantastic Four / Daredevil (Director's Cut) / Fantastic Four 2: Rise Of The Silver Surfer
Spider-Man / Spider-Man 2 / Spider-Man 3 (Widescreen)
Spider-Man 2 (Widescreen)
The Punisher (Extended Cut) (Widescreen)
DC Showcase: Superman / Shazam!: The Return Of The Black Adam
Ultimate Avengers: The Movie (Widescreen)
The Next Avengers: Heroes Of Tomorrow (Widescreen)
Ultimate Avengers 1 & 2 (Blu-ray) (Widescreen)
I tired append function and join but it dint work.. Do we have any specific function this in pandas dataframe?
The desired output should be outcome of using Pandas dataframe.

Well this will get you started, this extracts all the titles into a dict (I use a defaultdict for convenience):
In [163]:
from collections import defaultdict
data=defaultdict(list)
for product_title in g_data:
a_product_title = product_title.find_all("a","js-product-title")
for text_title in a_product_title:
data['Product title'].append(text_title.text)
df = pd.DataFrame(data)
df
Out[163]:
Product title
0 Marvel Heroes: Collection
1 Marvel: Guardians Of The Galaxy (Widescreen)
2 Marvel Complete Giftset (Widescreen)
3 Marvel's The Avengers (Widescreen)
4 Marvel Knights: Wolverine Versus Sabretooth - ...
5 Superheroes Collection: The Incredible Hulk Re...
6 Marvel: Iron Man & Hulk - Heroes United (Wides...
7 Marvel's The Avengers (DVD + Blu-ray) (Widescr...
8 Captain America: The Winter Soldier (Widescreen)
9 Iron Man 3 (DVD + Digital Copy) (Widescreen)
10 Thor: The Dark World (Widescreen)
11 Spider-Man (2-Disc) (Special Edition) (Widescr...
12 Elektra / Fantastic Four / Daredevil (Director...
13 Spider-Man / Spider-Man 2 / Spider-Man 3 (Wide...
14 Spider-Man 2 (Widescreen)
15 The Punisher (Extended Cut) (Widescreen)
16 DC Showcase: Superman / Shazam!: The Return Of...
17 Ultimate Avengers: The Movie (Widescreen)
18 The Next Avengers: Heroes Of Tomorrow (Widescr...
19 Ultimate Avengers 1 & 2 (Blu-ray) (Widescreen)
So you can modify this script to add the price and actors as keys to the data dict and then construct the df from the resultant dict, this will be better than appending a row at a time

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting year from a column of string movie names - python

Related

Why isn't replace working in pandas dataframe?

str.findall returns all NA's

How to count paragraphs from each article from dataframe?

Python - Pandas: extract a number from column into new column

Pandas Dataframe

Categories

Resources