Python - Pandas: extract a number from column into new column - python

I've been working a lot with pandas in python to extract information. I have the following titles in one column of my dataframe:
0
In & Out (1997)
Simple Plan, A (1998)
Retro Puppetmaster (1999)
Paralyzing Fear: The Story of Polio in America, A (1998)
Old Man and the Sea, The (1958)
Body Shots (1999)
Coogan's Bluff (1968)
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)
Search for One-eye Jimmy, The (1996)
Funhouse, The (1981)
I'd like to take the years of those titles and put into a new column. The issue I'm running into is if I do the split on '(' as the delimiter, as you see on row 8, it's split there. So how do I split at the (yyyy) to form a new column with that year to look like this?
0 1
In & Out 1997
Simple Plan, A 1998
Retro Puppetmaster 1999
Paralyzing Fear:... 1998
Old Man and the S... 1958
Body Shots 1999
Coogan's Bluff 1968
Seven Samurai (T... 1954
Search for One-ey... 1996
Funhouse, The 1981

You can use expand:
df['year'] = df.iloc[:,0].str.extract('\((\d{4})\)'',expand=False)
df
Out[381]:
0 year
0 In & Out (1997) 1997
1 Simple Plan, A (1998) 1998
2 Retro Puppetmaster (1999) 1999
3 Paralyzing Fear: The Story of Polio in America... 1998
4 Old Man and the Sea, The (1958) 1958
5 Body Shots (1999) 1999
6 Coogan's Bluff (1968) 1968
7 Seven Samurai (The Magnificent Seven) (Shichin... 1954
8 Search for One-eye Jimmy, The (1996) 1996
9 Funhouse, The (1981) 1981

You can try string slicing operation.
rindex() method of string data type returns the index value of the matched pattern (in this case it is '(') starting from right end corner. With the index value we can perform string slicing as expected.
For example :
>>> a = "Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)"
>>>
>>> print a[:a.rindex('(')], a[a.rindex('(')+1:-1]
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) 1954
>>>
>>>

Related

str.findall returns all NA's

I have this df1 with a lot of different news articles. An example of a news article is this:
'Today is Monday Aug. 17 the 230th day of 2020 . There are 136 days left in the year . On August 17 2017 a van plowed through pedestrians along a packed promenade in the Spanish city of Barcelona killing 13 people and injuring 120 . A 14th victim died later from injuries . Another man was stabbed to death in a carjacking that night as the van driver made his getaway and a woman died early the next day in a vehicle-and-knife attack in a nearby coastal town . Six by police two more died when a bomb workshop exploded . In 1915 a mob in Cobb County Georgia lynched Jewish businessman Leo Frank 31 whose death sentence for the murder of 13-year-old Mary Phagan had been commuted to life imprisonment . Frank who d maintained his innocence was pardoned by the state of Georgia in 1986 . In 1960 the newly renamed Beatles formerly the Silver Beetles began their first gig in Hamburg West Germany Teamsters union president Jimmy Hoffa was sentenced in Chicago to five years in federal prison for defrauding his union s pension fund . Hoffa was released in 1971 after President Richard Nixon commuted his sentence for this conviction and jury tampering . In 1969 Hurricane Camille slammed into the Mississippi coast as a Category 5 storm that was blamed for 256 U.S. deaths three in Cuba . In 1978 the first successful trans-Atlantic balloon flight ended as Maxie Anderson Ben Abruzzo and Larry Newman landed In 1982 the first commercially produced compact discs a recording of ABBA s The Visitors were pressed at a Philips factory near Hanover West Germany .'
And I have this df2 with all the words from the news articles in the column "Word" with their corresponding LIWC category in the second column.
Data example:
data = {'Word': ['killing','even','guilty','brain'], 'Category': ['Affect', 'Adverb', 'Anx','Body']}
What I'm trying to do is: To calculate for each article in df1 how many words occur of each category in df2. So I want to create a column for each category mentioned in df2["category"].
And it should look like this in the end:
Content | Achieve | Affiliation | affect
article text here | 6 | 2 | 2
article text here | 2 | 43 | 2
article text here | 6 | 8 | 8
article text here | 2 | 13 | 7
I since it's all strings I tried str.findall but this returns all NA's for everything. This is what I tried:
from collections import Counter
liwc = df1['articles'].str.findall(fr"'({'|'.join(df2)})'") \
.apply(lambda x: pd.Series(Counter(x), index=df2["category"].unique())) \
.fillna(0).astype(int)
Both a pandas or r solution would be equally great.
First flatten df2 values to dictionary, add word boundaries \b\b and pass to Series.str.extractall, so possible use Series.map and create DataFrame by reset_index, last pass to crosstab and append to original by DataFrame.join:
df1 = pd.DataFrame({'articles':['Today is killing Aug. 17 the 230th day of 2020',
'Today is brain Aug. 17 the guilty day of 2020 ']})
print (df1)
articles
0 Today is killing Aug. 17 the 230th day of 2020
1 Today is brain Aug. 17 the guilty day of 2020
If list of values in Word column like in picture:
data = {'Word': [['killing'],['even'],['guilty'],['brain']],
'Category': ['Affect', 'Adverb', 'Anx','Body']}
df2 = pd.DataFrame(data)
print (df2)
Word Category
0 [killing] Affect
1 [even] Adverb
2 [guilty] Anx
3 [brain] Body
d = {x: b for a, b in zip(df2['Word'], df2['Category']) for x in a}
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}
If df2 is different:
data = {'Word': ['killing','even','guilty','brain'],
'Category': ['Affect', 'Adverb', 'Anx','Body']}
df2 = pd.DataFrame(data)
print (df2)
0 killing Affect
1 even Adverb
2 guilty Anx
3 brain Body
d = dict(zip(df2['Word'], df2['Category']))
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}
import re
#thank you for improve solution Wiktor Stribiżew
pat = r"\b(?:{})\b".format("|".join(re.escape(x) for x in d))
df = df1['articles'].str.extractall(rf'({pat})')[0].map(d).reset_index(name='Category')
df = df1.join(pd.crosstab(df['level_0'], df['Category']))
print (df)
articles Affect Anx Body
0 Today is killing Aug. 17 the 230th day of 2020 1 0 0
1 Today is brain Aug. 17 the guilty day of 2020 0 1 1
You can craft a custom regex with named capturing groups and use str.extractall.
With your dictionary the custom regex would be '(?P<Affect>\\bkilling\\b)|(?P<Adverb>\\beven\\b)|(?P<Anx>\\bguilty\\b)|(?P<Body>\\bbrain\\b)'
Then groupby+max the notna results, convert to int and join to the original dataframe:
regex = '|'.join(fr'(?P<{k}>\b{v}\b)' for v,k in zip(*data.values()))
(df1.join(df1['articles'].str.extractall(regex, flags=2) # re.IGNORECASE
.notna().groupby(level=0).max()
.astype(int)
)
)
output:
articles Affect Adverb Anx Body
0 Today is killing Aug. 17 the 230th day of 2020 1 0 0 0
1 Today is brain Aug. 17 the guilty day of 2020 0 0 1 1

Counting unique words in a pandas column

I am having some difficulties with the following data (from a pandas dataframe):
Text
0 Selected moments from Fifa game t...
1 What I learned is that I am ...
3 Bill Gates kept telling us it was comi...
5 scenario created a month before the...
... ...
1899 Events for May 19 – October 7 - October CTOvision.com
1900 Office of Event Services and Campus Center Ope...
1901 How the CARES Act May Affect Gift Planning in ...
1902 City of Rohnert Park: Home
1903 iHeartMedia, Inc.
I would need to extract the count of unique words per row (after removing punctuation). So, for example:
Unique
0 6
1 6
3 8
5 6
... ...
1899 8
1900 8
1901 9
1902 5
1903 2
I tried to do it as follows:
df["Unique"]=df['Text'].str.lower()
df["Unique"]==Counter(word_tokenize('\n'.join( file["Unique"])))
but I have not got any count, only a list of words (without their frequency in that row).
Can you please tell me what is wrong?
First remove all Punctuation if you dont need it counted. Leverage sets. str.split.map(set) will give you a set. Count the elements in the set there after. Sets do not take multiple unique elements.
Chained
df['Text'].str.replace(r'[^\w\s]+', '').str.split().map(set).str.len()
Stepwise
df[Text]=df['Text'].str.replace(r'[^\w\s]+', '')
df['New Text']=df.Text.str.split().map(set).str.len()
So, I'm just updating this as per the comments. This solution accounts for punctuation as well.
df['Unique'] = df['Text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)).strip()).str.split(' ').apply(len)
try this
from collections import Counter
dict = {'A': {0:'John', 1:'Bob'},
'Desc': {0:'Bill ,Gates Started Microsoft at 18 Bill', 1:'Bill Gates, Again .Bill Gates and Larry Ellison'}}
df = pd.DataFrame(dict)
df['Desc']=df['Desc'].str.replace(r'[^\w\s]+', '')
print(df.loc[:,"Desc"])
print(Counter(" ".join(df.loc[0:0,"Desc"]).split(" ")).items())
print(len(Counter(" ".join(df.loc[0:0,"Desc"]).split(" ")).items()))

Extracting year from a column of string movie names

I have the following data, having two columns, "name" and "gross" in table called train_df:
gross name
760507625.0 Avatar (2009)
658672302.0 Titanic (1997)
652270625.0 Jurassic World (2015)
623357910.0 The Avengers (2012)
534858444.0 The Dark Knight (2008)
532177324.0 Rogue One (2016)
474544677.0 Star Wars: Episode I - The Phantom Menace (1999)
459005868.0 Avengers: Age of Ultron (2015)
448139099.0 The Dark Knight Rises (2012)
436471036.0 Shrek 2 (2004)
424668047.0 The Hunger Games: Catching Fire (2013)
423315812.0 Pirates of the Caribbean: Dead Man's Chest (2006)
415004880.0 Toy Story 3 (2010)
409013994.0 Iron Man 3 (2013)
408084349.0 Captain America: Civil War (2016)
408010692.0 The Hunger Games (2012)
403706375.0 Spider-Man (2002)
402453882.0 Jurassic Park (1993)
402111870.0 Transformers: Revenge of the Fallen (2009)
400738009.0 Frozen (2013)
381011219.0 Harry Potter and the Deathly Hallows: Part 2 (2011)
380843261.0 Finding Nemo (2003)
380262555.0 Star Wars: Episode III - Revenge of the Sith (2005)
373585825.0 Spider-Man 2 (2004)
370782930.0 The Passion of the Christ (2004)
I would like to read and extract the date from "name" to create a new column that will be called "year", which I will then use to filter the data set by specific year.
The new table will look like the following:
year gross name
2009 760507625.0 Avatar (2009)
1997 658672302.0 Titanic (1997)
2015 652270625.0 Jurassic World (2015)
2012 623357910.0 The Avengers (2012)
2008 534858444.0 The Dark Knight (2008)
I tried the apply and lambda approach, but got no results:
train_df[train_df.apply(lambda row: row['name'].startswith('2014'),axis=1)]
Is there a way to use contains (as in C# or "isin" to filter the strings in python?
If you know for sure that your years are going to be at the end of the string, you can do
df['year'] = df['name'].str[-5:-1].astype(int)
This takes the column name, uses the str accessor to access the value of each row as a string, and takes the -5:-1 slice from it. Then, it converts the result to int, and sets it as the year column. This approach will be much faster than iterating over the rows if you have lots of data.
Alternatively, you could use regex for more flexibility using the .extract() method of the str accessor.
df['year'] = df['name'].str.extract(r'\((\d{4})\)').astype(int)
This extracts groups matching the expression \((\d{4})\) (Try it here), which means capture the numbers inside a pair of parentheses containing exactly four digits, and will work anywhere in the string. To anchor it to the end of your string, use a $ at the end of your regex like so: \((\d{4})\)$. The result is the same using regex and using string slicing.
Now we have our new dataframe:
gross name year
0 760507625.0 Avatar (2009) 2009
1 658672302.0 Titanic (1997) 1997
2 652270625.0 Jurassic World (2015) 2015
3 623357910.0 The Avengers (2012) 2012
4 534858444.0 The Dark Knight (2008) 2008
5 532177324.0 Rogue One (2016) 2016
6 474544677.0 Star Wars: Episode I - The Phantom Menace (1999) 1999
7 459005868.0 Avengers: Age of Ultron (2015) 2015
8 448139099.0 The Dark Knight Rises (2012) 2012
9 436471036.0 Shrek 2 (2004) 2004
10 424668047.0 The Hunger Games: Catching Fire (2013) 2013
11 423315812.0 Pirates of the Caribbean: Dead Man's Chest (2006) 2006
12 415004880.0 Toy Story 3 (2010) 2010
13 409013994.0 Iron Man 3 (2013) 2013
14 408084349.0 Captain America: Civil War (2016) 2016
15 408010692.0 The Hunger Games (2012) 2012
16 403706375.0 Spider-Man (2002) 2002
17 402453882.0 Jurassic Park (1993) 1993
18 402111870.0 Transformers: Revenge of the Fallen (2009) 2009
19 400738009.0 Frozen (2013) 2013
20 381011219.0 Harry Potter and the Deathly Hallows: Part 2 (... 2011
21 380843261.0 Finding Nemo (2003) 2003
22 380262555.0 Star Wars: Episode III - Revenge of the Sith (... 2005
23 373585825.0 Spider-Man 2 (2004) 2004
24 370782930.0 The Passion of the Christ (2004) 2004
You can a regular expression with pandas.Series.str.extract for this:
df["year"] = df["name"].str.extract(r"\((\d{4})\)$", expand=False)
df["year"] = pd.to_numeric(df["year"])
print(df.head())
gross name year
0 760507625.0 Avatar (2009) 2009
1 658672302.0 Titanic (1997) 1997
2 652270625.0 Jurassic World (2015) 2015
3 623357910.0 The Avengers (2012) 2012
4 534858444.0 The Dark Knight (2008) 2008
The regular expression:
\(: find where there is a literal opening parentheses
(\d{4}) Then, find 4 digits appearing next to each other
The parentheses here means that we're storing our 4 digits as a capture group (in this case its the group of digits we want to extract from the larger string)
\): Then, find a closing parentheses
$: All of the above MUST occur at the end of the string
When all of the above criterion are met, get those 4 digits- or if no match is acquired, return NaN for that row.
Try this.
df = ['Avatar (2009)', 'Titanic (1997)', 'Jurassic World (2015)','The Avengers (2012)', 'The Dark Knight (2008)', 'Rogue One (2016)','Star Wars: Episode I - The Phantom Menace (1999)','Avengers: Age of Ultron (2015)', 'The Dark Knight Rises (2012)','Shrek 2 (2004)', 'Boiling Point (1990)', 'Terror Firmer (1999)', 'Adam's Apples (2005)', 'I Want You (1998)', 'Chalet Girl (2011)','Love, Honor and Obey (2000)', 'Perrier's Bounty (2009)','Into the White (2012)', 'The Decoy Bride (2011)','I Spit on Your Grave 2 (2013)']
for i in df:
mov_title = i[:-7]
year = i[-5:-1]
print(mov_title) //do your actual extraction
print(year) //do your actual extraction
def getYear(val):
startIndex = val.find('(')
endIndex = val.find(')')
return val[(int(startIndex) + 1):endIndex]
Am not much of a python dev, but i believe this will do. You will just need to loop through passing each to the above function. On each function call you will get the date extracted for you.

Pandas group by but keep another column

Say that I have a dataframe that looks something like this
date location year
0 1908-09-17 Fort Myer, Virginia 1908
1 1909-09-07 Juvisy-sur-Orge, France 1909
2 1912-07-12 Atlantic City, New Jersey 1912
3 1913-08-06 Victoria, British Columbia, Canada 1912
I want to use pandas groupby function to create an output that shows the total number of incidents by year but also keep the location column that will display one of the locations that year. Any which one works. So it would look something like this:
total location
year
1908 1 Fort Myer, Virginia
1909 1 Juvisy-sur-Orge, France
1912 2 Atlantic City, New Jersey
Can this be done without doing funky joining? The furthest I can get is using the normal groupby
df = df.groupby(['year']).count()
But that only gives me something like this
location
year
1908 1 1
1909 1 1
1912 2 2
How can I display one of the locations in this dataframe?
You can use groupby.agg and use 'first' to extract the first location in each group:
res = df.groupby('year')['location'].agg(['first', 'count'])
print(res)
# first count
# year
# 1908 Fort Myer, Virginia 1
# 1909 Juvisy-sur-Orge, France 1
# 1912 Atlantic City, New Jersey 2

Error on removing strings from a pandas data frame - Python

I have a set and a function to remove the strings that are in the variable 'nstandar'of my pandas data frame. The set, the function and the pandas data frame are the following:
setc={'adr','company','corporation','energy','etf','group','holdings','inc','international','ltd'}
def quitarc(x):
x=''.join(a for a in x if a not in setc)
return x
Company name nstandar
0 1-800-FLOWERS.COM 1800flowerscom
1 1347 PROPERTY INS HLDGS INC 1347 property ins hldgs inc
2 1ST CAPITAL BANK 1st capital bank
3 1ST CENTURY BANCSHARES INC 1st century bancshares inc
4 1ST CONSTITUTION BANCORP 1st constitution bancorp
5 1ST ENTERPRISE BANK 1st enterprise bank
6 1ST PACIFIC BANCORP 1st pacific bancorp
7 1ST SOURCE CORP 1st source corporation
8 1ST UNITED BANCORP INC 1st united bancorp inc
9 21ST CENTURY ONCOLOGY HLDGS 21st century oncology hldgs
However, When I create a new variable without the strings to remove, the new variable is just the same as 'nstandar'. The code is the following:
cemp['newnstandar']=cemp['nstandar'].apply(quitarc)
So, What is my error? How Can I fix my codes?
Finally, I realized that the problem was with my function. So I modify it and its code is:
def quitarc(x):
x=''.join(a + " " for a in x.split() if a not in setc)
x=x.strip()
return x

Categories

Resources