How to count paragraphs from each article from dataframe? - python

I want to count paragraphs from data frames. However, it turns out that my result gets zero inside the list. Does anybody know how to fix it? Thank you so much.
Here is my code:
def count_paragraphs(df):
paragraph_count = []
linecount = 0
for i in df.text:
if i in ('\n','\r\n'):
if linecount == 0:
paragraphcount = paragraphcount + 1
return paragraph_count
count_paragraphs(df)
df.text
0 On Saturday, September 17 at 8:30 pm EST, an e...
1 Story highlights "This, though, is certain: to...
2 Critical Counties is a CNN series exploring 11...
3 McCain Criticized Trump for Arpaio’s Pardon… S...
4 Story highlights Obams reaffirms US commitment...
5 Obama weighs in on the debate\n\nPresident Bar...
6 Story highlights Ted Cruz refused to endorse T...
7 Last week I wrote an article titled “Donald Tr...
8 Story highlights Trump has 45%, Clinton 42% an...
9 Less than a day after protests over the police...
10 I woke up this morning to find a variation of ...
11 Thanks in part to the declassification of Defe...
12 The Democrats are using an intimidation tactic...
13 Dolly Kyle has written a scathing “tell all” b...
14 The Haitians in the audience have some newswor...
15 The man arrested Monday in connection with the...
16 Back when the news first broke about the pay-t...
17 Chicago Environmentalist Scumbags\n\nLeftists ...
18 Well THAT’S Weird. If the Birther movement is ...
19 Former President Bill Clinton and his Clinton ...
Name: text, dtype: object

Use Series.str.count:
def count_paragraphs(df):
return df.text.str.count(r'\n\n').tolist()
count_paragraphs(df)

This is my answer and It works!
def count_paragraphs(df):
paragraph_count = []
for i in range(len(df)):
paragraph_count.append(df.text[i].count('\n\n'))
return paragraph_count
count_paragraphs(df)

Related

Get all the times an items value comes up in a data frame

So I have a data Frame with a list of teams and their goals. I have googled different ways to solve but I cant find a way that I want it done.
I have tried value_counts() And it seems to get all the Goals for each team I cant find a way to add them together.
HomeTeam TeamsGoals
Liverpool 4
0
3
6
1
matchdata.groupby("HomeTeam")["FullTimeHomeTeamsGoals"].value_counts()
I have tried many diffrent thing but I cant get the right output
My dataSet looks something like this:
HOME AWAY HOMEGOALS AWAYGOALS
Liverpool Man City 5 3
Man u Man City 0 2
LiverPool Man u 6 2
Man u LiverPool 0 2
Man City Man U 7 4
Man City Liverpool 2 2
wanted output:
HOME ToalScoreHome
Liverpool 11
Man City 9
Man u 0
Just use groupby and sum
matchdata.groupby('HOME')['HOMEGOALS'].sum()
HOME
LiverPool 6
Liverpool 5
Man City 9
Man u 0
Name: HOMEGOALS, dtype: int64
Or if LiverPool really has a differ case Liverpool then
matchdata.groupby(matchdata['HOME'].str.lower())['HOMEGOALS'].sum()
HOME
liverpool 11
man city 9
man u 0
Name: HOMEGOALS, dtype: int64
The reason that this does not work as you would expect is because LiverPool != Liverpool, so groupby won't do what you expect it to do, which makes sense why. Convert that, and try again:
df.replace({'LiverPool':'Liverpool'},inplace=True)
df.groupby('HOME',as_index=False)['HOMEGOALS'].sum()
HOME HOMEGOALS
0 Liverpool 11
1 Man City 9
2 Man u 0
Note you might need to do this for your other values if they are misspellings between them.

str.findall returns all NA's

I have this df1 with a lot of different news articles. An example of a news article is this:
'Today is Monday Aug. 17 the 230th day of 2020 . There are 136 days left in the year . On August 17 2017 a van plowed through pedestrians along a packed promenade in the Spanish city of Barcelona killing 13 people and injuring 120 . A 14th victim died later from injuries . Another man was stabbed to death in a carjacking that night as the van driver made his getaway and a woman died early the next day in a vehicle-and-knife attack in a nearby coastal town . Six by police two more died when a bomb workshop exploded . In 1915 a mob in Cobb County Georgia lynched Jewish businessman Leo Frank 31 whose death sentence for the murder of 13-year-old Mary Phagan had been commuted to life imprisonment . Frank who d maintained his innocence was pardoned by the state of Georgia in 1986 . In 1960 the newly renamed Beatles formerly the Silver Beetles began their first gig in Hamburg West Germany Teamsters union president Jimmy Hoffa was sentenced in Chicago to five years in federal prison for defrauding his union s pension fund . Hoffa was released in 1971 after President Richard Nixon commuted his sentence for this conviction and jury tampering . In 1969 Hurricane Camille slammed into the Mississippi coast as a Category 5 storm that was blamed for 256 U.S. deaths three in Cuba . In 1978 the first successful trans-Atlantic balloon flight ended as Maxie Anderson Ben Abruzzo and Larry Newman landed In 1982 the first commercially produced compact discs a recording of ABBA s The Visitors were pressed at a Philips factory near Hanover West Germany .'
And I have this df2 with all the words from the news articles in the column "Word" with their corresponding LIWC category in the second column.
Data example:
data = {'Word': ['killing','even','guilty','brain'], 'Category': ['Affect', 'Adverb', 'Anx','Body']}
What I'm trying to do is: To calculate for each article in df1 how many words occur of each category in df2. So I want to create a column for each category mentioned in df2["category"].
And it should look like this in the end:
Content | Achieve | Affiliation | affect
article text here | 6 | 2 | 2
article text here | 2 | 43 | 2
article text here | 6 | 8 | 8
article text here | 2 | 13 | 7
I since it's all strings I tried str.findall but this returns all NA's for everything. This is what I tried:
from collections import Counter
liwc = df1['articles'].str.findall(fr"'({'|'.join(df2)})'") \
.apply(lambda x: pd.Series(Counter(x), index=df2["category"].unique())) \
.fillna(0).astype(int)
Both a pandas or r solution would be equally great.
First flatten df2 values to dictionary, add word boundaries \b\b and pass to Series.str.extractall, so possible use Series.map and create DataFrame by reset_index, last pass to crosstab and append to original by DataFrame.join:
df1 = pd.DataFrame({'articles':['Today is killing Aug. 17 the 230th day of 2020',
'Today is brain Aug. 17 the guilty day of 2020 ']})
print (df1)
articles
0 Today is killing Aug. 17 the 230th day of 2020
1 Today is brain Aug. 17 the guilty day of 2020
If list of values in Word column like in picture:
data = {'Word': [['killing'],['even'],['guilty'],['brain']],
'Category': ['Affect', 'Adverb', 'Anx','Body']}
df2 = pd.DataFrame(data)
print (df2)
Word Category
0 [killing] Affect
1 [even] Adverb
2 [guilty] Anx
3 [brain] Body
d = {x: b for a, b in zip(df2['Word'], df2['Category']) for x in a}
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}
If df2 is different:
data = {'Word': ['killing','even','guilty','brain'],
'Category': ['Affect', 'Adverb', 'Anx','Body']}
df2 = pd.DataFrame(data)
print (df2)
0 killing Affect
1 even Adverb
2 guilty Anx
3 brain Body
d = dict(zip(df2['Word'], df2['Category']))
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}
import re
#thank you for improve solution Wiktor Stribiżew
pat = r"\b(?:{})\b".format("|".join(re.escape(x) for x in d))
df = df1['articles'].str.extractall(rf'({pat})')[0].map(d).reset_index(name='Category')
df = df1.join(pd.crosstab(df['level_0'], df['Category']))
print (df)
articles Affect Anx Body
0 Today is killing Aug. 17 the 230th day of 2020 1 0 0
1 Today is brain Aug. 17 the guilty day of 2020 0 1 1
You can craft a custom regex with named capturing groups and use str.extractall.
With your dictionary the custom regex would be '(?P<Affect>\\bkilling\\b)|(?P<Adverb>\\beven\\b)|(?P<Anx>\\bguilty\\b)|(?P<Body>\\bbrain\\b)'
Then groupby+max the notna results, convert to int and join to the original dataframe:
regex = '|'.join(fr'(?P<{k}>\b{v}\b)' for v,k in zip(*data.values()))
(df1.join(df1['articles'].str.extractall(regex, flags=2) # re.IGNORECASE
.notna().groupby(level=0).max()
.astype(int)
)
)
output:
articles Affect Adverb Anx Body
0 Today is killing Aug. 17 the 230th day of 2020 1 0 0 0
1 Today is brain Aug. 17 the guilty day of 2020 0 0 1 1

I want to filter rows from data frame where the year is 2020 and 2021 using re.search and re.match functions

Data Frame:
Unnamed: 0 date target insult tweet year
0 1 2014-10-09 thomas-frieden fool Can you believe this fool, Dr. Thomas Frieden ... 2014
1 2 2014-10-09 thomas-frieden DOPE Can you believe this fool, Dr. Thomas Frieden ... 2014
2 3 2015-06-16 politicians all talk and no action Big time in U.S. today - MAKE AMERICA GREAT AG... 2015
3 4 2015-06-24 ben-cardin It's politicians like Cardin that have destroy... Politician #SenatorCardin didn't like that I s... 2015
4 5 2015-06-24 neil-young total hypocrite For the nonbeliever, here is a photo of #Neily... 2015
I want the data frame which consists for only year with 2020 and 2021 using search and match methods.
df_filtered = df.loc[df.year.str.contains('2014|2015', regex=True) == True]

Extracting year from a column of string movie names

I have the following data, having two columns, "name" and "gross" in table called train_df:
gross name
760507625.0 Avatar (2009)
658672302.0 Titanic (1997)
652270625.0 Jurassic World (2015)
623357910.0 The Avengers (2012)
534858444.0 The Dark Knight (2008)
532177324.0 Rogue One (2016)
474544677.0 Star Wars: Episode I - The Phantom Menace (1999)
459005868.0 Avengers: Age of Ultron (2015)
448139099.0 The Dark Knight Rises (2012)
436471036.0 Shrek 2 (2004)
424668047.0 The Hunger Games: Catching Fire (2013)
423315812.0 Pirates of the Caribbean: Dead Man's Chest (2006)
415004880.0 Toy Story 3 (2010)
409013994.0 Iron Man 3 (2013)
408084349.0 Captain America: Civil War (2016)
408010692.0 The Hunger Games (2012)
403706375.0 Spider-Man (2002)
402453882.0 Jurassic Park (1993)
402111870.0 Transformers: Revenge of the Fallen (2009)
400738009.0 Frozen (2013)
381011219.0 Harry Potter and the Deathly Hallows: Part 2 (2011)
380843261.0 Finding Nemo (2003)
380262555.0 Star Wars: Episode III - Revenge of the Sith (2005)
373585825.0 Spider-Man 2 (2004)
370782930.0 The Passion of the Christ (2004)
I would like to read and extract the date from "name" to create a new column that will be called "year", which I will then use to filter the data set by specific year.
The new table will look like the following:
year gross name
2009 760507625.0 Avatar (2009)
1997 658672302.0 Titanic (1997)
2015 652270625.0 Jurassic World (2015)
2012 623357910.0 The Avengers (2012)
2008 534858444.0 The Dark Knight (2008)
I tried the apply and lambda approach, but got no results:
train_df[train_df.apply(lambda row: row['name'].startswith('2014'),axis=1)]
Is there a way to use contains (as in C# or "isin" to filter the strings in python?
If you know for sure that your years are going to be at the end of the string, you can do
df['year'] = df['name'].str[-5:-1].astype(int)
This takes the column name, uses the str accessor to access the value of each row as a string, and takes the -5:-1 slice from it. Then, it converts the result to int, and sets it as the year column. This approach will be much faster than iterating over the rows if you have lots of data.
Alternatively, you could use regex for more flexibility using the .extract() method of the str accessor.
df['year'] = df['name'].str.extract(r'\((\d{4})\)').astype(int)
This extracts groups matching the expression \((\d{4})\) (Try it here), which means capture the numbers inside a pair of parentheses containing exactly four digits, and will work anywhere in the string. To anchor it to the end of your string, use a $ at the end of your regex like so: \((\d{4})\)$. The result is the same using regex and using string slicing.
Now we have our new dataframe:
gross name year
0 760507625.0 Avatar (2009) 2009
1 658672302.0 Titanic (1997) 1997
2 652270625.0 Jurassic World (2015) 2015
3 623357910.0 The Avengers (2012) 2012
4 534858444.0 The Dark Knight (2008) 2008
5 532177324.0 Rogue One (2016) 2016
6 474544677.0 Star Wars: Episode I - The Phantom Menace (1999) 1999
7 459005868.0 Avengers: Age of Ultron (2015) 2015
8 448139099.0 The Dark Knight Rises (2012) 2012
9 436471036.0 Shrek 2 (2004) 2004
10 424668047.0 The Hunger Games: Catching Fire (2013) 2013
11 423315812.0 Pirates of the Caribbean: Dead Man's Chest (2006) 2006
12 415004880.0 Toy Story 3 (2010) 2010
13 409013994.0 Iron Man 3 (2013) 2013
14 408084349.0 Captain America: Civil War (2016) 2016
15 408010692.0 The Hunger Games (2012) 2012
16 403706375.0 Spider-Man (2002) 2002
17 402453882.0 Jurassic Park (1993) 1993
18 402111870.0 Transformers: Revenge of the Fallen (2009) 2009
19 400738009.0 Frozen (2013) 2013
20 381011219.0 Harry Potter and the Deathly Hallows: Part 2 (... 2011
21 380843261.0 Finding Nemo (2003) 2003
22 380262555.0 Star Wars: Episode III - Revenge of the Sith (... 2005
23 373585825.0 Spider-Man 2 (2004) 2004
24 370782930.0 The Passion of the Christ (2004) 2004
You can a regular expression with pandas.Series.str.extract for this:
df["year"] = df["name"].str.extract(r"\((\d{4})\)$", expand=False)
df["year"] = pd.to_numeric(df["year"])
print(df.head())
gross name year
0 760507625.0 Avatar (2009) 2009
1 658672302.0 Titanic (1997) 1997
2 652270625.0 Jurassic World (2015) 2015
3 623357910.0 The Avengers (2012) 2012
4 534858444.0 The Dark Knight (2008) 2008
The regular expression:
\(: find where there is a literal opening parentheses
(\d{4}) Then, find 4 digits appearing next to each other
The parentheses here means that we're storing our 4 digits as a capture group (in this case its the group of digits we want to extract from the larger string)
\): Then, find a closing parentheses
$: All of the above MUST occur at the end of the string
When all of the above criterion are met, get those 4 digits- or if no match is acquired, return NaN for that row.
Try this.
df = ['Avatar (2009)', 'Titanic (1997)', 'Jurassic World (2015)','The Avengers (2012)', 'The Dark Knight (2008)', 'Rogue One (2016)','Star Wars: Episode I - The Phantom Menace (1999)','Avengers: Age of Ultron (2015)', 'The Dark Knight Rises (2012)','Shrek 2 (2004)', 'Boiling Point (1990)', 'Terror Firmer (1999)', 'Adam's Apples (2005)', 'I Want You (1998)', 'Chalet Girl (2011)','Love, Honor and Obey (2000)', 'Perrier's Bounty (2009)','Into the White (2012)', 'The Decoy Bride (2011)','I Spit on Your Grave 2 (2013)']
for i in df:
mov_title = i[:-7]
year = i[-5:-1]
print(mov_title) //do your actual extraction
print(year) //do your actual extraction
def getYear(val):
startIndex = val.find('(')
endIndex = val.find(')')
return val[(int(startIndex) + 1):endIndex]
Am not much of a python dev, but i believe this will do. You will just need to loop through passing each to the above function. On each function call you will get the date extracted for you.

Knn for name matching-unsupervised learning

I am working on a name matching problem using synthetic data such as
alertname custname
0 wlison wilson
1 dais said
2 4dams adams
3 ad4ms adams
4 ad48s adams
5 smyth smith
6 smythe smith
7 gillan gillan
8 gilen gillan
9 scott-smith scottsmith
10 scott smith scottsmith
11 perrson person
12 persson person
Now I want to apply Knn for this task in unsupervised way, since I do not have any explicit label. I want to output matching score for each of the rows. I have used fuzzy matching already, now just wanted to explore knn for some automation. Would really appreciate if someone can provide starting point. Having said that, we do not have external label here.

Categories

Resources