Counting phrase frequencies in an html file - python

I'm currently trying to get used to Python and have recently hit block in my coding. I couldn't run a code that would count the number of times a phrase appears in an html file. I've recently received some help constructing the code for counting the frequency in a text file but am wondering there is a way to do this directly from the html file (to bypass the copy and paste alternative). Any advice will be sincerely appreciated. The previous coding I have used is the following:
#!/bin/env python 3.3.2
import collections
import re
# Defining a function named "findWords".
def findWords(filepath):
with open(filepath) as infile:
for line in infile:
words = re.findall('\w+', line.lower())
yield from words
phcnt = collections.Counter()
from itertools import tee
phrases = {'central bank', 'high inflation'}
fw1, fw2 = tee(findWords('02.2003.BenBernanke.txt'))
next(fw2)
for w1,w2 in zip(fw1, fw2):
phrase = ' '.join([w1, w2])
if phrase in phrases:
phcnt[phrase] += 1
print(phcnt)

You can use some_str.count(some_phrase) function
In [19]: txt = 'Text mining, also referred to as text data mining, Text mining,\
also referred to as text data mining,'
In [20]: txt.lower().count('data mining')
Out[20]: 2

What about just stripping the html tags before doing the analysis? html2text does this job quite well.
import html2text
content = html2text.html2text(infile.read())
would give you the text content (somehow formatted, but this is no problem in your approach I think). There are options to ignore images and links additionally, which you would use like
h = html2text.HTML2Text()
h.ignore_images = True
h.ignore_links = True
content = h.handle(infile.read())

Related

"Replace" from central file?

I am trying to extend the replace function. Instead of doing the replacements on individual lines or individual commands, I would like to use the replacements from a central text file.
That's the source:
import os
import feedparser
import pandas as pd
pd.set_option('max_colwidth', -1)
RSS_URL = "https://techcrunch.com/startups/feed/"
feed = feedparser.parse(RSS_URL)
entries = pd.DataFrame(feed.entries)
entries = entries[['title']]
entries = entries.to_string(index=False, header=False)
entries = entries.replace(' ', '\n')
entries = os.linesep.join([s for s in entries.splitlines() if s])
print(entries)
I want to be able to replace words from a RSS feed, from a central "Replacement"-file, witch So the source file should have two columns:Old word, New word. Like replace function replace('old','new').
Output/Print Example:
truck
rental
marketplace
D’Amelio
family
launches
to
invest
up
to
$25M
...
In most cases I want to delete the words that are unnecessary for me, so e.g. replace('to',''). But I also want to be able to change special names, e.g. replace('D'Amelio','DAmelio'). The goal is to reduce the number of words and build up a kind of keyword radar.
Is this possible? I can't find any help Googling. But it could well be that I do not know the right terms or can not formulate.
with open('<filepath>','r') as r:
# if you remove the ' marks from around your words, you can remove the [1:-1] part of the below code
words_to_replace = [word.strip()[1:-1] for word in r.read().split(',')]
def replace_words(original_text, words_to_replace):
for word in words_to_replace:
original_text = original_text.replace(word, '')
return original_text
I was unable to understand your question properly but as far as I understand you have strings like cat, dog, etc. and you have a file in which you have data with which you want to replace the string. If this was your requirement, I have given the solution below, so try running it if it satisfies your requirement.
If that's not what you meant, please comment below.
TXT File(Don't use '' around the strings in Text File):
papa, papi
dog, dogo
cat, kitten
Python File:
your_string = input("Type a string here: ") #string you want to replace
with open('textfile.txt',"r") as file1: #open your file
lines = file1.readlines()
for line in lines: #taking the lines of file in one by one using loop
string1 = f'{line}'
string1 = string1.split() #split the line of the file into list like ['cat,', 'kitten']
if your_string == string1[0][:-1]: #comparing the strings of your string with the file
your_string = your_string.replace(your_string, string1[1]) #If string matches like user has given input cat, it will replace it with kitten.
print(your_string)
else:
pass
If you got the correct answer please upvote my answer as it took my time to make and test the python file.

How can I parse a text from an URL and put the clean text in a DataFrame?

I have an Excel file of 147 Toronto Star news articles that I've compiled and created a dataframe. I have also written a Python script that can extract the text from one article at a time. However, I'd like to improve my script so that Python will cycle through all the URLs in the dataframe, scrape the text, append the scraped, stopworded text to the row (or perhaps to a linked text file?), and then leverage that data frame for a classification algorithm and more exploration.
Can someone please help me with writing the loop? (I have no background in programming.. struggling!)
creating the dataframe
url_file = 'https://github.com/MarissaFosse/ryersoncapstone/raw/master/DailyNewsArticles.xlsx'
tstar_articles = pd.read_excel(url_file, "TorontoStar Articles", header=0)
nltk with one article
URL = 'https://www.thestar.com/news/gta/2019/12/31/with-291-people-shot-2019-is-closing-as-torontos-bloodiest-year-on-record-for-overall-gun-violence.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(class_='c-article-body__content')
results_text = [tag.get_text().strip() for tag in results]
sentence_list = [sentence for sentence in results_text if not '\n' in sentence]
sentence_list = [sentence for sentence in sentence_list if '.' in sentence]
article = ' '.join(sentence_list)
from nltk.tokenize import word_tokenize
word_tokens = word_tokenize(article)
stop_words = set(stopwords.words('english'))
filtered_article = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
clean_tokens = tokens[:]
for token in tokens:
if token in stopwords.words('english'):
clean_tokens.remove(token)
Firstly, most news site has an RSS feed, for the ww.thestar.com site, there's https://www.thestar.com/about/rssfeeds.html
Instead of parsing urls from an excel sheet, it's much more convenient to parse the RSS feed =)
Lets try the Toronto news, from http://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.vancouver.rss
To get the data from a website, one can use the requests library
In code:
import requests
response = requests.get('http://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.vancouver.rss')
toronto_rss = response.content.decode('utf8')
To parse the XML file, lets use the feedparser library:
import requests
import feedparser
from bs4 import BeautifulSoup
response = requests.get('http://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.vancouver.rss')
toronto_rss = response.content.decode('utf8')
feed = feedparser.parse(toronto_rss)
for item in feed.entries:
print(item.link)
Now lets try to fetch the text from each of the link from the RSS using BeautifulSoup:
import requests
import feedparser
from bs4 import BeautifulSoup
response = requests.get('http://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.vancouver.rss')
toronto_rss = response.content.decode('utf8')
feed = feedparser.parse(toronto_rss)
for item in feed.entries:
url = item.link
response = requests.get(url)
bsoup = BeautifulSoup(response.content.decode('utf8'))
And from the BeautifulSoup object, there is a nifty get_text() function that we can use to extract the text (sometimes this can get somewhat noisy).
Since you already did the hard work for finding the c-article-body__content tag that you need to extract the article's main text, we can get the text from:
import requests
import feedparser
from bs4 import BeautifulSoup
response = requests.get('http://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.vancouver.rss')
toronto_rss = response.content.decode('utf8')
feed = feedparser.parse(toronto_rss)
url_to_sents = {}
for item in feed.entries:
url = item.link
response = requests.get(url)
bsoup = BeautifulSoup(response.content.decode('utf8'))
article_sents = '\n'.join([p.text for p in bsoup.find(class_='c-article-body__content').find_all('p')])
url_to_sents[url] = article_sents
That's all nice, the explanation and all but you haven't told me how to put them into a dataframe.
Now the question is why do you need the dataframe? If you only need some keyword tokens per url, then we have to do some processing.
Lets first define the steps needed for preprocessing to get our keywords,
1. We want to sentence token, then
2. Word tokenize each sentence
3. Remove the stop words
Now there are several options, we can use scikit-learn withnltk to do (1) , (2) and (3), see https://www.kaggle.com/alvations/basic-nlp-with-nltk
But lets keep it simple and just use NLTK for now.
Since the nltk.word_tokenize() function implicitly calls sent_tokenize, we can just call word_tokenize so just (2) and (3) would do.
For now lets simply use nltk.corpus.stopwords as stopwords for (3)
So we have this preprocess function:
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
stoplist = set(stopwords.words('english')) | set(punctuation)
def preprocess(text):
return [word for word in word_tokenize(text) if word not in stoplist and not word.isdigit()]
text = url_to_sents['https://www.thestar.com/vancouver/2020/02/20/vancouver-fire-says-smoking-caused-the-citys-first-fatal-fire-of-2020.html']
preprocess(text)
Hey, I said, that's all nice and all but I really want a DataFrame...
Okay, okay, there's dataframe but BTW, there's pandas.DataFrame is not the only DataFrame library in Python, see https://www.quora.com/Whats-the-difference-between-an-SFrame-and-a-DataFrame-in-Python
Alright, alright, here's pandas...
First we have the url_to_text dictionary, that have the urls as keys and the text from the article as values.
And lets say we want a dataframe where it keys
a. the URL
b. the text in the article
c. the resulting tokens from the "cleaned" text
So here's a dataframe with (a) and (b):
import pandas as pd
urls, texts = zip(*url_to_sents.items())
data = {'urls':urls, 'text': texts}
df = pd.DataFrame.from_dict(data)
[out]:
urls text
0 https://www.thestar.com/vancouver/2020/03/26/p... VANCOUVER—British Columbia’s human rights comm...
1 https://www.thestar.com/vancouver/2020/03/08/d... VICTORIA—At the end of a stark news conference...
2 https://www.thestar.com/vancouver/2020/03/08/c... Teck Resources says it’s baffled over the virt...
3 https://www.thestar.com/vancouver/2020/02/29/t... SQUAMISH, B.C.—RCMP in Squamish, B.C., are inv...
4 https://www.thestar.com/vancouver/2020/02/26/v... VANCOUVER—The man who attempted to steal a flo...
5 https://www.thestar.com/vancouver/2020/02/22/g... VANCOUVER—Canada’s Governor General visited an...
6 https://www.thestar.com/vancouver/2020/02/20/v... Vancouver philanthropist and former chancellor...
7 https://www.thestar.com/vancouver/2020/02/20/v... VANCOUVER—A man with mobility challenges has d...
8 https://www.thestar.com/vancouver/2020/02/17/b... VICTORIA—British Columbia’s finance minister i...
Nice! How about the cleaned tokens?
Since we have a dataframe to work with and function that we want to apply to all the values in the text column, we just need to use DataFrame.apply, i.e.
df['cleaned_tokens'] = df['text'].apply(preprocess)
Awesome!! Wait a minute, did you just do a quotation mark on the "cleaned" text?
Yes, I did. Because what is "clean"?, see https://www.kaggle.com/alvations/basic-nlp-with-nltk
Why do we need to clean the text?
Do we really need to clean the text?
What is the ultimate goal of preprocessing the text?
I guess the above questions are out of scope of the original post (OP), so gonna leave them as food for thoughts for you =)
Have fun with the code above!

how to extract the contextual words of a token in python

Actually i want to extract the contextual words of a specific word. For this purpose i can use the n-gram in python but the draw back of this is that it slides the window by one but i only need the contextual words of a specific word. E.g. my file is like this
IL-2
gene
expression
and
NF-kappa
B
activation
through
CD28
requires
reactive
oxygen
production
by
5-lipoxygenase
.
mean each token on every line. now i want to extract the surrounding words of each e.g. through and requires are the surrounding words of "CD28". I write a python code but did not worked and generating an error of ValueError: list.index(x): x not in list.
My code is
import re;
import nltk;
file=open("C:/Python26/test.txt");
contents= file.read()
tokens = nltk.word_tokenize(contents)
f=open("trigram.txt",'w');
for l in tokens:
print tokens[l],tokens[l+1]
f.close();
First of all, list.index(x) : Return the index in the list of the first item whose value is x.
>>> ["foo", "bar", "baz"].index('bar')
1
In your code, the variable 'word' is populated using range of integers not by actual contents. so we can't directly use 'word' in the list.index() function.
>>> print lines.index(1)
ValueError: 1 is not in list
change your code like this :
file="C:/Python26/tokens.txt";
f=open("trigram.txt",'w');
with open(file,'r') as rf:
lines = rf.readlines();
for word in range(1,len(lines)-1):
f.write(lines[word-1].strip()+"\t"+lines[word].strip()+"\t"+lines[word+1].strip())
f.close()
I dont really understood what you want to do, but, I'll do my best.
If you want to process words with python there is a library called NLTK which means Natural Language Toolkit.
You may need to tokenize a sentence or a document.
import nltk
def tokenize_query(query):
return nltk.word_tokenize(query)
f = open('C:/Python26/tokens.txt')
raw = f.read()
tokenize_query(raw)
We can also read a file one line at a time using a for loop:
f = open('C:/Python26/tokens.txt', 'rU')
for line in f:
print(line.strip())
r means 'read' and U means 'universal', if you are wondering.
strip() is just cutting '\n' from the text.
The context may be provided by wordnet and all its functions.
I guess you should use synsets with the word's pos (part of speech).
A synset is sort of a synonyms list in a semantic way.
NLTK can provide you some others nice features like sentiment analysis and similarity between synsets.
file="C:/Python26/tokens.txt";
f=open("trigram.txt",'w');
with open(file,'r') as rf:
lines = rf.readlines();
for word in range(1,len(lines)-1):
f.write(lines[word-1].strip()+"\t"+lines[word].strip()+"\t"+lines[word+1].strip())
f.write("\n")
f.close()
This code also gives the same result
import nltk;
from nltk.util import ngrams
from nltk import word_tokenize
file = open("C:/Python26/tokens.txt");
contents=file.read();
tokens = nltk.word_tokenize(contents);
f_tri = open("trigram.txt",'w');
trigram = ngrams(tokens,3)
for t in trigram:
f_tri.write(str(t)+"\n")
f_tri.close()

parsing a .srt file with regex

I am doing a small script in python, but since I am quite new I got stuck in one part:
I need to get timing and text from a .srt file. For example, from
1
00:00:01,000 --> 00:00:04,074
Subtitles downloaded from www.OpenSubtitles.org
I need to get:
00:00:01,000 --> 00:00:04,074
and
Subtitles downloaded from www.OpenSubtitles.org.
I have already managed to make the regex for timing, but i am stuck for the text. I've tried to use look behind where I use my regex for timing:
( ?<=(\d+):(\d+):(\d+)(?:\,)(\d+) --> (\d+):(\d+):(\d+)(?:\,)(\d+) )\w+
but with no effect. Personally, i think that using look behind is the right way to solve this, but i am not sure how to write it correctly. Can anyone help me? Thanks.
Honestly, I don't see any reason to throw regex at this problem. .srt files are highly structured. The structure goes like:
an integer starting at 1, monotonically increasing
start --> stop timing
one or more lines of subtitle content
a blank line
... and repeat. Note the bold part - you might have to capture 1, 2, or 20 lines of subtitle content after the time code.
So, just take advantage of the structure. In this way you can parse everything in just one pass, without needing to put more than one line into memory at a time and still keeping all the information for each subtitle together.
from itertools import groupby
# "chunk" our input file, delimited by blank lines
with open(filename) as f:
res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]
For example, using the example on the SRT doc page, I get:
res
Out[60]:
[['1\n',
'00:02:17,440 --> 00:02:20,375\n',
"Senator, we're making\n",
'our final approach into Coruscant.\n'],
['2\n', '00:02:20,476 --> 00:02:22,501\n', 'Very good, Lieutenant.\n']]
And I could further transform that into a list of meaningful objects:
from collections import namedtuple
Subtitle = namedtuple('Subtitle', 'number start end content')
subs = []
for sub in res:
if len(sub) >= 3: # not strictly necessary, but better safe than sorry
sub = [x.strip() for x in sub]
number, start_end, *content = sub # py3 syntax
start, end = start_end.split(' --> ')
subs.append(Subtitle(number, start, end, content))
subs
Out[65]:
[Subtitle(number='1', start='00:02:17,440', end='00:02:20,375', content=["Senator, we're making", 'our final approach into Coruscant.']),
Subtitle(number='2', start='00:02:20,476', end='00:02:22,501', content=['Very good, Lieutenant.'])]
Disagree with #roippi. Regex is a very nice solution to text matching. And the Regex for this solution is not tricky.
import re
f = file.open(yoursrtfile)
# Parse the file content
content = f.read()
# Find all result in content
# The first big (__) retrieve the timing, \s+ match all timing in between,
# The (.+) means retrieve any text content after that.
result = re.findall("(\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)\s+(.+)", content)
# Just print out the result list. I recommend you do some formatting here.
print result
number:^[0-9]+$
Time:
^[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]$
string: *[a-zA-Z]+*
hope this help.
Thanks #roippi for this excellent parser.
It helped me a lot to write a srt to stl converter in less than 40 lines (in python2 though, as it has to fit in a larger project)
from __future__ import print_function, division
from itertools import groupby
from collections import namedtuple
# prepare - adapt to you needs or use sys.argv
inputname = 'FR.srt'
outputname = 'FR.stl'
stlheader = """
$FontName = Arial
$FontSize = 34
$HorzAlign = Center
$VertAlign = Bottom
"""
def converttime(sttime):
"convert from srt time format (0...999) to stl one (0...25)"
st = sttime.split(',')
return "%s:%02d"%(st[0], round(25*float(st[1]) /1000))
# load
with open(inputname,'r') as f:
res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]
# parse
Subtitle = namedtuple('Subtitle', 'number start end content')
subs = []
for sub in res:
if len(sub) >= 3: # not strictly necessary, but better safe than sorry
sub = [x.strip() for x in sub]
number, start_end, content = sub[0], sub[1], sub[2:] # py 2 syntax
start, end = start_end.split(' --> ')
subs.append(Subtitle(number, start, end, content))
# write
with open(outputname,'w') as F:
F.write(stlheader)
for sub in subs:
F.write("%s , %s , %s\n"%(converttime(sub.start), converttime(sub.end), "|".join(sub.content)) )
for time:
pattern = ("(\d{2}:\d{2}:\d{2},\d{3}?.*)")
None of the pure REGEx solution above worked for the real life srt files.
Let's take a look of the following SRT patterned text :
1
00:02:17,440 --> 00:02:20,375
Some multi lined text
This is a second line
2
00:02:20,476 --> 00:02:22,501
as well as a single line
3
00:03:20,476 --> 00:03:22,501
should be able to parse unicoded text too
こんにちは
Take a note that :
text may contain unicode characters.
Text can consist of several lines.
every cue started with an integer value and ended with a blank new line which both unix style and windows style CR/LF are accepted
Here is the working regex :
\d+[\r\n](\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)[\r\n]((.+\r?\n)+(?=(\r?\n)?))
https://regex101.com/r/qICmEM/1

Replace section of text with only knowing the beginning and last word using Python

In Python, it possible to cut out a section of text in a document when you only know the beginning and end words?
For example, using the bill of rights as the sample document, search for "Amendment 3" and remove all the text until you hit "Amendment 4" without actually knowing or caring what text exists between the two end points.
The reason I'm asking is I would like to use this Python script to modify my other Python programs when I upload them to the client's computer -- removing sections of code that exists between a comment that says "#chop-begin" and "#chop-end". I do not want the client to have access to all of the functions without paying for the better version of the code.
You can use Python's re module.
I wrote this example script for removing the sections of code in file:
import re
# Create regular expression pattern
chop = re.compile('#chop-begin.*?#chop-end', re.DOTALL)
# Open file
f = open('data', 'r')
data = f.read()
f.close()
# Chop text between #chop-begin and #chop-end
data_chopped = chop.sub('', data)
# Save result
f = open('data', 'w')
f.write(data_chopped)
f.close()
With data.txt
do_something_public()
#chop-begin abcd
get_rid_of_me() #chop-end
#chop-beginner this should stay!
#chop-begin
do_something_private()
#chop-end The rest of this comment should go too!
but_you_need_me() #chop-begin
last_to_go()
#chop-end
the following code
import re
class Chopper(object):
def __init__(self, start='\\s*#ch'+'op-begin\\b', end='#ch'+'op-end\\b.*?$'):
super(Chopper,self).__init__()
self.re = re.compile('{0}.*?{1}'.format(start,end), flags=re.DOTALL+re.MULTILINE)
def chop(self, s):
return self.re.sub('', s)
def chopFile(self, infname, outfname=None):
if outfname is None:
outfname = infname
with open(infname) as inf:
data = inf.read()
with open(outfname, 'w') as outf:
outf.write(self.chop(data))
ch = Chopper()
ch.chopFile('data.txt')
results in data.txt
do_something_public()
#chop-beginner this should stay!
but_you_need_me()
Use regular expressions:
import re
string = re.sub('#chop-begin.*?#chop-end', '', string, flags=re.DOTALL)
.*? will match all between.

Categories

Resources