python crawler extract url not working

python crawler extract url not working - python

I tried to write a simple python program with imdb package to extract movie information from their database, but I do not know why the code returns empty list. My guess is that the way I extract url info(by using (.*?)) from the website is wrong. I want to extract an url link from the webpage. Here's the code. Thanks!
import urllib
import re
import imdb
imdb_access = imdb.IMDb()
top_num = 5
movie_list = ["The Matrix","The Matrix","The Matrix","The Matrix","The Matrix"]
for x in xrange(0,top_num):
contain = imdb_access.search_movie(movie_list[x])
ID = contain[0].movieID #str type
htmltext = (urllib.urlopen("http://www.imdb.com/title/tt0133093/?ref_=nv_sr_1")).read()
# a pattern in the website
regex = regex = '<img alt="The Matrix Poster" title="The Matrix Poster" src="(.*?)" itemprop="image">'
pattern = re.compile(regex)
#print (str((pattern)))
result = re.findall(pattern,htmltext)
print result
#print type(htmltext)

I think the problem is with new lines can you have (.*\n*.*?)

Related

How can I split the text elements out from this HTML String? Python

Good Morning,
I'm doing some HTML parsing in Python and I've run across the following which is a time & name pairing in a single table cell. I'm trying to extract each piece of info separately and have tried several different approaches to split the following string.
HTML String:
<span><strong>13:30</strong><br/>SecondWord</span></a>
My output would hopefully be:
text1 = 13:30
text2 = "SecondWord"
I'm currently using a loop through all the rows in the table, where I'm taking the text and splitting it by a new line. I noticed the HTML has a line break character in-between so it renders separately on the web, I was trying to replace this with a new line and run my split on that - however my string.replace() and re.sub() approaches don't seem to be working.
I'd love to know what I'm doing wrong.
Latest Approach:
resub_pat = r'<br/>'
rows=list()
for row in table.findAll("tr"):
a = re.sub(resub_pat,"\n",row.text).split("\n")
This is a bit hashed together, but I hope I've captured my problem! I wasn't able to find any similar issues.

You could try:
from bs4 import BeautifulSoup
import re
# the soup
soup = BeautifulSoup("<span><strong>13:30</strong><br/>SecondWord</span></a>", 'lxml')
# the regex object
rx = re.compile(r'(\d+:\d+)(.+)')
# time, text
text = soup.find('span').get_text()
x,y = rx.findall(text)[0]
print(x)
print(y)

Using recursive=False to get only direct text and strong.text to get the other one.
Ex:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<span><strong>13:30</strong><br/>SecondWord</span></a>", 'lxml')
# text1
print(soup.find("span").strong.text) # --> 13:30
# text2
print(soup.find("span").find(text=True, recursive=False)) # --> SecondWord

from bs4 import BeautifulSoup
txt = '''<span><strong>13:30</strong><br/>SecondWord</span></a>'''
soup = BeautifulSoup(txt, 'html.parser')
text1, text2 = soup.span.get_text(strip=True, separator='|').split('|')
print(text1)
print(text2)
Prints:
13:30
SecondWord

How can I parse a text from an URL and put the clean text in a DataFrame?

I have an Excel file of 147 Toronto Star news articles that I've compiled and created a dataframe. I have also written a Python script that can extract the text from one article at a time. However, I'd like to improve my script so that Python will cycle through all the URLs in the dataframe, scrape the text, append the scraped, stopworded text to the row (or perhaps to a linked text file?), and then leverage that data frame for a classification algorithm and more exploration.
Can someone please help me with writing the loop? (I have no background in programming.. struggling!)
creating the dataframe
url_file = 'https://github.com/MarissaFosse/ryersoncapstone/raw/master/DailyNewsArticles.xlsx'
tstar_articles = pd.read_excel(url_file, "TorontoStar Articles", header=0)
nltk with one article
URL = 'https://www.thestar.com/news/gta/2019/12/31/with-291-people-shot-2019-is-closing-as-torontos-bloodiest-year-on-record-for-overall-gun-violence.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(class_='c-article-body__content')
results_text = [tag.get_text().strip() for tag in results]
sentence_list = [sentence for sentence in results_text if not '\n' in sentence]
sentence_list = [sentence for sentence in sentence_list if '.' in sentence]
article = ' '.join(sentence_list)
from nltk.tokenize import word_tokenize
word_tokens = word_tokenize(article)
stop_words = set(stopwords.words('english'))
filtered_article = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
clean_tokens = tokens[:]
for token in tokens:
if token in stopwords.words('english'):
clean_tokens.remove(token)

Firstly, most news site has an RSS feed, for the ww.thestar.com site, there's https://www.thestar.com/about/rssfeeds.html
Instead of parsing urls from an excel sheet, it's much more convenient to parse the RSS feed =)
Lets try the Toronto news, from http://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.vancouver.rss
To get the data from a website, one can use the requests library
In code:
import requests
response = requests.get('http://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.vancouver.rss')
toronto_rss = response.content.decode('utf8')
To parse the XML file, lets use the feedparser library:
import requests
import feedparser
from bs4 import BeautifulSoup
response = requests.get('http://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.vancouver.rss')
toronto_rss = response.content.decode('utf8')
feed = feedparser.parse(toronto_rss)
for item in feed.entries:
print(item.link)
Now lets try to fetch the text from each of the link from the RSS using BeautifulSoup:
import requests
import feedparser
from bs4 import BeautifulSoup
response = requests.get('http://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.vancouver.rss')
toronto_rss = response.content.decode('utf8')
feed = feedparser.parse(toronto_rss)
for item in feed.entries:
url = item.link
response = requests.get(url)
bsoup = BeautifulSoup(response.content.decode('utf8'))
And from the BeautifulSoup object, there is a nifty get_text() function that we can use to extract the text (sometimes this can get somewhat noisy).
Since you already did the hard work for finding the c-article-body__content tag that you need to extract the article's main text, we can get the text from:
import requests
import feedparser
from bs4 import BeautifulSoup
response = requests.get('http://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.vancouver.rss')
toronto_rss = response.content.decode('utf8')
feed = feedparser.parse(toronto_rss)
url_to_sents = {}
for item in feed.entries:
url = item.link
response = requests.get(url)
bsoup = BeautifulSoup(response.content.decode('utf8'))
article_sents = '\n'.join([p.text for p in bsoup.find(class_='c-article-body__content').find_all('p')])
url_to_sents[url] = article_sents
That's all nice, the explanation and all but you haven't told me how to put them into a dataframe.
Now the question is why do you need the dataframe? If you only need some keyword tokens per url, then we have to do some processing.
Lets first define the steps needed for preprocessing to get our keywords,
1. We want to sentence token, then
2. Word tokenize each sentence
3. Remove the stop words
Now there are several options, we can use scikit-learn withnltk to do (1) , (2) and (3), see https://www.kaggle.com/alvations/basic-nlp-with-nltk
But lets keep it simple and just use NLTK for now.
Since the nltk.word_tokenize() function implicitly calls sent_tokenize, we can just call word_tokenize so just (2) and (3) would do.
For now lets simply use nltk.corpus.stopwords as stopwords for (3)
So we have this preprocess function:
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
stoplist = set(stopwords.words('english')) | set(punctuation)
def preprocess(text):
return [word for word in word_tokenize(text) if word not in stoplist and not word.isdigit()]
text = url_to_sents['https://www.thestar.com/vancouver/2020/02/20/vancouver-fire-says-smoking-caused-the-citys-first-fatal-fire-of-2020.html']
preprocess(text)
Hey, I said, that's all nice and all but I really want a DataFrame...
Okay, okay, there's dataframe but BTW, there's pandas.DataFrame is not the only DataFrame library in Python, see https://www.quora.com/Whats-the-difference-between-an-SFrame-and-a-DataFrame-in-Python
Alright, alright, here's pandas...
First we have the url_to_text dictionary, that have the urls as keys and the text from the article as values.
And lets say we want a dataframe where it keys
a. the URL
b. the text in the article
c. the resulting tokens from the "cleaned" text
So here's a dataframe with (a) and (b):
import pandas as pd
urls, texts = zip(*url_to_sents.items())
data = {'urls':urls, 'text': texts}
df = pd.DataFrame.from_dict(data)
[out]:
urls text
0 https://www.thestar.com/vancouver/2020/03/26/p... VANCOUVER—British Columbia’s human rights comm...
1 https://www.thestar.com/vancouver/2020/03/08/d... VICTORIA—At the end of a stark news conference...
2 https://www.thestar.com/vancouver/2020/03/08/c... Teck Resources says it’s baffled over the virt...
3 https://www.thestar.com/vancouver/2020/02/29/t... SQUAMISH, B.C.—RCMP in Squamish, B.C., are inv...
4 https://www.thestar.com/vancouver/2020/02/26/v... VANCOUVER—The man who attempted to steal a flo...
5 https://www.thestar.com/vancouver/2020/02/22/g... VANCOUVER—Canada’s Governor General visited an...
6 https://www.thestar.com/vancouver/2020/02/20/v... Vancouver philanthropist and former chancellor...
7 https://www.thestar.com/vancouver/2020/02/20/v... VANCOUVER—A man with mobility challenges has d...
8 https://www.thestar.com/vancouver/2020/02/17/b... VICTORIA—British Columbia’s finance minister i...
Nice! How about the cleaned tokens?
Since we have a dataframe to work with and function that we want to apply to all the values in the text column, we just need to use DataFrame.apply, i.e.
df['cleaned_tokens'] = df['text'].apply(preprocess)
Awesome!! Wait a minute, did you just do a quotation mark on the "cleaned" text?
Yes, I did. Because what is "clean"?, see https://www.kaggle.com/alvations/basic-nlp-with-nltk
Why do we need to clean the text?
Do we really need to clean the text?
What is the ultimate goal of preprocessing the text?
I guess the above questions are out of scope of the original post (OP), so gonna leave them as food for thoughts for you =)
Have fun with the code above!

Needing help on a python program. How to search and save IDs from a HTML

Currently I'm trying to write a program that will search for a tag and the characters in front of that tag (until a space or enter) on a html local file but i don't know how, I worte some code but it isn't working, it only lists all the text on the html instead of looking for the PA and the characters.
Here's my code so far:
from bs4 import BeautifulSoup
import re
ecj_data = open('output.html', 'r').read()
soup = BeautifulSoup(ecj_data, 'lxml')
d = 'PA'
soup_strings = [ l for l in list(soup.strings) if l.strip() != '' ]
for s in soup_strings :
print(s)

Do you mean to search word including 'PA'? Please try below.
for i in soup.strings.split(' '):
if 'PA' in i:
print (i)

How do I retrieve images located in a table from a website with my Python Codes?

I have the following Python codes running in my Jupyter notebook:
from lxml.html import parse
tree = parse('http://www.imdb.com/chart/top')
movies = tree.findall('.//table[#class="chart full-width"]//td[#class="titleColumn"]//a')
movies[0].text_content()
The above codes give me the following output:
'The Shawshank Redemption'
Basically, it is the content of the first row of the column named 'titleColumn' on that webpage. In that same table there is another column called 'posterColumn' which contains a thumbnail image.
Now I want my codes to retrieve those images and the output to also show that image.
Do I need to use another package to achieve this? Can the image be shown in Jupyter Notebook?

To get the associated images, you need to get the posterColumn. From this you can extract the img src entry and pull the jpg images. The file can then be saved based on the movie title, with care to remove any non-valid filename characters such as ::
from lxml.html import parse
import requests
import string
valid_chars = "-_.() " + string.ascii_letters + string.digits
tree = parse('http://www.imdb.com/chart/top')
movies = tree.findall('.//table[#class="chart full-width"]//td[#class="titleColumn"]//a')
posters = tree.findall('.//table[#class="chart full-width"]//td[#class="posterColumn"]//a')
for p, m in zip(posters, movies):
for element, attribute, link, pos in p.iterlinks():
if attribute == 'src':
print "{:50} {}".format(m.text_content(), link)
poster_jpg = requests.get(link, stream=True)
valid_filename = ''.join(c for c in m.text_content() if c in valid_chars)
with open('{}.jpg'.format(valid_filename), 'wb') as f_jpg:
for chunk in poster_jpg:
f_jpg.write(chunk)
So currently you would see something starting as:
The Shawshank Redemption https://images-na.ssl-images-amazon.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE#._V1_UY67_CR0,0,45,67_AL_.jpg
The Godfather https://images-na.ssl-images-amazon.com/images/M/MV5BZTRmNjQ1ZDYtNDgzMy00OGE0LWE4N2YtNTkzNWQ5ZDhlNGJmL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNjU0OTQ0OTY#._V1_UY67_CR1,0,45,67_AL_.jpg
The Godfather: Part II https://images-na.ssl-images-amazon.com/images/M/MV5BMjZiNzIxNTQtNDc5Zi00YWY1LThkMTctMDgzYjY4YjI1YmQyL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNjU0OTQ0OTY#._V1_UY67_CR1,0,45,67_AL_.jpg

How can I group it by using "search" function in regular expression?

I have been developing a python web-crawler to collect the used car stock data from this website. (http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=20)
First of all, I would like to collect only "BMW" from the list. So, I used "search" function in regular expression like the code below. But, it keeps returning "None".
Is there anything wrong in my code?
Please give me some advice.
Thanks.
from bs4 import BeautifulSoup
import urllib.request
import re
CAR_PAGE_TEMPLATE = "http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page="
def fetch_post_list():
for i in range(20,21):
URL = CAR_PAGE_TEMPLATE + str(i)
res = urllib.request.urlopen(URL)
html = res.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', class_='cyber')
print ("Page#", i)
# 50 lists per each page
lists=table.find_all('tr', itemtype="http://schema.org/Article")
count=0
r=re.compile("[BMW]")
for lst in lists:
if lst.find_all('td')[3].find('em').text:
lst_price=lst.find_all('td')[3].find('em').text
lst_title=lst.find_all('td')[1].find('a').text
lst_link = lst.find_all('td')[1].find('a')['href']
lst_photo_url=''
if lst.find_all('td')[0].find('img'):
lst_photo_url = lst.find_all('td')[0].find('img')['src']
count+=1
else: continue
print('#',count, lst_title, r.search("lst_title"))
return lst_link
fetch_post_list()

r.search("lst_title")
This is searching inside the string literal "lst_title", not the variable named lst_title, that's why it never matches.
r=re.compile("[BMW]")
The square brackets indicate that you're looking for one of those characters. So, for example, any string containing M will match. You just want "BMW". In fact you don't even need regular expressions, you can just test:
"BMW" in lst_title

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python crawler extract url not working - python

I think the problem is with new lines can you have (.\n.*?)

Related

How can I split the text elements out from this HTML String? Python

How can I parse a text from an URL and put the clean text in a DataFrame?

Needing help on a python program. How to search and save IDs from a HTML

How do I retrieve images located in a table from a website with my Python Codes?

How can I group it by using "search" function in regular expression?

Categories

Resources