re.search and urlopen in Python - python

I have this script :
for url in urls:
u = urlopen(url).read
owner_id = re.search(r'ownerId: ([1-9]+)?,', u).group(1)
id = re.search(r'id: ([1-9]+)?,', u).group(1)
print(owner_id)
print(id)
url is a list of urls
The script returns me "TypeError: expected string or bytes-like object"
Do you have an idea how to fix that ?

Not sure what version of Python your using (below is for v3+, for v2, replace urllib with urllib2).
need to import urllib and beautiful soup
import urllib
from bs4 import BeautifulSoup
url = "url address"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")

Related

Extracting json when web scraping

I was following a python guide on web scraping and there's one line of code that won't work for me. I'd appreciate it if anybody could help me figure out what the issue is, thanks.
from bs4 import BeautifulSoup
import json
import re
import requests
url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
page = requests.get(url)
soup = BeautifulSoup(page.content, "lxml")
script = soup.find('script', text=re.compile('root\.App\.main'))
json_text = re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1)
Error Message:
json_text = re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1)
AttributeError: 'NoneType' object has no attribute 'string'
Link to the guide I was looking at: https://www.mattbutton.com/how-to-scrape-stock-upgrades-and-downgrades-from-yahoo-finance/
Main issue in my opinion is that you should add an user-agent to your request, so that you get expected HTML:
headers = {'user-agent':'Mozilla/5.0'}
page = requests.get(url, headers=headers)
Note: Almost and first at all - Take a deeper look into your soup, to check if expected information is available.
Example
import re
import json
from bs4 import BeautifulSoup
url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
headers = {'user-agent':'Mozilla/5.0'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content)
script = soup.find('script', text=re.compile('root\.App\.main'))
json_text = json.loads(re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1))
json_text

Python: extracted email address from a webpage returning extra characters

Here is the example web address that contains an email address.
Here is the code that I am using:
from bs4 import BeautifulSoup
import requests
import re
url = 'https://viterbi.usc.edu/directory/faculty/Zadeh/Ali-Enayat'
page_response = requests.get(url, timeout=5)
soup = BeautifulSoup(page_response.content, "html.parser")
email = re.findall(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+", soup.text)
print(email)
I am expecting it to return azadeh#usc.edu as the email address but it returns 740-4694azadeh#usc.edu. What am I doing wrong, and how this can be solved so the email extraction works for any webpage?
There is no need to use re when the full capabilities of bs4 are at your disposal:
from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://viterbi.usc.edu/directory/faculty/Zadeh/Ali-Enayat').text, 'html.parser')
email = d.find('div', {'class':'contactInformation'}).find_all('ul')[-2].find_all('li')[-1].text
Output:
'azadeh#usc.edu'
Edit: a more generic approach is to apply the regular expression to the html content of the bs4 object:
re.findall(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+", str(d))
Output:
['azadeh#usc.edu']

How to use Beautiful Soup to extract function string in <script> tag from a website?

In website page, How can I use beautiful soup to extract the "return" information under "function getData() in html source code" ?
I got error like this :
print(pattern.search(script.text).group(1))
AttributeError: 'NoneType' object has no attribute 'text'
import os, sys, urllib, urllib2
from urllib2 import urlopen, Request
from bs4 import BeautifulSoup
url = "http://zipwho.com/?zip=91709&city=&filters=--_--_--_--&state=&mode=zip"
data = urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r'return "(.*?)";$', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print(pattern.search(script.text).group(1))
Tried it on my computer (with requests, not urllib2) and got this
print(script)
>>> None
This is why you get the
AttributeError: 'NoneType' object has no attribute 'text'
Im not sure what your regex is trying to achieve but check it again.
Maybe test it on the string which u expect to get first
edit:
try this
url = "http://zipwho.com/?zip=91709&city=&filters=--_--_--_--&state=&mode=zip"
data = urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
script = soup.find("script")
print(script.text)
the output:
function getData()
{
return "zip,city,state,MedianIncome,MedianIncomeRank,CostOfLivingIndex,CostOfLivingRank,MedianMortgageToIncomeRatio,MedianMortgageToIncomeRank,OwnerOccupiedHomesPercent,OwnerOccupiedHomesRank,MedianRoomsInHome,MedianRoomsInHomeRank,CollegeDegreePercent,CollegeDegreeRank,ProfessionalPercent,ProfessionalRank,Population,PopulationRank,AverageHouseholdSize,AverageHouseholdSizeRank,MedianAge,MedianAgeRank,MaleToFemaleRatio,MaleToFemaleRank,MarriedPercent,MarriedRank,DivorcedPercent,DivorcedRank,WhitePercent,WhiteRank,BlackPercent,BlackRank,AsianPercent,AsianRank,HispanicEthnicityPercent,HispanicEthnicityRank\n91709,Chino Hills,CA,78336,96,260.8,93,25.6,92,84.9,81,6.4,90,37.5,87,44.9,88,66693,99,3.3,96,32.3,13,93.6,57,66.9,83,6.3,11,43.7,10,5.4,68,21.0,98,25.6,92";
}
function getResultsCount()
{
return "1";
}
its a string
type(script.text)
>>><class 'str'>
so now you can easily match a regex against it to get the result you want
my code
import requests
from bs4 import BeautifulSoup
url = "http://zipwho.com/?zip=91709&city=&filters=--_--_--_--&state=&mode=zip"
data = requests.get(url)
soup = BeautifulSoup(data.content, "html.parser")
script = soup.find('script')
print(script.text)
notice that im using requests instad of urllib2 (go ahead and try it)

Crawl a news website and getting the news content

I'm trying to download the text from a news website. The HTML is:
<div class="pane-content">
<div class="field field-type-text field-field-noticia-bajada">
<div class="field-items">
<div class="field-item odd">
<p>"My Text" target="_blank">www.injuv.cl</a></strong></p> </div>
The output should be: My Text
I'm using the following python code:
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = "My URL"
parsed_html = BeautifulSoup(html)
p = parsed_html.find("div", attrs={'class':'pane-content'})
print(p)
But the output of the code is: "None". Do you know what is wrong with my code??
The problem is that you are not parsing the HTML, you are parsing the URL string:
html = "My URL"
parsed_html = BeautifulSoup(html)
Instead, you need to get/retrieve/download the source first, example in Python 2:
from urllib2 import urlopen
html = urlopen("My URL")
parsed_html = BeautifulSoup(html)
In Python 3, it would be:
from urllib.request import urlopen
html = urlopen("My URL")
parsed_html = BeautifulSoup(html)
Or, you can use the third-party "for humans"-style requests library:
import requests
html = requests.get("My URL").content
parsed_html = BeautifulSoup(html)
Also note that you should not be using BeautifulSoup version 3 at all - it is not maintained anymore. Replace:
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
with just:
from bs4 import BeautifulSoup
BeautifulSoup accepts a string of HTML. You need to retrieve the HTML from the page using the URL.
Check out urllib for making HTTP requests. (Or requests for an even simpler way.) Retrieve the HTML and pass that to BeautifulSoup like so:
import urllib
from bs4 import BeautifulSoup
# Get the HTML
conn = urllib.urlopen("http://www.example.com")
html = conn.read()
# Give BeautifulSoup the HTML:
soup = BeautifulSoup(html)
From here, just parse as you attempted previously.
p = soup.find("div", attrs={'class':'pane-content'})
print(p)

'NoneType' object is not callable beautifulsoup error while using get_text

I wrote this code for extracting all text from a web page:
from BeautifulSoup import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('http://www.pythonforbeginners.com').read())
print(soup.get_text())
The problem is I get this error:
print(soup.get_text())
TypeError: 'NoneType' object is not callable
Any idea about how to solve this?
The method is called soup.getText(), i.e. camelCased.
Why you get TypeError instead of AttributeError here is a mystery to me!
As Markku suggests in the comments, I would recommend breaking your code up.
from BeautifulSoup import BeautifulSoup
import urllib2
URL = "http://www.pythonforbeginners.com"
page = urllib2.urlopen('http://www.pythonforbeginners.com')
html = page.read()
soup = BeautifulSoup(html)
print(soup.get_text())
If it's still not working, throw in some print statements to see what's going on.
from BeautifulSoup import BeautifulSoup
import urllib2
URL = "http://www.pythonforbeginners.com"
print("URL is {} and its type is {}".format(URL,type(URL)))
page = urllib2.urlopen('http://www.pythonforbeginners.com')
print("Page is {} and its type is {}".format(page,type(page))
html = page.read()
print("html is {} and its type is {}".format(html,type(html))
soup = BeautifulSoup(html)
print("soup is {} and its type is {}".format(soup,type(soup))
print(soup.get_text())

Categories

Resources