I'm trying to get data and export to CSV which I have main URL page and second URL main page which I have imported the following of these:
from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import urlparse, parse_qs
import csv
def get_page(url):
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
mainpage = response.read().decode('utf-8')
return mainpage
mainpage = get_page(www.website1.com)
mainpage_parser = BeautifulSoup(mainpage,'html.parser')
secondpage = get_page('www.website2.com')
secondpage_parser = BeautifulSoup(secondpage,'html.parser')
The patterns of the data are the same such as Title, Address; thus, the code I
use is "find" or "find_all" in each class; for example,
try:
name = page_parser.find("h1",{"class":"xxx"}).find("a").get_text()
print(name)
except:
print(name)
Which it worked.
However, I couldn't get the "lat" and "lon" from url link in this html class:
<img class="aaa" alt="map" data-track-id="static-map" width="97" height="142" src="https://www.website.com/aaaaaaa;height=284&lat=18.111&lon=98.111&level=15&returnImage=true">
The code I'm trying to get latitude and longitude is:
for gps in secondpage_parser.find_all('img',{"class":"aaa"}, src=True):
parsed_url = urlparse(gps['src'])
mykeys = ['lat', 'lon']
gpslocation = [parse_qs(parsed_url.query)[k][0] for k in mykeys]
print(gpslocation)
But it has Key Error on the "gpslocation = [parse_qs(parsed_url.query)[k][0] for k in mykeys]" line which it indicates "KeyError: 'lat'"
I would like to know which part here I have the mistake or how should I fix it. Please help.
This url has no query string but does have parameters (see what is the difference between URL parameters and query strings). So when you try to parse the query string you get an an empty dictionary. Hence the KeyError.
"https://www.website.com/aaaaaaa;height=284&lat=18.111&lon=98.111&level=15&returnImage=true"
# ^--- semicolon, not question mark
Result of print(parsed_url)
ParseResult(
scheme='https',
netloc='www.website.com',
path='/aaaaaaa',
params='height=284&lat=18.111&lon=98.111&level=15&returnImage=true',
query='',
fragment='')
The key here is to parse the parameters. To fix your code change parsed_url.query to parsed_url.params:
gpslocation = [parse_qs(parsed_url.params)[k][0] for k in mykeys]
Related
I was wondering if you can help.
I'm using beautifulsoup to write to Google Sheets.
I've created a crawler that runs through a series of URLs, scrapes the content and then updates a Google sheet.
What I now want to do is if a duplicate URL exists (in column c) to prevent it from being written to my sheet again.
e.g If I had the url https://www.bbc.co.uk/1 in my table I wouldn't want it appearing in my table again.
Here is my code:
from cgitb import text
import requests
from bs4 import BeautifulSoup
import gspread
import datetime
import urllib.parse
gc = gspread.service_account(filename='creds.json')
sh = gc.open('scrapetosheets').sheet1
urls = ["https://www.ig.com/uk/trading-strategies", "https://www.ig.com/us/trading-strategies"]
for url in urls:
my_url = requests.get(url)
html = my_url.content
soup = BeautifulSoup(html, 'html.parser')
for item in soup.find_all('h3', class_="article-category-section-title"):
date = datetime.datetime.now()
title = item.find('a', class_ = 'primary js_target').text.strip()
url = item.find('a', class_ = 'primary js_target').get('href')
abs = "https://www.ig.com"
rel = url
info = {'date':date, 'title':title, 'url':urllib.parse.urljoin(abs, rel)}
sh.append_row([str(info['date']), str(info['title']), str(info['url'])])
Thanks in advance.
Mark
I'd like to know what i can add to the end of my code to prevent duplicate URLs being entered into my Google Sheet.
I believe your goal is as follows.
You want to put the values of [str(info['date']), str(info['title']), str(info['url'])], when the value of str(info['url']) is not existing in the column "C".
Modification points:
In this case, it is required to check the column "C" of the existing sheet of sh = gc.open('scrapetosheets').sheet1. This has already been mentioned in the TheMaster's comment.
When I saw your script, append_row is used in a loop. In this case, the process cost will become high.
When these points are reflected in your script, how about the following modification?
Modified script:
from cgitb import text
import requests
from bs4 import BeautifulSoup
import gspread
import datetime
import urllib.parse
gc = gspread.service_account(filename='creds.json')
sh = gc.open('scrapetosheets').sheet1
urls = ["https://www.ig.com/uk/trading-strategies", "https://www.ig.com/us/trading-strategies"]
# I modified the below script.
obj = {r[2]: True for r in sh.get_all_values()}
ar = []
for url in urls:
my_url = requests.get(url)
html = my_url.content
soup = BeautifulSoup(html, "html.parser")
for item in soup.find_all("h3", class_="article-category-section-title"):
date = datetime.datetime.now()
title = item.find("a", class_="primary js_target").text.strip()
url = item.find("a", class_="primary js_target").get("href")
abs = "https://www.ig.com"
rel = url
info = {"date": date, "title": title, "url": urllib.parse.urljoin(abs, rel)}
url = str(info["url"])
if url not in obj:
ar.append([str(info["date"]), str(info["title"]), url])
if ar != []:
sh.append_rows(ar, value_input_option="USER_ENTERED")
When this script is run, first, the values are retrieved from the sheet, and create an object for searching the value of str(info["url"]). When the value of str(info["url"]) is not existing in column "C" of the sheet, the values are put into an array. And then, the array is appended to the sheet.
Reference:
append_rows
I have some problem with code (I use bs4):
elif 'temperature' in query:
speak("where?")
miejsce=takecommand().lower()
search = (f"Temperature in {miejsce}")
url = (f'https://www.google.com/search?q={search}')
r = requests.get(url)
data = BeautifulSoup(r.text , "html.parser")
temp = data.find("div", class_="BNeawe").text
speak(f"In {search} there is {temp}")
and the error is:
temp = data.find("div", class_="BNeawe").text
AttributeError: 'NoneType' object has no attribute 'text'
Could you help me please
data.find("div", class_="BNeawe") didnt return anything, so i believe google changed how it displays weather since you last ran this code successfully.
If you search for yourself 'Weather in {place}' then right click the weather widget and choose Inspect Element (browser dependent), you can look for yourself at where the data is in the page, and see which class the data is under.
It appears it was previously under the BNeawe class.
elif "temperature" in query or "temperatures" in query:
search = "Temperature in New York"
url = f"https://www.google.com/search?q={search}:"
r = requests.get(url)
data = BeautifulSoup(r.text, "html.parser")
temp = data.find("div", class_="BNeawe").text
speak(f"Currently, the temperature in your region is {temp}")
Try this one, you were experiencing your proble in line 5 which is '(r.text, "html.parser")'
try to avoid these comma space mistakes in the code...
Best practice would be to use directly api google / weather - If you wanna scrape,try to avoid selecting your elements by classes, cause they are often that dynamic.
Instead focus on id if possible or use HTML structure:
for p in list(soup.select_one('span:-soup-contains("weather.com")').parents):
if '°' in p.text:
print(p.next.get_text(strip=True))
break
Example
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/search?q=temperature"
response = requests.get(url, headers = {'User-Agent': 'Mozilla/5.0', 'Accept-Language':'en-US,en;q=0.5'}, cookies={'CONSENT':'YES+'})
soup = BeautifulSoup(response.text)
for p in list(soup.select_one('span:-soup-contains("weather.com")').parents):
if '°' in p.text:
print(p.next.get_text(strip=True))
break
I am fairly new to this and I am facing issues all around. Any help/guidance is really appreciated!
I have a dataframe in the following structure:
data:
LINK
<link_one>
<link_two>
<link_three>
The dataframe name is data and it has one column called LINK which contains few weblinks.
I am trying to take each link from the column LINK and do some scraping to return text body contents of each link and attached it to a column called CONTENT in the dataframe.
Here is what the outcome I am hoping for:
data:
LINK CONTENT
<link_one> <text_body_one>
<link_two> <text_body_two>
<link_three> <text_body_three>
This is what I have so far:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
data = pd.read_csv("~/Downloads/links.csv")
def body_content(val):
url = val
try:
page = requests.get(url, verify=False).text
except requests.ConnectionError:
pass
soup = BeautifulSoup(page, 'lxml')
p_tags = soup.find_all('p')
p_tags_text = [tag.get_text().strip() for tag in p_tags]
sentence_list = [sentence for sentence in p_tags_text if not '\n' in sentence]
sentence_list = [sentence for sentence in sentence_list if '.' in sentence]
article = ' '.join(sentence_list)
return article
data['CONTENT'] = zip(*data['LINK'].map(body_content))
While the function body_content works but I can not get the contents to attach properly to the dataframe. Getting the following error:
UnboundLocalError: local variable 'page' referenced before assignment
Thank you for your time!
Probably the problem is because in the try/except part, the code goes to the except and thus doesn't create the variable page, you can do as following:
except requests.ConnectionError:
return ''
So if it has a connection error, it will return an empty string.
I'm trying to snip a embedded json from a webpage and then passing the json object to json.loads(). First url is okay but when loading the second url it's return error
ValueError: Unterminated string starting at: line 1 column 2078 (char 2077)
here is the code
import requests,json
from bs4 import BeautifulSoup
urls = ['https://www.autotrader.co.uk/dealers/greater-manchester/manchester/williams-landrover-9994',
'https://www.autotrader.co.uk/dealers/warwickshire/stratford-upon-avon/guy-salmon-land-rover-stratford-upon-avon-9965'
]
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
scripts = soup.find_all('script')[0]
data = scripts.text.split("window['AT_APOLLO_STATE'] = ")[1].split(';')[0]
jdata = json.loads(data)
print(jdata)
If you print out scripts.text.split("window['AT_APOLLO_STATE'] = ")[1], you will see the follows that includes a ; right after and enthusiastic. So you get an invalid json string from scripts.text.split("window['AT_APOLLO_STATE'] = ")[1].split(';')[0]. And the data ends with and enthusiastic that is not a valid json string.
"strapline":"In our state-of-the-art dealerships across the U.K, Sytner Group
represents the world’s most prestigious car manufacturers.
All of our staff are knowledgeable and enthusiastic; making every interaction
special by going the extra mile.",
Reason has been given. You could also regex out appropriate string
import requests,json
urls = ['https://www.autotrader.co.uk/dealers/greater-manchester/manchester/williams-landrover-9994',
'https://www.autotrader.co.uk/dealers/warwickshire/stratford-upon-avon/guy-salmon-land-rover-stratford-upon-avon-9965'
]
p = re.compile(r"window\['AT_APOLLO_STATE'\] =(.*?});", re.DOTALL)
for url in urls:
r = requests.get(url)
jdata = json.loads(p.findall(r.text)[0])
print(jdata)
Missed a } in the original post.
I'm trying to parse Oxford Dictionary in order to obtain the etymology of a given word.
class SkipException (Exception):
def __init__(self, value):
self.value = value
try:
doc = lxml.html.parse(urlopen('https://en.oxforddictionaries.com/definition/%s' % "good"))
except SkipException:
doc = ''
if doc:
table = []
trs = doc.xpath("//div[1]/div[2]/div/div/div/div[1]/section[5]/div/p")
I cannot seem to work out how to obtain the string of text I need. I know I lack some lines of code in the ones I have copied but I don't know how HTML nor LXML fully works. I would much appreciate if someone could provide me with the correct way to solve this.
You don't want to do web scraping, and especially when probably every dictionary has an API interface. In the case of Oxford create an account at https://developer.oxforddictionaries.com/. Get the API credentials from your account and do something like this:
import requests
import json
api_base = 'https://od-api.oxforddictionaries.com:443/api/v1/entries/{}/{}'
language = 'en'
word = 'parachute'
headers = {
'app_id': '',
'app_key': ''
}
url = api_base.format(language, word)
reply = requests.get(url, headers=headers)
if reply.ok:
reply_dict = json.loads(reply.text)
results = reply_dict.get('results')
if results:
headword = results[0]
entries = headword.get('lexicalEntries')[0].get('entries')
if entries:
entry = entries[0]
senses = entry.get('senses')
if senses:
sense = senses[0]
print(sense.get('short_definitions'))
Here's a sample to get you started scraping Oxford dictionary pages:
import lxml.html as lh
from urllib.request import urlopen
url = 'https://en.oxforddictionaries.com/definition/parachute'
html = urlopen(url)
root = lh.parse(html)
body = root.find("body")
elements = body.xpath("//span[#class='ind']")
for element in elements:
print(element.text)
To find the correct search string you need to format the html so you can see the structure. I used the html formatter at https://www.freeformatter.com/html-formatter.html. Looking at the formatted HTML, I could see the definitions were in the span elements with the 'ind' class attribute.