Invalid Argument in Open method for web scraping

Invalid Argument in Open method for web scraping - python

I am trying to scrape some data from the ancestry, I have a .net background but thought i'd try a bit of python for a project.
I'm falling at the first step, Firstly i am trying to open this page and then just print out the rows.
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
raw_html = open('https://www.ancestry.co.uk/search/collections/britisharmyservice/?
birth=_merthyr+tydfil-wales-united+kingdom_1651442').read()
html = BeautifulSoup(raw_html, 'html.parser')
for p in html.select('tblrow record'):
print(p)
I am getting an illegal argument on open.

According to documentation, open is used to:
Open [a] file and return a corresponding file object.
As such, you cannot use it for downloading the HTML contents of a webpage. You probably meant to use requests.get as follows:
raw_html = get('https://www.ancestry.co.uk/search/collections/britisharmyservice/?
birth=_merthyr+tydfil-wales-united+kingdom_1651442').text
# .text gets the raw text of the response
# (http://docs.python-requests.org/en/master/api/#requests.Response.text)
Here are a few recommendation to improve your code as well:
requests.get provides many useful parameters, one of them being params, which allows you to provide the URL parameters in the form of a Python dictionary.
If you need to verify whether the request was successful before accessing its text, then just check if the returned response.status_code == requests.codes.ok. This only covers status code 200, but if you need more codes, then response.raise_for_status should be helpful.

Related

scraping inside script tag with beautifulsoup

I'm scraping data from e-commerce site and I need model number of each laptops. But in div tags, there are no model numbers. I found model number inside script tag as "productCode". For this example its:
"productCode":"MGND3TU/A"
How can I gather the "productCode" data. I couldn't understand from other posts.
Edit: I find the ‘productCode’ inside script tag. But i don’t know how to get it. You can check from page source.

Since the JSON is hidden in the <head>, it can be parsed, but with some custom logic.
Unfortunately the script tags exports the JSON to a window var, so we'll need to remove that befor we can parse it.
Get url
Get all <script>
Check if PRODUCT_DETAIL_APP_INITIAL_STAT exist in the string (valid json)
Remove the prefix (hardcoded)
Find the index of the next key (hardcoded)
Remove after the suffix
Try to parse to json
Print json['product']['productCode'] if it exists
import json
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
reqs = requests.get("https://www.trendyol.com/apple/macbook-air-13-m1-8gb-256gb-ssd-altin-p-67940132")
soup = BeautifulSoup(reqs.text, 'html.parser')
for sc in soup.findAll('script'):
if len(sc.contents) > 0 and "PRODUCT_DETAIL_APP_INITIAL_STAT" in sc.contents[0]:
withoutBegin = sc.contents[0][44:]
endIndex = withoutBegin.find('window.TYPageName=') - 1
withoutEnd = withoutBegin[:endIndex]
try:
j = json.loads(withoutEnd)
if j['product']['productCode']:
print(j['product']['productCode'])
except Exception as e:
print("Unable to parse JSON")
continue
Output:
MGND3TU/A

In this case beautifulsoup is not needed cause response could be searched directly with regex:
json.loads(re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r).group(1))
Example
import requests, re, json
r = requests.get('https://www.trendyol.com/apple/macbook-air-13-m1-8gb-256gb-ssd-altin-p-67940132').text
json_data = json.loads(re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r).group(1))
json_data['product']['productCode']
Output
MGND3TU/A

That's because those tags are generated using JavaScript. When you send a request to that URL, you will get back a response which has information for a JS script to build DOM for you. (technically JSON information):
To see what your returned response actually is, either print the value of r.text (r is returned from requests.get()) or manually see the "view page source" from the browser. (not inspect element section)
Now to solve it, you can either use something that can render JS, just like your browser. For example Selenium. requests module is not capable of rendering JS. It is just for sending and receiving requests.
Or manually extract that JSON text from the returned text (using Regex or,...) then create a Python dictionary from it.

Python Requests Library - Scraping separate JSON and HTML responses from POST request

I'm new to web scraping, programming, and StackOverflow, so I'll try to phrase things as clearly as I can.
I'm using the Python requests library to try to scrape some info from a local movie theatre chain. When I look at the Chrome developer tools response/preview tabs in the network section, I can see what appears to be very clean and useful JSON:
However, when I try to use requests to obtain this same info, instead I get the entire page content (pages upon pages of html). Upon further inspection of the cascade in the Chrome developer tools, I can see there are two events called GetNowPlayingByCity: One contains the JSON info while the other seems to be the HTML.
JSON Response
HTML Response
How can I separate the two and only obtain the JSON response using the Python requests library?
I have already tried modifying the headers within requests.post (the Chrome developer tools indicate this is a post method) to include "accept: application/json, text/plain, */*" but didn't see a difference in the response I was getting with requests.post. As it stands I can't parse any JSON from the response I get with requests.post and get the following error:
"json.decoder.JSONDecodeError: Expecting value: line 4 column 1 (char 3)"
I can always try to parse the full HTML, but it's so long and complex I would much rather work with friendly JSON info. Any help would be much appreciated!

This is probably because the javascript the page sends to your browser is making a request to an API to get the json info about the movies.
You could either try sending the request directly to their API (see edit 2), parse the html with a library like Beautiful Soup or you can use a dedicated scraping library in python. I've had great experiences with scrapy. It is much faster than requests
Edit:
If the page uses dynamically loaded content, which I think is the case, you'd have to use selenium with the PhantomJS browser instead of requests. here is an example:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "your url"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
# Then parse the html code here
Or you could load the dynamic content with scrapy
I recommend the latter if you want to get into scraping. It would take a bit more time to learn but it is a better solution.
Edit 2:
To make a request directly to their api you can just reproduce the request you see. Using google chrome, you can see the request if you click on it and go to 'Headers':
After that, you simply reproduce the request using the requests library:
import requests
import json
url = 'http://paste.the.url/?here='
response = requests.get(url)
content = response.content
# in my case content was byte string
# (it looks like b'data' instead of 'data' when you print it)
# if this is you case, convert it to string, like so
content_string = content.decode()
content_json = json.loads(content_string)
# do whatever you like with the data
You can modify the url as you see fit, for example if it is something like http://api.movies.com/?page=1&movietype=3 you could modify movietype=3 to movietype=2 to see a different type of movie, etc

How to loop through a vector of URLs and scrape some basic tags from each

I am trying to loop through a list of URLs and scrape some data from each link. Here is my code.
from bs4 import BeautifulSoup as bs
import webbrowser
import requests
url_list = ['https://corp-intranet.com/admin/graph?dag_id=emm1_daily_legacy',
'https://corp-intranet.com/admin/graph?dag_id=emm1_daily_legacy_history']
for link in url_list:
File = webbrowser.open(link)
File = requests.get(link)
data = File.text
soup = bs(data, "lxml")
tspans = soup.find_all("tspan")
tspans
I think this is pretty close, but I'm getting nothing for the 'tspans' variable. I get no error; 'tspans' just shows [].
This is an internal corporate intranet, so I can't share the exact details, but I think it's just a matter of grabbing all the HTML elements named 'tspans' and writing all of them to a text file or a CSV file. That's my ultimate goal. I want to collate everything into a large list and write it all to a file.
As an aside, I was going to use Selenium to log into this site, which requires creds, but it seem like the code I'm testing now allows you you open new tabs on a browser, and everything loads up fine, if you are already logged in. Is this the best practice, or should I use the full login creds + Selenium? I'm just trying to keep things simple.

Getting information inside script function from webpage

I am trying to get information from https://rosettacode.org/wiki/Category:Rascal and similar pages. The information that I am interested in is in the window on the right side on upper part of page that lists details of the language such as execution method, garbage collected etc. This information is contained in following line on the html source of the page:
<script type="8b5f853f8b614ed469e51514-">window.RLQ = window.RLQ || []; window.RLQ.push( function () {
mw.config.set({"wgCanonicalNamespace":"Category","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":14,"wgPageName":"Category:Rascal","wgTitle":"Rascal","wgCurRevisionId":137957,"wgRevisionId":137957,"wgArticleId":11663,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],
"wgCategories":["Execution method/Interpreted","Garbage collection/Yes","Parameter passing/By value","Typing/Safe","Typing/Strong","Typing/Expression/Partially implicit","Typing/Checking/Dynamic","Impl needed","Programming Languages"],
"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRelevantPageName":"Category:Rascal"
,"wgRelevantArticleId":11663,"wgIsProbablyEditable":!0,"wgRestrictionEdit":[],"wgRestrictionMove":[],"sfgAutocompleteValues":[],"sfgAutocompleteOnAllChars":!1,"sfgFieldProperties":[],"sfgDependentFields":[],"sfgShowOnSelect":[],"sfgScriptPath":"/mw/extensions/SemanticForms","sdgDownArrowImage":"/mw/extensions/SemanticDrilldown/skins/down-arrow.png","sdgRightArrowImage":"/mw/extensions/SemanticDrilldown/skins/right-arrow.png"});mw.loader.implement("user.options",function($,jQuery){mw.user.options.set({"variant":"en"});});mw.loader.implement("user.tokens",function($,jQuery){mw.user.tokens.set({"editToken":"+\\","patrolToken":"+\\","watchToken":"+\\"});});mw.loader.load(["ext.smw.style","ext.smw.tooltips","mediawiki.page.startup","mediawiki.legacy.wikibits"]);
} );</script>
The main part is in "wgCategories" (shown in the middle of code above).
I have following code to get the page:
import requests, sys
lang_url = 'https://rosettacode.org/wiki/Category:Rascal'
rg = requests.get(lang_url)
if rg is None:
print("Could not obtain web page.")
sys.exit()
else: print("length of obtained page:", len(rg.text) )
from bs4 import BeautifulSoup
What function of BeautifulSoup can I use to get this information?
Edit: I checked about BeautifulSoup - I can get title, para by p and links by a and a['href'] and so on, but I cannot find a method to find and search inside a script function.

You can pass your requests object's content into the BeautifulSoup constructor, while specifying BeautifulSoup's HTML parser, html.parser, to get it in the correct format. Then, you can use BeautifulSoup's find_all() function, which has an element tag parameter and returns a list. See below:
import requests
r = requests.get('https://rosettacode.org/wiki/Category:Rascal')
from bs4 import BeautifulSoup as bs
soup = bs(r.content, 'html.parser')
print(soup.find_all('script'))
Another option is to use regex, if you're into that kind of thing.

It's not beautifulsoup, but you may want to use re for this, as html parsing will return the entire script block.
import re
wgcontent = re.findall('wgCategories":\[(.+?)]', rg.text)[0].replace('"', '').split(',')
this will return a list of:
Execution method/Interpreted
Garbage collection/Yes
Parameter passing/By value
Typing/Safe
Typing/Strong
Typing/Expression/Partially implicit
Typing/Checking/Dynamic
Impl needed
Programming Languages

urllib2 not returning full webpage

I'm just starting out in Python and I'm trying to request the html source code of a site using urllib2. However when I try and get the html content from a site I'm not getting the full html content - there are tags missing. I know they're missing as when I view the site in firebug the code shows up. Is this due to the way I'm requesting the data - or due to the site? If so is there a way in which I can get the full source code of the site in python, and then parse it?
Currently the code I'm using to request the content and the site I'm trying is:
import urllib2
url = 'http://marinetraffic.com/ais/'
response = urllib2.urlopen(url)
html = response.read()
print(html)
Specifically the content between the - div id="map_area" - is missing. Any help/pointers greatly appreciated!

You are getting incomplete data because most of the content on this page is dynamically generated via Javascript...

read on a descriptor returned by urlopen will only return what has already been downloaded. So you're liable to get a short read. You're better off using urllib.urlretrieve(), which tries to fetch the entire file, checks the Content-Length header, and raises an error if it fails.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Invalid Argument in Open method for web scraping - python

Related

scraping inside script tag with beautifulsoup

Python Requests Library - Scraping separate JSON and HTML responses from POST request

How to loop through a vector of URLs and scrape some basic tags from each

Getting information inside script function from webpage

urllib2 not returning full webpage

Categories

Resources