I tried to scrape a booking page of a movie in bookmyshow site. I am using requests library for this. My code is:
import requests
from bs4 import BeautifulSoup
rs = s.get("https://in.bookmyshow.com/serv/getData?cmd=GETSHOWINFO&vid=MOMV&ssid=1112")
print(rs.status_code)
rt = s.get("https://in.bookmyshow.com/serv/doSecureTrans.bms")
print(rt.status_code)
print(rs.text)
This is the code for getting the source code of the page, i am using. For the first page i am getting 200 as response and then for the second page is also giving 200 response. when i print the source code i am getting "invalid request" as output. what could be the error?
Related
I am trying to web scrape Instagram accounts to find out how many posts they have (visible on all public and private profiles). I am writing a script to do so.
I tried using the requests library to obtain the HTML code, but when running this below:
import requests
url = 'https://www.instagram.com/instagram/'
r = requests.get(url)
print(r)
print(r.text)
I keep getting the 429 too many requests error even though I have sent less than 10 total (I am manually running the script). I also tried following https://stackoverflow.com/questions/22786068/how-to-avoid-http-error-429-too-many-requests-python#:~:text=Writing%20this%20piece,your%20bot%200.1'%7D) and using requests.get(link, headers = {'User-agent': 'your bot 0.1'}) but that also does not work.
I tried to getting the title of a web page by web scraping using Beautifulsoup4 python module and it's returning a string "Not Acceptable!" as the title, but when I open the webpage via browser the title is different. I tried looping through list of links and extract titles of all the webpages but it's returning the same string "Not Acceptable!" for all the links.
here is the python code
from bs4 import BeautifulSoup
import requests
URL = 'https://insights.blackcoffer.com/how-is-login-logout-time-tracking-for-employees-in-office-done-by-ai/'
result = requests.get(URL)
doc = BeautifulSoup(result.text, 'html.parser')
tag = doc.title
print(tag.get_text())
here is link to the corresponding web page webpage link
I don't know if it is a problem with Beautifulsoup4 or with requests library, is it because the site has enabled bot protection and not returning the HTML when sending the requests?
The server expects the User-Agent header. Interestingly, it is happy with any User-Agent, even a fictitious one:
result = requests.get(URL, headers = {'User-Agent': 'My User Agent 1.0'})
An easy way to debug this kind of issue is to print (or write to a file) the request.text. This is because some servers don't allow scraping. Some websites generate HTML using JavaScript at runtime (e.g. YouTube). These are some of the scenarios where the request.text can be different than the source HTML we see in the browser. The below text has been returned by the server.
<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>
Edit:
As pointed by DYZ, this is a 406 error and User Agent in the request header was missing.
https://www.exai.com/blog/406-not-acceptable
The 406 Not Acceptable status code is a client-side error. It's part
of the HTTP response status codes in the 4xx category, which are
considered client error responses
I am trying to use requests to get data from twitter but when i run my code i get this error: simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
This is my code so far:
import requests
url = 'https://twitter.com/search?q=memes&src=typed_query'
results = requests.get(url)
better_results = results.json()
better_results['results'][1]['text'].encode('utf-8')
print(better_results)
because you are making a request to a dynamic website.
when we are making a request to a dynamic website we must render the html first in order to receive all the content that we were expecting to receive.
just making the request is not enough.
other libraries such as requests_html render the html and javascript in background using a lite browser.
you can try this code:
# pip install requests_html
from requests_html import HTMLSession
url = 'https://twitter.com/search?q=memes&src=typed_query'
session = HTMLSession()
response = session.get(url)
# rendering part
response.html.render(timeout=20)
better_results = response.json()
better_results['results'][1]['text'].encode('utf-8')
print(better_results)
been through a few web scraping tutorials now trying a basic api scraper.
This is my code
from bs4 import BeautifulSoup
import requests
url = 'https://qships.tmr.qld.gov.au/webx/services/wxdata.svc/GetDataX'
response = requests.get(url, timeout=5)
content = BeautifulSoup(response.content, "html.parser")
print (content)
comes up with method not allowed :(
Im still learning so any advice will be well recieved
cheers
It is clearly a problem with your URL, service doesn't allow to retrieve information. but you can check this URL, where the steps for retrieving metadata are described.
https://qships.tmr.qld.gov.au/webx/services/wxdata.svc
I am testing the Python library request to see if it is suitable for my work. Here is my sample code for reference:
import requests
url = "http://www.genenetwork.org/webqtl/main.py?cmd=sch&gene=Grin2b&tissue=hip&format=text"
print url
print requests.get(url)
My Output:
http://www.genenetwork.org/webqtl/main.py?cmd=sch&gene=Grin2b&tissue=hip&format=text
Response [200]
Output that I get from my browser & my expected result:
What made the differences? How can I get my expected results? I wanted to process the data inside the webpage.
Your code is currently printing the status code of your GET request. You can access the requested content via the text attribute of the Response class returned by the get method.
import requests
r = requests.get("http://www.genenetwork.org/webqtl/main.py?cmd=sch&gene=Grin2b&tissue=hip&format=text")
r.text