Python requests - using twitter search - python

I am trying to use requests to get data from twitter but when i run my code i get this error: simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
This is my code so far:
import requests
url = 'https://twitter.com/search?q=memes&src=typed_query'
results = requests.get(url)
better_results = results.json()
better_results['results'][1]['text'].encode('utf-8')
print(better_results)

because you are making a request to a dynamic website.
when we are making a request to a dynamic website we must render the html first in order to receive all the content that we were expecting to receive.
just making the request is not enough.
other libraries such as requests_html render the html and javascript in background using a lite browser.
you can try this code:
# pip install requests_html
from requests_html import HTMLSession
url = 'https://twitter.com/search?q=memes&src=typed_query'
session = HTMLSession()
response = session.get(url)
# rendering part
response.html.render(timeout=20)
better_results = response.json()
better_results['results'][1]['text'].encode('utf-8')
print(better_results)

Related

Web Scraping with Requests -Python

I have been struggling to do a web scraping with the below code and its showing me null records. If we print the output data, it dosent show the requested output. this is the web site i am going to do this web scraping https://coinmarketcap.com/. there are several pages which need to be taken in to the data frame. (64 Pages)
import requests
import pandas as pd
url = "https://api.coinmarketcap.com/data-api/v3/topsearch/rank"
req= requests.post(url)
main_data=req.json()
can anyone help me to sort this out?
Instead of using post requests use get in request call it will
work!
import requests
res=requests.get("https://api.coinmarketcap.com/data-api/v3/topsearch/rank")
main_data=res.json()
data=main_data['data']['cryptoTopSearchRanks']
With All pages: You can find this URL from Network tab go to xhr and reload now go to second page URL will avail in xhr tab you can copy and make call of it i have shorten the URL here
res=requests.get("https://coinmarketcap.com/")
soup=BeautifulSoup(res.text,"html.parser")
last_page=soup.find_all("p",class_="sc-1eb5slv-0 hykWbK")[-1].get_text().split(" ")[-1]
res=requests.get(f"https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit={last_page}&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath")
Now use json method
data=res.json()['data']['cryptoCurrencyList']
print(len(data))
Output:
6304
For getting/reading the data you need to use get method not post
import requests
import pandas as pd
import json
url = "https://api.coinmarketcap.com/data-api/v3/topsearch/rank"
req = requests.get(url)
main_data = req.json()
print(main_data) # without pretty printing
pretty_json = json.loads(req.text)
print(json.dumps(pretty_json, indent=4)) # with pretty print
Their terms of use prohibit web scraping. The site provides a well-documented API that has a free tier. Register and get API token:
from requests import Session
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/listings/latest'
parameters = {
'start':'1',
'limit':'5000',
'convert':'USD'
}
headers = {
'Accepts': 'application/json',
'X-CMC_PRO_API_KEY': HIDDEN_TOKEN, # replace that with your API Key
}
session = Session()
session.headers.update(headers)
response = session.get(url, params=parameters)
data = response.json()
print(data)

Method not allowed first API

been through a few web scraping tutorials now trying a basic api scraper.
This is my code
from bs4 import BeautifulSoup
import requests
url = 'https://qships.tmr.qld.gov.au/webx/services/wxdata.svc/GetDataX'
response = requests.get(url, timeout=5)
content = BeautifulSoup(response.content, "html.parser")
print (content)
comes up with method not allowed :(
Im still learning so any advice will be well recieved
cheers
It is clearly a problem with your URL, service doesn't allow to retrieve information. but you can check this URL, where the steps for retrieving metadata are described.
https://qships.tmr.qld.gov.au/webx/services/wxdata.svc

urlopen of urllib.request cannot open a page in python 3.7

I want to write webscraper to collect titles of articles from Medium.com webpage.
I am trying to write a python script that will scrape headlines from Medium.com website. I am using python 3.7 and imported urlopen from urllib.request.
But it cannot open the site and shows
"urllib.error.HTTPError: HTTP Error 403: Forbidden" error.
from bs4 import BeautifulSoup
from urllib.request import urlopen
webAdd = urlopen("https://medium.com/")
bsObj = BeautifulSoup(webAdd.read())
Result = urllib.error.HTTPError: HTTP Error 403: Forbidden
Expected result is that it will not show any error and just read the web site.
But this does not happen when I use requests module.
import requests
from bs4 import BeautifulSoup
url = 'https://medium.com/'
response = requests.get(url, timeout=5)
This time around it works without error.
Why ??
Urllib is pretty old and small module. For webscraping, requests module is recommended.
You can check out this answer for additional information.
Many sites nowadays check where the user agent is coming from, to try and deter bots. requests is the better module to use, but if you really want to use urllib, you can alter the headers text, to pretend to be Firefox or something else, so that it is not blocked. Quick example can be found here:
https://stackoverflow.com/a/16187955
import urllib.request
user_agent = 'Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion'
url = "http://example.com"
request = urllib.request.Request(url)
request.add_header('User-Agent', user_agent)
response = urllib.request.urlopen(request)
You will need to alter the user_agent string with the appropriate versions of things too. Hope this helps.
this worked for me
import urllib
from urllib.request import urlopen
html = urlopen(MY_URL)
contents = html.read()
print(contents)

Not able to get correct response using requests.session()

I tried to scrape a booking page of a movie in bookmyshow site. I am using requests library for this. My code is:
import requests
from bs4 import BeautifulSoup
rs = s.get("https://in.bookmyshow.com/serv/getData?cmd=GETSHOWINFO&vid=MOMV&ssid=1112")
print(rs.status_code)
rt = s.get("https://in.bookmyshow.com/serv/doSecureTrans.bms")
print(rt.status_code)
print(rs.text)
This is the code for getting the source code of the page, i am using. For the first page i am getting 200 as response and then for the second page is also giving 200 response. when i print the source code i am getting "invalid request" as output. what could be the error?

Urllib request throws a decode error when parsing from url

I'm trying to parse the json formatted data from this url: http://ws-old.parlament.ch/sessions?format=json. My browser copes nicely with the json data. But requests always throw the following error:
JSONDecodeError: Expecting value: line 3 column 1 (char 4)
I'm using Python 3.5. And this is my code:
import json
import urllib.request
connection = urllib.request.urlopen('http://ws-old.parlament.ch/affairs/20080062?format=json')
js = connection.read()
info = json.loads(js.decode("utf-8"))
print(info)
The site uses User-Agent filtering to only serve JS to known browsers. Luckily it is easily fooled, just set the User-Agent header to Mozilla:
request = urllib.request.Request(
'http://ws-old.parlament.ch/affairs/20080062?format=json',
headers={'User-Agent': 'Mozilla'})
connection = urllib.request.urlopen(request)
js = connection.read()
info = json.loads(js.decode("utf-8"))
print(info)

Categories

Resources