can't parse the transcript of a video from https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript
the requests won't see the span class where the text actually is. What could be the problem?
import requests
url = 'https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript'
page = requests.get(url)
print(page.content)
Is there any way to reach the transcript? Thank you.
I need to reach this
no atrribute found
That's because the data is not loaded via the link you're using, but via a call to their GraphQL instance.
Using curl, you can fetch the data like so:
curl 'https://www.ted.com/graphql?operationName=Transcript&variables=%7B%22id%22%3A%22alexis_nikole_nelson_a_flavorful_field_guide_to_foraging%22%2C%22language%22%3A%22en%22%7D&extensions=%7B%22persistedQuery%22%3A%7B%22version%22%3A1%2C%22sha256Hash%22%3A%2218f8e983b84c734317ae9388c83a13bc98702921b141c2124b3ce4aeb6c48ef6%22%7D%7D' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' -H 'Accept-Encoding: gzip, deflate, br' -H 'Referer: https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript' -H 'content-type: application/json' -H 'client-id: Zenith production' -H 'x-operation-name: Transcript' --output - | gzip -d
Note, the URL is urlencoded. You can import from urllib.parse import quote to use the quote() method to urlencode a string in python.
So simply translate the above curl command to python.
There's no magic, simply set the correct headers.
If you're lazy, you can also use this online converter, to convert a curl command to python code.
This produces:
import requests
from requests.structures import CaseInsensitiveDict
url = "https://www.ted.com/graphql?operationName=Transcript&variables=%7B%22id%22%3A%22alexis_nikole_nelson_a_flavorful_field_guide_to_foraging%22%2C%22language%22%3A%22en%22%7D&extensions=%7B%22persistedQuery%22%3A%7B%22version%22%3A1%2C%22sha256Hash%22%3A%2218f8e983b84c734317ae9388c83a13bc98702921b141c2124b3ce4aeb6c48ef6%22%7D%7D"
headers = CaseInsensitiveDict()
headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"
headers["Accept"] = "*/*"
headers["Accept-Language"] = "en-US,en;q=0.5"
headers["Accept-Encoding"] = "gzip, deflate, br"
headers["Referer"] = "https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript"
headers["content-type"] = "application/json"
headers["client-id"] = "Zenith production"
headers["x-operation-name"] = "Transcript"
resp = requests.get(url, headers=headers)
print(resp.content)
Output:
b'{"data":{"translation":{"id":"209255","language" ...
Related
The site that I am trying to scrape must follow 302 observed by the browser network tools. I've copied the network request as Curl and its working fine, but when I convert it to python requests its just returning 200.
Curl:
curl -v 'https://my-site.com/CardAuthentication.aspx' \
-H 'Connection: keep-alive' \
-H 'Pragma: no-cache' \
-H 'Cache-Control: no-cache' \
-H 'Upgrade-Insecure-Requests: 1' \
-H 'Origin: https://my-site.com' \
-H 'Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryBN2XwBbcAwvRWZzk' \
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36' \
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
-H 'Sec-GPC: 1' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'Sec-Fetch-Mode: navigate' \
-H 'Sec-Fetch-User: ?1' \
-H 'Sec-Fetch-Dest: document' \
-H 'Referer: https://my-site.com/CardAuthentication.aspx' \
-H 'Accept-Language: en-US,en;q=0.9' \
-H 'Cookie: ASP.NET_SessionId=...; TS0192fa71=...; ServiceProvider=UID=...; .ASPXAUTH=...' \
--data-raw $'------WebKitFormBoundaryBN2XwBbcAwvRWZzk\r\ DATA \r\n' \
--compressed
this is returning:
* Trying IP:443...
........
........
< HTTP/1.1 302 Found
< Cache-Control: no-cache
< Pragma: no-cache
< Content-Type: text/html; charset=utf-8
< Expires: -1
< Location: my-site.com/MemberDetails.aspx
< Date: Mon, 07 Feb 2022 09:54:49 GMT
< Content-Length: 61211
< Set-Cookie: TS0192fa71=...; Path=/; Domain=.my-site.com; Secure
Python:
import requests
import logging
logging.basicConfig(level=logging.DEBUG)
cookies = {
"ASP.NET_SessionId": ".....",
"TS0192fa71": ".....",
"ServiceProvider": ".....",
".ASPXAUTH": ".....",
}
headers = {
......
"Content-Type": "multipart/form-data; boundary=----WebKitFormBoundaryBN2XwBbcAwvRWZzk",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"Referer": "https://my-site.com/CardAuthentication.aspx",
"Accept-Language": "en-US,en;q=0.9",
}
data = {
"------WebKitFormBoundaryBN2XwBbcAwvRWZzk\r\nContent-Disposition: form-data; name": '"__EVENTTARGET"\r\n\r\n\r\n-- DATA
}
response = requests.post(
"https://my-site.com/CardAuthentication.aspx",
headers=headers,
cookies=cookies,
data=data,
)
And the return code is 200, with response history empy.
Is there something wrong with requests library or in the way that request library is processing the data? How can I solve this?
Is there something wrong with requests library or in the way that
request library is processing the data? How can I solve this?
This is default requests behavior, you need to set allow_redirects to False if you want to not follow in case of 3xx response code, example
import requests
r = requests.get("http://github.com", allow_redirects=False)
print(r.status_code) # 301
If you want to know more read request.request docs.
There is a website which I need to scrape. It has a long list of available job positions, that are folded by default:
Which unfold when a user clicks on it:
When a user unfolds it, the page sends a POST request to a website with a position id.
I tried to imitate this request (see code below), it doesn't fail (status==200) but doesn't return anything. I suspect that is because of CORS. Is there anyway to still collect the data?
import requests
url = "https://econjobmarket.org/positions/recordClick"
payload = 'posid=7026'
headers = {
'Accept': '*/*',
'X-CSRF-TOKEN': HERE_GOES_THE_TOKEN,
'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie': HERE_GOES_THE_COOKIE
}
response = requests.request("POST", url, headers=headers, data = payload)
print(response.text.encode('utf8'))
I don't see additional requests sent to get expanded data. All data (both in folded and expanded states are already in page source)
response = requests.get('https://econjobmarket.org/positions').content
print("Post-Doc, Computational Marketing" in response)
True
The recordClick URL you are seeing is simply for recording the click for web analytics. As Parolla said, what you are looking for is already in the page source. Your best bet is to do an HTTP GET on the website and parse the html code with BeautifulSoup.
You can reduce the ability of the site to track you and potentially block your scraping if you drop the token and cookies from the request headers.
A quick test in curl shows the response is still complete without them.
curl -i -s -k -X $'GET' \
-H $'Host: econjobmarket.org' -H $'Connection: close' -H $'Cache-Control: max-age=0' -H $'DNT: 1' -H $'Upgrade-Insecure-Requests: 1' -H $'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36' -H $'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H $'Sec-GPC: 1' -H $'Sec-Fetch-Site: cross-site' -H $'Sec-Fetch-Mode: navigate' -H $'Sec-Fetch-User: ?1' -H $'Sec-Fetch-Dest: document' -H $'Accept-Encoding: gzip, deflate' -H $'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' \
$'https://econjobmarket.org/positions'
J and Parolla are correct that the POST is just recording your actions on the website.
So i'm trying to consume this API, I got this URL http://www.ventamovil.com.mx:9092/service.asmx?op=Check_Balance
There you can write this {"User":"6144135400","Password":"Prueba$$"} on the input field and you get a response.
https://i.stack.imgur.com/RTEii.png
Response
But when i try to consume this api on python i just can't, i don't exactly know how to consume correctly:
My Code
As you can see i got a different response with my code, i should be getting the same response as the "Response" image.
To save yourself some time, you can use their request to build python code automatically, all you have to do is:
Just as you did at first, enter the json in the input field and invoke.
open the network tab, copy the post request they made as curl
curl 'http://www.ventamovil.com.mx:9092/service.asmx/Check_Balance' -H 'Connection: keep-alive' -H 'Cache-Control: max-age=0' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36' -H 'Origin: http://www.ventamovil.com.mx:9092' -H 'Content-Type: application/x-www-form-urlencoded' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'Referer: http://www.ventamovil.com.mx:9092/service.asmx?op=Check_Balance' -H 'Accept-Language: en-US,en;q=0.9,ar;q=0.8,pt;q=0.7' --data 'jrquest=%7B%22User%22%3A6144135400%2C+%22Password%22%3A+%22Prueba%24%24%22%7D' --compressed --insecure
Go to postman and import the curl, then click code and select python, and here you go you have all the right headers needed
import requests
url = "http://www.ventamovil.com.mx:9092/service.asmx/Check_Balance"
payload = 'jrquest=%7B%22User%22%3A6144135400%2C+%22Password%22%3A+%22Prueba%24%24%22%7D'
headers = {
'Upgrade-Insecure-Requests': '1',
'Content-Type': 'application/x-www-form-urlencoded',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
}
response = requests.request("POST", url, headers=headers, data = payload)
print(response.text.encode('utf8'))
As you can see, they accept their input as form encoded payload.
You need to modify this request to be parameterized with user/password you want each time you use.
Btw, the output of this python code is:
b'<?xml version="1.0" encoding="utf-8"?>\r\n<string xmlns="http://www.ventamovil.com.mx/ws/">{"Confirmation":"00","Saldo_Inicial":"10000","Compras":"9360","Ventas":"8416","Comision":"469","Balance":"10345.92"}</string>'
How to convert this curl request to python requests module compitable(Post request)
curl 'http://sss.com' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36' -H --data 'aq=%40syssource%3DProFind%20AND%20NOT%20%40title%3DCoveo%20AND%20NOT%20%40title%3Derror&searchHub=ProFind&xxx=yyy&xxx=yyy&xxx=yyy=10&xxx=yyy' --compressed
I am searching for requests module here
http://docs.python-requests.org/en/master/user/quickstart/
But they only have data value as key,value there
r = requests.post('http://httpbin.org/post', data = {'key':'value'})
So how could in convert the above curl post request to python requests module to make post request
The documentation you linked says:
There are many times that you want to send data that is not form-encoded. If you pass in a string instead of a dict, that data will be posted directly.
So just use
r = requests.post('http://sss.com', data = 'aq=%40syssource%3DProFind%20AND%20NOT%20%40title%3DCoveo%20AND%20NOT%20%40title%3Derror&searchHub=ProFind&xxx=yyy&xxx=yyy&xxx=yyy=10&xxx=yyy')
Why does request not download a response for this webpage?
#!/usr/bin/python
import requests
headers={ 'content-type':'application/x-www-form-urlencoded; charset=UTF-8',
'Accept-Encoding': 'gzip, deflate',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0',
'Referer' : 'http://sportsbeta.ladbrokes.com/football',
}
payload={'N': '4294966750',
'facetCount_156%23327': '12',
'facetCount_157%23325': '8',
'form-trigger':'moreId',
'moreId':'156%23327',
'pageId':'p_football_home_page',
'pageType':'EventClass',
'type':'ajaxrequest'
}
url='http://sportsbeta.ladbrokes.com/view/EventDetailPageComponentController'
r = requests.post(url, data=payload, headers=headers)
These are the parameters of the POST that I see in Firebug, and there the response received back contains a list (of football leagues), yet when I run my python script like this I get nothing.
(you can see the request in Firefox by clicking the See All in the competitions section of the left hand nav bar of link and looking at the XHR in Firebug. The Firebug response shows the HTML body as expected.)
Anyone any ideas? Will my handling of the % symbols in the payload be causing any trouble at all?
EDIT: Attempt using session
from requests import Request, Session
#turn post string into dict:
def parsePOSTstring(POSTstr):
paramList = POSTstr.split('&')
paramDict = dict([param.split('=') for param in paramList])
return paramDict
headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0',
'Referer' : 'http://sportsbeta.ladbrokes.com/football'
}
#prep the data (POSTstr copied from Firebug raw source)
POSTstr = "moreId=156%23327&facetCount_156%23327=12&event=&N=4294966750&pageType=EventClass&
pageId=p_football_home_page&type=ajaxrequest&eventIDNav=&removedSelectionNav=&
currentSelectedId=&form-trigger=moreId"
payload = parsePOSTstring(POSTstr)
#end url
url='http://sportsbeta.ladbrokes.com/view/EventDetailPageComponentController'
#start a session to manage cookies, and visit football page first so referer agrees
s = Session()
s.get('http://sportsbeta.ladbrokes.com/football')
#now visit disired url with headers/data
r = s.post(url, data=payload, headers=headers)
#print output
print r.text #this is empty
Working curl
curl 'http://sportsbeta.ladbrokes.com/view/EventDetailPageComponentController'
-H 'Cookie: JSESSIONID=DE93158F07E02DD3CC1CC32B1AA24A9E.ecomprodsw015;
geoCode=FRA;
FLAGS=en|en|uk|default|ODDS|0|GBP;
ECOM_BETA_SPORTS=1;
PLAYED=4%7C0%7C0%7C0%7C0%7C0%7C0'
-H 'Referer: http://sportsbeta.ladbrokes.com/football'
-H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:27.0)
Gecko/20100101 Firefox/27.0'
--data 'facetCount_157%23325=8&moreId=156%23327&
facetCount_156%23327=12&event=&
N=4294966750&
pageType=EventClass&pageId=p_football_home_page&
type=ajaxrequest&eventIDNav=&
removedSelectionNav=¤tSelectedId=&
form-trigger=moreId' --compressed
Yet this curl works.
Here's the smallest working example that I can come up with:
from requests import Session
session = Session()
# HEAD requests ask for *just* the headers, which is all you need to grab the
# session cookie
session.head('http://sportsbeta.ladbrokes.com/football')
response = session.post(
url='http://sportsbeta.ladbrokes.com/view/EventDetailPageComponentController',
data={
'N': '4294966750',
'form-trigger': 'moreId',
'moreId': '156#327',
'pageType': 'EventClass'
},
headers={
'Referer': 'http://sportsbeta.ladbrokes.com/football'
}
)
print response.text
You just weren't decoding the percent-encoded POST data properly, so # was being represented as %23 in the actual POST data (e.g. 156%23327 should've been 156#327).