Web scraping in python returns "None" - python

I'm trying to scrape something from a site using python. For example the views on this video (the url) it always returns "None". What am I doing wrong? here is the code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.youtube.com/watch?v=1OfK8UmLMl0&ab_channel=HitraNtheUnnecessaryProgrammer'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
views = soup.body.find(class_='view-count style-scope ytd-video-view-count-renderer')
print(views)
Thanks!
(btw when I try the code shown in the video it works fine)

The page is loaded dynamically, requests doesn't support dynamically loaded pages. However, the data is available in JSON format, you can use the re/json modules to get the correct data.
For example, to get the "view count":
import re
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.youtube.com/watch?v=1OfK8UmLMl0&ab_channel=HitraNtheUnnecessaryProgrammer"
soup = BeautifulSoup(requests.get(url).text, "html.parser")
# We locate the JSON data using a regular-expression pattern
data = re.search(r"var ytInitialData = ({.*?});", soup).group(1)
data = json.loads(data)
print(
data["contents"]["twoColumnWatchNextResults"]["results"]["results"]["contents"][0][
"videoPrimaryInfoRenderer"
]["viewCount"]["videoViewCountRenderer"]["viewCount"]["simpleText"]
)
Output:
124 views
The variable data contains all the data in a Python dictionary (dict) to print all the data you can use:
print(json.dumps(data, indent=4))
Output (truncated):
{
"responseContext": {
"serviceTrackingParams": [
{
"service": "CSI",
"params": [
{
"key": "c",
"value": "WEB"
},
{
"key": "cver",
"value": "2.20210701.07.00"
},
{
"key": "yt_li",
"value": "0"
},
{
"key": "GetWatchNext_rid",
"value": "0x1d62a299beac9e1f"
}
]
},
{
"service": "GFEEDBACK",
"params": [
{
"key": "logged_in",
"value": "0"
},
{
"key": "e",
"value": "24037443,24058293,24058128,24003103,24042870,23882685,24023960,23944779,24027649,24046896,24059898,24049577,23983296,23966208,24056265,23891346,1714258,24049575,24045412,24003105,23999405,24051884,23891344,23986022,24049573,24056839,24053866,24058240,23744176,23998056,24010336,24037586,23934970,23974595,23735348,23857950,24036947,24051353,24038425,23990875,24052245,24063702,24058380,23983813,24058812,24026834,23996830,23946420,24001373,24049820,24030040,24062848,23968386,24027689,24004644,23804281,24049569,23973490,24044110,23884386,24012512,24044124,24059521,23918597,24007246,24049567,24022729,24037794"
}
]
},
{
"service": "GUIDED_HELP",
"params": [
{
"key": "logged_in",
"value": "0"
}
]
},
{
"service": "ECATCHER",
"params": [
{
"key": "client.version",
"value": "2.20210701"
},
{
"key": "client.name",
"value": "WEB"
}
]
}
],
"mainAppWebResponseContext": {
"loggedOut": true
},
"webResponseContextExtensionData": {
"ytConfigData": {
"visitorData": "CgtoanprT1pPbmtWTSjYk46HBg%3D%3D",
"rootVisualElementType": 3832
},

I usually try to view the API requests (from the network tab on dev tools) when a site is dynamically loaded. I was successful with sites such as udemy, skillshare and few others but not with youtube. so in such case, I would use the youtube official API. which is quite easy to use and have plenty of code samples on github. with that you just request your data and get a json response. that you can convert to a dictionary with response.json(). or another option would be using selenium which is not a solution I like and it's pretty resource and time consuming. requesting from API is faster than scraping or any other solution on earth. when something doesn't provide an API, you need scraping

Related

How can I scrape the "Time" and other data in the advanced details section using Beautiful Soup

Here is the url that I want to scrape the data 'https://www.blockchain.com/explorer/transactions/btc/43eebdc59c6c5ce948ccd9cf514b6c2ece9f1289f136c2e3d9d69dcd29304142'
I tried to scrape using Beautiful Soup because it doesn't have to open the browser like the Selenium does. So I tried to extract the data from the outer section
('section',{'class':'sc-f9148dd7-2 irWxzm'})
the irWxzm section
and then tried to find a little bit deeper to the targeted div tag but I don't understand why after I extracted the data from 'section',{'class':'sc-f9148dd7-2 irWxzm'}, it seemed that the data has stopped from the advanced detailsthe data from irWxzm section and I can't dive deeper to the desired div tag in the first picture.
Here my code that I wrote
import requests
import bs4
from bs4 import BeautifulSoup
url = 'https://www.blockchain.com/explorer/transactions/btc/43eebdc59c6c5ce948ccd9cf514b6c2ece9f1289f136c2e3d9d69dcd29304142'
res = requests.get(url)
format = '%Y-%m-%d %H:%M'
soup = BeautifulSoup(res.content, 'html.parser')
soup2 = soup.find('section',{'class':'sc-f9148dd7-2 irWxzm'})
print(soup2)
I tried a lot but it can't find any tag under 'class':'sc-f9148dd7-2 irWxzm' except class="sc-c907597a-0 MqlNG and class="sc-c907597a-3 ctQMfW" according to second picture.
Could you help me find the way to get the data in the advanced details section please.
desired data
Thank you so very much in advance.
The page loads the data from external URL via JavaScript, to load the data you can use next example:
from datetime import datetime
import requests
api_url = "https://www.blockchain.com/explorer/api/transaction?asset=btc&id=43eebdc59c6c5ce948ccd9cf514b6c2ece9f1289f136c2e3d9d69dcd29304142"
data = requests.get(api_url).json()
print(data)
Prints:
{
"ticker": "btc",
"transaction": {
"txid": "43eebdc59c6c5ce948ccd9cf514b6c2ece9f1289f136c2e3d9d69dcd29304142",
"size": 381,
"version": 1,
"locktime": 0,
"fee": 170170,
"inputs": [
{
"coinbase": False,
"txid": "667f2825db6b03e349b5e4be7b4c4c5be266c242a6aaa0218480572ffc5a7b37",
"output": 0,
"sigscript": "47304402204ba063dca925f759777ed8818027c421cb4052ecf2e3b980c814bc528c73638e02206a3d58ec92d0be9915c14d6c4cef40a01d301286c90d82c1bcf166db0e94c3bb012103951bbeb5b73e530b6849fca68e470118f4b379ad9126015caf1355dc2a9e8480",
"sequence": 4294967295,
"pkscript": "76a9149c8ab044348d826b9ae88d698d575a45a6e8fc6988ac",
"value": 207730,
"address": "1FGiZB7K757EUixGcyeyME6Jp8qQZEiUUk",
"witness": [],
},
{
"coinbase": False,
"txid": "3c2dc36fd0bebc46062362aff0c4f307d1c99900c5f358fdd37b436a15d37a5f",
"output": 0,
"sigscript": "4730440220322e489e971b2c651224c2e03bea408df8c67a0a1c18ddfd20e940d90a8e61990220707ba2431bde31500ebe6a2b3c4a7974b87c4b9ee33849e1453c0831318bed14012103951bbeb5b73e530b6849fca68e470118f4b379ad9126015caf1355dc2a9e8480",
"sequence": 4294967295,
"pkscript": "76a9149c8ab044348d826b9ae88d698d575a45a6e8fc6988ac",
"value": 231716,
"address": "1FGiZB7K757EUixGcyeyME6Jp8qQZEiUUk",
"witness": [],
},
],
"outputs": [
{
"address": "1FGiZB7K757EUixGcyeyME6Jp8qQZEiUUk",
"pkscript": "76a9149c8ab044348d826b9ae88d698d575a45a6e8fc6988ac",
"value": 269276,
"spent": True,
"spender": {
"txid": "c7ed715e9f73b2792957af94d3143750525a29f6a62fd6f68d470e56e4bbef7b",
"input": 0,
},
},
{
"address": None,
"pkscript": "6a208627c703aeac41df8acad1c643d9ee9c2370f9cace1af05a0ac41219116b5e0b",
"value": 0,
"spent": False,
"spender": None,
},
],
"block": {"height": 756449, "position": 2},
"deleted": False,
"time": 1664582470,
"rbf": False,
"weight": 1524,
},
"rate": 16526.38,
"latestBlock": 769853,
"id": "43eebdc59c6c5ce948ccd9cf514b6c2ece9f1289f136c2e3d9d69dcd29304142",
"description": False,
"fiat": "USD",
"labels": {},
}
To get the time:
print(datetime.fromtimestamp(data["transaction"]["time"]))
Prints:
2022-10-01 02:01:10

Is there any way to extract dataLayer information from a webpage with python?

I'm working in a dataset construction, make of dataLayer variable (object) information.
I want to automatized a classification process of pages with machine learning.
enter image description here
Yes, there is.
If the variable is statically assigned in e.g. a <script> block, then you can just parse the HTML with e.g. Beautiful Soup, find the script block and get the result.
More likely, though, the data is dynamically generated after the page loads (or in separate script blocks), so you'd need e.g. Playwright to automate a headless browser, and then read the variable from there.
Playwright example
from playwright.sync_api import sync_playwright, BrowserContext
def get_datalayer(ctx: BrowserContext, url: str):
page = ctx.new_page()
page.goto(url)
page.wait_for_load_state("networkidle")
return page.evaluate("window.dataLayer")
with sync_playwright() as p:
browser = p.chromium.launch()
with browser.new_context() as bcon:
data_layer = get_datalayer(bcon, "https://www.berceaumagique.com/")
print(data_layer)
This prints out
[
{
"UtmSource": "",
"EmailHash": "...",
"NewCustomer": "0",
"AcceptFunctionalCookie": "",
"AcceptTargetingCookie": "",
"IdUser": "",
"Page": "home",
"RealPage": "home",
"urlElitrack": "...",
},
{"google_tag_params": {"ecomm_pagetype": "home"}},
{"PageType": "HomePage"},
{"EffinityPage": "home", "Session": "0", "NewCustomer": "0"},
{"gtm.start": 1658135295957, "event": "gtm.js", "gtm.uniqueEventId": 1},
{
"event": "axeptio_update",
"axeptio_authorized_vendors": [],
"gtm.uniqueEventId": 19,
},
{"event": "gtm.dom", "gtm.uniqueEventId": 22},
{"event": "gtm.js", "gtm.uniqueEventId": 23},
{
"event": "promotionsView",
"ecommerce": {
"promoView": {
"promotions": [
{
"id": "slider-1",
"name": "rentree-scolaire",
"creative": "home slider",
"position": "1",
}
]
}
},
"gtm.uniqueEventId": 24,
},
...
]

Parse complex JSON in Python

EDITED WITH LARGER JSON:
I have the following JSON and I need to get id element: 624ff9f71d847202039ec220
results": [
{
"id": "62503d2800c0d0004ee4636e",
"name": "2214524",
"settings": {
"dataFetch": "static",
"dataEntities": {
"variables": [
{
"id": "624ffa191d84720202e2ed4a",
"name": "temp1",
"device": {
"id": "624ff9f71d847202039ec220",
"name": "282c0240ea4c",
"label": "282c0240ea4c",
"createdAt": "2022-04-08T09:01:43.547702Z"
},
"chartType": "line",
"aggregationMethod": "last_value"
},
{
"id": "62540816330443111016e38b",
"device": {
"id": "624ff9f71d847202039ec220",
"name": "282c0240ea4c",
},
"chartType": "line",
}
]
}
...
Here is my code (EDITED)
url = "API_URL"
response = urllib.urlopen(url)
data = json.loads(response.read().decode("utf-8"))
print url
all_ids = []
for i in data['results']: # i is a dictionary
for variable in i['settings']['dataEntities']['variables']:
print(variable['id'])
all_ids.append(variable['id'])
But I have the following error:
for variable in i['settings']['dataEntities']['variables']:
KeyError: 'dataEntities'
Could you please help?
Thanks!!
What is it printing when you print(fetc)? If you format the json, it will be easier to read, the current nesting is very hard to comprehend.
fetc is a string, not a dict. If you want the dict, you have to use the key.
Try:
url = "API_URL"
response = urllib.urlopen(url)
data = json.loads(response.read().decode("utf-8"))
print url
for i in data['results']:
print(json.dumps(i['settings']))
print(i['settings']['dataEntities']
EDIT: To get to the id field, you'll need to dive further.
i['settings']['dataEntities']['variables'][0]['id']
So if you want all the ids you'll have to loop over the variables (assuming the list is more than one)`, and if you want them for all the settings, you'll need to loop over that too.
Full solution for you to try (EDITED after you uploaded the full JSON):
url = "API_URL"
response = urllib.urlopen(url)
data = json.loads(response.read().decode("utf-8"))
print url
all_ids = []
for i in data['results']: # i is a dictionary
for variable in i['settings']['dataEntities']['variables']:
print(variable['id'])
all_ids.append(variable['id'])
all_ids.append(variable['device']['id']
Let me know if that works.
The shared JSON is not valid. A valid JSON similar to yours is:
{
"results": [
{
"settings": {
"dataFetch": "static",
"dataEntities": {
"variables": [
{
"id": "624ffa191d84720202e2ed4a",
"name": "temp1",
"span": "inherit",
"color": "#2ccce4",
"device": {
"id": "624ff9f71d847202039ec220"
}
}
]
}
}
}
]
}
In order to get a list of ids from your JSON you need a double for cycle. A Pythonic code to do that is:
all_ids = [y["device"]["id"] for x in my_json["results"] for y in x["settings"]["dataEntities"]["variables"]]
Where my_json is your initial JSON.

Python post request, problem with posting

I'm trying to write a typeform bot but I am a totally beginner so I have problems with request.post
I am trying to fill this typeform: https://typeformtutorial.typeform.com/to/aA7Vx9
by this code
import requests
token = requests.get("https://typeformtutorial.typeform.com/app/form/result/token/aA7Vx9/default")
data = {"42758279": "true",
"42758410": "text",
"token": token}
r = requests.post("https://typeformtutorial.typeform.com/app/form/submit/aA7Vx9", data)
print(r)
I think that something is wrong with "data" and I am not sure if I use token in a good way. Could you help me?
So, first of all, you need to get another field with the token. To do that, you should pass the header 'accept': 'application/json' in your first request. In the response, you'll get the json object with the token and landed_at parameters. You should use them in the next step.
Then, the post data shoud be different from what you're passing. See the network tab in the browser's developer tools to find out the actual template. It has a structure like that:
{
"signature": <YOUR_SIGNATURE>,
"form_id": "aA7Vx9",
"landed_at": <YOUR_LANDED_AT_TIME>,
"answers": [
{
"field": {
"id": "42758279",
"type": "yes_no"
},
"type": "boolean",
"boolean": True
},
{
"field": {
"id": "42758410",
"type": "short_text"
},
"type": "text",
"text": "1"
}
]
}
And finally, you should convert that json to text so the server would successfully parse it.
Working example:
import requests
import json
token = json.loads(requests.post(
"https://typeformtutorial.typeform.com/app/form/result/token/aA7Vx9/default",
headers={'accept': 'application/json'}
).text)
signature = token['token']
landed_at = int(token['landed_at'])
data = {
"signature": signature,
"form_id": "aA7Vx9",
"landed_at": landed_at,
"answers": [
{
"field": {
"id": "42758279",
"type": "yes_no"
},
"type": "boolean",
"boolean": True
},
{
"field": {
"id": "42758410",
"type": "short_text"
},
"type": "text",
"text": "1"
}
]
}
json_data = json.dumps(data)
r = requests.post("https://typeformtutorial.typeform.com/app/form/submit/aA7Vx9", data=json_data)
print(r.text)
Output:
{"message":"success"}

Issues decoding Collections+JSON in Python

I've been trying to decode a JSON response in Collections+JSON format using Python for a while now but I can't seem to overcome a small issue.
First of all, here is the JSON response:
{
"collection": {
"href": "http://localhost:8000/social/messages-api/",
"items": [
{
"data": [
{
"name": "messageID",
"value": 19
},
{
"name": "author",
"value": "mike"
},
{
"name": "recipient",
"value": "dan"
},
{
"name": "pm",
"value": "0"
},
{
"name": "time",
"value": "2015-03-31T15:04:01.165060Z"
},
{
"name": "text",
"value": "first message"
}
]
}
],
"version": "1.0",
"links": []
}
}
And here is how I am attempting to extract data:
response = urllib2.urlopen('myurl')
responseData = response.read()
jsonData = json.loads(responseData)
test = jsonData['collection']['items']['data']
When I run this code I get the error:
list indices must be integers, not str
If I use an integer, e.g. 0, instead of a string it merely shows 'data' instead of any useful information, unlike if I were to simply output 'items'. Similarly, I can't seem to access the data within a data child, for example:
test = jsonData['collection']['items'][0]['name']
This will argue that there is no element called 'name'.
What is the proper method of accessing JSON data in this situation? I would also like to iterate over the collection, if that helps.
I'm aware of a package that can be used to simplify working with Collections+JSON in Python, collection-json, but I'd rather be able to do this without using such a package.

Categories

Resources