How to convert values into nested JSON in python? - python

I scrape certain values using beautifulsoup. I want to convert it into Nested JSON format.
Following is my value structure.
category = development
heading = Complete Python Bootcamp | Deep Learning Into Python Coding
image = https://i.udemycdn.com/course/750x422/3871500_3d01_3.jpg
link = https://www.udemy.com/course/complete-python-bootcamp-deep-learning-into-python-coding
category = development
heading = C++ Complete Course For Beginners
image = https://i.udemycdn.com/course/750x422/3847698_8547_2.jpg
link = https://www.udemy.com/course/c-complete-course-for-beginners/?couponCode=FREE2021
category = it-software
heading = TB0-116 TIBCO Enterprise Message Service 6 Practice Exam
image = https://i.udemycdn.com/course/750x422/2931054_d555.jpg
link = https://www.udemy.com/course/tb0-116-tibco-enterprise-message-service-6-practice-exam-t
Expected json output:
[
{
"development":[
{
"heading":" Complete Python Bootcamp | Deep Learning Into Python Coding",
"image":"https://i.udemycdn.com/course/750x422/3871500_3d01_3.jpg",
"courselink":"https://www.udemy.com/course/complete-python-bootcamp-deep-learning"
}
{
"heading":"C++ Complete Course For Beginners",
"image":"https://i.udemycdn.com/course/750x422/3871500_3d01_3.jpg",
"courselink":"https://www.udemy.com/course/complete-python-bootcamp-deep-learning"
}
],
"it-software":[
{
"heading" : "TB0-116 TIBCO Enterprise Message Service 6 Practice Exam",
"image" : "https://i.udemycdn.com/course/750x422/2931054_d555.jpg"
"courselink" : "https://www.udemy.com/course/tb0-116-tibco-enterprise-message-service"
}
],
]
BELOW I ATTACHED MY SCRAPING CODE
def scrapeData(category):
base_url = "https://udemycoupon.learnviral.com/coupon-category/"+category+"/"
print(base_url)
source=requests.get(base_url,headers=headers).text
soup = BeautifulSoup(source,'lxml')
contents = soup.find_all('div',class_="item-holder")
print()
# print(contents)
for item in contents:
print(category)
heading=item.find("h3",{"class":"entry-title"}).text.replace("[Free]","")
print(heading)
image=item.find("div",{"class":"store-image"}).find("img")['src']
imagelink = image.replace('240x135', '750x422')
print(imagelink)
courselink = item.find("a", {"class":"coupon-code-link btn promotion"})
Anyone help me to convert it into my expected format in python.Thanks in advance.

def scrape_category(name):
base_url = 'https://udemycoupon.learnviral.com/coupon-category/' + name + '/'
source = requests.get(base_url).text
soup = BeautifulSoup(source, 'lxml')
contents = soup.find_all('div', class_='item-holder')
courses = []
for item in contents:
heading = item.find('h3', {'class': 'entry-title'}).text.replace('[Free]', '')
image = item.find('div', {'class': 'store-image'}).find('img')['src']
course_link = item.find('a', {'class': 'coupon-code-link btn promotion'})
courses.append({
'heading': heading,
'image': image.replace('240x135', '750x422'),
'courselink': course_link['href'],
})
return courses
result = {}
for category in ('development', 'it-software', ):
result[category] = scrape_category(category)
print(result) # or print([result])

You can use defaultdict and update the scraping code to create the dictionary object for every new course:
from collections import defaultdict
main_d = defaultdict(list)
for item in contents:
print(category)
heading=item.find("h3",{"class":"entry-title"}).text.replace("[Free]","")
print(heading)
image=item.find("div",{"class":"store-image"}).find("img")['src']
imagelink = image.replace('240x135', '750x422')
print(imagelink)
courselink = item.find("a", {"class":"coupon-code-link btn promotion"})
d = {"heading": heading, "image": image, "courselink": courselink}
main_d[category].append(d)
main_d will be a dictionary object with following structure:
{
"development":[
{
"heading":" Complete Python Bootcamp | Deep Learning Into Python Coding",
"image":"https://i.udemycdn.com/course/750x422/3871500_3d01_3.jpg",
"courselink":"https://www.udemy.com/course/complete-python-bootcamp-deep-learning"
}
{
"heading":"C++ Complete Course For Beginners",
"image":"https://i.udemycdn.com/course/750x422/3871500_3d01_3.jpg",
"courselink":"https://www.udemy.com/course/complete-python-bootcamp-deep-learning"
}
],
"it-software":[
{
"heading" : "TB0-116 TIBCO Enterprise Message Service 6 Practice Exam",
"image" : "https://i.udemycdn.com/course/750x422/2931054_d555.jpg"
"courselink" : "https://www.udemy.com/course/tb0-116-tibco-enterprise-message-service"
}
],
}
Note: This is not a tested code and might require some modifications to make it work correctly.

Related

Scrape eBay Sold Items Using Selenium Returns []

I have almost no webscraping experience, and wasn't able to solve this using BeautifulSoup, so I'm trying selenium (installed it today). I'm trying to scrape sold items on eBay. I'm trying to scrape:
https://www.ebay.com/sch/i.html?_from=R40&_nkw=oakley+sunglasses&_sacat=0&Brand=Oakley&rt=nc&LH_Sold=1&LH_Complete=1&_ipg=200&_oaa=1&_fsrp=1&_dcat=79720
Here is my code where I load in html code and convert to selenium html:
ebay_url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=oakley+sunglasses&_sacat=0&Brand=Oakley&rt=nc&LH_Sold=1&LH_Complete=1&_ipg=200&_oaa=1&_fsrp=1&_dcat=79720'
html = requests.get(ebay_url)
#print(html.text)
driver = wd.Chrome(executable_path=r'/Users/mburley/Downloads/chromedriver')
driver.get(ebay_url)
Which correctly opens a new chrome session at the correct url. I'm working on getting the titles, prices, and date sold and then loading it into a csv file. Here is the code I have for those:
# Find all div tags and set equal to main_data
all_items = driver.find_elements_by_class_name("s-item__info clearfix")[1:]
#print(main_data)
# Loop over main_data to extract div classes for title, price, and date
for item in all_items:
date = item.find_element_by_xpath("//span[contains(#class, 'POSITIVE']").text.strip()
title = item.find_element_by_xpath("//h3[contains(#class, 's-item__title s-item__title--has-tags']").text.strip()
price = item.find_element_by_xpath("//span[contains(#class, 's-item__price']").text.strip()
print('title:', title)
print('price:', price)
print('date:', date)
print('---')
data.append( [title, price, date] )
Which just returns []. I think ebay may be blocking my IP, but the html code loads in and looks correct. Hopefully someone can help! Thanks!
It is not necessary to use Selenium for eBay scraping, as the data is not rendered by JavaScript thus can be extracted from plain HTML. It is enough to use BeautifulSoup web scraping library.
Keep in mind that problems with site parsing may arise when you try to request a site multiple times. eBay may consider that this is a bot that sends a request (not a real user).
To avoid this, one of the ways is to send headers that contain user-agent in the request, then the site will assume that you're a user and display information.
As an additional step is to rotate those user-agents. The ideal scenario is to use proxies in combo with rotated user-agents (besides CAPTCHA solver)
from bs4 import BeautifulSoup
import requests, json, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
}
params = {
'_nkw': 'oakley+sunglasses', # search query
'LH_Sold': '1', # shows sold items
'_pgn': 1 # page number
}
data = []
while True:
page = requests.get('https://www.ebay.com/sch/i.html', params=params, headers=headers, timeout=30)
soup = BeautifulSoup(page.text, 'lxml')
print(f"Extracting page: {params['_pgn']}")
print("-" * 10)
for products in soup.select(".s-item__info"):
title = products.select_one(".s-item__title span").text
price = products.select_one(".s-item__price").text
link = products.select_one(".s-item__link")["href"]
data.append({
"title" : title,
"price" : price,
"link" : link
})
if soup.select_one(".pagination__next"):
params['_pgn'] += 1
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False)
Example output
Extracting page: 1
----------
[
{
"title": "Shop on eBay",
"price": "$20.00",
"link": "https://ebay.com/itm/123456?hash=item28caef0a3a:g:E3kAAOSwlGJiMikD&amdata=enc%3AAQAHAAAAsJoWXGf0hxNZspTmhb8%2FTJCCurAWCHuXJ2Xi3S9cwXL6BX04zSEiVaDMCvsUbApftgXEAHGJU1ZGugZO%2FnW1U7Gb6vgoL%2BmXlqCbLkwoZfF3AUAK8YvJ5B4%2BnhFA7ID4dxpYs4jjExEnN5SR2g1mQe7QtLkmGt%2FZ%2FbH2W62cXPuKbf550ExbnBPO2QJyZTXYCuw5KVkMdFMDuoB4p3FwJKcSPzez5kyQyVjyiIq6PB2q%7Ctkp%3ABlBMULq7kqyXYA"
},
{
"title": "Oakley X-metal Juliet Men's Sunglasses",
"price": "$280.00",
"link": "https://www.ebay.com/itm/265930582326?hash=item3deab2a936:g:t8gAAOSwMNhjRUuB&amdata=enc%3AAQAHAAAAoH76tlPncyxembf4SBvTKma1pJ4vg6QbKr21OxkL7NXZ5kAr7UvYLl2VoCPRA8KTqOumC%2Bl5RsaIpJgN2o2OlI7vfEclGr5Jc2zyO0JkAZ2Gftd7a4s11rVSnktOieITkfiM3JLXJM6QNTvokLclO6jnS%2FectMhVc91CSgZQ7rc%2BFGDjXhGyqq8A%2FoEyw4x1Bwl2sP0viGyBAL81D2LfE8E%3D%7Ctkp%3ABk9SR8yw1LH9YA"
},
{
"title": " Used Oakley PROBATION Sunglasses Polished Gold/Dark Grey (OO4041-03)",
"price": "$120.00",
"link": "https://www.ebay.com/itm/334596701765?hash=item4de7847e45:g:d5UAAOSw4YtjTfEE&amdata=enc%3AAQAHAAAAoItMbbzfQ74gNUiinmOVnzKlPWE%2Fc54B%2BS1%2BrZpy6vm5lB%2Bhvm5H43UFR0zeCU0Up6sPU2Wl6O6WR0x9FPv5Y1wYKTeUbpct5vFKu8OKFBLRT7Umt0yxmtLLMWaVlgKf7StwtK6lQ961Y33rf3YuQyp7MG7H%2Fa9fwSflpbJnE4A9rLqvf3hccR9tlWzKLMj9ZKbGxWT17%2BjyUp19XIvX2ZI%3D%7Ctkp%3ABk9SR8yw1LH9YA"
},
As an alternative, you can use Ebay Organic Results API from SerpApi. It`s a paid API with a free plan that handles blocks and parsing on their backend.
Example code that paginates through all pages:
from serpapi import EbaySearch
import os, json
params = {
"api_key": os.getenv("API_KEY"), # serpapi api key
"engine": "ebay", # search engine
"ebay_domain": "ebay.com", # ebay domain
"_nkw": "oakley+sunglasses", # search query
"_pgn": 1, # page number
"LH_Sold": "1" # shows sold items
}
search = EbaySearch(params) # where data extraction happens
page_num = 0
data = []
while True:
results = search.get_dict() # JSON -> Python dict
if "error" in results:
print(results["error"])
break
for organic_result in results.get("organic_results", []):
link = organic_result.get("link")
price = organic_result.get("price")
data.append({
"price" : price,
"link" : link
})
page_num += 1
print(page_num)
if "next" in results.get("pagination", {}):
params['_pgn'] += 1
else:
break
print(json.dumps(data, indent=2))
Output:
[
{
"price": {
"raw": "$68.96",
"extracted": 68.96
},
"link": "https://www.ebay.com/itm/125360598217?epid=20030526224&hash=item1d3012ecc9:g:478AAOSwCt5iqgG5&amdata=enc%3AAQAHAAAA4Ls3N%2FEH5OR6w3uoTlsxUlEsl0J%2B1aYmOoV6qsUxRO1d1w3twg6LrBbUl%2FCrSTxNOjnDgIh8DSI67n%2BJe%2F8c3GMUrIFpJ5lofIRdEmchFDmsd2I3tnbJEqZjIkWX6wXMnNbPiBEM8%2FML4ljppkSl4yfUZSV%2BYXTffSlCItT%2B7ZhM1fDttRxq5MffSRBAhuaG0tA7Dh69ZPxV8%2Bu1HuM0jDQjjC4g17I3Bjg6J3daC4ZuK%2FNNFlCLHv97w2fW8tMaPl8vANMw8OUJa5z2Eclh99WUBvAyAuy10uEtB3NDwiMV%7Ctkp%3ABk9SR5DKgLD9YA"
},
{
"price": {
"raw": "$62.95",
"extracted": 62.95
},
"link": "https://www.ebay.com/itm/125368283608?epid=1567457519&hash=item1d308831d8:g:rnsAAOSw7PJiqMQz&amdata=enc%3AAQAHAAAA4AwZhKJZfTqrG8VskZL8rtfsuNtZrMdWYpndpFs%2FhfrIOV%2FAjLuzNzaMNIvTa%2B6QUTdkOwTLRun8n43cZizqtOulsoBLQIwy3wf19N0sHxGF5HaIDOBeW%2B2sobRnzGdX%2Fsmgz1PRiKFZi%2BUxaLQpWCoGBf9n8mjcsFXi3esxbmAZ8kenO%2BARbRBzA2Honzaleb2tyH5Tf8%2Bs%2Fm5goqbon%2FcEsR0URO7BROkBUUjDCdDH6fFi99m6anNMMC3yTBpzypaFWio0u2qu5TgjABUfO1wzxb4ofA56BNKjoxttb7E%2F%7Ctkp%3ABk9SR5DKgLD9YA"
},
# ...
]
Disclaimer, I work for SerpApi.
You can use the below code to scrape the details. also you can use pandas to store data in csv file.
Code :
ebay_url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=oakley+sunglasses&_sacat=0&Brand=Oakley&rt=nc&LH_Sold=1&LH_Complete=1&_ipg=200&_oaa=1&_fsrp=1&_dcat=79720'
html = requests.get(ebay_url)
# print(html.text)
driver = wd.Chrome(executable_path=r'/Users/mburley/Downloads/chromedriver')
driver.maximize_window()
driver.implicitly_wait(30)
driver.get(ebay_url)
wait = WebDriverWait(driver, 20)
sold_date = []
title = []
price = []
i = 1
for item in driver.find_elements(By.XPATH, "//div[contains(#class,'title--tagblock')]/span[#class='POSITIVE']"):
sold_date.append(item.text)
title.append(driver.find_element_by_xpath(f"(//div[contains(#class,'title--tagblock')]/span[#class='POSITIVE']/ancestor::div[contains(#class,'tag')]/following-sibling::a/h3)[{i}]").text)
price.append(item.find_element_by_xpath(f"(//div[contains(#class,'title--tagblock')]/span[#class='POSITIVE']/ancestor::div[contains(#class,'tag')]/following-sibling::div[contains(#class,'details')]/descendant::span[#class='POSITIVE'])[{i}]").text)
i = i + 1
print(sold_date)
print(title)
print(price)
data = {
'Sold_date': sold_date,
'title': title,
'price': price
}
df = pd.DataFrame.from_dict(data)
df.to_csv('out.csv', index = 0)
Imports :
import pandas as pd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By

How can I extract values from Tableau on this webpage

I am trying to extract the "mobility index" values for each state and county from this webpage:
https://www.cuebiq.com/visitation-insights-mobility-index/
The preferred output would be a panel data of place (state/county) by date for all available places and dates.
There is another thread (How can I scrape tooltips value from a Tableau graph embedded in a webpage) with a similar question. I tried to follow the solution there but it doesn't seem to work for my case.
Thanks a lot in advance.
(A way that I have tried is to download PDF files generated from Tableau, which would contain all counties' value on a specific date. However, I still need to find a way to make request for each date in the data. Anyway, let me know if you have a better idea than this route).
This tableau data url doesn't return any data. In fact, it only render images of the values (canvas probably) and I'm guessing it detects click based on coordinate. Probably, it's made this way to cache the value and render quickly.
But when you click on a state, it actually returns data but it seems it doesn't always returns the result for the state (but works the individual county).
The solution I've found is to use the tooltip to get the data for the state. When you click the state, it generates a request like this :
POST https://public.tableau.com/{path}/{session_id}/commands/tabsrv/render-tooltip-server
with the following form param :
worksheet: US Map - State - CMI
dashboard: CMI
tupleIds: [18]
vizRegionRect: {"r":"viz","x":496,"y":148,"w":0,"h":0,"fieldVector":null}
allowHoverActions: false
allowPromptText: true
allowWork: false
useInlineImages: true
where tupleIds: [18] refers to the index of the state in a list of states in reverse alphabetical order like this :
stateNames = ["Wyoming","Wisconsin","West Virginia","Washington","Virginia","Vermont","Utah","Texas","Tennessee","South Dakota","South Carolina","Rhode Island","Pennsylvania","Oregon","Oklahoma","Ohio","North Dakota","North Carolina","New York","New Mexico","New Jersey","New Hampshire","Nevada","Nebraska","Montana","Missouri","Mississippi","Minnesota","Michigan","Massachusetts","Maryland","Maine","Louisiana","Kentucky","Kansas","Iowa","Indiana","Illinois","Idaho","Georgia","Florida","District of Columbia","Delaware","Connecticut","Colorado","California","Arkansas","Arizona","Alabama"]
It gives a json with the html of the tooltip which has the CMI and YoY values you want to extract :
{
"vqlCmdResponse": {
"cmdResultList": [{
"commandName": "tabsrv:render-tooltip-server",
"commandReturn": {
"tooltipText": "{\"htmlTooltip\": \"<HTML HERE WITH THE VALUES>\"}]},\"overlayAnchors\":[]}"
}
}]
}
}
The only caveat is that you'll hava to make one request per state :
import requests
from bs4 import BeautifulSoup
import json
import time
data_host = "https://public.tableau.com"
r = requests.get(
f"{data_host}/views/CMI-2_0/CMI",
params= {
":showVizHome":"no",
}
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'{data_host}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
"sheet_id": tableauData["sheetId"],
})
data = []
stateNames = ["Wyoming","Wisconsin","West Virginia","Washington","Virginia","Vermont","Utah","Texas","Tennessee","South Dakota","South Carolina","Rhode Island","Pennsylvania","Oregon","Oklahoma","Ohio","North Dakota","North Carolina","New York","New Mexico","New Jersey","New Hampshire","Nevada","Nebraska","Montana","Missouri","Mississippi","Minnesota","Michigan","Massachusetts","Maryland","Maine","Louisiana","Kentucky","Kansas","Iowa","Indiana","Illinois","Idaho","Georgia","Florida","District of Columbia","Delaware","Connecticut","Colorado","California","Arkansas","Arizona","Alabama"]
for stateIndex, state in enumerate(stateNames):
time.sleep(0.5) #for throttling
r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabsrv/render-tooltip-server',
data = {
"worksheet": "US Map - State - CMI",
"dashboard": "CMI",
"tupleIds": f"[{stateIndex+1}]",
"vizRegionRect": json.dumps({"r":"viz","x":496,"y":148,"w":0,"h":0,"fieldVector":None}),
"allowHoverActions": "false",
"allowPromptText": "true",
"allowWork": "false",
"useInlineImages": "true"
})
tooltip = json.loads(r.json()["vqlCmdResponse"]["cmdResultList"][0]["commandReturn"]["tooltipText"])["htmlTooltip"]
soup = BeautifulSoup(tooltip, "html.parser")
rows = [
t.find("tr").find_all("td")
for t in soup.find_all("table")
]
entry = { "state": state }
for row in rows:
if (row[0].text == "Mobility Index:"):
entry["CMI"] = "".join([t.text.strip() for t in row[1:]])
if row[0].text == "YoY (%):":
entry["YoY"] = "".join([t.text.strip() for t in row[1:]])
print(entry)
data.append(entry)
print(data)
Try this on repl.it
To get the county information it's the same as this post using the select endpoint which gives you the data with the same format as the post you've linked in your question
The following will extract data for all county and state :
import requests
from bs4 import BeautifulSoup
import json
import time
data_host = "https://public.tableau.com"
worksheet = "US Map - State - CMI"
dashboard = "CMI"
r = requests.get(
f"{data_host}/views/CMI-2_0/CMI",
params= {
":showVizHome":"no",
}
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'{data_host}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
"sheet_id": tableauData["sheetId"],
})
data = []
stateNames = ["Wyoming","Wisconsin","West Virginia","Washington","Virginia","Vermont","Utah","Texas","Tennessee","South Dakota","South Carolina","Rhode Island","Pennsylvania","Oregon","Oklahoma","Ohio","North Dakota","North Carolina","New York","New Mexico","New Jersey","New Hampshire","Nevada","Nebraska","Montana","Missouri","Mississippi","Minnesota","Michigan","Massachusetts","Maryland","Maine","Louisiana","Kentucky","Kansas","Iowa","Indiana","Illinois","Idaho","Georgia","Florida","District of Columbia","Delaware","Connecticut","Colorado","California","Arkansas","Arizona","Alabama"]
for stateIndex, state in enumerate(stateNames):
time.sleep(0.5) #for throttling
r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabsrv/render-tooltip-server',
data = {
"worksheet": worksheet,
"dashboard": dashboard,
"tupleIds": f"[{stateIndex+1}]",
"vizRegionRect": json.dumps({"r":"viz","x":496,"y":148,"w":0,"h":0,"fieldVector":None}),
"allowHoverActions": "false",
"allowPromptText": "true",
"allowWork": "false",
"useInlineImages": "true"
})
tooltip = json.loads(r.json()["vqlCmdResponse"]["cmdResultList"][0]["commandReturn"]["tooltipText"])["htmlTooltip"]
soup = BeautifulSoup(tooltip, "html.parser")
rows = [
t.find("tr").find_all("td")
for t in soup.find_all("table")
]
entry = { "state": state }
for row in rows:
if (row[0].text == "Mobility Index:"):
entry["CMI"] = "".join([t.text.strip() for t in row[1:]])
if row[0].text == "YoY (%):":
entry["YoY"] = "".join([t.text.strip() for t in row[1:]])
r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabdoc/select',
data = {
"worksheet": worksheet,
"dashboard": dashboard,
"selection": json.dumps({
"objectIds":[stateIndex+1],
"selectionType":"tuples"
}),
"selectOptions": "select-options-simple"
})
entry["county_data"] = r.json()["vqlCmdResponse"]["layoutStatus"]["applicationPresModel"]["dataDictionary"]["dataSegments"]
print(entry)
data.append(entry)
print(data)

Grabbing nested attributes with slimit - Python

I'm trying to get URL's from a site with a nested structure like this:
<script>
var.model = {
data: {
"alist": [{
'url': 'http://www.here.org'
}]
}
}
</script>
The problem is, I can't seem to find a way to get through the tree that far. I have experience using BeautifulSoup4 but none using slimit, which I am trying to learn now.
def grabLinks(base, limit):
base_url = requests.get(base)
soup = BeautifulSoup(base_url.content, "html.parser")
for script in soup.find_all("script", {'src': False}):
if isinstance(script, NavigableString): continue
tree = Parser().parse(script.text)
for node in nodevisitor.visit(tree):
if isinstance(node, ast.Assign) and getattr(node.left, 'value', '') == "data":
print(getattr(node.right, 'properties'))
I can go as far as getting to the name "alist" (with a whole lot of getattrs), but I cannot access the dictionaries of values inside it. If you need more information, please let me know.
EDIT: Fixed! Updated code:
def grabLinks(base, limit):
base_url = requests.get(base)
soup = BeautifulSoup(base_url.content, "html.parser")
for script in soup.find_all("script", {'src': False}):
if isinstance(script, NavigableString): continue
tree = Parser().parse(script.text)
for node in nodevisitor.visit(tree):
if isinstance(node, a.Assign) and getattr(node.left, 'value', '') == "data":
sibnode = getattr(node.right, 'properties')[0]
print(sibnode.right.items[0].properties[0].right.value)

Web scrape Google search results using BeautifulSoup

My goal is to web scrape Google search results using BeautifulSoup. I am using Anaconda Python and use Ipython as the IDE console. Why don't I get an ouptput when run the following command?
def google_scrape(query):
address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query))
request = urllib2.Request(address, None, {'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'})
urlfile = urllib2.urlopen(request)
page = urlfile.read()
soup = BeautifulSoup(page)
linkdictionary = {}
for li in soup.findAll('li', attrs={'class':'g'}):
sLink = li.find('a')
print sLink['href']
sSpan = li.find('span', attrs={'class':'st'})
print sSpan
return linkdictionary
if __name__ == '__main__':
links = google_scrape('english')
You are never adding anything to linkedDictionary
def google_scrape(query):
address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query))
request = urllib2.Request(address, None, {'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'})
urlfile = urllib2.urlopen(request)
page = urlfile.read()
soup = BeautifulSoup(page)
linkdictionary = {}
for li in soup.findAll('li', attrs={'class':'g'}):
sLink = li.find('a')
sSpan = li.find('span', attrs={'class':'st'})
linkeDictionary['href'] = sLink['href']
linkedDictionary['sSpan'] = sSpan
return linkdictionary
if __name__ == '__main__':
links = google_scrape('english')
The problem as Cody Bouche mentioned is that nothing has been adding to the dict().
In my opinion, you'll have hard times updating your dict if you haven't change {}(dict) to [](array).
Appending to array is much simpler (note: I could be wrong here, it's just a personal opinion from previous experience).
To make it work in a simple maner, you need to change dict to array {} --> [] and then use .append({}) to append to list()
Code and example in the online IDE:
def google_scrape(query):
html = requests.get(f'https://www.google.com/search?q={query}', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
data = []
for container in soup.findAll('div', class_='tF2Cxc'):
title = container.select_one('.DKV0Md').text
link = container.find('a')['href']
data.append({
'title': title,
'link': link,
})
print(f'{title}\n{link}')
print(json.dumps(data, indent=2))
google_scrape('english')
# part of the outputs:
'''
English language - Wikipedia
https://en.wikipedia.org/wiki/English_language
[
{
"title": "English language - Wikipedia",
"link": "https://en.wikipedia.org/wiki/English_language"
},
]
'''
If you still want to append to dict() then this is one of the ways of approaching this (only part of the for loop shown):
for container in soup.findAll('div', class_='tF2Cxc'):
data_dict = {}
title = container.select_one('.DKV0Md').text
link = container.find('a')['href']
# creates title key and assigns title value
data_dict['title'] = title
# creates link key and assigns link value
data_dict['link'] = link
print(json.dumps(data_dict, indent = 2))
# part of the output:
'''
{
"title": "Minecraft Official Site | Minecraft",
"link": "https://www.minecraft.net/en-us/"
}
'''
Alternatively, you can do the same thing using Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
Essentially, it's doing the same thing as the code above, but you don't to figure out how to do certain things or trying to understand how to scrape certain element, it's already done for the end-user with a JSON output so the only thing that needs to be done is to iterate over a JSON and get the desired output.
Code to integrate:
from serpapi import GoogleSearch
import json
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "minecraft",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
print(json.dumps(result, indent = 2, ensure_ascii = False))
# part of the json output:
'''
{
"position": 1,
"title": "Minecraft - Aplikasi di Google Play",
"link": "https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=in&gl=US",
"displayed_link": "https://play.google.com › store › apps › details › id=co...",
"rich_snippet": {
"top": {
"detected_extensions": {
"skor": 46,
"suara": 4144655,
"us": 749
},
"extensions": [
"Skor: 4,6",
"‎4.144.655 suara",
"‎US$7,49",
"‎Android",
"‎Game"
]
}
}
'''
Disclaimer, I work for SerpApi.

Go to next page with Beautifulsoup

I am new to python, and I have looked at some examples of scripts already made.
I started my script and it works fine but now I want to add some extras and I would appreciate your help.
def listar_videos(url):
codigo_fonte = abrir_url(url)
soup = BeautifulSoup(abrir_url(url))
content = BeautifulSoup(soup.find("div", { "id" : "dle-content" }).prettify())
filmes = content("div", { "class" : "short-film" })
for filme in filmes:
nome_filme = filme.img["alt"]
url = filme.a["href"].replace('Assistir ','')
img = filme.img["src"]
addDir(nome_filme.encode('utf8'),url,4,img,False,len(filmes))
pagina = BeautifulSoup(soup.find('div', { "class" : "pages" }).prettify())("div", { "class" : "pnext" })[0]["href"]
addDir('Página Seguinte >>',pagina,2,artfolder + 'prox.png')
IndexError: list index out of range on this line:
pagina = BeautifulSoup(soup.find('div', { "class" : "pages" }).prettify())("div", { "class" : "pnext" })[0]["href"]
I alredy try another different code but no sucess.
It's weird because I already used this code in a website Worpress and it works fine, but this new one gave me this error.
I need your help guys.

Categories

Resources