Parsing json Objects From Text File With Much Other Stuff - Python - python

I have an html page.
I read with requests and parsed a script tag with beautifulsoup, now this tag has loads of text, and some of it is json objects.
How can I read all the json objects from this text?
What I want to achieve is to get the products with prices from amazon daily deals and this is what I wrote for now:
from bs4 import BeautifulSoup
import json
import requests
def FindRightScriptTag(soup):
for tag in soup.find_all('script', type="text/javascript"):
if 'sortedDealIDs' and 'dealDetails' in tag.text:
return tag
url = "https://www.amazon.co.uk/gp/deals/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,"html.parser")
tag = FindRightScriptTag(soup)
print (tag)

Would be good if you shared some of your code. In general if you know how to navigate down your beautiful soup xml tree you can pass the string you know to be json into the json module.
json.loads() is what you are looking for as it takes a json string to turn it into a Python object dict for you to use.

Related

Is there a way I can extract a list from a javascript document?

There is a website where I need to obtain the owners of this item from an online-game item and from research, I need to do some 'web scraping' to get this data. But, the information is in a Javascript document/code, not an easily parseable HTML document like bs4 shows I can easily extract information from. So, I need to get a variable in this javascript document (contains a list of owners of the item I'm looking at) and make it into a usable list/json/string I can implement in my program. Is there a way I can do this? if so, how can I?
I've attached an image of the variable I need when viewing the page source of the site I'm on.
My current code:
from bs4 import BeautifulSoup
html = requests.get('https://www.rolimons.com/item/1029025').content #the item webpage
soup = BeautifulSoup(html, "lxml")
datas = soup.find_all("script")
print(data) #prints the sections of the website content that have ja
IMAGE LINK
To scrape javascript variable, can't use only BeautifulSoup. Regular expression (re) is required.
Use ast.literal_eval to convert string representation of dict to a dict.
from bs4 import BeautifulSoup
import requests
import re
import ast
html = requests.get('https://www.rolimons.com/item/1029025').content #the item webpage
soup = BeautifulSoup(html, "lxml")
ownership_data = re.search(r'ownership_data\s+=\s+.*;', soup.text).group(0)
ownership_data_dict = ast.literal_eval(ownership_data.split('=')[1].strip().replace(';', ''))
print(ownership_data_dict)
Output:
> {'id': 1029025, 'num_points': 1616, 'timestamps': [1491004800,
> 1491091200, 1491177600, 1491264000, 1491350400, 1491436800,
> 1491523200, 1491609600, 1491696000, 1491782400, 1491868800,
> 1491955200, 1492041600, 1492128000, 1492214400, 1492300800,
> 1492387200, 1492473600, 1492560000, 1492646400, 1492732800,
> 1492819200, ...}
import requests
import json
import re
r = requests.get('...')
m = re.search(r'var history_data\s+=\s+(.*)', r.text)
print(json.loads(m.group(1)))

Python: BeautifulSoup Pulling/Parsing data from within html tag

I'm attempting to pull sporting data from a url using Beautiful Soup in Python code. The issue I'm having with this data source is the data appears within the html tag. Specifically this tag is titled ""
I'm after the players data - which seems to be in XML format. However this data is appearing within the "match" tag rather that as the content within the start/end tag.
So like this:
print(soup.match)
Returns: (not going to include all the text):
<match :matchdata='{"match":{"id":"5dbb8e20-6f37-11eb-924a-1f6b8ad68.....ALL DATA HERE....>
</match>
Because of this when I try to output the contents as text it returns empty.
print(soup.match.text)
Returns: nothing
How would I extract this data from within the "" html tag. After this I would like to either save as an XML file or even better a CSV file would be ideal.
My python program from the beginning is:
from bs4 import BeautifulSoup
import requests
url="___MY_URL_HERE___"
# Make a GET request for html content
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content, "lxml")
## type(soup)
## <class 'bs4.BeautifulSoup'>
print(soup.match)
Thanks a lot!
A tag may have any number of attributes. The tag has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
So in your case
print(soup.match[":matchdata"])

How to get CData from html using beautiful soup

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]
It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)
Try simply:
soup.select_one('div.field-redshift > div.value>b').text
If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

How to extract ids and classes from a webpage using python?

This is my code so far :
import urllib2
with urllib2.urlopen("https://quora.com") as response:
html = response.read()
I am new to Python and somehow I am successful in fetching the webpage, now how to extract ids and classes from the webpage?
A better way to do so would be using the BeautifulSoup (bs4) web-scraping library, and requests.
After having installed both using pip, you can start as so:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://quora.com")
soup = BeautifulSoup(r.content, "html.parser")
To find an element with a specific id:
soup.find(id="your_id")
To find all elements with the "Answer" class:
soup.find_all(class_="Answer")
You can then use .get_text() to remove the html tags and use python string operations to organize your data.
You may try to parse the html code using dedicated libraries, for instance BeautifulSoup.
you can do it easly by xml parsing
from lxml import html
import requests
page = requests.get('http://google.com')
with open('/home/Desktop/test.txt','wb') as f :
f.write(page.content)

Python XMl Parser with BeautifulSoup. How do I remove tags?

For a project I decided to make an app that helps people find friends on Twitter.
I have been able to grab usernames from xml pages. So for example with my current code I can get <uri>http://twitter.com/username</uri> from an XML page, but I want to remove the <uri> and </uri> tags using Beautiful Soup.
Here is my current code:
import urllib
import BeautifulSoup
doc = urllib.urlopen("http://search.twitter.com/search.atom?q=travel").read()
soup = BeautifulStoneSoup(''.join(doc))
data = soup.findAll("uri")
Don't use BeautifulSoup to parse twitter, use their API (also don't use BeautifulSoup, use lxml). To answer your question:
import urllib
from BeautifulSoup import BeautifulSoup
resp = urllib.urlopen("http://search.twitter.com/search.atom?q=travel")
soup = BeautifulSoup(resp.read())
for uri in soup.findAll('uri'):
uri.extract()
To answer your question about BeautifulSoup, text is what you need to grab the contents of each <uri> tag. Here I extract the information into a list comprehension:
>>> uris = [uri.text for uri in soup.findAll('uri')]
>>> len(uris)
15
>>> print uris[0]
http://twitter.com/MarieJeppesen
But, as zeekay says, Twitter's REST API is a better approach for querying Twitter.

Categories

Resources