I have parsed XML from website and i found that it has two branches (children),
How to Separate the two branches into two lists of dictionaries,
here's my code so far:
import pandas as pd
import xml.etree.ElementTree as ET
import requests
url = "http://cs.stir.ac.uk/~soh/BD2spring2022/assignmentdata.php"
params = {'data':'spurpyr'}
response = requests.get (url, params)
tree = response.content
#extract the root element as separate variable, and display the root tag.
root = ET.fromstring(tree)
print(root.tag)
#Get attributes of root
root_attr = root.attrib
print(root_attr)
#Finding children of root
for child in root:
print(child.tag, child.attrib)
#extract the two children of the root element into another two separate variables, and display their tags as well
child_dict = []
for child in root:
child_dict.append(child.tag)
tweets_branch = child_dict[0]
cities_branch = child_dict[1]
#the elements in the entire tree
[elem.tag for elem in root.iter()]
#specify both the encoding and decoding of the document you are displaying as the string
print(ET.tostring(root, encoding='utf8').decode('utf8'))
Using beautifulsoup module. To parse tweets and cities to list of dictionaries you can use this example:
import requests
from bs4 import BeautifulSoup
url = "http://cs.stir.ac.uk/~soh/BD2spring2022/assignmentdata.php"
params = {"data": "spurpyr"}
soup = BeautifulSoup(requests.get(url, params=params).content, "xml")
tweets = []
for t in soup.select("tweets > tweet"):
tweets.append({"id": t["id"], **{x.name: x.text for x in t.find_all()}})
cities = []
for c in soup.select("cities > city"):
cities.append({"id": c["id"], **{x.name: x.text for x in c.find_all()}})
print(tweets)
print(cities)
Prints:
[
{
"id": "16620625 5686",
"Name": "Kenyon Conley",
"Phone": "0327 103 9485",
"Email": "malesuada#lobortisClassaptent.edu",
"Location": "45.5333, -73.2833",
"GenderID": "male",
"Tweet": "#FollowFriday #DanielleMorrill - She's with #Seattle20 and #Twilio. Also fun to talk to. #entrepreneur",
"City": "Saint-Basile-le-Grand",
"Country": "Canada",
"Age": "34",
},
{
"id": "16310427-5502",
"Name": "Griffin Norton",
"Phone": "0306 178 7917",
"Email": "in.dolor.Fusce#necmalesuadaut.ca",
"Location": "52.0000, 84.9833",
"GenderID": "male",
"Tweet": "!!!Veryy Bored!!! ~~Craving Million's Of MilkShakes~~",
"City": "Belokurikha",
"Country": "Russia",
"Age": "33",
},
...
Related
I have the following code:
import requests
from bs4 import BeautifulSoup
import urllib, json
url = "https://exec85.pythonanywhere.com/static/data_xbox.JSON"
response = urllib.request.urlopen(url)
json_data_xbox = json.loads(response.read())
name_var = "Salah"
rating_var = "96"
if name in json_data_xbox["player"].__str__(): # here I don't know what to write to check if name + rating are matching
print("matching")
else:
print("not matching")
This is the corresponding JSON file:
{"player": [{"name": "Salah", "rating": "96", "rarity": "TOTS", "prices": 5380000}, {"name": "Salah", "rating": "93", "rarity": "FOF PTG", "prices": 956000}]
As you see, I have two entries with the same name but different rating and prices.
I would like to check if two variables can be found within one object in my json dictionary to match the right one.
So in this example I want to check if name_varand rating_var are matching to then get the correct "prices" value.
What would I need to write to get that check done?
You would probably want to loop over each dictionary in the response and check that the key values are equal to your variables. Could do something like this:
json_data_xbox = json.loads(
'{"player": [{"name": "Salah", "rating": "96", "rarity": "TOTS", "prices": 5380000}, {"name": "Salah", "rating": "93", "rarity": "FOF PTG", "prices": 956000}]}')
name = "Salah"
rating_var = "96"
price = ""
for i in json_data_xbox["player"]:
if i["name"] == name and i["rating"] == rating_var:
price = i["prices"]
print(price)
I am trying to scrape the content of this particular website : https://www.cineatlas.com/
I tried scraping the date part as shown in the print screen :
I used this basic beautifulsoup code
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text,'html.parser')
type(soup)
time = soup.find('ul',class_='slidee')
This is what I get instead of the list of elements
<ul class="slidee">
<!-- adding dates -->
</ul>
The site creates HTML elements dynamically from the Javascript content. You can get the JS content by using re for example:
import re
import json
import requests
from ast import literal_eval
url = 'https://www.cineatlas.com/'
html_data = requests.get(url).text
movieData = re.findall(r'movieData = ({.*?}), movieDataByReleaseDate', html_data, flags=re.DOTALL)[0]
movieData = re.sub(r'\s*/\*.*?\*/\s*', '', movieData) # remove comments
movieData = literal_eval(movieData) # in movieData you have now the information about the current movies
print(json.dumps(movieData, indent=4)) # print data to the screen
Prints:
{
"2019-08-06": [
{
"url": "fast--furious--hobbs--shaw",
"image-portrait": "https://d10u9ygjms7run.cloudfront.net/dd2qd1xaf4pceqxvb41s1xpzs0/1562603443098_891497ecc8b16b3a662ad8b036820ed1_500x735.jpg",
"image-landscape": "https://d10u9ygjms7run.cloudfront.net/dd2qd1xaf4pceqxvb41s1xpzs0/1562603421049_7c233477779f25725bf22aeaacba469a_700x259.jpg",
"title": "FAST & FURIOUS : HOBBS & SHAW",
"releaseDate": "2019-08-07",
"endpoint": "ST00000392",
"duration": "120 mins",
"rating": "Classification TOUT",
"director": "",
"actors": "",
"times": [
{
"time": "7:00pm",
"bookingLink": "https://ticketing.eu.veezi.com/purchase/8388?siteToken=b4ehk19v6cqkjfwdsyctqra72m",
"attributes": [
{
"_id": "5d468c20f67cc430833a5a2b",
"shortName": "VF",
"description": "Version Fran\u00e7aise"
},
{
"_id": "5d468c20f67cc430833a5a2a",
"shortName": "3D",
"description": "3D"
}
]
},
{
"time": "9:50pm",
"bookingLink": "https://ticketing.eu.veezi.com/purchase/8389?siteToken=b4ehk19v6cqkjfwdsyctqra72m",
... and so on.
lis = time.findChildren()
This returns a list of child nodes
While I was practicing some web-scraping on a webpage (param cookies required), I found myself having problems to scrape out JSON data embedded in the HTML. The following was what I did:
import requests from bs4
import BeautifulSoup as soup
import json
my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'
cookies = {
"Hm_lpvt_7cd4710f721b473263eed1f0840391b4": "1548175412",
"Hm_lvt_7cd4710f721b473263eed1f0840391b4": "1548140525",
"x5sec":"7b22617365727665722d6c617a6164613b32223a223832333339343739626466613939303562613535386138333266383365326132434c4b516e65494645495474764a322b706f6d6f6941453d227d", }
ret = requests.get(my_url, cookies=cookies)
print("New Super Mario Bros" in ret.text) # True
page_soup = soup(ret.text, 'html.parser')
data = page_soup.findAll('script', {'type':'application/ld+json'})
The output is as follows:
[
<script type="application/ld+json">{
"#context": "https://schema.org",
"#type": "BreadcrumbList",
"itemListElement": [
{
"item": {
"name": "Home",
"#id": "https://www.lazada.sg/"
},
"#type": "ListItem",
"position": "1"
}
]
}</script>,
<script type="application/ld+json">{
"#context": "https://schema.org",
"#type": "ItemList",
"itemListElement": [
{
"offers": {
"priceCurrency": "SGD",
"#type": "Offer",
"price": "71.00",
"availability": "https://schema.org/InStock"
},
"image": "https://sg-test-11.slatic.net/p/670a73a9613c36b2bb01555ab4092ba2.jpg",
"#type": "Product",
"name": "Switch: Super Mario Party [Available in Stock! Immediate Shipping]",
"url": "https://www.lazada.sg/products/switch-super-mario-party-available-in-stock-immediate-shipping-i278269540-s429667097.html?search=1"
},
...
I tried to follow an existing thread Extract json from html in python beautifulsoup but found myself stuck, probably due to the different JSON formatting in the HTML soup. The part which I scrape out contains all the different products in that page, is there a way where I further scrape out each product's details (eg. Title, price, rating, etc) and count the number of products present? Thanks!
You can loop parsing out from the json after loading with json.loads. All the product info for those containers is listed in one script tag so you can just grab that.
import requests
from bs4 import BeautifulSoup as soup
import json
import pandas as pd
my_url = 'https://www.lazada.sg/catalog/?spm=a2o42.home.search.1.488d46b5mJGzEu&q=switch%20games&_keyori=ss&from=search_history&sugg=switch%20games_0_1'
cookies = {
"Hm_lpvt_7cd4710f721b473263eed1f0840391b4": "1548175412",
"Hm_lvt_7cd4710f721b473263eed1f0840391b4": "1548140525",
"x5sec":"7b22617365727665722d6c617a6164613b32223a223832333339343739626466613939303562613535386138333266383365326132434c4b516e65494645495474764a322b706f6d6f6941453d227d", }
ret = requests.get(my_url, cookies=cookies)
print("New Super Mario Bros" in ret.text) # True
page_soup = soup(ret.text, 'lxml')
data = page_soup.select("[type='application/ld+json']")[1]
oJson = json.loads(data.text)["itemListElement"]
numProducts = len(oJson)
results = []
for product in oJson:
results.append([product['name'], product['offers']['price'], product['offers']['availability'].replace('https://schema.org/', '')]) # etc......
df = pd.DataFrame(results)
print(df)
i have json file that containd the metadata of 900 articles and i want to extract the Urls from it. my file start like this
[
{
"title": "The histologic phenotypes of …",
"authors": [
{
"name": "JE Armes"
},
],
"publisher": "Wiley Online Library",
"article_url": "https://onlinelibrary.wiley.com/doi/abs/10.1002/(SICI)1097-0142(19981201)83:11%3C2335::AID-CNCR13%3E3.0.CO;2-N",
"cites": 261,
"use": true
},
{
"title": "Comparative epidemiology of pemphigus in ...",
"authors": [
{
"name": "S Bastuji-Garin"
},
{
"name": "R Souissi"
}
],
"year": 1995,
"publisher": "search.ebscohost.com",
"article_url": "http://search.ebscohost.com/login.aspx?direct=true&profile=ehost&scope=site&authtype=crawler&jrnl=0022202X&AN=12612836&h=B9CC58JNdE8SYy4M4RyVS%2FrPdlkoZF%2FM5hifWcv%2FwFvGxUCbEaBxwQghRKlK2vLtwY2WrNNl%2B3z%2BiQawA%2BocoA%3D%3D&crl=c",
"use": true
},
.........
I want to inspect the file with objectpath to create json.tree for the extraxtion of the url. this is the code i want to execute
1. import json
2. import objectpath
3. with open("Data_sample.json") as datafile: data = json.load(datafile)
4. jsonnn_tree = objectpath.Tree(data['name of data'])
5. result_tuple = tuple(jsonnn_tree.execute('$..article_url'))
But in the step 4 for the creation of the tree, I have to insert the name of the data whitch i think that i haven't in my file. How can i replace this line?
You can get all the article urls using a list comprehension.
import json
with open("Data_sample.json") as fh:
articles = json.load(fh)
article_urls = [article['article_url'] for article in articles]
You can instantiate the tree like this:
tobj = op.Tree(your_data)
results = tobj.execute("$.article_url")
And in the end:
results = [x for x in results]
will yield:
["url1", "url2", ...]
Did you try removing the reference and just using:
jsonnn_tree = objectpath.Tree(data)
I am new to Python Scrapy and I am trying to create JSON file from 3 levels of nested pages. I have following structure:
Page 1 (start): contains links of second page (called Mangas)
Page 2: Contains nested Volumes and Chapters
Page 3: Each Chapter contains multiple images
My Code
import scrapy
import time
import items
import json
class GmangaSpider(scrapy.Spider):
name = "gmanga"
start_urls = [
"http://gmanga.me/mangas"
]
def parse(self, response):
# mangas = []
for manga in response.css('div.manga-item'):
link = manga.css('a.manga-item-content').xpath('#href').extract_first()
if link:
page_link = "http://gmanga.me%s" % link
mangas = items.Manga()
mangas['cover'] = manga.css('a.manga-item-content .manga-cover-container img').xpath('#src').extract_first()
mangas['title'] = manga.css('a.manga-item-content .manga-cover-container img').xpath('#alt').extract_first()
mangas['link'] = page_link
mangas['volumes'] = []
yield scrapy.Request(page_link, callback=self.parse_volumes, meta = {"mangas": mangas})
def parse_volumes(self, response):
mangas = response.meta['mangas']
for manga in response.css('div.panel'):
volume = items.Volume()
volume['name'] = manga.css('div.panel-heading .panel-title a::text').extract_first()
volume['chapters'] = []
for tr in manga.css('div.panel-collapse .panel-body table tbody tr'):
chapter = items.Chapter()
chapter['name'] = tr.css('td:nth_child(1) div::text').extract_first()
chapter_link = tr.css('td:nth_child(3) a::attr("href")').extract_first()
chapter['link'] = chapter_link
request = scrapy.Request("http://gmanga.me%s" % chapter_link, callback = self.parse_images, meta = {"chapter": chapter})
yield request
volume['chapters'].append(chapter)
mangas['volumes'].append(volume)
yield mangas
def parse_images(self, response):
chapter = response.meta['chapter']
data = response.xpath("//script").re("alphanumSort\((.*])")
if data:
images = json.loads(data[0])
chapter['images'] = images
return chapter
My Items.py
from scrapy import Item, Field
class Manga(Item):
title = Field()
cover = Field()
link = Field()
volumes = Field()
class Volume(Item):
name = Field()
chapters = Field()
class Chapter(Item):
name = Field()
images = Field()
link = Field()
Now I am bit confused in parse_volumes function where to yield or return to get following structure in json file.
Expected Result:
[{
"cover": "http://media.gmanga.me/uploads/manga/cover/151/medium_143061.jpg",
"link": "http://gmanga.me/mangas/gokko",
"volumes": [{
"name": "xyz",
"chapters": [{
"link": "/mangas/gokko/4/3asq",
"name": "4",
"images": ["img1.jpg", "img2.jpg"]
}, {
"link": "/mangas/gokko/3/3asq",
"name": "3",
"images": ["img1.jpg", "img2.jpg"]
}]
}],
"title": "Gokko"
}]
But I am getting images node as separate node it must be within chapters node of volume:
[{
"cover": "http://media.gmanga.me/uploads/manga/cover/10581/medium_I2.5HFzVh7e.png",
"link": "http://gmanga.me/mangas/godess-creation-system",
"volumes": [{
"name": "\u0627\u0644\u0645\u062c\u0644\u062f ",
"chapters": [{
"link": "/mangas/godess-creation-system/1/ayou-cahn",
"name": "1"
}]
}],
"title": "Godess Creation System"
},
{
"images": ["http://media.gmanga.me/uploads/releases/lolly-pop/047-20160111235059UXYGJACW/01.jpg?ak=p0skml", "http://media.gmanga.me/uploads/releases/lolly-pop/047-20160111235059UXYGJACW/02.jpg?ak=p0skml", "http://media.gmanga.me/uploads/releases/lolly-pop/047-20160111235059UXYGJACW/03.jpg?ak=p0skml", "http://media.gmanga.me/uploads/releases/lolly-pop/047-20160111235059UXYGJACW/04.jpg?ak=p0skml"],
"link": "/mangas/reversal/1/Lolly-Pop",
"name": "1"
}]
Each function is individually fetching data properly, the only issue is JSON formation. It is not writing to json file properly. Please lead me where I am wrong.