I'm attempting to use BeautifulSoup to compose a webpage.
When I go to set a tag's inner content via string it automatically escapes the string. I have yet to locate a technique, like a html method/attribute, where BS won't auto escape everything.
from bs4 import BeautifulSoup
f = open("template.html", "r")
soup = BeautifulSoup(f.read(), 'html.parser')
f.close()
x = soup.find("div", id="example")
x.string("<div>example</div>")
# x's contents...
# <div id="example"><div>example</div></div>
It's apparent that BS is more often used for scraping HTML than building HTML – is there a common library for building out?
You should try Jinja. Then you can render templates like this:
from jinja2 import Template
t = Template('<div id="example">{{example_div}}</div>')
t.render(example_div='<div>example</div>')
Resulting in:
'<div id="example"><div>example</div></div>'
Of course, you can also read the template from a file:
with open('template.html', 'r') as f:
t = Template(f.read())
Related
There is a website where I need to obtain the owners of this item from an online-game item and from research, I need to do some 'web scraping' to get this data. But, the information is in a Javascript document/code, not an easily parseable HTML document like bs4 shows I can easily extract information from. So, I need to get a variable in this javascript document (contains a list of owners of the item I'm looking at) and make it into a usable list/json/string I can implement in my program. Is there a way I can do this? if so, how can I?
I've attached an image of the variable I need when viewing the page source of the site I'm on.
My current code:
from bs4 import BeautifulSoup
html = requests.get('https://www.rolimons.com/item/1029025').content #the item webpage
soup = BeautifulSoup(html, "lxml")
datas = soup.find_all("script")
print(data) #prints the sections of the website content that have ja
IMAGE LINK
To scrape javascript variable, can't use only BeautifulSoup. Regular expression (re) is required.
Use ast.literal_eval to convert string representation of dict to a dict.
from bs4 import BeautifulSoup
import requests
import re
import ast
html = requests.get('https://www.rolimons.com/item/1029025').content #the item webpage
soup = BeautifulSoup(html, "lxml")
ownership_data = re.search(r'ownership_data\s+=\s+.*;', soup.text).group(0)
ownership_data_dict = ast.literal_eval(ownership_data.split('=')[1].strip().replace(';', ''))
print(ownership_data_dict)
Output:
> {'id': 1029025, 'num_points': 1616, 'timestamps': [1491004800,
> 1491091200, 1491177600, 1491264000, 1491350400, 1491436800,
> 1491523200, 1491609600, 1491696000, 1491782400, 1491868800,
> 1491955200, 1492041600, 1492128000, 1492214400, 1492300800,
> 1492387200, 1492473600, 1492560000, 1492646400, 1492732800,
> 1492819200, ...}
import requests
import json
import re
r = requests.get('...')
m = re.search(r'var history_data\s+=\s+(.*)', r.text)
print(json.loads(m.group(1)))
I am using both selenium and BeautifulSoup in order to do some web scraping. I have managed myself to obtain the next piece of code:
from selenium.webdriver import Chrome
from bs4 import BeautifulSoup
url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
The output soup produces has the following structure:
<html>
<head>
</head>
<body>
<rf-list-detail line-color="245,150,40" line-number="C2" line-text="Línea C2"
list="[{... ;direction":"Place1"}
,... ,
;direction":"Place2"}...
Recall both text and output style have been modified for reading reasons. I attach an image of the actual output just in case it is more convinient.
Does anyone know how could I obtain every PlaceN (in the image, Moixent would be Place1) in a list? Something like
places = [Place1,...,PlaceN]
I have tried parsing it, but as it has no tags (or at least my html knowledge, which is barely none, says so) I obtain nothing. I have also tried using a regular expression, which I have just found out where a thing, but I am not sure how to do it properly.
Any thoughts?
Thank you in advance!!
output of soup
This site responds with non-html structure. So, you need no html-parser like BeautifulSoup or lxml for this task.
Here example using requests library. You can install it like this
pip install requests
import requests
import html
import json
url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html'
response = requests.get(url)
data = response.text # get data from site
raw_list = data.split("'")[1] # extract rf-list-detail.list attribute
json_list = html.unescape(raw_list) # decode html symbols
parsed_list = json.loads(json_list) # parse json
print(parsed_list) # printing result
directions = []
for item in parsed_list:
directions.append(item["direction"])
print(directions) # extracting directions
# ['Moixent', 'Vallada', 'Montesa', "L'Alcudia de Crespins", 'Xàtiva', "L'Enova-Manuel", 'La Pobla Llarga', 'Carcaixent', 'Alzira', 'Algemesí', 'Benifaió-Almussafes', 'Silla', 'Catarroja', 'Massanassa', 'Alfafar-Benetússer', 'València Nord']
I'm trying to remove all script tags from html files using beautifulsoup. The problem is that html files do not have opening tags for table rows in some cases (there are only </ tr> tags at the end of the row) and beautifulsoup seems to be removing them, since they are incomplete. This messes up the formatting of the table as a result. Is there any other way to remove these script tags without messing up the formatting?
import os
from pathlib import Path
from bs4 import BeautifulSoup
root_dir = os.path.join(Path().absolute(),'newfolder\\')
for path in Path(root_dir).iterdir():
if path.is.file():
htmlfile = open(path,encoding="utf-8".read()
soup = BeautifulSoup(htmlfile)
to_be_removed = soup.find_all("script")
for x in to_be_removed:
x.extract()
html = soup.prettify("utf-8")
with open(path,"wb") as file:
file.write(html)
This is happening due to the parser used by BeautifulSoup to read your HTML document, at soup = BeautifulSoup(htmlfile).
BeautifulSoup uses lxml as the default, which makes assumptions such as that your HTML is valid. When that is not the case, I would suggest looking into a more lenient parser - the documentation is a great source of information for this.
To get specific, you could use html.parser as it's more lenient, and finally try outputting your code without prettifying it, by using:
html = soup.prettify(formatter=None)
I'm trying to programmatically extract text from this webpage which describes a genome assembly in the public archive:
http://www.ebi.ac.uk/ena/data/view/ERS019623
I have thousands of assemblies that I want to track down and extract the study accession, which is the code on the far left of the table beginning with "PRJ". The URL for each of these assemblies is of the same format as the one above, i.e. "http://www.ebi.ac.uk/ena/data/view/ERS******". I have the ERS code for each of my assemblies so I can construct the URL for each one.
I've tried a few different methods, firstly if you add "&display=XML" to the end of the URL it prints the XML (or at least I'm presuming that it's printing the XML for the entire page, because the problem is that the study accession "PRJ******" is no where to be seen here). I had utilised this to extract another code that I needed from the same webpage, the run accession which is always of the format "ERR******" using the below code:
import urllib2
from bs4 import BeautifulSoup
import re
import csv
with open('/Users/bj5/Desktop/web_scrape_test.csv','rb') as f:
reader = csv.reader(f) #opens csv containig list of ERS numbers
for row in reader:
sample = row[0] #reads index 0 (1st row)
ERSpage = "http://www.ebi.ac.uk/ena/data/view/" + sample + "&display=xml" #creates URL using ERS number from 1st row
page = urllib2.urlopen(ERSpage) #opens url and assigns it to variable page
soup = BeautifulSoup(page, "html.parser") #parses the html/xml from page and assigns it to variable called soup
page_text = soup.text #returns text from variable soup, i.e. no tags
ERS = re.search('ERS......', page_text, flags=0).group(0) #returns first ERS followed by six wildcards
ERR = re.search('ERR......', page_text, flags=0).group(0) #retursn first ERR followed by six wildcards
print ERS + ',' + ERR + ',' + "http://www.ebi.ac.uk/ena/data/view/" + sample #prints ERS,ERR,URL
This worked very well, but as the study accession is not in the XML I can't use it to access this.
I also attempted to use BeautifulSoup again to download the HTML by doing this:
from bs4 import BeautifulSoup
from urllib2 import urlopen
BASE_URL = "http://www.ebi.ac.uk/ena/data/view/ERS019623"
def get_category_links(section_url):
html = urlopen(section_url).read()
soup = BeautifulSoup(html, "lxml")
print soup
get_category_links(BASE_URL)
But again I can't see the study accession in the output from this either...
I have also attempted to use a different python module, lxml, to parse the XML and HTML but haven't had any luck there either.
When I right click and inspect element on the page I can find the study accession by doing ctrl+F -> PRJ.
So my question is this: what is the code that I'm looking at in inspect element, XML or HTML (or something else)? Why does it look different to the code that prints in my console when I try and use BeautifulSoup to parse HTML? And finally how can I scrape the study accessions (PRJ******) from these webpages?
(I've only been coding for a couple of months and I'm entirely self-taught so apologies for the slightly confused nature of this question but I hope I've got across what it is that I'm trying to do. Any suggestions or advice would be much appreciated.)
from bs4 import BeautifulSoup
import requests
import re
r = requests.get('http://www.ebi.ac.uk/ena/data/view/ERS019623&display=xml')
soup = BeautifulSoup(r.text, 'lxml')
ERS = soup.find('primary_id').text
ERR = soup.find('id', text=re.compile(r'^ERR')).text
url = 'http://www.ebi.ac.uk/ena/data/view/{}'.format(ERS)
print(ERS, ERR, url)
out:
ERS019623 ERR048142 http://www.ebi.ac.uk/ena/data/view/ERS019623
bs4 can parse xml file, just treat it like html, they are all the same, so their is no need to use regex to extract info.
i find a TEXT download link:
http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=ERS019623&result=read_run&fields=study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,tax_id,scientific_name,instrument_model,library_layout,fastq_ftp,fastq_galaxy,submitted_ftp,submitted_galaxy,sra_ftp,sra_galaxy,cram_index_ftp,cram_index_galaxy&download=txt
this link's fileds can be changed to get the data you want, like this:
http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=ERS019623&result=read_run&fields=study_accession&download=txt
by doing so, you can get all you data in a text file
In you sample soup is a BeautifulSoup object: a representation of the parsed document.
If you want to print the entire HTML of the document, you can call print(soup.prettify()) or if you want the text within it print(soup.get_text()).
The soup object has other possibilities to access parts of the document you are interested in: to navigate the parsed tree, to search in it ...
I am trying to do the following: Wrangle my own Facebook page for all my friend names. I have initially stored my webpage as a local file and then parse it using BeautifulSoup. And I have used find_all method to look for all the elements of the following pattern: <div class="name">friend name</div>. However, when I run the code,I get an empty list containing no matches? So what do you is the problem
Thanks
Code snippet:
from bs4 import BeautifulSoup
page = "myFacebook.html" with open(page, "r") as html:
page_html = BeautifulSoup(html, "lxml")
print page_html.find_all(attrs = {"class" : "name"})
I was able to extract the required information using python's regular expression
library -re.
Following is the simple script to pull friend names from facebook page stored locally as html file :
import re
page = "friends.html"
f = open(page,'r')
s = f.read()
regX = r'<div class="fsl fwb fcb">.*?</div>'## Making it less greedy using ?
for elem in re.findall(regX,s):
print re.findall(r'<a.*>(.*)</a>',elem)[0]## getting the names text