beautifulsoup delete elements other than required element in xml tree - python

I am using beautifulsoup to delete an element from xml document. It is deleting required tag but also removing some other info from xml document which is not related to that element. How to stop this?
Code to reproduce:
import requests
from bs4 import BeautifulSoup
text_file = open('C:\Ashok\sample.xml', 'r')
s = text_file.read()
soup = BeautifulSoup(s, 'xml')
u = soup.find('Version', text='29.2.3')
fed = u.findParent()
fed.decompose()
f = open('C:\Ashok\sample.xml', "w")
f.write(str(soup))
f.close()
Find comparison attached. deleted other info showed in red rectangles.
It is updating Header and footer tags which I did not ask code to do.

What happens?
The empty elements are not deleted only notation is transformed.
Empty elements in XML
An element with no content is empty and in XML, you can indicate an empty element like this:
<element></element>
An alternativ notation is the so called self-closing tag:
<element />
Both forms have identical results in XML readers, parsers,...

Related

Python: BeautifulSoup Pulling/Parsing data from within html tag

I'm attempting to pull sporting data from a url using Beautiful Soup in Python code. The issue I'm having with this data source is the data appears within the html tag. Specifically this tag is titled ""
I'm after the players data - which seems to be in XML format. However this data is appearing within the "match" tag rather that as the content within the start/end tag.
So like this:
print(soup.match)
Returns: (not going to include all the text):
<match :matchdata='{"match":{"id":"5dbb8e20-6f37-11eb-924a-1f6b8ad68.....ALL DATA HERE....>
</match>
Because of this when I try to output the contents as text it returns empty.
print(soup.match.text)
Returns: nothing
How would I extract this data from within the "" html tag. After this I would like to either save as an XML file or even better a CSV file would be ideal.
My python program from the beginning is:
from bs4 import BeautifulSoup
import requests
url="___MY_URL_HERE___"
# Make a GET request for html content
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content, "lxml")
## type(soup)
## <class 'bs4.BeautifulSoup'>
print(soup.match)
Thanks a lot!
A tag may have any number of attributes. The tag has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
So in your case
print(soup.match[":matchdata"])

Parsing the html of the child element [BeautifulSoup]

I have only two weeks learning python.
I'm scraping an XML file and one of the elements of the loop [item->description], have HTML inside, how could I get the text inside p?
url="https://www.milenio.com/rss"
source=requests.get(url)
soup=BeautifulSoup(source.content, features="xml")
items=soup.findAll('item')
for item in items:
html_text=item.description
# This returns HTML code: <p>Paragraph 1</p> <p>Paragraph 2</p>
This next line could work, BUT I got some internal, external links and images, which isn't required.
desc=item.description.get_text()
So, if I make a loop o trying to get all the p, it doesn't work.
for p in html_text.find_all('p'):
print(p)
AttributeError: 'NoneType' object has no attribute 'find_all'
Thank you so much!
The issue is how bs4 processes CData (it's pretty well documented but not very solved).
You'll need to import CData from bs4 which will help extract the CData as a string and use the html.parser library, from there create a new bs4 object with that string to give it a findAll attribute and iterate over it's contents.
from bs4 import BeautifulSoup, CData
import requests
url="https://www.milenio.com/rss"
source=requests.get(url)
soup = BeautifulSoup(source.content, 'html.parser')
items=soup.findAll('item')
for item in items:
html_text = item.description
findCdata = html_text.find(text=lambda tag: isinstance(tag, CData))
newSoup = BeautifulSoup(findCdata, 'html.parser')
paragraphs = newSoup.findAll('p')
for p in paragraphs:
print(p.get_text())
Edit:
OP needed to extract link text and found that to only be possible inside the item loop using link = item.link.nextSibling because the link content was jumping outside of its tag like so </link>http://www.... In XML tree view this particular XML doc showed a drop down for the link element which is likely the cause.
To get content from other tags inside the document that don't show a dropdown in XML tree view and don't have nested CData convert the tag to lowercase and return the text as usual:
item.pubdate.get_text() # Gets contents the tag <pubDate>
item.author.get_text() # Gets contents of the tag <author>
this should look like this:
for item in items:
html_text=item.description #??
#!! dont use html_text.find_all !!
for p in item.find_all('p'):
print(p)

Writing URLs via Beautifulsoup to a csv file vertically

I have a project for one of my college classes that requires me to pull all URLs from a page on the U.S. census bureau website and store them in a CSV file. For the most part I've figured out how to do that but for some reason when the data gets appended to the CSV file, all the entries are being inserted horizontally. I would expect the data to be arranged vertically, meaning row 1 has the first item in the list, row 2 has the second item and so on. I have tried several approaches but the data always ends up as a horizontal representation. I am new to python and obviously don't have a firm enough grasp on the language to figure this out. Any help would be greatly fully appreciated.
I am parsing the website using Beautifulsoup4 and the request library. Pulling all the 'a' tags from the website was easy enough and getting the URLs from those 'a' tags into a list was pretty clear as well. But when I append the list to my CSV file with a writerow function, all the data ends up in one row as opposed to one separate row for each URL.
import requests
import csv
requests.get
from bs4 import BeautifulSoup
from pprint import pprint
page = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = BeautifulSoup(page.text, 'html.parser')
## Create Link to append web data to
links = []
# Pull text from all instances of <a> tag within BodyText div
AllLinks = soup.find_all('a')
for link in AllLinks:
links.append(link.get('href'))
with open("htmlTable.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(links)
pprint(links)
Try this:
import requests
import csv
from bs4 import BeautifulSoup
page = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = BeautifulSoup(page.text, 'html.parser')
## Create Link to append web data to
links = []
# Pull text from all instances of <a> tag within BodyText div
AllLinks = soup.find_all('a')
for link in AllLinks:
links.append(link.get('href'))
with open("htmlTable.csv", "w") as f:
writer = csv.writer(f)
for link in links:
if (isinstance(link, str)):
f.write(link + "\n",)
I changed it to check whether a given link was indeed a string and if so, add a newline after it.
Try making a list of lists, by appending the url inside a list
links.append([link.get('href')])
Then the csv writer will put each list on a new line with writerows
writer.writerows(links)

Delete a certain tag with a certain id content from an HTML using python BeautifulSoup

I got a suggestion to use BeautifulSoup to delete a tag with a certain id from an HTML. For example, deleting <div id=needDelete>...</div> Below is my code, but doesn't seem to be working correctly:
import os, re
from bs4 import BeautifulSoup
cwd = os.getcwd()
print ('Now you are at this directory: \n' + cwd)
# find files that have an extension with HTML
Files = os.listdir(cwd)
print Files
def func(file):
for file in os.listdir(cwd):
if file.endswith('.html'):
print ('HTML files are \n' + file)
f = open(file, "r+")
soup = BeautifulSoup(f, 'html.parser')
matches = str(soup.find_all("div", id="jp-post-flair"))
#The soup.find_all part should be correct as I tested it to
#print the matches and the result matches the texts I want to delete.
f.write(f.read().replace(matches,''))
#maybe the above line isn't correct
f.close()
func(file)
Would you help check which part has the wrong code and maybe how should I approach it?
Thank you very much!!
You can use the .decompose() method to remove the element/tag:
f = open(file, "r+")
soup = BeautifulSoup(f, 'html.parser')
elements = soup.find_all("div", id="jp-post-flair")
for element in elements:
element.decompose()
f.write(str(soup))
It's also worth mentioning that you can probably just use the .find() method because an id attribute should be unique within a document (which means that there will likely only be one element in most cases):
f = open(file, "r+")
soup = BeautifulSoup(html_doc, 'html.parser')
element = soup.find("div", id="jp-post-flair")
if element:
element.decompose()
f.write(str(soup))
As an alternative, based on the comments below:
If you only want to parse and modify part of the document, BeautifulSoup has a SoupStrainer class that allows you to selectively parse parts of the document.
You mentioned that the indentations and formatting in the HTML file was being changing. Rather than just converting the soup object directly into a string, you can check out the relevant output formatting section in the documentation.
Depending on the desired output, here are a few potential options:
soup.prettify(formatter="minimal")
soup.prettify(formatter="html")
soup.prettify(formatter=None)

How to extract text from a webpage using python 2.7?

I'm trying to programmatically extract text from this webpage which describes a genome assembly in the public archive:
http://www.ebi.ac.uk/ena/data/view/ERS019623
I have thousands of assemblies that I want to track down and extract the study accession, which is the code on the far left of the table beginning with "PRJ". The URL for each of these assemblies is of the same format as the one above, i.e. "http://www.ebi.ac.uk/ena/data/view/ERS******". I have the ERS code for each of my assemblies so I can construct the URL for each one.
I've tried a few different methods, firstly if you add "&display=XML" to the end of the URL it prints the XML (or at least I'm presuming that it's printing the XML for the entire page, because the problem is that the study accession "PRJ******" is no where to be seen here). I had utilised this to extract another code that I needed from the same webpage, the run accession which is always of the format "ERR******" using the below code:
import urllib2
from bs4 import BeautifulSoup
import re
import csv
with open('/Users/bj5/Desktop/web_scrape_test.csv','rb') as f:
reader = csv.reader(f) #opens csv containig list of ERS numbers
for row in reader:
sample = row[0] #reads index 0 (1st row)
ERSpage = "http://www.ebi.ac.uk/ena/data/view/" + sample + "&display=xml" #creates URL using ERS number from 1st row
page = urllib2.urlopen(ERSpage) #opens url and assigns it to variable page
soup = BeautifulSoup(page, "html.parser") #parses the html/xml from page and assigns it to variable called soup
page_text = soup.text #returns text from variable soup, i.e. no tags
ERS = re.search('ERS......', page_text, flags=0).group(0) #returns first ERS followed by six wildcards
ERR = re.search('ERR......', page_text, flags=0).group(0) #retursn first ERR followed by six wildcards
print ERS + ',' + ERR + ',' + "http://www.ebi.ac.uk/ena/data/view/" + sample #prints ERS,ERR,URL
This worked very well, but as the study accession is not in the XML I can't use it to access this.
I also attempted to use BeautifulSoup again to download the HTML by doing this:
from bs4 import BeautifulSoup
from urllib2 import urlopen
BASE_URL = "http://www.ebi.ac.uk/ena/data/view/ERS019623"
def get_category_links(section_url):
html = urlopen(section_url).read()
soup = BeautifulSoup(html, "lxml")
print soup
get_category_links(BASE_URL)
But again I can't see the study accession in the output from this either...
I have also attempted to use a different python module, lxml, to parse the XML and HTML but haven't had any luck there either.
When I right click and inspect element on the page I can find the study accession by doing ctrl+F -> PRJ.
So my question is this: what is the code that I'm looking at in inspect element, XML or HTML (or something else)? Why does it look different to the code that prints in my console when I try and use BeautifulSoup to parse HTML? And finally how can I scrape the study accessions (PRJ******) from these webpages?
(I've only been coding for a couple of months and I'm entirely self-taught so apologies for the slightly confused nature of this question but I hope I've got across what it is that I'm trying to do. Any suggestions or advice would be much appreciated.)
from bs4 import BeautifulSoup
import requests
import re
r = requests.get('http://www.ebi.ac.uk/ena/data/view/ERS019623&display=xml')
soup = BeautifulSoup(r.text, 'lxml')
ERS = soup.find('primary_id').text
ERR = soup.find('id', text=re.compile(r'^ERR')).text
url = 'http://www.ebi.ac.uk/ena/data/view/{}'.format(ERS)
print(ERS, ERR, url)
out:
ERS019623 ERR048142 http://www.ebi.ac.uk/ena/data/view/ERS019623
bs4 can parse xml file, just treat it like html, they are all the same, so their is no need to use regex to extract info.
i find a TEXT download link:
http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=ERS019623&result=read_run&fields=study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,tax_id,scientific_name,instrument_model,library_layout,fastq_ftp,fastq_galaxy,submitted_ftp,submitted_galaxy,sra_ftp,sra_galaxy,cram_index_ftp,cram_index_galaxy&download=txt
this link's fileds can be changed to get the data you want, like this:
http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=ERS019623&result=read_run&fields=study_accession&download=txt
by doing so, you can get all you data in a text file
In you sample soup is a BeautifulSoup object: a representation of the parsed document.
If you want to print the entire HTML of the document, you can call print(soup.prettify()) or if you want the text within it print(soup.get_text()).
The soup object has other possibilities to access parts of the document you are interested in: to navigate the parsed tree, to search in it ...

Categories

Resources