How do you get the text from an HTML 'datacell' using BeautifulSoup - python

I have been trying to strip out some data from HTML files. I have the logic coded to get the right cells. Now I am struggling to get the actual contents of the 'cell':
here is my HTML snippet:
headerRows[0][10].contents
[<font size="+0"><font face="serif" size="1"><b>Apples Produced</b><font size="3">
</font></font></font>]
Note that this is a list item from Python [].
I need the value Apples Produced but can't get to it.
Any suggestions would be appreciated
Suggestions on a good book that explains this would earn my eternal gratitude
Thanks for that answer. However-isn't there a more general answer. What happens if my cell doesn't have a bold attribute
say it is:
[<font size="+0"><font face="serif" size="1"><I>Apples Produced</I><font size="3">
</font></font></font>]
Apples Produced
I am trying to learn to read/understand the documentation and your response will help
I really appreciate this help. The best thing about these answers is that it is a lot easier to generalize from them then I have been able to do so from the BeautifulSoup documentation. I learned to program in the Fortran era and now I am learning python and I am amazed at its power - BeautifulSoup is an example. Making a coherent whole of the documentation is tough for me.
Cheers

The BeautifulSoup documentation should cover everything you need - in this case it looks like you want to use findNext:
headerRows[0][10].findNext('b').string
A more generic solution which doesn't rely on the <b> tag would be to use the text argument to findAll, which allows you to search only for NavigableString objects:
>>> s = BeautifulSoup(u'<p>Test 1 <span>More</span> Test 2</p>')
>>> u''.join([s.string for s in s.findAll(text=True)])
u'Test 1 More Test 2'

headerRows[0][10].contents[0].find('b').string

I have a base class that I extend all Beautiful Soup classes with a bunch of methods that help me get at text within a group of elements that I don't necessarily want to rely on the structure of. One of those methods is the following:
def clean(self, val):
if type(val) is not StringType: val = str(val)
val = re.sub(r'<.*?>', '', s) #remove tags
val = re.sub("\s+" , " ", val) #collapse internal whitespace
return val.strip() #remove leading & trailing whitespace

Related

python3.6 How do I regex a url from a .txt?

I need to grab a url from a text file.
The URL is stored in a string like so: 'URL=http://example.net'.
Is there anyway I could grab everything after the = char up until the . in '.net'?
Could I use the re module?
text = """A key feature of effective analytics infrastructure in healthcare is a metadata-driven architecture. In this article, three best practice scenarios are discussed: https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare Automating ETL processes so data analysts have more time to listen and help end users , https://www.google.com/, https://www.facebook.com/, https://twitter.com
code below catches all urls in text and returns urls in list."""
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)
output:
[
'https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare',
'https://www.google.com/',
'https://www.facebook.com/',
'https://twitter.com'
]
i dont have much information but i will try to help with what i got im assuming that URL= is part of the string in that case you can do this
re.findall(r'URL=(.*?).', STRINGNAMEHERE)
Let me go more into detail about (.*?) the dot means Any character (except newline character) the star means zero or more occurences and the ? is hard to explain but heres an example from the docs "Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’." the brackets place it all into a group. All this togethear basicallly means it will find everything inbettween URL= and .
You don't need RegEx'es (the re module) for such a simple task.
If the string you have is of the form:
'URL=http://example.net'
Then you can solve this using basic Python in numerous ways, one of them being:
file_line = 'URL=http://example.net'
start_position = file_line.find('=') + 1 # this gives you the first position after =
end_position = file_line.find('.')
# this extracts from the start_position up to but not including end_position
url = file_line[start_position:end_position]
Of course that this is just going to extract one URL. Assuming that you're working with a large text, where you'd want to extract all URLs, you'll want to put this logic into a function so that you can reuse it, and build around it (achieve iteration via the while or for loops, and, depending on how you're iterating, keep track of the position of the last extracted URL and so on).
Word of advice
This question has been answered quite a lot on this forum, by very skilled people, in numerous ways, for instance: here, here, here and here, to a level of detail that you'd be amazed. And these are not all, I just picked the first few that popped up in my search results.
Given that (at the time of posting this question) you're a new contributor to this site, my friendly advice would be to invest some effort into finding such answers. It's a crucial skill, that you can't do without in the world of programming.
Remember, that whatever problem it is that you are encountering, there is a very high chance that somebody on this forum had already encountered it, and received an answer, you just need to find it.
Please try this. It worked for me.
import re
s='url=http://example.net'
print(re.findall(r"=(.*)\.",s)[0])

How to print only specific words of a string

I want to only print all the "words" that start with "/watch" from the string, and then add all the '/watch...' to a list. Thanks in advance!
# Take a random video from my youtube recommended and add it to watch2gether
import requests
from bs4 import BeautifulSoup as BS
import time
import random
# Importing libraries
num = random.randint(1, 20)
recommended = requests.get('https://www.youtube.com/results?search_query=svenska+youtube+klassiker&sp=EgIQAQ%253D%253D')
recommended_soup = BS(recommended.content, features='lxml')
recommended_vid = recommended_soup.find_all('a', href=True)
for links in recommended_vid:
print(links['href'])
Output:
/
//www.youtube.com/upload
/
/feed/trending
/feed/history
/premium
/channel/UC-9-kyTW8ZkZNDHQJ6FgpwQ
/channel/UCEgdi0XIXXZ-qJOFPf4JSKw
/gaming
/feed/guide_builder
/watch?v=PbVt_O1kFpA
/watch?v=PbVt_O1kFpA
/user/thedjdoge
/watch?v=1lcksCjvuSs
/watch?v=1lcksCjvuSs
/channel/UCn-puiDqHNMhRvq6wsU3nsQ
/watch?v=AKj_pxp2l1c
/watch?v=AKj_pxp2l1c
/watch?v=QNnEqTQD6DM
/watch?v=QNnEqTQD6DM
/channel/UCDuOAYzgiZzqqlXd2G3GAwg
....
Maybe I can use something like .remove or .replace, don't know what to do so I appreciate all help.
yea re is definitely overkill here. this is a perfect use case for filter
a_list = ["/watch/blah", "not/watch"]
new_list = filter(lambda x: x.startswith("/watch"), a_list)
print(list(new_list))
['/watch/blah']
just be aware it returns a generator, so wrap it in list if you want the list.
http://book.pythontips.com/en/latest/map_filter.html is good if you want more information on functions that do this kind of data cleaning. If you need to get really fancy with your data cleaning look into using pandas. It has a steep learning curve, but it's fantastic for complicated data cleaning.
you can do the following
for links in recommended_vid:
if "/watch" in links[href]:
print(link[href])
This should help you find all the /watch links.
import re
pattern = re.compile(r"/watch")
# pattern = re.compile(r"/watch\?v=[a-zA-Z_0-9]{11}") -- This pattern is to find all the links as well
matches = pattern.finditer(<your_string>)
for m in matches:
print(m) #will print all the locations at which /watch occurs
You can collect all the URLs in a list and proceed. Good Luck!!
Looking at your code, a simple if statement with str.startswith() should suffice to get what you want.
Assuming the links['href'] contains a str, then:
for links in recommended_vid:
href = links['href'] # I think 'href' will be of type 'str'
if href.startswith('/watch'):
print(href)
Note: .startswith() will only work if /watch is really at the start of the href; you could also try if '/watch' in href:, which will match if that string appears anywhere in href.

Get num of page with beautifulsoup

i want to get the number of pages in the next code html:
<span id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoNumPagMAQ" class="outputText marginLeft0punto5">1</span>
<span id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoTotalPaginaMAQ" class="outputText marginLeft0punto5">37</span>
<span id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterTotalTotalMAQ" class="outputText marginLeft0punto5">736</span>
The goal is get the number 1, 37 and 736
My problem is that i don't know how define the line to extract the numbers, for example for the number 1:
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
first_page = int(soup.find('span', {'id': 'viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoNumPagMAQ'}).getText())
Thanks so much
EDIT: Finally i found a solution with Selenium:
numpag = int(driver.find_element_by_xpath('//*[#id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoNumPagMAQ"]').text)
pagtotal = int(driver.find_element_by_xpath('//*[#id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterInfoTotalPaginaMAQ"]').text)
totaltotal = int(driver.find_element_by_xpath('//*[#id="viewns_Z7_AVEQAI930OBRD02JPMTPG21004_:form1:textfooterTotalTotalMAQ"]').text)
Thanks #abarnert, sorry for the caos in my question, it was my first post =)
The code you provided already works for the example you provided.
My guess is that your problem is that it doesn't work for any other page, probably because those id values are different each time.
If that's the case, you need to look at (or show us) multiple different outputs to figure out if there's a recognizable pattern that you can match with a regular expression or a function full of string operations or whatever. See Searching the tree in the docs for the different kinds of filters you can use.
As a wild guess, that Z7 and AVEQAI930OBRD02JPMTPG21004 are replaced by different strings of capitals and digits each time, but the rest of the format is always the same? If so, there are some pretty obvious regular expressions you can use:
rnumpag = re.compile(r'.*:form1:textfooterInfoNumPagMAQ')
rtotalpagina = re.compile(r'.*:form1:textfooterInfoTotalPaginaMAQ')
rtotaltotal = re.compile(r'.*:form1:textfooterTotalTotalMAQ')
numpag = int(soup.find('span', id=rnumpag).string)
totalpagina = int(soup.find('span', id=rtotalpagina).string)
totaltotal = int(soup.find('span', id=rtotaltotal).string)
This works on your provided example, and would also work on a different page that had different strings of characters within the part we're matching with .*.
And, even if my wild guess was wrong, this should show you how to write a search for whatever you actually do have to search for.
As a side note, you were using the undocumented legacy function getText(). This implies that you're copying and pasting ancient BS3 code. Don't do that. Some of it will work with BS4, even when it isn't documented to (as in this case), but it's still a bad idea. It's like trying to run Python 2 source code with Python 3 without understanding the differences.
What you want here is either get_text(), string, or text, and you should look at what all three of these mean in the docs to understand the difference—but here, the only thing within the tag is a text string, so they all happen to do the same thing.

How to do parsing in python?

I'm kinda new to Python. And I'm trying to find out how to do parsing in Python?
I've got a task: to do parsing with some piece of unknown for me symbols and put it to DB. I guess I can create DB and tables with help of SQLAlchemy, but I have no idea how to do parsing and what all these symbols below mean?
http://joxi.ru/YmEVXg6Iq3Q426
http://joxi.ru/E2pvG3NFxYgKrY
$$HDRPUBID 112701130020011127162536
H11127011300UNIQUEPONUMBER120011127
D11127011300UNIQUEPONUMBER100001112345678900000001
D21127011300UNIQUEPONUMBER1000011123456789AR000000001
D11127011300UNIQUEPONUMBER200002123456987X000000001
D21127011300UNIQUEPONUMBER200002123456987XIR000000000This item is inactive. 9781605600000
$$EOFPUBID 1127011300200111271625360000005
Thanks in advance those who can give me some advices what to start from and how the parsing is going on?
The best approach is to first figure out where each token begins and ends, and write a regular expression to capture these. The site RegexPal might help you design the regex.
As other suggest take a look to some regex tutorials, and also re module help.
Probably you're looking to something like this:
import re
headerMapping = {'type': (1,5), 'pubid': (6,11), 'batchID': (12,21),
'batchDate': (22,29), 'batchTime': (30,35)}
poaBatchHeaders = re.findall('\$\$HDR\d{30}', text)
parsedBatchHeaders = []
batchHeaderDict = {}
for poaHeader in poaBatchHeaders:
for key in headerMapping:
start = headerMapping[key][0]-1
end = headerMapping[key][1]
batchHeaderDict.update({key: poaHeader[start:end]})
parsedBatchHeaders.append(batchHeaderDict)
Then you have list with dicts, each dict contains data for each attribute. I assume that you have your datafile in text which is string. Each dict is made for one found structure (POA Batch Header in example).
If you want to parse it further, you have to made a function to parse each date in each attribute.
def batchDate(batch):
return (batch[0:2]+'-'+batch[2:4]+'-20'+batch[4:])
for header in parsedBatchHeaders:
header.update({'batchDate': batchDate( header['batchDate'] )})
Remember, that's an example and I don't know documentation of your data! I guess it works like that, but rest is up to you.

Interpreting nested HTML <blockquote>s in Python?

I have a web app that reads from the Tumblr API and reformats the way that "reblog chains" are formatted.
With Tumblr, commentary for a post is stored as HTML blockquotes. As users respond to the commentary above, another level gets added to the blockquote chain, eventually resulting in many nested reblog chains.
Here is an example of how a "reblog chain" looks in plain HTML:
<p><a class="tumblr_blog" href="http://chainsaw-police.tumblr.com/post/96158438802/example-tumblr-post">chainsaw-police</a>:</p><blockquote>
<p><a class="tumblr_blog" href="http://example-blog-domain.tumblr.com/post/96158384215/example-tumblr-post">example-blog-domain</a>:</p><blockquote>
<p>Here is an example of a Tumblr post.</p> <p>It can have multiple <p> elements sometimes. It may only have one, though, at other times.</p>
</blockquote>
<p>This is an example of a user “reblogging” a post. As you can see, the previous comment is stored above as a <blockquote>.</p>
</blockquote>
<p>This is another reblog. As you can see, all of the previous comments are stored as blockquotes, with earlier ones being residing deeper in the nest of blockquotes.</p>
And this is what it looks like when rendered.
I want to be able to reformat the reblog chain so that it looks more like this:
example-blog-domain:
Here is an example of a Tumblr post.
It can have multiple <p> elements sometimes. It may only have one, though, at other times.
chainsaw-police:
This is an example of a user “reblogging” a post. As you can see, the previous comment is stored above as a <blockquote>.
example-blog-domain:
This is another reblog. As you can see, all of the previous comments are stored as blockquotes, with earlier ones being residing deeper in the nest of blockquotes.
I know, It's an incredibly confusing structure, hence why I'm trying to write something to make it more readable.
Is there any way to interpret the HTML and split the reblogs up into individual "comments"? For example, having an array or dict that has the username and the commentary would be more than enough. However, after messing with lxml and BeautifulSoup for months, I'm at my wits' end.
If there was even a way to do it in CSS, which I highly doubt, that would be fine.
Thanks in advance, everyone!
I guess CSS does not have a such functionality.
You need parse to a structure by lxml, ... and render it. It is easier way. You can also create a filter using regexp that does not pass wrong items of html code.
reddit user /u/joyeusenoelle has answered my question over at /r/LearnPython using a tonne of convoluted regexes that end up looking more like a voodoo magic spell than a text manipulation script.
Lots of regexes later, I think I've solved this for an
arbitrarily-deep comment chain.
import re
with open("tcomment.txt","r") as tf:
text = ""
for line in tf:
text += line
tf.close()
text = text.replace("\n","")
text = text.replace(">",">\n")
text = text.replace("<","\n<")
text = re.sub("</p>\s*<p>","<br><br>", text)
text = text.replace("<p>\n", "")
text = text.replace("</p>\n","\n")
text = re.sub("<[/]{0,1}blockquote>","<chunk>",text)
text = re.sub("<a class=\"tumblr_blog\"[^>]+?>","<chunk>",text)
text = text.replace("</a>","")
text = re.sub("\n+","", text)
text = re.sub("\s{2,}"," ", text)
text = re.sub("<chunk>\s*<chunk>","<chunk>",text)
bits = text.split("<chunk>")
bits[0] = "Latest:"
comments = []
for i in range(len(bits)):
temp = ""
j = 0 - (i+1)
if (len(bits)-i) > i:
temp = "<b>" + bits[i] + "</b> " + bits[j]
comments.append(temp)
comments.reverse()
for comment in comments:
print("<p>%s</p>" % (comment))
print()
The line bits[0] = "Latest:" can be changed to whatever you want the
most recent comment to display, and you'll probably want to change how
the text comes into the script.
For the text you provided, this gives me:
<p><b>example-blog-domain:</b> Here is an example of a Tumblr post.<br><br>It can have multiple <p> elements sometimes. It may
only have one, though, at other times.
<p><b>chainsaw-police:</b> This is an example of a user "reblogging" a post. As you can see, the previous comment is stored
above as a <blockquote>.
<p><b>Latest:</b> This is another reblog. As you can see, all of the previous comments are stored as blockquotes, with earlier ones
being residing deeper in the nest of blockquotes.
e: Some thoughts: this is in Python 3, but everything but the print
statements should work in Python 2, I think. I used text.split()
whenever possible because direct string manipulation is typically
faster than regular expressions are, but that may not be appropriate
here. And finally, it's possible that I'm making more work for myself
than I need to in the substitutions section, but at this point I've
looked at the code too long to figure out if it could be slimmed down.

Categories

Resources