Regular Expressions or BeautifulSoup - Varying Cases - python

I have 3 strings I am looking to retrieve that are characterized by the presence of two words: section and front. I'm terrible with regex.
contentFrame wsj-sectionfront economy_sf
contentFrame wsj-sectionfront business_sf
section-front markets
How can I match both of these words using one regular expression? This will be used to match the contents of a html page parsed by BeautifulSoup.
UPDATE:
I want to extract the main body of a webpage (https://www.wsj.com/news/business) that has the div tag: Main Content Housing. For some reason, BeautifulSoup isn't recognizing the highlighted class attribute using:
wsj_soup.find('div', attrs = {'class':'contentFrame wsj-sectionfront business_sf')
# Returns []
I'm trying to stay in BeautifulSoup as much as possible, but if regex is the way to go I will use that. From there I will more than likely search using the contents attribute to search for relevant keywords, but if anyone has a better idea of how to approach it please share.

One way to handle this would be to use two separate lookaheads which check for each of these words:
^(?=.*section)(?=.*front).*$
Demo

Related

How to find a string pattern that isn't in a tag in a BeautifulSoup object

Problem: My program processes HTML test reports and lists header information and pass/fail results. Some of the information in the report needed for the header is text that is not in a tag so the find() and find_all() methods aren't applicable.
When Python prints a BeautifulSoup object, it looks like text but you can't do simple pattern matches on it. And all the BeautifulSoup methods seem to assume that everything is a tag of some kind.
I won't list all the things I tried because I resolved the issue.
Resolution: Cast the BeautifulSoup object as a string first, then Python pattern matching works.
I'm posting this because I spent a lot of time searching for an answer to this problem and didn't find any information. And it would be interesting to know of other ways to do this.
# testReport type is BeautifulSoup
s = str(testReport)
if '_Tray5_' in s:
Tray = '5'

Using regex to find something in the middle of a href while looping

For "extra credit" in a beginners class in Python that I am taking I wanted to extract data out of a URL using regex. I know that there are other ways I could probably do this, but my regex desperately needs work so...
Given a URL to start at, find the xth occurrence of a href on the page, and use that link to go down a level. Rinse and repeat until I have found the required link on the page at the requested depth on the site.
I am using Python 3.7 and Beautiful Soup 4.
At the beginning of the program, after all of the house-keeping is done, I have:
starting_url = 'http://blah_blah_blah_by_Joe.html'
extracted_name = re.findall('(?<=by_)([a-zA-Z0-9]+)[^.html]*', starting_url)
selected_names.append(extracted_name)
# Just for testing purposes
print(selected_name) [['Joe']]
Hmm, a bit odd didn't expect a nested list, but I know how to flatten a list, so ok. Let's go on.
I work my way through a couple of loops, opening each url for the next level down by using:
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
Continue processing and, in the loop where the program should have found the href I want:
# Testing to check I have found the correct href
print(desired_link) <a href="http://blah_blah_blah_by_Mary.html">blah
blah</a>
type(desired_link) bs4.element.tag
Correct link, but a "type" new to me and not something I can use re.findall on. So more research and I have found:
for link in soup.find_all('a') :
tags = link.get('href')
type(tags) str
print(tags)
http://blah_blah_blah_by_George.html
http://blah_blah_blah_by_Bill.html
http://blah_blah_blah_by_Mary.html
etc.
Right type, but when I look at what printed, I think what I am looking at is maybe just one long string? And I need a way to just assign the third href in the string to a variable that I can use in re.findall('regex expression', desired_link).
Time to ask for help, I think.
And, while we are at it, any ideas about why I get the nested list the first time I used re.findall with the regex?
Please let me know how to improve this question so it is clearer what I've done and what I'm looking for (I KNOW you guys will, without me even asking).
You've printed every link on the page. But each time in the loop tags contains only one of them (you can print len(tags) to validate it easily).
Also I suggest replacing [a-zA-Z0-9]+ with \w+ - it will catch letters, numbers and underscores and is much cleaner.

How to use BeautifulSoup to get only strings from tags that have specific start?

I am scraping usernames and all of them are in the same a tag and their hrefs all start the same, like this:
Sadastyczny
I tried finding only if they have the class link5 but there are other values that have that class which I don't want to scrape. So is there a way to search for all the tags which have the
href="http://lolprofile.net/summoner"
in them but not the rest since that obviously is different for every username?
From the BeautifulSoup documentation.
Using a regular expression you can match the sites. If you have never heard of regular expressions you can use this:
soup.find_all(href=re.compile("http://lolprofile.net/summoner/*"))
Don't forget to import the re-module!

Finding urls containing a specific string

I haven't used RegEx before, and everyone seems to agree that it's bad for webscraping and html in particular, but I'm not really sure how to solve my little challenge without.
I have a small Python scraper that opens 24 different webpages. In each webpage, there's links to other webpages. I want to make a simple solution that gets the links that I need and even though the webpages are somewhat similar, the links that I want are not.
The only common thing between the urls seems to be a specific string: 'uge' or 'Uge' (uge means week in Danish - and the week number changes every week, duh). It's not like the urls have a common ID or something like that I could use to target the correct ones each time.
I figure it would be possible using RegEx to go through the webpage and find all urls that has 'uge' or 'Uge' in them and then open them. But is there a way to do that using BS? And if I do it using RegEx, how would a possible solution look like?
For example, here are two of the urls I want to grab in different webpages:
http://www.domstol.dk/KobenhavnsByret/retslister/Pages/Uge45-Tvangsauktioner.aspx
http://www.domstol.dk/esbjerg/retslister/Pages/Straffesageruge32.aspx
This should work... The RegEx uge\d\d? tells it to find "uge" followed by a digit, and possibly another one.
import re
for item in listofurls:
l = re.findall("uge\d\d?", item, re.IGNORECASE):
if l:
print item #just do whatever you want to do when it finds it
Yes, you can do this with BeautifulSoup.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
# To find just 'Uge##' or 'uge##', as specified in the question:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("[Uu]ge\d+"))]
# To find without regard to case at all:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("(?i)uge\d+"))]
Or just use a simple for loop:
list_of_urls = ["""LIST GOES HERE"""]
for url in list_of_urls:
if 'uge' in url.lower():
# Code to execute
The regex expression would look something like: uge\d\d

Removing unneccessary inner tags

We are converting DOCX to HTML through some external converter tool.
The generated HTML for tables contains something like this:
<td><div><span><b>Patienten</b></span></div></td>
The <div> and <span> tags inside TD are completely superfluous here.
The expected result is
<td><b>Patienten</b></td>
Is there some chance to remove them in a sane way using BeautifulSoup?
Well, the <div> and <span> tags have a structural meaning, that cannot be automatically guessed as "superfluous".
Your problem looks very similar to AST (Abstract Syntax Tree) optimization done in compilers. You could try to define some rules and build a SoupOptimizer to take a tree (your document) and produce an optimized output tree. Rules could be:
span(content) -> content, if span.attributes is empty
div(content) -> content, if div.attributes is empty
Note, that tree transformations on XML dialects can be done with XSLT. Just be ready to have your brain turned inside out before you see the light!
The way we do it is to use lxml and determine the parents and children of every element. If there is no text content difference in the parents and children then we have a set of rules that we follow to retain certain children while tossing the parents. And then forcing the appropriate block elements In your case b is a child of span, div and td we know that the td tag is the structuring element that is relevant so we get rid of the others. Again this requires testing the text content of each of the nested elements.
You could use the strip_tags function of Jesse Dhillon's answer of this question
You could rearrange the parse tree like this:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("<td><div><span><b>Patienten</b></span></div></td>")
td = soup.td
b = soup.td.div.span.b
td.insert(0,b)
td.div.extract()
print soup
I like the approach suggested by #Daren Thomas, but be aware that removing those "useless" tags could drastically affect the rendered appearance of the document thanks to JavaScript (less likely) or CSS (much more likely, possibly even probable) that relies on the resulting HTML to follow certain structural patterns, even if they are wasteful.
This makes the life of the tool writer much easier. Assume that some given construct in the DOCX has two possible variations. One of these requires a lot of boilerplate so you can attach a few special attributes (say a text-align or some such). The other doesn't. It's way easier to just always generate the boilerplate and write your CSS or what-have-you with that fact in mind.
If Beautiful Soup alone isn't sufficient, you can resort to regular expression.
import re
ch = 'sunny day<td><div><span><b>Patienten</b></span></div></td>rainy week'
# <td><b>Patienten</b></td>
RE = '(<td>)<div><span>(<b>.*?</b>)</span></div>(</td>)'
pat = re.compile(RE)
print ch
print pat.sub('\\1\\2\\3',ch)
result
sunny day<td><div><span><b>Patienten</b></span></div></td>rainy week
sunny day<td><b>Patienten</b></td>rainy week
Easy, easyn't it ?
A preliminary inspection can be done to determine if the replacement must really be done or not.

Categories

Resources