BeautifulSoup String Search - python

I have been googling and looking at other question here on search for a string in a BeautifulSoup object.
Per my searching, the following should detect the string - but it doesn't:
strings = soup.find_all(string='Results of Operations and Financial Condition')
However, the following detects the string:
tags = soup.find_all('div',{'class':'info'})
for tag in tags:
if re.search('Results of Operations and Financial Condition',tag.text):
''' Do Something'''
Why does one work and the other not?

You might want to use:
strings = soup.find_all(string=lambda x: 'Results of Operations and Financial Condition' in x)
This happens because the implementation of find_all looks for the string you search to match exactly. I suppose you might have some other text next to 'Results of Operations and Financial Condition'.
If you check the docs here you can see that you can give a function to that string param and it seems that the following lines are equivalent:
soup.find_all(string='Results of Operations and Financial Condition')
soup.find_all(string=lambda x: x == 'Results of Operations and Financial Condition')

For this code
page = urllib.request.urlopen('https://en.wikipedia.org/wiki/Alloxylon_pinnatum')
sp = bs4.BeautifulSoup(page)
print(sp.find_all(string=re.compile('The pinkish-red compound flowerheads'))) # You need to use like this to search within text nodes.
print(sp.find_all(string='The pinkish-red compound flowerheads, known as'))
print(sp.find_all(string='The pinkish-red compound flowerheads, known as ')) #notice space at the end of string
Results are -
['The pinkish-red compound flowerheads, known as ']
[]
['The pinkish-red compound flowerheads, known as ']
It looks like string argument searches for exact full string match, not whether some HTML text node contains that string, but exact value of the HTML text node. You can however use regular expressions to search whether a text node contains some string, as shown in above code.

Related

Split String based on multiple Regex matches

First of all, I checked these previous posts, and did not help me. 1 & 2 & 3
I have this string (or a similar case could be) that need to be handled with regex:
"Text Table 6-2: Management of children study and actions"
What I am supposed to do is detect the word Table and the word(s) before if existed
detect the numbers following and they can be in this format: 6 or 6-2 or 66-22 or 66-2
Finally the rest of the string (in this case: Management of children study and actions)
After doing so, the return value must be like this:
return 1 and 2 as one string, the rest as another string
e.g. returned value must look like this: Text Table 6-2, Management of children study and actions
Below is my code:
mystr = "Text Table 6-2: Management of children study and actions"
if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
print("True matched")
parts_of_title = re.search("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr)
print(parts_of_title)
print(" ".join(parts_of_title.group().split()[0:3]), parts_of_title.group().split()[-1])
The first requirement is returned true as should be but the second doesn't so, I changed the code and used compile but the regex functionality changed, the code is like this:
mystr = "Text Table 6-2: Management of children study and actions"
if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
print("True matched")
parts_of_title = re.compile("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?").split(mystr)
print(parts_of_title)
Output:
True matched
['', 'Text ', 'Table', '-2', ':\tManagement of children study and actions']
So based on this, how I can achieve this and stick to a clean and readable code? and why does using compile change the matching?
The matching changes because:
In the first part, you call .group().split() where .group() returns the full match which is a string.
In the second part, you call re.compile("...").split() where re.compile returns a regular expression object.
In the pattern, this part will match only a single word [a-zA-Z0-9]+[ ], and if this part should be in a capture group [0-9]([-][0-9]+)? the first (single) digit is currently not part of the capture group.
You could write the pattern writing 4 capture groups:
^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)
See a regex demo.
import re
pattern = r"^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)"
s = "Text Table 6-2: Management of children study and actions"
m = re.match(pattern, s)
if m:
print(m.groups())
Output
('Text ', 'Table', '6-2', 'Management of children study and actions')
If you want point 1 and 2 as one string, then you can use 2 capture groups instead.
^((?:.*? )?(?:[Ll]ist|[Tt]able|[Ff]igure)\s+\d+(?:-\d+)?):\s+(.+)
Regex demo
The output will be
('Text Table 6-2', 'Management of children study and actions')
you have already had answers but I wanted to try your problem to train myself so I give you all the same what I found if you are interested:
((?:[a-zA-Z0-9]+)? ?(?:[Ll]ist|[Tt]able|[Ff]igure)).*?((?:[0-9]+\-[0-9]+)|(?<!-)[0-9]+): (.*)
And here is the link to my tests: https://regex101.com/r/7VpPM2/1

Hyphen at beginning of regex causes it to stop matching (python 2.7) - but at the end it's fine?

I'm writing a simple script to dump the tracks, artists, and times of a bandcamp album (https://nihonkizuna.bandcamp.com/album/nihon-kizuna), but I'm having trouble with the regex. For context, the track titles are in the format "Artist - Title". I'm trying to separate the dumped track titles so that I have the artist in one list and the title in another, then writing these and the time to a csv.
For some reason, the expression:
(.*) -
Finds the artist correctly, but:
- (.*)
Fails to find the title correctly. Instead I get:
AttributeError: 'NoneType' object has no attribute 'group'
I've tried escaping the hyphen, but python returns "None" for a match as long as it's the first character. I've tried testing it by regexing an actual title, "- 9 Samurai", and it still fails.
import pandas as pd
from lxml import html
import re
import requests
page = requests.get("https://nihonkizuna.bandcamp.com/album/nihon-kizuna")
tree = html.fromstring(page.content)
tracks = tree.xpath('//table[#id ="track_table"]//td[#class="title-col"]/div[#class="title"]/a/span/text()')
time = tree.xpath('//table[#id ="track_table"]//td[#class="title-col"]/div[#class="title"]/span/text()')
newtimes = []
artists = []
newtracks = []
for item in time:
newitem = item.strip()
newtimes.append(newitem)
for item in tracks:
track_item = re.match("(.*) -", item)
artists.append(track_item.group(1))
newitem2 = re.match("- (.*)", item)
newtracks.append(newitem2.group(1))
raw_data = {"track": newtracks, "artist": artists, "time": newtimes}
df = pd.DataFrame(raw_data, columns = ["track", "artist", "time"])
df.index += 1
df.to_csv(raw_input("Input the csv path."))
As the documentation to re.match states:
If zero or more characters at the beginning of string match the regular expression pattern, (...).
Use re.search instead.
Why don't use a regular str.split():
artists, newtracks = zip(*[item.split(" - ") for item in tracks])
The zip(*[...]) here would unzip the list of 2-item tuples into two separate sequences allowing us to separate artists and newtracks.
Note that both solutions are vulnerable in case a dash can be a part of artist or track name. On this particular page, artist and track names are always met "together", joined with -. If you are worried about cases like these and you can sacrifice performance in exchange for quality and robustness - follow the track pages where you have artists and songs defined separately. If you do that, make sure to have a web-scraping requests session defined while you crawl the website.

Regex Python firstname lastname tag the keywords in json dump

Python Regex :
I have a json file and list of keywords.
I need to match the keywords in the json file dump.
I have set of keywords : Data Filter Terms:
Candidate Names
Hillary Clinton
Bernie Sanders
Jeb Bush
Donald Trump
John Kasich
Marco Rubio
Scott Walker
I need to match these keywords in such a way that it should search for
'Scott Walker' as well as 'Scott','Walker' independently too.
and I need to tag these in the json dump.
can anyone help me out in this?
I wrote a pseudocode for this :
import re
json_pages = open('/home/Desktop/arti.json','r')
filterd_pages = []
for page in json_pages:
text = page['text']
re.match('Hillary Clinton')
if matches:
page['matched_keyword'] = matches.group()
filterd_pages.append(page)
dump_json(filterd_pages)
f = open('/home/soundarya/Documents/synapsifyone.json')
json_response = json.loads(f.read())
keywords = ['Hillary Clinton ', 'Bernie Sanders', 'Jeb Bush','Donald Trump','John Kasich','Marco Rubio','Scott Walker']
for k, v in json_response.iteritems():
if k in keywords:
print(v)
break
How to tag the keywords in the JSON dump?
I have crawled so many datas , posts from nearly 30 urls using Diffbot tool and got json as the output file. from this json file i have to match the keywords (First Name , Last Name, First Name Last Name) and tag it at the end of each dict in the list or return the sentences which have the sentences that contain - 'hillary' , 'clinton' or 'Hillary Clinton.
You can create a list of regular expressions, one for each term. The idea is to construct a regular expression that matches either the whole term or any word in it.
term_regexes = []
term_tags = []
for term in term_file:
term_matchers = [term] + term.split()
term_regexes.append('|'.join(term_matchers))
term_tags.append(term)
We are creating a list to hold the regular expressions and another for holding the tags.
term_file contains each term. For each of them, we construct the regular expression that matches either the term or any of its parts. This corresponds to the union of the expressions matching each one of them, using the union ("or") regex operator ("|"). For instance, the expression "Hillary Clinton|Hillary|Clinton" would do the job for your example.
Finally, we iterate the list of dictionaries, search for a match of any of the terms and tag when found:
for d in dict_list:
# Search each term.
for term_regex, tag in zip(term_matchers, term_tags):
if re.match(term_regex, d['text'], re.IGNORECASE):
d['tag'] = tag
break

beautifulsoup, Find th with text 'price', then get price from next th

My html looks like:
<td>
<table ..>
<tr>
<th ..>price</th>
<th>$99.99</th>
</tr>
</table>
</td>
So I am in the current table cell, how would I get the 99.99 value?
I have so far:
td[3].findChild('th')
But I need to do:
Find th with text 'price', then get next th tag's string value.
Think about it in "steps"... given that some x is the root of the subtree you're considering,
x.findAll(text='price')
is the list of all items in that subtree containing text 'price'. The parents of those items then of course will be:
[t.parent for t in x.findAll(text='price')]
and if you only want to keep those whose "name" (tag) is 'th', then of course
[t.parent for t in x.findAll(text='price') if t.parent.name=='th']
and you want the "next siblings" of those (but only if they're also 'th's), so
[t.parent.nextSibling for t in x.findAll(text='price')
if t.parent.name=='th' and t.parent.nextSibling and t.parent.nextSibling.name=='th']
Here you see the problem with using a list comprehension: too much repetition, since we can't assign intermediate results to simple names. Let's therefore switch to a good old loop...:
Edit: added tolerance for a string of text between the parent th and the "next sibling" as well as tolerance for the latter being a td instead, per OP's comment.
for t in x.findAll(text='price'):
p = t.parent
if p.name != 'th': continue
ns = p.nextSibling
if ns and not ns.name: ns = ns.nextSibling
if not ns or ns.name not in ('td', 'th'): continue
print ns.string
I've added ns.string, that will give the next sibling's contents if and only if they're just text (no further nested tags) -- of course you can instead analize further at this point, depends on your application's needs!-). Similarly, I imagine you won't be doing just print but something smarter, but I'm giving you the structure.
Talking about the structure, notice that twice I use if...: continue: this reduces nesting compared to the alternative of inverting the if's condition and indenting all the following statements in the loop -- and "flat is better than nested" is one of the koans in the Zen of Python (import this at an interactive prompt to see them all and meditate;-).
With pyparsing, it's easy to reach into the middle of some HTML for a tag pattern like this:
from pyparsing import makeHTMLTags, Combine, Word, nums
th,thEnd = makeHTMLTags("TH")
floatnum = Combine(Word(nums) + "." + Word(nums))
priceEntry = (th + "price" + thEnd +
th + "$" + floatnum("price") + thEnd)
tokens,startloc,endloc = priceEntry.scanString(html).next()
print tokens.price
Pyparsing's makeHTMLTags helper returns a pair of pyparsing expressions, one for the start tag and one for the end tag. The start tag pattern is much more than just adding "<>"s around the given string, but also allows for extra whitespace, variable case, and the presence or absence of tag attributes. For instance, note that even though I specified "TH" as the table head tag, it will also match "th", "Th", "tH" and "TH". Pyparsing's default whitespace skipping behavior will also handle extra spaces, between tag and "$", between "$" and numeric price, etc., without having to sprinkle "zero or more whitespace chars could go here" indicators. Lastly, by assigning the results name "price" (following floatum in the definition of priceEntry), it makes it very simple to access that specific value from the full list of tokens matching the overall priceEntry expression.
(Combine is used for 2 purposes: it disallows whitespace between the components of the number; and returns a single combined token "99.99" instead of the list ["99", ".", "99"].)

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources