reverse regex summarization [duplicate] - python

Is there any lib out there that can take a text (like a html document) and a list of strings (like the name of some products) and then find a pattern in the list of strings and generate a regular expression that would extract all the strings in the text (html document) that match the pattern it found?
For example, given the following html:
<table>
<tr>
<td>Product 1</td>
<td>Product 2</td>
<td>Product 3</td>
<td>Product 4</td>
<td>Product 5</td>
<td>Product 6</td>
<td>Product 7</td>
<td>Product 8</td>
</tr>
</table>
and the following list of strings:
['Product 1', 'Product 2', 'Product 3']
I'd like a function that would build a regex like the following:
'<td>(.*?)</td>'
and then extract all the information from the html that match the regex.
In this case, the output would be:
['Product 1', 'Product 2', 'Product 3', 'Product 4', 'Product 5', 'Product 6', 'Product 7', 'Product 8']
CLARIFICATION:
I'd like the function to look at the surrounding of the samples, not at the samples themselves.
So, for example, if the html was:
<tr>
<td>Word</td>
<td>More words</td>
<td>101</td>
<td>-1-0-1-</td>
</tr>
and the samples ['Word', 'More words'] I'd like it to extract:
['Word', 'More words', '101', '-1-0-1-']

Your requirement is at the same time very specific and very general.
I don't think you would ever find any library for your purpose unless you write your own.
On the other hand, if you spend too much time writing regex, you could use some GUI tools to help you build them, like:
http://www.regular-expressions.info/regexmagic.html
However, if you need to extract data from html documents only, you should consider using an html parser, it should make things a lot easier.
I recommend beautifulsoup for parsing html document in python:
https://pypi.python.org/pypi/beautifulsoup4/4.2.1

I'm pretty sure the answer to this question in the general case (without being pedantic) is no. The problem is that an arbitrary text, together with an arbitrary set of substrings of that text, do not rigorously define a single regular expression.
As a couple people have mentioned, a function could simply return .* for every set of inputs. Or it could return, for input strings ['desired', 'input', 'strings'], the regex
'(desired)+|(input)+|(strings)+'
Or plenty of other trivially correct but wholly useless results.
The issue you're facing is that in order to build a regex, you need to rigorously define it. And to do that, you need to describe the desired expression using language as expressive as the regex language you're working in... a string plus a list of substrings is not sufficient (just look at all the options a tool like RegexMagic needs to compute regular expressions in a limited environment!). In practical terms, this means that you need the regular expression you want, in order to compute it efficiently.
Of course, you could always go the million-monkeys route and attempt to evolve an appropriate regex somehow, but you're still going to have the problem of requiring a huge sample input of text + expected output in order to get a viable expression. Plus it'll take ages to run and probably be bloated six ways from Sunday with useless detritus. You'd likely be better off writing it yourself.

I had a similar problem. Pyparsing is a great tool to do exactly as you said.
https://github.com/pyparsing/pyparsing
It allows you to build expressions much list a regex but much more flexible. The site has some good examples.
Here is a quick script for the problem you posed above:
from pyparsing import *
cell_contents = []
results = []
text_string="""<table>
<tr>
<td>Product 1</td>
<td>Product 2</td>
<td>Product 3</td>
<td>Product 4</td>
<td>Product 5</td>
<td>Product 6</td>
<td>Product 7</td>
<td>Product 8</td>
</tr>
</table>"""
text_string = text_string.splitlines()
for line in text_string:
anchorStart,anchorEnd = makeHTMLTags("td")
table_cell = anchorStart + SkipTo(anchorEnd).setResultsName("contents") + anchorEnd
for tokens,start,end in table_cell.scanString(line):
cell_contents = ''.join(tokens.contents)
results.append(cell_contents)
for i in results:
print i

Try this:
https://github.com/noprompt/frak
It's written in Clojure, and there are no guarantees what it outputs is the most concise expression, but seems to have some potential

Perhaps it would be better to use a Python HTML parser that supports XPATHs (see this related question), look for the bits of interest in the HTML code, and then record their XPATHs - or at least the ones shared by more than one of the examples?

const table = document.querySelector("table");
const rows = table.querySelectorAll("tr");
let array = [];
for (const row of rows) {
const cells = row.querySelectorAll("td");
let rowArray = [];
for (const cell of cells) {
rowArray.push(cell.textContent);
}
array.push(rowArray);
}
console.log(array);

Rather than generating a regex, how about using a more general regex? If your data is constrained to the inner text of an element that does not itself contain elements, then this regex used with re.findall will yield a list of tuples where each tuple is (tagname, text):
r'<(?P<tag>[^>]*)>([^<>]+?)</(?P=tag)>'
You could then extract just the text from each tuple easily.

Related

How to search for the text within an "a href" tag and click?

I've got a table with a bunch of links. The IDs are all unique but do not correspond to the actual text that is displayed so I'm having some trouble.
Ex.
<tr>
<td><a id="011" href="/link">Project 1</a></td>
</tr>
<tr>
<td><a id="235" href="/link">Project 2</a></td>
</tr>
<tr>
<td><a id="033" href="/link">Project 3</a></td>
</tr>
<tr>
<td><a id="805" href="/link">Project 4</a></td>
</tr>
I only know the text within the ahref (ie. Project 1) and I want to search for it and click it. I haven't been able to figure this out and I've been playing around with find_element_by_xpath for a while.
I've been using
selectproject = browser.find_element_by_xpath("//a[contains(.,projectname)]").click();
(projectname is a variable that changes every iteration)
I think it works to find the element since the script runs but it doesn't click. I think it's because I'm not actually searching for the ahref and just for the text?
Here is the Answer to your Question:
If you want to click the link with text Project 1 you can use the following line of code:
browser.find_element_by_xpath("//a[contains(text(),'Project 1')]").click()
or
browser.find_element_by_xpath("//a[#id="011"][contains(text(),'Project 1')]").click()
Update:
As you mentioned the Project 1 part is dynamic so you can try to construct a separate function() for clicking these links. Call the function with all the projectnames one by one as follows (the function is in Java consider to convert as per your required language binding):
public void clickProject(String projectName)
{
browser.findElement(By.xpath("//a[.='" + projectName + "']")).click();
}
Now you can call from your main() class as: clickProject(Project1)
Let me know if this Answers your Question.
If your requirement is to "click on the link Project 1", then you should use that as the locator. No need to mess around with XPath.
browser.find_element_by_linkText("Project 1").click();
// or the more flexible
browser.find_element_by_partialLinkText("Project 1").click();
The .find_element_by_partialLinkText() locator strategy should account for any extra whitespace padding due to the extra span element.
Note: I write Java, so the above Python syntax may be off. But those methods must exist.

Using BeautifulSoup to find first string which comes after certain string

How can I find the first string which comes after a certain string using BeautfulSoup?
I have this text within an HTML file:
<tr>
<th scope="row">Continent:</th>
<td>North America</td>
</tr>
<tr>
I'd like to fetch out of it "North America" by getting the first string which comes after 'Continent:' string.
How can I do that?
BTW, I found another way to get it, but I'm looking for a more simple way:
continent_tag = soup.find('th', string='Continent:')
print continent_tag.parent.contents[3].contents[0]
Thanks,
Moty
Since the elements are siblings, another option would be to use the .find_next_sibling() method in order to select the adjacent td sibling element:
print(soup.find('th', string='Continent:').find_next_sibling('td').text)
# North America

Iteratively reading a specific element from a <table> with Selenium for Python

I am trying to read in information from this table that changes periodically. The HTML looks like this:
<table class="the_table_im_reading">
<thead>...</thead>
<tbody>
<tr id="uc_6042339">
<td class="expansion">...</td>
<td>
<div id="card_6042339_68587" class="cb">
TEXT I NEED TO READ
</td>
<td>...</td>
more td's
</tr>
<tr id="uc_6194934">...</tr>
<td class="expansion">...</td>
similar as the first <tr id="uc...">
I was able to get to the table using:
table_xpath = "//*[#id="content-wrapper"]/div[5]/table"
table_element = driver.find_element_by_xpath(table_xpath)
And I am trying to read the TEXT I NEED TO READ part for each unique <tr id="uc_unique number">. The id=uc_unique number changes periodically, so I cannot use find element by id.
Is there a way reach that element and read that specific text?
Looks like you can search via the anchor-element link (href-attribute), since I guess this will not change.
via xpath:
yourText = table_element.find_element_by_xpath(.//a[#href='/blahsomelink']).text
UPDATE
OP mentioned that his link is also changing (with each call?), which means that the first approach is not for him.
if you want the text of the first row-element you can try this:
yourText = table_element.find_element_by_xpath(.//tr[1]//a[#class='cl']).text
if you know for example that the link element is always in the second data-element of the first row and there is only one link-element, then you can do this:
yourText = table_element.find_element_by_xpath(.//tr[1]/td[2]//a).text
Unless you provide more detailed requirements as to what you are really searching for, this will have to suffice so far...
Another UPDATE
OP gave more info regarding his requirement:
I am trying to get the text in each row.
Given there is only one anchor-element with class cl in each tr element you can do the following:
elements = table_element.find_elements_by_xpath(.//tr//a[#class='cl'])
for element in elements:
row_text = element.text
Now you can do whatever you need with all these texts...
It looks like you have a few options.
If all you want is the first A, it might be as simple as
table_element.find_element_by_css_selector("a.cl")).text
or the little more specific
table_element.find_element_by_css_selector("div.cb > a.cl")).text
If you want all the As, try the find_elements_* versions of the above.
I managed to find the elements I needed using .get_attribute("textContent") instead of .text , a tip from Get Text from Span returns empty string

Can't see the HTML in the element

I am able to log on and access my account page, here is a sample of the HTML (modified for brevity and to not exceed the URL limit):
<div class='table m_t_4'>
<table class='data' border=0 width=100% cellpadding=0 cellspacing=0>
<tr class='title'>
<td align='center' width='15'><a></a></td>
<td align='center' width='60'></td>
</tr>
<TR bgcolor=>
<td valign='top' align='center'>1</TD>
<td valign='top' align='left'><img src='/images/sale_small.png' alt='bogo sale' />Garden Escape Planters</TD>
<td valign='top' align='right'>13225</TD>
<td valign='top' align='center'>2012-01-17 11:34:32</TD>
<td valign='top' align='center'>FILLED</TD>
<td valign='top' align='center'><A HREF='https://www.daz3d.com/i/account/orderdetail?order=7886745'>7886745</A></TD>
<td valign='top' align='center'><A HREF='https://www.daz3d.com/i/account/req_dlreset?oi=18087292'>Reset</A>
</TR>
Note that the only item I really need is the first HREF with the "order=7886745'>7886745<"...
And there are several of the TR blocks that I need to read.
I am using the following xpath coding:
browser.get('https://www.daz3d.com/i/account/orderitem_hist?')
account_history = browser.find_element_by_xpath("//div[#class='table m_t_4']");
print account_history
product_block = account_history.find_element_by_xpath("//TR[contains(#bgcolor, '')]");
print product_block
product_link = product_block.find_element_by_xpath("//TR/td/A#HREF")
print product_link
I am using the Python FireFox version of webdriver.
When I run this, the account_history and product_block xpath's seem to work fine (they print as "none" so I assume they worked), but I get a "the expession is not a legal expression" error on the product_link.
I have 2 questions:
1: Why doesn't the "//TR/td/A#HREF" xpath work? It is supposed to be using the product_block - which it (should be) just the TR segment, so it should start with the TR, then look for the first td that has the HREF...correct?
I tried using the exact case used in the HTML, but I think it shouldn't matter...
2: What coding do I need to use to see the content (HTML/text) of the elements?
I need to be able to do this to get the URL I need for the next page to call.
I would also like to see for sure that the correct HTML is being read here...that should be a normal part of debugging, IMHO.
How is the element data stored? Is it in an array or table that I can read using Python? It has to be available somewhere, in order to be of any use in testing - doesn't it?
I apologize for being so confused, but I see a lot of info on this on the web, and yet much of it either doesn't do anything, or it causes an error.
There do not seem to be any "standard" coding rules available...and so I am a bit desperate here...
I really like what I have seen in Selenium up to this point, but I need to get past it in order to make this work!
Edited!
OK, after getting some sleep the first answer provided the clue - find_elements_by_xpath creates a list...so I used that to find all of the xpath("//a[contains(#href,'https://www.daz3d.com/i/account/orderdetail?order=')]"); elements in the entire history, then accessed the list it created...and write it to a file to be sure of what I was seeing.
The revised code:
links = open("listlinks.txt", "w")
browser.get('https://www.daz3d.com/i/account/orderitem_hist?')
account_history = browser.find_element_by_xpath("//div[#class='table m_t_4']");
print account_history.get_attribute("div")
product_links = []
product_links = account_history.find_elements_by_xpath("//a[contains(#href,'https://www.daz3d.com/i/account/orderdetail?order=')]");
print str(len(product_links)) + ' elements'
for index, item in enumerate(product_links):
link = item.get_attribute("href")
links.write(str(index) + '\t' + str(link) + '\n')
And this gives me the file with the links I need...
0 https://www.daz3d.com/i/account/orderdetail?order=7905687
1 https://www.daz3d.com/i/account/orderdetail?order=7886745
2 https://www.daz3d.com/i/account/orderdetail?order=7854456
3 https://www.daz3d.com/i/account/orderdetail?order=7812189
So simple I couldn't see it for tripping over it...
Thanks!
1: Why doesn't the "//TR/td/A#HREF" xpath work? It is supposed to be
using the product_block - which it (should be) just the TR segment, so
it should start with the TR, then look for the first td that has the
HREF...correct?
WebDriver only returns elements, not attributes of said elements, thus:
"//TR/td/A"
works, but
"//TR/td/A#HREF"
or
"//TR/td/A#ANYTHING"
does not.
2: What coding do I need to use to see the content (HTML/text) of the
elements?
To retrieve the innertext:
string innerValue = element.Text;
To retrieve the innerhtml:
This is a little harder, you would need to iterate through each of the child elements and reconstruct the html based on that - or you could process the html with a scraping tool.
To retrieve an attribute:
string hrefValue = element.GetAttribute("href");
(C#, hopefully you can make the translation to Python)
There are other ways too to access an element than browser.find_element_by_xpath.
You can access by for e.g. id, or class
browser.find_element_by_id
browser.find_element_by_link_text
browser.find_element
browser.find_element_by_class_name
browser.find_element_by_css_selector
browser.find_element_by_name
browser.find_element_by_partial_link_text
browser.find_element_by_xpath
browser.find_element_by_tag_name
Each of above has a similar function which returns a list(just replace element with elements
Note: I have separated top two rows as I think they might help you.

Regex try and match until hitting end tag in python

I'm looking for a bit of help with a regex in python and google is failing me. Basically I'm searching some html and there is a certain type of table I'm searching for, specifically any table that includes a background tag in it (i.e. BGCOLOR). Some tables have this tag and some do not. Could someone help me out with how to write a regex that searches for the start of the table, then searches for the BGCOLOR but if it hits the end of the table then it stops and moves on?
Here's a very simplified example that will server the purpose:
`<TABLE>
<B>Item 1.</B>
</TABLE>
<TABLE>
BGCOLOR
</TABLE>
<TABLE>
<B>Item 2.</B>
</TABLE>`
So we have three tables but I'm only interested in finding the middle table that contains 'BGCOLOR'
The problem with my regex at the moment is that it searches for the starting table tag then looks for 'BGCOLOR' and doesn't care if it passes the table end tag:
tables = re.findall('\<table.*?BGCOLOR=".*?".*?\<\/table\>', text, re.I|re.S)
So it would find the first two tables instead of just the second table. Let me know if anyone knows how to handle this situation.
Thanks,
Michael
Don't use a regular expression to parse HTML. Use lxml or BeautifulSoup.
Don't use regular expressions to parse HTML -- use an HTML parser, such as BeautifulSoup.
Specifically, your situation is basically one of having to deal with "nested parentheses" (where an open "parens" is an opening <table> tag and the corresponding closed parens is the matching </table>) -- exactly the kind of parsing tasks that regular expressions can't perform well. Lots of the work in parsing HTML is exactly connected with this "matched parentheses" issue, which makes regular expressions a perfectly horrible choice for the purpose.
You mention in a comment to another answer that you've had unspecified problems with BS -- I suspect you were trying the latest, 3.1 release (which has gone downhill) instead of the right one; try 3.0.8 instead, as BS's own docs recommend, and you could be better off.
If you've made some kind of pact with Evil never to use the right tool for the job, your task might not be totally impossible if you don't need to deal with nesting (just matching), i.e., there is never a table inside another table. In this case you can identify one table with r'<\s*TABLE(.*?)<\s*/\s*TABLE' (with suitable flags such as re.DOTALL and re.I); loop over all such matches with the finditer method of regular expressions; and in the loop's body check whether BGCOLOR (in a case-insensitive sense) happens to be inside the body of the current match. It's still going to be more fragile, and more work, than using an HTML parser, but while definitely an inferior choice it needs not be a desperate situation.
If you do have nested tables to contend with, then it is a desperate situation.
if your task is just this simple, here's a way. split on <TABLE> then iterate the items and find the required pattern you want.
myhtml="""
<TABLE>
<B>Item 1.</B>
</TABLE>
some text1
some text2
some text3
<TABLE>
blah
BGCOLOR
blah
</TABLE>
some texet
<TABLE>
<B>Item 2.</B>
</TABLE>
"""
for tab in myhtml.split("</TABLE>"):
if "<TABLE>" in tab and "BGCOLOR" in tab:
print ''.join(tab.split("<TABLE>")[1:])
output
$ ./python.py
blah
BGCOLOR
blah
Here's the code that ended up working for me. It finds the correct table and adds more tagging around it so that it is identified from the group with open and close tags of 'realTable'.
soup = BeautifulSoup(''.join(text))
for p in soup.findAll('table'):
pattern = '.*BGCOLOR.*'
if (re.match(pattern, str(p), re.S|re.I)):
tags = Tag(soup, "realTable")
p.replaceWith(tags)
text = NavigableString(str(p))
tags.insert(0, text)
print soup
prints this out:
<table><b>Item 1.</b></table>
<realTable><table>blah BGCOLOR blah</table></realTable>
<table><b>Item 2.</b></table>

Categories

Resources