python: regex to extract content between two text - python

I want a python regex expression that can pull the contents between script[" and "] but there are other "]" which worries me
expected:
{bunch of javascript here. [\"apple\"] test}
my attempt:
javascript\[\"(.*)"]
target string:
//url//script["{bunch of javascript here. [\"apple\"] test}"]|//*[#attribute="eggs"]
link to the regex

You can't match nested brackets with the re module since it doesn't have the recursion feature to do that. However, in your example you can skip the innermost square brackets if you choose to ignore all brackets enclosed between double quotes.
try something like this:
p = re.compile(r'script\["([^\\"]*(?:\\.[^\\"]*)*)"]', re.S)
Note: I assumed here that the predicate is only related to the "text" content of the script node (and not an attribute, a number of item or an axe).

It's very hard to understand exactly what you want to achieve because of the way you have written the question. However if you are looking for the firs instance of "] AFTER a } then try this:
\["([^}]+}.*?)"\]
Link to the regex
This also would work:
\["(.*?}.*?)"\]
Link to the second regex example

Related

Python regex doesn't work on string

I have an HTML file that I process using lxml and BeautifulSoup (convert from HTML to text). Somehow, the ill-formed HTML below makes it into the text and I'd like to remove it. I tried matching something like "<.+>" in the text string, but it doesn't work. The string I want to remove is this:
string = """ .trb_m_b:befoe{ctent:'Hide comments'}.trb_c_so{padding-top:10px;min-height:500px}||<div class="trb_c_so" data-role=c_container><div class="s_comments" data-sitename="ffff" data-content-id="jksjkj7878787" data-type=promo-comment data-publisher="ronctt"></div></div>"""
The exact code I tried on it is:
pattern = re.compile(r'<.+>')
if (pattern.search(string)):
print ("Found")
However, that regex doesn't match the string, although it should.
Why would that be?
Thanks.
EDIT. It looks like the problem is not with the regular expressions, but with something very bizarre. I have this string in a list, it's the last item. When I loop through it the first time, for some reason, the program never hits it. The second time, however, it does. I don't understand the reason for it.
EDIT2. It turns out the problem was that I was trying to remove elements in a loop (if they matched the regex), which is not permitted. I rewrote the code to use a list comprehension, and now it works fine.
I believe what you want is this:
import re
data = re.findall("\<(.*?)\>", string)
Your HTML is not a complete HTML tag, if you really want to match the string that you give,you can use this:
re.findall("\.trb_m_b.*?></div></div>", string)

Replace tweet urls with string

I am trying to delete all URLS in a big csv-file and replace it with the string "URL" (a so-called equivalence token). The code does what I want, but it clumps/concatenates some rows together in one row.
That means that the original csv has 63.000 rows and the output csv only 55000. That is not what I want. How can I replace links with this token and leave all columns separated?
#links are replaced with links
import re
with open('data_feat1.csv',"r", encoding="utf-8") as oldfile2, open('data_feat2.csv', 'w',encoding="utf-8") as newfile2:
for line in oldfile2:
line=re.sub(r"http\S+", r"URL", line) #replaces links with "URL"
newfile2.write(line)
newfile2.close()
The solution was to add a ' to "URL":
line=re.sub(r"http\S+", r'URL"', line) #replaces links with "URL"
I do not know why it worked, but it did!
the reason why it works is because it is using a regular expression to search for http.
the re modulehandles regular expressions. re.sub will replace the matched regular expression with the second argument (URL" in your case).
what that regular expression is doing is searching for http and everything after it. "everything after it" is designated by the \S+ characters which is saying "everything that goes until white space"
Take a look at Pythex. It'll be a good place to learn how regular expression work in Python

How to regex split, but keep the split string?

I have the following URL pattern:
http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en
I would like to get everything up until and inclusive of /watch/\d+/.
So far I have:
>>> re.split(r'watch/\d+/', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'supernatural-dub-hollywood-babylon/en']
But this does not include the split string (the string which appears between the domain and the path). The end answer I want to achieve is:
http://www.hulu.jp/watch/589851
You need to use capture group :
>>> re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/en']
As mentioned in the other answer, you need to use groups to capture the "glue" between the split strings.
I wonder though, is what you want here a split() or a search()? It looks (from the sample) that you're trying to extract from a URL everything from the first occurrence of /watch/XXX/ where XXX is 1 or more digits, to the end of the string. If that's the case, then a match/search might be more suitable, as with a split if the search regex can match multiple times you'll split into multiple groups. Ex:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/', 'watch/2342/', 'fdsaafsdf']
Which doesn't look like what you want. Instead perhaps:
result = re.search(r'(watch/\d+/)(.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groups() if result else []
which gives:
('watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
You could also use this approach combined with named groups to get extra fancy:
result = re.search(r'(?P<watchId>watch/\d+/)(?P<path>.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groupdict() if result else {}
giving:
{'path': 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', 'watchId': 'watch/589851/'}
If you're set on the split() approach, you can also set the maxsplit parameter to ensure it's only split once:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', maxsplit=1)
giving:
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf']
Personally though, I find that when parsing URL's into constituent parts the search() with named groups approach works extremely well as it allows you to name the various parts in the regex itself, and via groupdict() get a nice dictionary you can use for working with those parts.
You've surely seen the Stack Overflow don't-parse-HTML-with-regex post, yes?
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.
Well, regex can parse URLs, but trying to do so when there's a plethora of better tools is foolish.
This is what a regex for URLs looks like:
^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?$ (+ caseless flag)
It's just a mess of characters, right? Exactly!
Don't parse URLs with regex... almost.
There is one simple thing:
A path-relative URL must be zero or more path segments separated from each other by a "/".
Splitting the URL should be as simple as url.split("/").
from urllib.parse import urlparse, urlunparse
myurl = "http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en"
# Run a parser over it
parts = urlparse(myurl)
# Crop the path to UP TO length 2
new_path = str("/".join(parts.path.split("/")[:3]))
# Unparse
urlunparse(parts._replace(path=new_path))
#>>> 'http://www.hulu.jp/watch/589851'
You can try following regex
.*\/watch\/\d+
Working Demo

Python - regular expressions - find every word except in tags

How to find all words except the ones in tags using RE module?
I know how to find something, but how to do it opposite way? Like I write something to search for, but acutally I want to search for every word except everything inside tags and tags themselves?
So far I managed this:
f = open (filename,'r')
data = re.findall(r"<.+?>", f.read())
Well it prints everything inside <> tags, but how to make it find every word except thats inside those tags?
I tried ^, to use at the start of pattern inside [], but then symbols as . are treated literally without special meaning.
Also I managed to solve this, by splitting string, using '''\= <>"''', then checking whole string for words that are inside <> tags (like align, right, td etc), and appending words that are not inside <> tags in another list. But that a bit ugly solution.
Is there some simple way to search for every word except anything that's inside <> and these tags themselves?
So let say string 'hello 123 <b>Bold</b> <p>end</p>'
with re.findall, would return:
['hello', '123', 'Bold', 'end']
Using regex for this kind of task is not the best idea, as you cannot make it work for every case.
One of solutions that should catch most of such words is regex pattern
\b\w+\b(?![^<]*>)
If you want to avoid using a regular expression, BeautifulSoup makes it very easy to get just the text from an HTML document:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
text = "".join(soup.findAll(text=True))
From there, you can get the list of words with split:
words = text.split()
Something like re.compile(r'<[^>]+>').sub('', string).split() should do the trick.
You might want to read this post about processing context-free languages using regular expressions.
Strip out all the tags (using your original regex), then match words.
The only weakness is if there are <s in the strings other than as tag delimiters, or the HTML is not well formed. In that case, it is better to use an HTML parser.

python regex to find any link that contains the text 'abc123'

I am using beautifuly soup to find all href tags.
links = myhtml.findAll('a', href=re.compile('????'))
I need to find all links that have 'abc123' in the href text.
I need help with the regex , see ??? in my code snippet.
If 'abc123' is literally what you want to search for, anywhere in the href, then re.compile('abc123') as suggested by other answers is correct. If the actual string you want to match contains punctuation, e.g. 'abc123.com', then use instead
re.compile(re.escape('abc123.com'))
The re.escape part will "escape" any punctuation so that it's taken literally, just like alphanumerics are; without it, some punctuation gets interpreted in various ways by RE's engine, for example the dot ('.') in the above example would be taken as "any single character whatsoever", so re.compile('abc123.com') would match, e.g. 'abc123zcom' (and many other strings of a similar nature).
"abc123" should give you what you want
if that doesn't work, than BS is probably using re.match in which case you would want ".*abc123.*"
If you want all the links with exactly 'abc123' you can simply put:
links = myhtml.findAll('a', href=re.compile('abc123'))

Categories

Resources