Replace tweet urls with string - python

I am trying to delete all URLS in a big csv-file and replace it with the string "URL" (a so-called equivalence token). The code does what I want, but it clumps/concatenates some rows together in one row.
That means that the original csv has 63.000 rows and the output csv only 55000. That is not what I want. How can I replace links with this token and leave all columns separated?
#links are replaced with links
import re
with open('data_feat1.csv',"r", encoding="utf-8") as oldfile2, open('data_feat2.csv', 'w',encoding="utf-8") as newfile2:
for line in oldfile2:
line=re.sub(r"http\S+", r"URL", line) #replaces links with "URL"
newfile2.write(line)
newfile2.close()

The solution was to add a ' to "URL":
line=re.sub(r"http\S+", r'URL"', line) #replaces links with "URL"
I do not know why it worked, but it did!

the reason why it works is because it is using a regular expression to search for http.
the re modulehandles regular expressions. re.sub will replace the matched regular expression with the second argument (URL" in your case).
what that regular expression is doing is searching for http and everything after it. "everything after it" is designated by the \S+ characters which is saying "everything that goes until white space"
Take a look at Pythex. It'll be a good place to learn how regular expression work in Python

Related

How to return everything in a string that is not matched by a regex?

I have a string and a regular expression that matches portions of the string. I want to return a string representing what's left of the original string after all matches have been removed.
import re
string="<font size="2px" face="Tahoma"><br>Good Morning, </font><div><br></div><div>As per last email"
pattern = r'<[a-zA-Z0-9 ="/\-:;.]*>'
re.findall(pattern, string)
['<font size="2px" face="Tahoma">',
'<br>',
'</font>',
'<div>',
'<br>',
'</div>',
'<div>']
desired_string = "Good Morning, As per last email"
Instead of re.findall, use re.sub to replace each matche with an empty string.
re.sub(pattern, "", string)
While that's the literal answer to your general question about removing patterns from a string, it appears that your specific problem is related to manipulating HTML. It's generally a bad idea to try to manipulate HTML with regular expressions. For more information see this answer to a similar question: https://stackoverflow.com/a/1732454/7432
Instead of a regular expression, use an HTML parser like BeautifulSoup. It looks like you are trying to strip the HTML elements and get the underlying text.
from bs4 import BeautifulSoup
string="""<font size="2px" face="Tahoma"><br>Good Morning, </font><div><br></div><div>As per last email"""
soup = BeautifulSoup(string, 'lxml')
print(soup.get_text())
This outputs:
Good Morning, As per last email
One thing to notice is that the was changed to a regular space using this method.

python: regex to extract content between two text

I want a python regex expression that can pull the contents between script[" and "] but there are other "]" which worries me
expected:
{bunch of javascript here. [\"apple\"] test}
my attempt:
javascript\[\"(.*)"]
target string:
//url//script["{bunch of javascript here. [\"apple\"] test}"]|//*[#attribute="eggs"]
link to the regex
You can't match nested brackets with the re module since it doesn't have the recursion feature to do that. However, in your example you can skip the innermost square brackets if you choose to ignore all brackets enclosed between double quotes.
try something like this:
p = re.compile(r'script\["([^\\"]*(?:\\.[^\\"]*)*)"]', re.S)
Note: I assumed here that the predicate is only related to the "text" content of the script node (and not an attribute, a number of item or an axe).
It's very hard to understand exactly what you want to achieve because of the way you have written the question. However if you are looking for the firs instance of "] AFTER a } then try this:
\["([^}]+}.*?)"\]
Link to the regex
This also would work:
\["(.*?}.*?)"\]
Link to the second regex example

How to regex split, but keep the split string?

I have the following URL pattern:
http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en
I would like to get everything up until and inclusive of /watch/\d+/.
So far I have:
>>> re.split(r'watch/\d+/', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'supernatural-dub-hollywood-babylon/en']
But this does not include the split string (the string which appears between the domain and the path). The end answer I want to achieve is:
http://www.hulu.jp/watch/589851
You need to use capture group :
>>> re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/en']
As mentioned in the other answer, you need to use groups to capture the "glue" between the split strings.
I wonder though, is what you want here a split() or a search()? It looks (from the sample) that you're trying to extract from a URL everything from the first occurrence of /watch/XXX/ where XXX is 1 or more digits, to the end of the string. If that's the case, then a match/search might be more suitable, as with a split if the search regex can match multiple times you'll split into multiple groups. Ex:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/', 'watch/2342/', 'fdsaafsdf']
Which doesn't look like what you want. Instead perhaps:
result = re.search(r'(watch/\d+/)(.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groups() if result else []
which gives:
('watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
You could also use this approach combined with named groups to get extra fancy:
result = re.search(r'(?P<watchId>watch/\d+/)(?P<path>.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groupdict() if result else {}
giving:
{'path': 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', 'watchId': 'watch/589851/'}
If you're set on the split() approach, you can also set the maxsplit parameter to ensure it's only split once:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', maxsplit=1)
giving:
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf']
Personally though, I find that when parsing URL's into constituent parts the search() with named groups approach works extremely well as it allows you to name the various parts in the regex itself, and via groupdict() get a nice dictionary you can use for working with those parts.
You've surely seen the Stack Overflow don't-parse-HTML-with-regex post, yes?
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.
Well, regex can parse URLs, but trying to do so when there's a plethora of better tools is foolish.
This is what a regex for URLs looks like:
^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?$ (+ caseless flag)
It's just a mess of characters, right? Exactly!
Don't parse URLs with regex... almost.
There is one simple thing:
A path-relative URL must be zero or more path segments separated from each other by a "/".
Splitting the URL should be as simple as url.split("/").
from urllib.parse import urlparse, urlunparse
myurl = "http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en"
# Run a parser over it
parts = urlparse(myurl)
# Crop the path to UP TO length 2
new_path = str("/".join(parts.path.split("/")[:3]))
# Unparse
urlunparse(parts._replace(path=new_path))
#>>> 'http://www.hulu.jp/watch/589851'
You can try following regex
.*\/watch\/\d+
Working Demo

Removing TAGS in a document

I need to find all the tags in .txt format (SEC filing) and remove from the filing.
Well, as a beginner of Python, I used the following code to find the tags, but it returns None, None, ... and I don't know how to remove all the tags. My question is how to find all the tags <....> and remove all the tags so that the document contains everything but tags.
import re
tags = [re.search(r'<.+>', line) for line in mylist]
#mylist is the filename opened by open(filename, 'rU').readlines()
Thanks for your time.
Use something like this:
re.sub(r'<[^>]+>', '', open(filename, 'r').read())
Your current code is getting a None for each line that does not include angle-bracketed tags.
You probably want to use [^>] to make sure it matches only up to the first >.
re.sub(r'<.*?>', '', line)
Use re.sub and <.*?> expression
Well, for starters, you're going to need a different regex. The one you have will select everything between the first '<' and the last '>' So the string:
I can type in <b>BOLD</b>
would render the match:
BOLD
The way to fix this would be to use a lazy operators this site has a good explanation on why you should be using
<.+?>
to match HTML tags. And ultimately, you should be substituting, so:
re.sub(r'', '', line)
Though, I suspect what you'd actually like to match is between the tags. Here's where a good lookahead can do wonders!
(?<=>).+?(?=<)
Looks crazy, but it breaks down pretty easy. Let's start with what you know:
.+?
matches a string of arbitrary length. ? means it will match the shortest string possible. (The laziness we added before)
(<?=...)
is a lookbehind. It literally looks behind itself without capturing the expression.
(?=...)
is a lookahead. It's the same as a lookbehind. Then with a little findall:
re.findall(r'(?<=>).+?(?=<)', line);
Now, you can iterate over the array and trim an unnecessary spaces that got left behind and make for some really nice output! Or, if you'd really like to use a substitution method (I know I would):
re.sub(r'\s*(?:</+?>\s*)+', ' ', line)
the
\s*
will match any amount of whitespace attached to a tag, which you can then replace with one space, whittlling down those unnerving double and triple spaces that often result from over careful tagging. As a bonus, the
(?: ... )
is known as a non-capturing group (it won't give you smaller sub matches in your result). It's not really necessary in this situation for your purposes, but groups are always useful things to think about, and it's good practice to only capture the ones you need. Tacking a + onto the end of that (as I did), will capture as many tags as are right next to each other, eliminating them into a single space. So if the file has
This is <b> <i> overemphasized </b> </i>!
you'd get
This is overemphasized !
instead of
This is overemphasized !

python regex to find any link that contains the text 'abc123'

I am using beautifuly soup to find all href tags.
links = myhtml.findAll('a', href=re.compile('????'))
I need to find all links that have 'abc123' in the href text.
I need help with the regex , see ??? in my code snippet.
If 'abc123' is literally what you want to search for, anywhere in the href, then re.compile('abc123') as suggested by other answers is correct. If the actual string you want to match contains punctuation, e.g. 'abc123.com', then use instead
re.compile(re.escape('abc123.com'))
The re.escape part will "escape" any punctuation so that it's taken literally, just like alphanumerics are; without it, some punctuation gets interpreted in various ways by RE's engine, for example the dot ('.') in the above example would be taken as "any single character whatsoever", so re.compile('abc123.com') would match, e.g. 'abc123zcom' (and many other strings of a similar nature).
"abc123" should give you what you want
if that doesn't work, than BS is probably using re.match in which case you would want ".*abc123.*"
If you want all the links with exactly 'abc123' you can simply put:
links = myhtml.findAll('a', href=re.compile('abc123'))

Categories

Resources