How to eliminate the ☎ unicode? - python

During web scraping and after getting rid of all html tags, I got the black telephone character \u260e in unicode (☎). But unlike this response I do want to get rid of it too.
I used the following regular expressions in Scrapy to eliminate html tags:
pattern = re.compile("<.*?>| |&",re.DOTALL|re.M)
Then I tried to match \u260e and I think I got caught by the backslash plague. I tried unsuccessfully this patterns:
pattern = re.compile("<.*?>| |&|\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>| |&|\\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>| |&|\\\\u260e",re.DOTALL|re.M)
None of this worked and I still have \u260e as an output.
How can I make this disappear?

Using Python 2.7.3, the following works fine for me:
import re
pattern = re.compile(u"<.*?>| |&|\u260e",re.DOTALL|re.M)
s = u"bla ble \u260e blo"
re.sub(pattern, "", s)
Output:
u'bla ble blo'
As pointed by #Zack, this works due to the fact that the string is now in unicode, i.e., the string is already converted, and the sequence of characters \u260e is now the -- probably -- two bytes used to write that little black phone ☎ (:
Once both the string to be searched and the regular expression have the black phone itself, and not the sequence of characters \u260e, they both match.

If your string is already unicode, there's two easy ways. The second one will affect more than just the ☎, obviously.
>>> import string
>>> foo = u"Lorum ☎ Ipsum"
>>> foo.replace(u'☎', '')
u'Lorum Ipsum'
>>> "".join(s for s in foo if s in string.printable)
u'Lorum Ipsum'
Remove non-ascii characters but leave periods and spaces for more information about string.printable
The SHORTEST way to remove multiple spaces in a string in Python if you don't want multiple whitespaces.

You may try with BeatifulSoup, as explained here, with something like
soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

Related

Python regular expression help needed, multiple lines regex

I was trying to scape a link out of a .eml file but somehow I always get "NONE" as return for my search. But I don't even get the link with the confirm brackets, no problem in getting that valid link once the string is pulled.
One problem that I see is, that the string that is found by the REGEX has multiple lines, but the REGES itself seems to be valid.
CODE/REGEX I USE:
def get_url(raw):
#get rid of whitespaces
raw = raw.replace(' ', '')
#search for the link
url = re.search('href=3D(.*?)token([^\s]+)\W([^\s]+)\W([^\s]+)\W([^\s]+)\W([^\s]+)', raw).group(1)
return url
First thing, the .eml is encoded in MIME quoted-printable (the hint is the = signs at the end of the line. You should decode this first, instead of dealing with the encoded raw text.
Second, regex is overkill. Some nice string.split() usage will work just as fine. Regex is extremely usefull in it's proper usage scenarios, but some simple python can usually do the same without having to use regex' flavor of magic, which can be confusing as [REDACTED].
Note that if you're building regex, it's always adviced to use one of the gazillion regex editors as these will help you build your regex... My personal favorite is regex101
EDIT: added regex way to do it.
import quopri
import re
def get_url_by_regex(raw):
decoded = quopri.decodestring(raw).decode("utf-8")
return re.search('(<a href=")(.*?)(")', decoded).group(2)
def get_url(raw):
decoded = quopri.decodestring(raw).decode("utf-8")
for line in decoded.split('\n'):
if 'token=' in line:
return line.split('<a href="')[1].split('"')[0]
return None # just in case this is needed
print(get_url(raw_email))
print(get_url_by_regex(raw_email))
result is:
https://app.rule.io/subscriber/optIn?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzd[REST_OF_TOKEN_REDACTED]
https://app.rule.io/subscriber/optIn?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzd[REST_OF_TOKEN_REDACTED]

How to remove all unicode representations in python

I am trying to remove all representations of special characters in my document, for example part of the document says: "world\u2019s", when I split this it gives ['world', '\u2019', 's'] but I need only the word(unicode and 's' removed).
I am already removing all punctuation and this works on the actual punctuation that are shown normally not on these unicode representations.
And I have also tried to use regex to match everything that begins with a '\' but that doesn't seem to work either.
import re
string = "world\u2019s"
print (re.sub(r"\b([^\s]+)\\([^\s]+)\b",r'\1',str(string.encode('ascii', 'backslashreplace'), 'ascii')))
Output:
world
You can apply this to your whole string document, should be working.
import re
string = "world\u2019s h\u2018e"
print (re.sub(r"\b([^\s]+)\\([^\s]+)\b",r'\1',str(string.encode('ascii', 'backslashreplace'), 'ascii')))
Output:
world h

Regex in python, repeated fragment finding

I try find in text using regex the elements like this: abs=abs , 1=1 etc.
i wrote this i this way:
opis="Some text abs=abs sfsdvc"
wyn=re.search('([\w]*)=\1',opis)
print(wyn.group(0))
And this find nothing, when i tried this code in the websites like www.regexr.com it was working correctly.
Am I doing something wrong in python re ?
You must specify the regex as raw string r'..'
>>> opis="Some text abs=abs sfsdvc"
>>> wyn=re.search(r'([\w]*)=\1',opis)
>>> print wyn.group(0)
abs=abs
From re documentation
Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical:
Meaning, if you are not planing to use raw string, then all the \ in the string must be escaped as
>>> opis="Some text abs=abs sfsdvc"
>>> wyn=re.search('([\\w]*)=\\1',opis)
>>> print wyn.group(0)
abs=abs
Change your regex to:
re.search(r'(\w+)=\1', opis).group()
↑
Note that you don't really need character class here, the [ and ] are redundant, also it's better to have \w+ if you don't want to match the string "=" (lonely equal sign).

Regex Matching Unicode Characters acting oddly with different strings

Ok, I am doing a unicode regex match on some strings.
These are the strings in question. Not two separate lines, but two separate strings.
\u2018Mummy\u2019 Reboot May Get \u2018Mama\u2019 Director
\u2018Glee\u2019 Star Grant Gustin to Play The Flash in \u2018Arrow\u2019 Season 2
And I am using this regex to parse out the titles surround in unicode quotes.
regex = re.compile("\\u2018[^(?!\\u2018$)]*\\u2019",re.UNICODE)
using regex.findall() returns me
['u2018Mama\\u2019']
and
['u2018Glee\\u2019', 'u2018Arrow\\u2019']
This brings up two questions that I couldn't figure out. why isn't it returning \u2018, where is the initial \?
Secondly, what is different. I can't see it. Finally, I replaced \u2018 and \u2019 with '.
Then using this regex.
re.compile("'[^']*'")
It matches both in both strings. What is the difference here? What am I missing in the unicode regex?
Thank you in advance.
#coding=utf8
import re
s=u'''\u2018Mummy\u2019 Reboot May Get \u2018Mama\u2019 Director
\u2018Glee\u2019 Star Grant Gustin to Play The Flash in \u2018Arrow\u2019 Season 2'''
print s
regex = re.compile(ur"‘[^(?!‘$)]*’",re.UNICODE)
m = regex.findall(s)
print m
[u'\u2018Mummy\u2019', u'\u2018Mama\u2019', u'\u2018Glee\u2019', u'\u2018Arrow\u2019']

Python split by regular expression

In Python, I am extracting emails from a string like so:
split = re.split(" ", string)
emails = []
pattern = re.compile("^[a-zA-Z0-9_\.-]+#[a-zA-Z0-9-]+.[a-zA-Z0-9-\.]+$");
for bit in split:
result = pattern.match(bit)
if(result != None):
emails.append(bit)
And this works, as long as there is a space in between the emails. But this might not always be the case. For example:
Hello, foo#foo.com
would return:
foo#foo.com
but, take the following string:
I know my best friend mailto:foo#foo.com!
This would return null. So the question is: how can I make it so that a regex is the delimiter to split? I would want to get
foo#foo.com
in all cases, regardless of punctuation next to it. Is this possible in Python?
By "splitting by regex" I mean that if the program encounters the pattern in a string, it will extract that part and put it into a list.
I'd say you're looking for re.findall:
>>> email_reg = re.compile(r'[a-zA-Z0-9_.-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
>>> email_reg.findall('I know my best friend mailto:foo#foo.com!')
['foo#foo.com']
Notice that findall can handle more than one email address:
>>> email_reg.findall('Text text foo#foo.com, text text, baz#baz.com!')
['foo#foo.com', 'baz#baz.com']
Use re.search or re.findall.
You also need to escape your expression properly (. needs to be escaped outside of character classes, not inside) and remove/replace the anchors ^ and $ (for example with \b), eg:
r"\b[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\b"
The problem I see in your regex is your use of ^ which matches the start of a string and $ which matches the end of your string. If you remove it and then run it with your sample test case it will work
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","I know my best friend mailto:foo#foo.com!")
['foo#foo.com']
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","Hello, foo#foo.com")
['foo#foo.com']
>>>

Categories

Resources