Erase duplicate emails - python

I'm trying to use regex in scrapy to find all email addresses on a page.
I'm using this code:
item["email"] = re.findall('[\w\.-]+#[\w\.-]+', response.body)
Which works almost perfectly: it grabs all the emails and gives them to me. However what I want is this: that it doesn't give me a repeat before it actually parses, even if there are more than one of the same email address.
I'm getting responses like this (which is correct):
{'email': ['billy666#stanford.edu',
'cantorfamilies#stanford.edu',
'cantorfamilies#stanford.edu',
'cantorfamilies#stanford.edu',
'footer-stanford-logo#2x.png']}
However I want to only show the unique addresses which would be
{'email': ['billy666#stanford.edu',
'cantorfamilies#stanford.edu',
'footer-stanford-logo#2x.png']}
If you want to throw in how to only collect the email and not that
'footer-stanford-logo#2x.png'
that is helpful also.
Thanks everyone!

Here is how you can get rid of the dupes and 'footer-stanford-logo#2x.png'-like thingies in your output:
import re
p = re.compile(r'[\w.-]+#(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b')
test_str = "{'email': ['billy666#stanford.edu',\n 'cantorfamilies#stanford.edu',\n 'cantorfamilies#stanford.edu',\n 'cantorfamilies#stanford.edu',\n 'footer-stanford-logo#2x.png']}"
print(set(p.findall(test_str)))
See the Python demo
The regex will look like
[\w.-]+#(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^
See demo
The negative lookahead (?![\w.-]*\.(?:png|jpe?g|gif)\b) will disallow all matches with png, jpg, etc. extensions at the end of the word (\b is a word boundary, and in this case, it is a trailing word boundary).
Dupes can easily be removed with a set - it is the least troublesome part here.
FINAL SOLUTION:
item["email"] = set(re.findall(r'[\w.-]+#(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b', response.body))

item["email"] = set(re.findall('[\w\.-]+#[\w\.-]+', response.body))

Can't you just use a set instead of a list?
item["email"] = set(re.findall('[\w\.-]+#[\w\.-]+', response.body))
And if you really want a list then:
item["email"] = list(set(re.findall('[\w\.-]+#[\w\.-]+', response.body)))

Related

How to use Regex to extract a string from a specific string until a specific symbol in python?

Question
Assume that I have a string like this:
example_text = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
Expectation
And I want to only extract the first url, which is
output = "https://www.example.com/link_1.html"
I think using regex to find the url start from "https" and end up '\' will be a good solution.
If so, how can I write the regex pattern?
I try something like this:
`
re.findall("https://([^\\\\)]+)", example_text)
output = ['www.example.com/link_1.html', 'www.example.com/link_2.html']
But then, I need to add "https://" back and choose the first item in the return.
Is there any other solution?
You need to tweak your regex a bit.
What you were doing before:
https://([^\\\\)]+) this matches your link but only captures the part after https:// since you used the capturing token after that.
Updated Regex:
(https\:\/\/[^\\\\)]+) this matches the link and also captures the whole token (escaped special characters to avoid errors)
In Code:
import re
input = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
print(re.findall("(https\:\/\/[^\\\\)]+)", input))
Output:
['https://www.example.com/link_1.html', "https://www.example.com/link_2.html'"]
You could also use (https\:\/\/([^\\\\)]+).html) to get the link with https:// and without it as a tuple. (this also avoids the ending ' that you might get in some links)
If you want only the first one, simply do output[0].
Try:
match = re.search(r"https://[^\\']+", example_text)
url = match.group()
print(url)
output:
https://www.example.com/link_1.html

Problem with re.split() , data extraction from a string (splitting a string)

I have been trying to split this string but it only gives me the last character of the username I want. for example
in this dataset I want to separate the username from the actual message but after doing this code-
#how can we separate users from messages
users = []
messages = []
for message in df['user_message']:
entry = re.split('([a-zA-Z]|[0-9])+#[0-9]+\\n', message)
if entry[1:]:
users.append(entry[1])
messages.append(entry[2])
else:
users.append('notif')
messages.append(entry[0])
df['user'] = users
df['message'] = messages
df.drop(columns=['user_message'], inplace = True)
df.head(30)
I only get
Could someone please tell me why it only gives me the last character of the string i want to split and how I can fix it? thanks a lot. This means a lot
Splitting is not really the string operation you want here. Instead, just use str.extract directly on the user_message column:
df["username"] = df["user_message"].str.extract(r'^([^#]+)')
The above logic will extract the leading part of the user message, from the beginning, until reaching the first hash symbol.
You could do this a lot simpler, by just using string.split() and setting the maxsplit to 1. See the example below.
Note that regex is very useful, but it's very easy to get incorrect results with it. I advise to use a online regex validator if you really need to use it. As for the actual regex, your + is in the wrong place. You need move it inside the group. I used regex101.com for testing...
([a-zA-Z0-9]+)#[0-9]+\\n
string.split() example:
line = "keikeo#2720\nAdded a recipient.\n\n\n"
user, message = line.split('\n', maxsplit=1)
print(user)
print(message)

Regex to find text after linebreak in URL

I want to use regex to get a part of the string. I want to remove the kerberos and everything after it and get the Username
import re
text = 'Kerberos://DME.DMS.WORLD.DMSHEN/Username'
reg1 = re.compile(r"^((Kerberos?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$",text)
print(reg1)
Output
Username
I am new to regex and tried this regex but it doesn't seem to work
Your regex works just fine, but I am assuming you would like to make most of the groups non-capturing (you can do that by adding ?: to each group.
It will give you the following:
re.match(r"^(?:(?:Kerberos?|ftp):\/)?\/?(?:[^:\/\s]+)(?:(\/\w+)*\/)(?P<u>[\w\-\.]+[^#?\s]+)(?:.*)?(?:#[\w\-]+)?$",t).group('u')
Also, for future reference, try using https://regex101.com/ , it has an easy way to test your regex + explanations on each part.
How about this simple one:
import re
text = 'Kerberos://DME.DMS.WORLD.DMSHEN/Username'
reg1 = re.findall(r"//.*/(.*)", text)
print(''.join(reg1))
# Username
If you want you can use split instead of regex
text = 'Kerberos://DME.DMS.WORLD.DMSHEN/Username'
m = text.split('/')[-1]
print m

How to get all emails from raw strings

I tried this code:
contents = 'alokm.014#gmail.yahoo.com.....thankyou'
match = re.findall(r'[\w\.-]+#[\w\.-]+', contents)
print match
Result:
alokm.014#gmail.yahoo.com.....thankyou
I want to remove ....thankyou from my email
Is it possible to obtain only alok.014#gmail.yahoo.com
and one more thing the content list is bigger so I want some changes in
re.findall(r'[\w\.-]+#[\w\.-]+', contents)
if it is possible.
I don't know about python, but languages like Java have libraries that help validate URLs and email addresses. Alternately, you can use a well-vetted regex expression.
My suggestion would be to keep removing the end of the string based on dots until the string validates. So test the string, and if it doesn't validate as an email, read the string from the right until you encounter a period, then drop the period and everything to the right and start again.
So you'd loop through like this
alokm.014#gmail.yahoo.com.....thankyou
alokm.014#gmail.yahoo.com....
alokm.014#gmail.yahoo.com...
alokm.014#gmail.yahoo.com..
alokm.014#gmail.yahoo.com.
alokm.014#gmail.yahoo.com
At which point it would validate as a real email address. Yes, it's slow. Yes, it can be tricked. But it will work most of the time based on the little info (possible strings) given.
Interesting question! And, here's a Python Regex program to help make extraction of email from the contents possible:
import re
contents = 'alokm.014#gmail.yahoo.com.....thankyou'
emailRegex = re.compile(r'''
[a-zA-Z0-9.]+ # username
# # # symbol
[a-zA-Z0-9.]+\.com # domain
''', re.VERBOSE) # re.VERBOSE helps make Regex multi-line with comments for better readability
extractEmail = emailRegex.findall(contents)
print(extractEmail)
Output will be:
['alokm.014#gmail.yahoo.com']
I will now suggest that you refer to this Regex-HowTo doc to understand what's happening in this program and to come up with a better version that could extract all the emails from your larger text.

extract URL from string in python

I want to extract a full URL from a string.
My code is:
import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
print re.match(r'(ftp|http)://.*\.(jpg|png)$', data)
Output:
None
Expected Output
http://www.google.com/a.jpg
I found so many questions on StackOverflow, but none worked for me.
I have seen many posts and this is not a duplicate. Please help me! Thanks.
You were close!
Try this instead:
r'(ftp|http)://.*\.(jpg|png)'
You can visualize this here.
I would also make this non-greedy like this:
r'(ftp|http)://.*?\.(jpg|png)'
You can visualize this greedy vs. non-greedy behavior here and here.
By default, .* will match as much text as possible, but you want to match as little text as possible.
Your $ anchors the match at the end of the line, but the end of the URL is not the end of the line, in your example.
Another problem is that you're using re.match() and not re.search(). Using re.match() starts the match at the beginning of the string, and re.search() searches anywhere in the string. See here for more information.
You should use search instead of match.
import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
url=re.search('(ftp|http)://.*\.(jpg|png)', data)
if url:
print url.group(0)
Find the start of the url by using find(http:// , ftp://) . Find the end of url using find(jpg , png). Now get the substring
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
start = data.find('http://')
kk = data[start:]
end = kk.find('.jpg')
print kk[0:end+4]

Categories

Resources