Problem with re.split() , data extraction from a string (splitting a string)

Problem with re.split() , data extraction from a string (splitting a string) - python

I have been trying to split this string but it only gives me the last character of the username I want. for example
in this dataset I want to separate the username from the actual message but after doing this code-
#how can we separate users from messages
users = []
messages = []
for message in df['user_message']:
entry = re.split('([a-zA-Z]|[0-9])+#[0-9]+\\n', message)
if entry[1:]:
users.append(entry[1])
messages.append(entry[2])
else:
users.append('notif')
messages.append(entry[0])
df['user'] = users
df['message'] = messages
df.drop(columns=['user_message'], inplace = True)
df.head(30)
I only get
Could someone please tell me why it only gives me the last character of the string i want to split and how I can fix it? thanks a lot. This means a lot

Splitting is not really the string operation you want here. Instead, just use str.extract directly on the user_message column:
df["username"] = df["user_message"].str.extract(r'^([^#]+)')
The above logic will extract the leading part of the user message, from the beginning, until reaching the first hash symbol.

You could do this a lot simpler, by just using string.split() and setting the maxsplit to 1. See the example below.
Note that regex is very useful, but it's very easy to get incorrect results with it. I advise to use a online regex validator if you really need to use it. As for the actual regex, your + is in the wrong place. You need move it inside the group. I used regex101.com for testing...
([a-zA-Z0-9]+)#[0-9]+\\n
string.split() example:
line = "keikeo#2720\nAdded a recipient.\n\n\n"
user, message = line.split('\n', maxsplit=1)
print(user)
print(message)

Related

How do I remove everything from a string except what I want?

Okay, so basically I want the user to be able to input something, like "quote python syntax and semantics", remove the word 'quote' and anything else (for example, the command could be, 'could you quote for me Python syntax and semantics') then format it in a way that I can pass it to the Wikipedia article URL (in this case 'https://en.wikipedia.org/wiki/Python_syntax_and_semantics'), request it and scrape the element(s) I want.
Any answer would be greatly appreciated.

Here's a simple example of doing this:
import re
msg = input() # Here give as input "quote python syntax and semantics"
repMsg = re.sub("quote", "", msg).strip() # Erase "quote" and space at the start
repMsg = re.sub(" ", "_", repMsg) # Replace spaces with _
print(repMsg) # prints "python_syntax_and_semantics"
The python regex module is very handy for doing this sort of things. Note that you'll probably need to fine tune your code e.g. decide when to replace first occurrence vs replace all, at which point to strip white spaces etc.

Function that extracts email-adresses from strings given in an assert and gives them in a list

Before I go on, there are many questions previously asked like this one but I none I have found exactly like this.
I want to write a function that correctly interprets these results and give the outputs shown
assert get_mail("") == []
assert get_mail("rektor#kth.se") == ["rektor#kth.se"]
assert get_mail("Private mail is foo#gmail.com work mail is bar#corp.com.") == ["foo#gmail.com", "bar#corp.com"]
that is it extracts the email adresses from a string given to the function and returns them in a list.
What I have come up with is this
import re
def get_mail(text):
email_list = []
s = text
match = re.findall(r'[\w\.-]+#[\w\.-]+', text)
email_list.append(match)
return email_list
my reasoning is this: I start with an empty list where the adresses should later be appended. I say that the text inside the get_mail() is s and then to scan s to only find email adresses and return the appended list, but obviously something is wrong or missing.

There are two problems with your code. The first is that you have overcomplicated your function and its returning a list within a list. Therefore making the assertion fail. Instead, a simple get_mail() function like the one below will fix your second assertion error:
def get_mail(text):
match = re.findall(r'[\w\.-]+#[\w\.-]+', text)
return match
The second problem, is that your e-mail regex pattern is not optimal, and in your third assertion test it extracts:
['foo#gmail.com', 'bar#corp.com.']
Which is not the same as what you want:
['foo#gmail.com', 'bar#corp.com']
You see the difference? Is that 1 dot at the end of 'bar#corp.com.' that's causing the problem. Its not recommended that you write your own email regex pattern, but that you find an optimum one online instead. That will solve your 3rd assertion error.

How to get all emails from raw strings

I tried this code:
contents = 'alokm.014#gmail.yahoo.com.....thankyou'
match = re.findall(r'[\w\.-]+#[\w\.-]+', contents)
print match
Result:
alokm.014#gmail.yahoo.com.....thankyou
I want to remove ....thankyou from my email
Is it possible to obtain only alok.014#gmail.yahoo.com
and one more thing the content list is bigger so I want some changes in
re.findall(r'[\w\.-]+#[\w\.-]+', contents)
if it is possible.

I don't know about python, but languages like Java have libraries that help validate URLs and email addresses. Alternately, you can use a well-vetted regex expression.
My suggestion would be to keep removing the end of the string based on dots until the string validates. So test the string, and if it doesn't validate as an email, read the string from the right until you encounter a period, then drop the period and everything to the right and start again.
So you'd loop through like this
alokm.014#gmail.yahoo.com.....thankyou
alokm.014#gmail.yahoo.com....
alokm.014#gmail.yahoo.com...
alokm.014#gmail.yahoo.com..
alokm.014#gmail.yahoo.com.
alokm.014#gmail.yahoo.com
At which point it would validate as a real email address. Yes, it's slow. Yes, it can be tricked. But it will work most of the time based on the little info (possible strings) given.

Interesting question! And, here's a Python Regex program to help make extraction of email from the contents possible:
import re
contents = 'alokm.014#gmail.yahoo.com.....thankyou'
emailRegex = re.compile(r'''
[a-zA-Z0-9.]+ # username
# # # symbol
[a-zA-Z0-9.]+\.com # domain
''', re.VERBOSE) # re.VERBOSE helps make Regex multi-line with comments for better readability
extractEmail = emailRegex.findall(contents)
print(extractEmail)
Output will be:
['alokm.014#gmail.yahoo.com']
I will now suggest that you refer to this Regex-HowTo doc to understand what's happening in this program and to come up with a better version that could extract all the emails from your larger text.

Erase duplicate emails

I'm trying to use regex in scrapy to find all email addresses on a page.
I'm using this code:
item["email"] = re.findall('[\w\.-]+#[\w\.-]+', response.body)
Which works almost perfectly: it grabs all the emails and gives them to me. However what I want is this: that it doesn't give me a repeat before it actually parses, even if there are more than one of the same email address.
I'm getting responses like this (which is correct):
{'email': ['billy666#stanford.edu',
'cantorfamilies#stanford.edu',
'cantorfamilies#stanford.edu',
'cantorfamilies#stanford.edu',
'footer-stanford-logo#2x.png']}
However I want to only show the unique addresses which would be
{'email': ['billy666#stanford.edu',
'cantorfamilies#stanford.edu',
'footer-stanford-logo#2x.png']}
If you want to throw in how to only collect the email and not that
'footer-stanford-logo#2x.png'
that is helpful also.
Thanks everyone!

Here is how you can get rid of the dupes and 'footer-stanford-logo#2x.png'-like thingies in your output:
import re
p = re.compile(r'[\w.-]+#(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b')
test_str = "{'email': ['billy666#stanford.edu',\n 'cantorfamilies#stanford.edu',\n 'cantorfamilies#stanford.edu',\n 'cantorfamilies#stanford.edu',\n 'footer-stanford-logo#2x.png']}"
print(set(p.findall(test_str)))
See the Python demo
The regex will look like
[\w.-]+#(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^
See demo
The negative lookahead (?![\w.-]*\.(?:png|jpe?g|gif)\b) will disallow all matches with png, jpg, etc. extensions at the end of the word (\b is a word boundary, and in this case, it is a trailing word boundary).
Dupes can easily be removed with a set - it is the least troublesome part here.
FINAL SOLUTION:
item["email"] = set(re.findall(r'[\w.-]+#(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b', response.body))

item["email"] = set(re.findall('[\w\.-]+#[\w\.-]+', response.body))

Can't you just use a set instead of a list?
item["email"] = set(re.findall('[\w\.-]+#[\w\.-]+', response.body))
And if you really want a list then:
item["email"] = list(set(re.findall('[\w\.-]+#[\w\.-]+', response.body)))

How do I exclude a string from re.findall?

This might be a silly question, but I'm just trying to learn!
I'm trying to build a simple email search tool to learn more about python. I'm modifying some open source code to parse the email address:
emails = re.findall(r'([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html)
Then I'm writing the results into a spreadsheet using the CSV module.
Since I'd like to keep the domain extension open to almost any, my results are outputting image files with an email type format:
example: forbes#2x-302019213j32.png
How can I add to exclude "png" string from re.findall
Code:
def scrape(self, page):
try:
request = urllib2.Request(page.url.encode("utf8"))
html = urllib2.urlopen(request).read()
except Exception, e:
return
emails = re.findall(r'([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html)
for email in emails:
if email not in self.emails: # if not a duplicate
self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email])
self.emails.append(email)

you already are only acting on an if ... just make part of the if check ... ...that will be much much much easier than trying to exclude it from the regex
if email not in self.emails and not email.endswith("png"): # if not a duplicate
self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email])
self.emails.append(email)

I know Joran already gave you a response, but here's another way to do it with Python regex that I found cool.
There is a (?!...) matching pattern that essentially says: "Wherever you place this matching pattern, if at that point in the string this pattern is checked and a match is found, then that match occurrence fails."
If that was a bad explanation, the Python document does a much better job: https://docs.python.org/2/howto/regex.html#lookahead-assertions
Also, here is a working example:
y = r'([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.(?!png)[a-zA-z]*)'
s = 'forbes#2x-302019213j32.png'
re.findall(y, s) # Will return an empty list
s2 = 'myname#email2018529391230.net'
re.findall(y, s2) # Will return a list with s2 string
s3 = s + ' ' + s2 # Concatenates the two e-mail-formatted strings
re.findall(y, s3) # Will only return s2 string in list

Lots of ways to do this, but my favorite is:
pat = re.compile(r'''
[A-Za-z0-9\.\+_-]+ # 1+ \w\n.+-_
#[A-Za-z0-9\._-]+ # literal # followed by same
\.png # if png, DON'T CAPTURE
|([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*)
# if not png, CAPTURE''', flags=re.X)
Since regexes are evaluated left-to-right, if a string starts to match then it will match the left side of the | first. If the string ends in .png, then it will consume that string but NOT capture it. If it DOESN'T end in .png, the right side of the | will begin to consume it and WILL capture it. For a more in-depth conversation of this trick, see here. To use these do:
matches = filter(None,pat.findall(html))
Any string matched by the left side (e.g. all the png files that are matched but NOT part of a capturing group) will show up as an empty string in your findall. filter(None, iterable) removes all the empty strings from your iterable, leaving you with only the data you want.
Alternatively, you can filter after you grab everything
pat = re.compile(r'''[A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*''')
# same regex you have currently
matches = filter(lambda x: not x.endswith('png'), pat.findall(html))
Note that further on, you should really make self.emails a set. It doesn't seem to need to keep its ordering, and set lookup is WAY faster than list lookup. Remember to use set.add instead of list.append though.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem with re.split() , data extraction from a string (splitting a string) - python

Related

How do I remove everything from a string except what I want?

Function that extracts email-adresses from strings given in an assert and gives them in a list

How to get all emails from raw strings

Erase duplicate emails

How do I exclude a string from re.findall?

Categories

Resources