Right now I'm "removing" emails from a list by mapping a new list excluding the things I don't want. This looked like:
pattern = re.compile('b\.com')
emails = ['user#a.com', 'user#b.com', 'user#c.com', 'user#d.com']
emails = [e for e in emails if pattern.search(e) == None]
# resulting list: ['user#a.com', 'user#c.com']
However, now I need to filter out multiple domains, so I have a list of domains that need to be filtered out.
pattern_list = ['b.com', 'c.com']
Is there a way to do this still in list comprehension form or am I going to have to revert back to nested for loops?
Note: splitting the string at the # and doing word[1] in pattern_list won't work because c.com needs to catch sub.c.com as well.
There are a few ways to do this, even without using a regex. One is:
[e for e in emails if not any(pat in e for pat in pattern_list)]
This will also exclude emails like user#crumb.com and bob.com#bob.com, but so does your original solution. It does not, however, exclude cases like user#bocom, which your existing solution does. Again, it's not clear if your existing solution actually does what you think it does.
Another possibility is to combine your patterns into one with rx = '|'.join(pattern_list) and then match on that regex. Again, though, you'll need to use a more complex regex if you want to only match b.com as a full domain (not as just part of the domain or as part of the username).
import re
pattern = re.compile('b.com$|c.com$')
emails = ['user#a.com', 'user#b.com', 'user#c.com', 'user#d.com']
emails = [e for e in emails if pattern.search(e) == None]
print emails
what about this
Related
I'm working with a Rest Api for finding address details. I pass it an address and it passes back details for that address: lat/long, suburb etc. I'm using the requests library with the json() method on the response and adding the json response to a list to analyse later.
What I'm finding is that when there is a single match for an address the 'FoundAddress' key in the json response contains a dictionary but when more than one match is found the 'FoundAddress' key contains a list of dictionaries.
The returned json looks something like:
For a single match:
{
'FoundAddress': {AddressDetails...}
}
For multiple matches:
{
'FoundAddress': [{Address1Details...}, {Address2Details...}]
}
I don't want to write code to handle a single match and then multiple matches.
How can I modify the 'FoundAddress' so that when there is a single match it changes it to a list with a single dictionary entry? Such that I get something like this:
{
'FoundAddress': [{AddressDetails...}]
}
If it's the external API sending responses in that format then you can't really change FoundAddress itself, since it will always arrive in that format.
You can change the response if you want to, since you have full control over what you've received:
r = json.parse(response)
fixed = r['FoundAddress'] if (type(r['FoundAddress']) is list) else [r['FoundAddress']]
r['FoundAddress'] = fixed
Alternatively you can do the distinction at address usage time:
def func(foundAddress):
# work with a single dictionary instance here
then:
result = map(func, r['FoundAddress']) if (type(r['FoundAddress']) is list) else [func(r['FoundAddress'])]
But honestly I'd take a clear:
if type(r['FoundAddress']) is list:
result = map(func, r['FoundAddress'])
else:
result = func(r['FoundAddress'])
or the response fix-up over the a if b else c one-liner any day.
If you can, I would just change the API. If you can't there's nothing magical you can do. You just have to handle the special case. You could probably do this in one place in your code with a function like:
def handle_found_addresses(found_addresses):
if not isinstance(found_addresses, list):
found_addresses = [found_addreses]
...
and then proceed from there to do whatever you do with found addresses as if the value is always a list with one or more items.
I'm creating a chat bot for twitch, more importantly, I'm attempting to have a list that can be added to during iteration and can also be accessed to from within the channel chat. This is the overall code:
https://pastebin.com/maCbceaB
I'm focused on this portion of the code however:
clist = ["!add", ]
if message.strip() == "!add":
chat(s, "Syntax: !add !<command> <what the command does>")
if message.strip().startswith("!add"):
clist.append(message[5:])
chat(s, "The command has been added!")
EDIT: I'm moreso focused on how to add to the list while the code is iterating because I have to be able to add to the clist because it will be used in:
if message.strip() == "!commands":
chat(s, clist)
Currently this code will only output: ['!add'] when !commands is used
All the options I've researched are typically for massive lists and mine will be consisted mostly of strings so I need something fairly simple.
If you want to check if a string starts with a substring you can use the String startswith method:
if message.startswith('!add'):
and then you can grab the command by removing the '!add ' part using a String slice:
message[5:]
Your code will be as follows:
>>> clist = []
>>> message = '!add testcommand'
>>> if message.startswith('!add'):
>>> clist.append(message[5:])
>>> clist
>>> ['testcommand']
I am having data as follows,
data['url']
http://hostname.com/aaa/uploads/2013/11/a-b-c-d.jpg https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/e-f-g-h.gif https://www.aaa.com/
http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html
http://hostname.com/ddd/uploads/2013/11/w-e-r-t.ico
http://hostname.com/ddd/uploads/2013/11/r-t-y-u.aspx https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/t-r-w-q.jpeg https://www.aaa.com/
I want to find out the formats such as .jpg, .gif, .png, .ico, .aspx, .html, .jpeg and parse it out backwards until it finds a "/". Also I want to check for several occurance all through the string. My output should be,
data['parsed']
a-b-c-d
e-f-g-h
e-f-g-h a-a-a-a
w-e-r-t
r-t-y-u
t-r-w-q
I am thinking instead of writing individual commands for each of the formats, is there a way to write everything under a single command.
Can anybody help me in writing for theses commands? I am new to regex and any help would be appreciated.
this builds a list of name to extension pairs
import re
results = []
for link in data:
matches = re.search(r'/(\w-\w-\w-\w)\.(\w{2,})\b', link)
results.append((matches.group(1), matches.group(2)))
This pattern returns the file names. I have just used one of your urls to demonstrate, for more, you could simply append the matches to a list of results:
import re
url = "http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html"
p = r'((?:[a-z]-){3}[a-z]).'
matches = re.findall(p, url)
>>> print('\n'.join(matches))
e-f-g-h
a-a-a-a
There is the assumption that the urls all have the general form you provided.
You might try this:
data['parse'] = re.findall(r'[^/]+\.[a-z]+ ',data['url'])
That will pick out all of the file names with their extensions. If you want to remove the extensions, the code above returns a list which you can then process with list comprehension and re.sub like so:
[re.sub('\.[a-z]+$','',exp) for exp in data['parse']]
Use the .join function to create a string as demonstrated in Totem's answer
I am attempting to isolate TLDs utilizing regex from giant lists of FQDNs without importing 3rd party modules and am attempting to determine if there is a more eloquent way of doing this. My way works but is a bit cumbersome for my liking.
Sample code:
domains = ['x.sample1.com', 'y.sample2.org', 'z.sample3.biz']
temp = []
for domain in domains:
temp.append(re.findall('\.[a-z0-9]+', domain, re.I)
tlds = []
for item in temp:
for tld in item:
tlds.append(tld)
It is inconvenient how the return of the re.findall is a list object as it makes the iterating process an entire level deeper than desired but am unsure of how to get around this.
The "quick fix" is either to take the last item in each array:
split('.', domain)[-1]
Or, if you really don't care about the first matches, then don't capture them at all:
re.find('\.[a-z0-9]+$', domain, re.I)
(Note the use of $ to match the end of string.)
HOWEVER, note that it's impossible to solve this problem properly with regex. For example, how can you know that the TLD for google.co.uk is co.uk, and not just uk?
The only full solution to this problem, unfortunately, is by using a library that implements the public suffix list - which is basically just a very long (manually updated) list of all TLDs. For example, in python: https://pypi.python.org/pypi/publicsuffix/
I have this regex for extracting emails which works fine:
([a-zA-Z][\w\.-]*[a-zA-Z0-9])#([a-zA-Z0-9][\w\.-]*[a-zA-Z0-9]\.[a-zA-Z][a-zA-Z\.]*[a-zA-Z])
however there are some e-mails I don't want to include like:
server#example.com
noreply#example.com
name#example.com
I've been trying to add things like ^(?!server|noreplay|name) but isn't no working.
Also by using parentheses as above will afect tuples with (name, domain) ?
Just check for those email addresses after you extract them...
bad_addresses=['server#example.com', 'noreply#example.com', 'name#example.com']
emails=re.findall('[a-zA-Z][\w\.-]*[a-zA-Z0-9])#([a-zA-Z0-9][\w\.-]*[a-zA-Z0-9]\.[a-zA-Z][a-zA-Z\.]*[a-zA-Z]', contentwithemails)
for item in emails[:]:
if item in bad_addresses:
emails.remove(item)
You have to do a slice of emails ( emails[:] ), because you can't do a for loop on a list that keeps changing size. This creates a "ghost" list that can be read while the real list is acted on.
Check the results from your regex for any emails that match the bad emails list.
results = list_from_your_regex
invalids = ['info', 'server', 'noreply', ...]
valid_emails = [good for good in results if good.split('#')[0] not in invalids]