Stripping URLs in regex - python

I am trying to isolate the domain name for a database full of URLs, but I'm running into some regex problems.
Starting example:
examples = ['www2.chccs.k12.nc.us', 'wwwsco.com', 'www-152.aig.com', 'www.google.com']
Desired goal:
['chccs.k12.nc.us', 'sco.com', 'aig.com', 'google.com']
I've been trying a two stage process where I add in a "." before "www", then replace the "www.", but that doesn't quite lead to the results I'd like.
Any regex wizards out there able to help?
Thanks in advance!

import re
def extract(domain):
return re.sub(r'^www[\d-]*\.?', '', domain)
examples = ['www2.chccs.k12.nc.us', 'wwwsco.com', 'www-152.aig.com', 'www.google.com']
result = [extract(d) for d in examples]
assert result == ['chccs.k12.nc.us', 'sco.com', 'aig.com', 'google.com'], result

Related

Python Regex re.compile query

I'm trying to find get a list of required names from list of names using a regex query.
csv file: FYI, I converted Countries from Capital to small letters
searchList:
['AU.LS1_james.aus',
'AU.LS1_scott.aus',
'AP.LS1_amanda.usa',
'AP.LS1_john.usa',
'LA.LS1_harsha.ind',
'LA.LS1_vardhan.ind',
'IECAU13_peter-tu.can',
'LONSA13_smith.gbp']
Format of the searchList: [(region)(Category)]_[EmployeeName].[country]
(region)(Category) is concatenated.
I'm trying to get a list of each group like this,
[
['AU.LS1_james.aus', 'AU.LS1_scott.aus'],
['AP.LS1_amanda.usa', 'AP.LS1_john.usa'],
['LA.LS1_harsha.ind', 'LA.LS1_vardhan.ind']
]
Using the following regex query: \<({region}).*\{country}\>
for region, country in regionCountry:
query = f"\<({region}).*\{country}\>"
r = re.compile(query)
group = list(filter(r.match, searchList))
I tried re.search as well, but the group is always None
FYI: I also tried this query in notepad++ find using regex functionality.
Can Anyone Tell where it's going wrong in my script.? Thank you
Without regex:
split
And a dictionary to group the entries:
Data
entries = ['AU.LS1_james.aus', 'AU.LS1_scott.aus', 'AP.LS1_amanda.usa', 'AP.LS1_john.usa', 'LA.LS1_harsha.ind', 'LA.LS1_vardhan.ind']
Solution 1: simple dict and setdefault
d = {}
for entry in entries:
d.setdefault(entry.split('.',1)[0], []).append(entry)
Solution 2: defaultdict
from collections import defaultdict
d = defaultdict(list)
for entry in entries:
d[entry.split('.',1)[0]].append(entry)
Result is in d.values()
>>> list(d.values())
[['AU.LS1_james.aus', 'AU.LS1_scott.aus'],
['AP.LS1_amanda.usa', 'AP.LS1_john.usa'],
['LA.LS1_harsha.ind', 'LA.LS1_vardhan.ind']]
I thank you all for trying to assist my question. This answer worked out well for my usage. For some reason python doesn't like \< and \>. so i just removed them and it worked fine. I didn't expect that there could be some limitations using re library.
Answer:
f({region}).*\{country}

Regular Expression, extract the number with a decimal place from API input

I was trying to extract the number with 2 decimal places from my APIs input. These data are shown with text and comma but I only need the number with a decimal. I'm pretty sure this isn't the right way of using regex101. I'm a beginner in coding so I don't have much knowledge about a Regular Expression
1: {"symbol":"BTCUSDT","price":"34592.99000000"}
Attempt to extract: 34592.99000000 using regex101 "\d+........"
2: {"THB_BTC":{"id":1,"last":1102999.13,"lowestAsk":1102999.08,"highestBid":1100610.1,"percentChange":2.94,"baseVolume":202.54340749,"quoteVolume":221380256.57,"isFrozen":0,"high24hr":1108001,"low24hr":1061412.72,"change":31496.06,"prevClose":1102999.13,"prevOpen":1071503.07}}
Attempt to extract: 1102999.13 using regex101 "\d\d....."
These attempts only get me close but not 100% to the target, I believe there is a right way of doing this.
here's my code
import requests
import re
result = requests.get("https://api.binance.com/api/v3/ticker/price?symbol=BTCUSDT")
result1 = requests.get("https://api.bitkub.com/api/market/ticker/?sym=THB_BTC&lmt=10")
result.text
result1.text
api0 = re.compile(r"\d+........").findall(result.text)[0]
api1 = re.compile(r"\d\d.....").findall(result1.text)[0]
print(result.text)
print(result1.text)
If you have any advice please do. I'm highly appreciated in advance
An easier and better way to do this, without regex
import requests
import re
result = requests.get("https://api.binance.com/api/v3/ticker/price?symbol=BTCUSDT").json()
result1 = requests.get("https://api.bitkub.com/api/market/ticker/?sym=THB_BTC&lmt=10").json()
data_1 = format(float(result['price']), '.2f')
data_2 = format(float(result1['THB_BTC']['last']), '.2f')
print(data_1, data_2)
34602.98 1101999.95
You can try something like that. Change your regex to \d+\.\d+
import requests
import re
result = requests.get("https://api.binance.com/api/v3/ticker/price?symbol=BTCUSDT")
result1 = requests.get("https://api.bitkub.com/api/market/ticker/?sym=THB_BTC&lmt=10")
api0 = re.compile(r"\d+\.\d+").findall(result.text)[0]
api1 = re.compile(r"\d+\.\d+").findall(result1.text)[0]
print(result.text)
print(result1.text)
print(api0)
print(api1)

Python and regex: create a template

I need to find a lot of substrings in string but It takes a lot of time, so I need to combine it in pattern:
I should find string
003.ru/%[KEYWORD]%
1click.ru/%[KEYWORD]%
3dnews.ru/%[KEYWORD]%
where % - is an any symbols
and [KEYWORD] - can be ['sony%xperia', 'iphone', 'samsung%galaxy', 'lenovo_a706']
I try to do a search with
keywords = ['sony%xperia', 'iphone', 'samsung%galaxy', 'lenovo_a706']
for i, key in enumerate(keywords):
coding['keyword_url'] = coding.url.apply(lambda x: x.replace('[KEYWORD]', key).replace('%', '[a-zA-Z0-9-_\.\?!##$%^&*+=]+') if '[KEYWORD]' in x else x.replace('%', '[a-zA-Z0-9-_\.\?!##$%^&*+=]+'))
for (domain, keyword_url) in zip(coding.domain.values.tolist(), coding.keyword_url.values.tolist()):
df.loc[df.event_address.str.contains(keyword_url), 'domain'] = domain
Where df contains only event_address (urls)
coding
domain url
003.ru 003.ru/%[KEYWORD]%
1CLICK 1click.ru/%[KEYWORD]%
33033.ru 33033.ru/%[KEYWORD]%
3D NEWS 3dnews.ru/%[KEYWORD]%
96telefonov.ru 96telefonov.ru/%[KEYWORD]%
How can I improve my pattern to do it faster?
First, you should consider using re module. Look at the re.compile function for your patterns and then you can match them.

Multiple distinct replaces using RegEx

I am trying to write some Python code that will replace some unwanted string using RegEx. The code I have written has been taken from another question on this site.
I have a text:
text_1=u'I\u2019m \u2018winning\u2019, I\u2019ve enjoyed none of it. That\u2019s why I\u2019m withdrawing from the market,\u201d wrote Arment.'
I want to remove all the \u2019m, \u2019s, \u2019ve and etc..
The code that I've written is given below:
rep={"\n":" ","\n\n":" ","\n\n\n":" ","\n\n\n\n":" ",u"\u201c":"", u"\u201d":"", u"\u2019[a-z]":"", u"\u2013":"", u"\u2018":""}
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text_1)
The code works perfectly for:
"u"\u201c":"", u"\u201d":"", u"\u2013":"" and u"\u2018":""
However, It doesn't work that great for:
u"\u2019[a-z] : The presence of [a-z] turns rep into \\[a\\-z\\] which doesnt match.
The output I am looking for is:
text_1=u'I winning, I enjoyed none of it. That why I withdrawing from the market,wrote Arment.'
How do I achieve this?
The information about the newlines completely changes the answer. For this, I think building the expression using a loop is actually less legible than just using better formatting in the pattern itself.
replacements = {'newlines': ' ',
'deletions': ''}
pattern = re.compile(u'(?P<newlines>\n+)|'
u'(?P<deletions>\u201c|\u201d|\u2019[a-z]?|\u2013|\u2018)')
def lookup(match):
return replacements[match.lastgroup]
text = pattern.sub(lookup, text_1)
The problem here is actually the escaping, this code does what you want more directly:
remove = (u"\u201c", u"\u201d", u"\u2019[a-z]?", u"\u2013", u"\u2018")
pattern = re.compile("|".join(remove))
text = pattern.sub("", text_1)
I've added the ? to the u2019 match, as I suppose that's what you want as well given your test string.
For completeness, I think I should also link to the Unidecode package which may actually be more closely what you're trying to achieve by removing these characters.
The simplest way is this regex:
X = re.compile(r'((\\)(.*?) ')
text = re.sub(X, ' ', text_1)

python regex get first part of an email address

I am quite new to python and regex and I was wondering how to extract the first part of an email address upto the domain name. So for example if:
s='xjhgjg876896#domain.com'
I would like the regex result to be (taking into account all "sorts" of email ids i.e including numbers etc..):
xjhgjg876896
I get the idea of regex - as in I know I need to scan till "#" and then store the result - but I am unsure how to implement this in python.
Thanks for your time.
You should just use the split method of strings:
s.split("#")[0]
As others have pointed out, the better solution is to use split.
If you're really keen on using regex then this should work:
import re
regexStr = r'^([^#]+)#[^#]+$'
emailStr = 'foo#bar.baz'
matchobj = re.search(regexStr, emailStr)
if not matchobj is None:
print matchobj.group(1)
else:
print "Did not match"
and it prints out
foo
NOTE: This is going to work only with email strings of SOMEONE#SOMETHING.TLD. If you want to match emails of type NAME<SOMEONE#SOMETHING.TLD>, you need to adjust the regex.
You shouldn't use a regex or split.
local, at, domain = 'john.smith#example.org'.rpartition('#')
You have to use right RFC5322 parser.
"#####"#example.com is a valid email address, and semantically localpart("#####") is different from its username(#####)
As of python3.6, you can use email.headerregistry:
from email.headerregistry import Address
s='xjhgjg876896#domain.com'
Address(addr_spec=s).username # => 'xjhgjg876896'
#!/usr/bin/python3.6
def email_splitter(email):
username = email.split('#')[0]
domain = email.split('#')[1]
domain_name = domain.split('.')[0]
domain_type = domain.split('.')[1]
print('Username : ', username)
print('Domain : ', domain_name)
print('Type : ', domain_type)
email_splitter('foo.goo#bar.com')
Output :
Username : foo.goo
Domain : bar
Type : com
Here is another way, using the index method.
s='xjhgjg876896#domain.com'
# Now lets find the location of the "#" sign
index = s.index("#")
# Next lets get the string starting from the begining up to the location of the "#" sign.
s_id = s[:index]
print(s_id)
And the output is
xjhgjg876896
need to install package
pip install email_split
from email_split import email_split
email = email_split("ssss#ggh.com")
print(email.domain)
print(email.local)
Below should help you do it :
fromAddr = message.get('From').split('#')[1].rstrip('>')
fromAddr = fromAddr.split(' ')[0]
Good answers have already been answered but i want to put mine anyways.
If i have an email john#gmail.com i want to get just "john".
i want to get only "john"
If i have an email john.joe#gmail.com i want to get just "john"
i want to get only "john"
so this is what i did:
name = recipient.split("#")[0]
name = name.split(".")[0]
print name
cheers
You can also try to use email_split.
from email_split import email_split
email = email_split('xjhgjg876896#domain.com')
email.local # xjhgjg876896
email.domain # domain.com
You can find more here https://pypi.org/project/email_split/ . Good luck :)
The following will return the continuous text before #
re.findall(r'(\S+)#', s)
You can find all the words in the email and then return the first word.
import re
def returnUserName(email):
return re.findall("\w*",email)[0]
print(returnUserName("johns123.ss#google.com")) #Output is - johns123
print(returnUserName('xjhgjg876896#domain.com')) #Output is - xjhgjg876896

Categories

Resources