From the following link, I am trying to extract the longitude and latitude. I found similar posts but not one with the same format. I'm new to regex/text manipulation and would appreciate any guidance on how I might do this using Python. The output I'd like to have from this example is
latitude = 40.744221
longitude = -73.982854
Many thanks in advance.
https://maps.googleapis.com/maps/api/staticmap?scale=1¢er=40.744221%2C-73.982854&language=en&zoom=15&markers=scale%3A1%7Cicon%3Ahttps%3A%2F%2Fyelp-images.s3.amazonaws.com%2Fassets%2Fmap-markers%2Fannotation_32x43.png%7C40.744221%2C-73.982854&client=gme-yelp&sensor=false&size=315x150&signature=OjixVjNCwF7yLR5tsYw2fDRZ7bw
Python has a module for parsing URLs in the standard library
from urllib import parse
# Split off the query
_, query_string = parse.splitquery("https://maps.googleapis.com/maps/api/staticmap?scale=1¢er=40.744221%2C-73.982854&language=en&zoom=15&markers=scale%3A1%7Cicon%3Ahttps%3A%2F%2Fyelp-images.s3.amazonaws.com%2Fassets%2Fmap-markers%2Fannotation_32x43.png%7C40.744221%2C-73.982854&client=gme-yelp&sensor=false&size=315x150&signature=OjixVjNCwF7yLR5tsYw2fDRZ7bw")
# Parse the query into a dict
query = parse.parse_qs(query_string)
# You can now access the query using a dict lookup
latlng = query["center"]
# And to get the values (selecting 0 as it is valid for a query string to contain the same key multiple times).
latitude, longitude = latlng[0].split(",")
For this usecase I would avoid regular expressions. The urllib module is more explicit, will handle all aspects of URL encoding and is well tested.
Another great third party module for handling URLs is the excellent YARL.
I'm guessing that this expression might return our desired output:
center=(-?\d+\.\d+)%2C(-?\d+\.\d+)
Test with re.findall
import re
regex = r"center=(-?\d+\.\d+)%2C(-?\d+\.\d+)"
test_str = "https://maps.googleapis.com/maps/api/staticmap?scale=1¢er=40.744221%2C-73.982854&language=en&zoom=15&markers=scale%3A1%7Cicon%3Ahttps%3A%2F%2Fyelp-images.s3.amazonaws.com%2Fassets%2Fmap-markers%2Fannotation_32x43.png%7C40.744221%2C-73.982854&client=gme-yelp&sensor=false&size=315x150&signature=OjixVjNCwF7yLR5tsYw2fDRZ7bw"
print(re.findall(regex, test_str))
Test with re.finditer
import re
regex = r"center=(-?\d+\.\d+)%2C(-?\d+\.\d+)"
test_str = "https://maps.googleapis.com/maps/api/staticmap?scale=1¢er=40.744221%2C-73.982854&language=en&zoom=15&markers=scale%3A1%7Cicon%3Ahttps%3A%2F%2Fyelp-images.s3.amazonaws.com%2Fassets%2Fmap-markers%2Fannotation_32x43.png%7C40.744221%2C-73.982854&client=gme-yelp&sensor=false&size=315x150&signature=OjixVjNCwF7yLR5tsYw2fDRZ7bw"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
The expression is explained on the top right panel of this demo, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
Use a simple re.search on the string with tuple packing:
lattitude, longitude = re.search(r'center=(.*?)%2C(.*?)&', s).groups()
where s is your string (link).
Example:
import re
s = 'https://maps.googleapis.com/maps/api/staticmap?scale=1¢er=40.744221%2C-73.982854&language=en&zoom=15&markers=scale%3A1%7Cicon%3Ahttps%3A%2F%2Fyelp-images.s3.amazonaws.com%2Fassets%2Fmap-markers%2Fannotation_32x43.png%7C40.744221%2C-73.982854&client=gme-yelp&sensor=false&size=315x150&signature=OjixVjNCwF7yLR5tsYw2fDRZ7bw'
lattitude, longitude = re.search(r'center=(.*?)%2C(.*?)&', s).groups()
print(lattitude) # 40.744221
print(longitude) # -73.982854
Related
I want to use this regex
r"Summe\d+\W\d+"
to match this string
150,90‡50,90‡8,13‡Summe50,90•50,90•8,13•Kreditkartenzahlung
but I want to only filter out this specific part
Summe50,90
I can select the entire string with this regex but I'm not sure how to filter out only the matching part
here is the function it is in where i am trying to get the amount from a pdf:
def get_amount(url):
data = requests.get(url)
with open('/Users/derricdonehoo/code/derric-d/price-processor/exmpl.pdf', 'wb') as f:
f.write(data.content)
pdfFileObj = open('exmpl.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
text = pageObj.extractText().split()
regex = re.compile(r"Summe\d+\W\d+")
matches = list(filter(regex.search, text))
for i in range(len(matches)):
matchString = '\n'.join(matches)
print(matchString)
as described above, I would like guidance on how I can best filter out a part of this string so that it returns just the matching portion. preferably with varying lengths of characters on either side but that's not a priority.
thanks!!
My guess is that this simple expression might likely work here,
(Summe.+?)•
Test
import re
regex = r"(Summe.+?)•"
test_str = "150,90‡50,90‡8,13‡Summe50,90•50,90•8,13•Kreditkartenzahlung"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
Demo
What you are missing is a convenient way to "grab" your match.
import re
text = "150,90‡50,90‡8,13‡Summe50,90•50,90•8,13•Kreditkartenzahlung"
match = re.search("Summe\d+\W\d+", text)
if match:
res = match.group()
>>> print(res)
'Summe50,90'
Note that group accepts an index to return a group from inside your regex but since this one doesn't use groups (Which are surrounded by (...) in your regex) you simply call it like that.
If you want to find all occurences of said pattern use re.findall:
import re
text = "150,90‡50,90‡8,13‡Summe50,90•50,90•Summe8,13•Kreditkartenzahlung"
matchs = re.findall("Summe\d+\W\d+", text)
>>> print(matches)
['Summe50,90', 'Summe8,13']
In this case a list with all matches (already strings, not Match objects) will be returned. Again, if you use capture groups, a list of tuples will be returned where each tuple contains all the groups for a match.
Read about the methods - re.search and re.findall
This is what you want, your regex is correct but you must get the match after searching for it.
regex = re.compile(r"Summe\d+\W\d+")
text = ["150,90‡50,90‡8,13‡Summe50,90•50,90•8,13•Kreditkartenzahlung"]
matches = []
for t in text:
m = regex.search(t)
if m:
matches.append(m.group(0))
print(matches)
re.search returns a Match object on success, None on failure, and that object contains all the information about your matching regex. To get the whole match you call Match.group().
\W will probably match everything up to Kredit...
regex = r'Summe\d+,\d{2}'
should match the first 50,90 after Summe.
If the separating comma is too specific (because it might come as a dot) you can use a character set:
regex = r'Summe\d[,.]\d{2}'
This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
Closed 3 years ago.
I am trying to get all the substring that matches some delimiters. My issue is that i also need the character at the end of the last occurrence. The strings need to be between any of these characters: . , / , ? , = , - , _
I have tried this regular expression
pattern = re.compile(r"""[./?=\-_][^./?=\-_]+[./?=\-_]""")
In this exemple:
-facebook=chat.messenger?
I am not able to get the substring =chat.
I am only getting -facebook= and .messenger?
Looks like the overlap is what's causing some the drama. If using the regex module (which is expected to eventually replace the re module), you can do
import regex as re
delimiters = r'[./?=\-_]'
pattern = delimiters + r'[a-z]+' + delimiters
s = '-facebook=chat.messenger?'
print(regex.findall(pattern, s, overlapped=True))
# ['-facebook=', '=chat.', '.messenger?']
Notice that this assumes all characters are lowercase with [a-z], and that [./?=\-_] is the list of delimiters you specified.
Hope this helps!
My guess is that this expression might be what we might want to start with:
((?:[/?=_–.-])([a-z]+)(?:[/?=_–.-]))|([a-z]+)
Demo
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"((?:[/?=_–.-])([a-z]+)(?:[/?=_–.-]))|([a-z]+)"
test_str = "-facebook=chat.messenger?"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
I have a regex in my django code but I don't know what it means actually. Here is my regex :
r'^email/(?P<email>[^#\s]+#[^#\s]+\.[^#\s]+)/$',
Could you give me some examples which match with this regex?
RegEx Circuit
You can visualize your expressions in jex.im:
You can also test/modify/change your expressions in regex101.com.
Basically, your expression would match:
email/some_alphanumeric[A-Z0-9]_special_chars_##$*some_alphanumeric_special_chars_#$*.some_alphanumeric_special_chars_#$*
Demo
If you wish to match:
myurl/email/blabla#blabla.com
You can modify it to:
myurl\/email\/([^#\s]+#[^#\s]+\.[^#\s]+)
Python Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"myurl\/email\/([^#\s]+#[^#\s]+\.[^#\s]+)"
test_str = "myurl/email/blabla#blabla.com"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
in addition :
r'^email/(?P<email>[^#\s]+#[^#\s]+\.[^#\s]+)/$'
this regx use in django url
url example : email/test#gmail.com/
email/ = consolent value in your url
[^#\s] = you can write any character except # and space "/s"
#[^#\s] = you must start with # + anything expect #character and space "/s"
\. = matches "."
[^#\s] = you can write anycharacter except # and space "/s"
+ = you can type many character
/$ = end of url
I'm still new at Regex, and I've been trying to implement a Gmail validation algorithm in my Python program.
This is my Regex
mail_address = "hello.89#gmail.com"
result = re.findall(r'\w+[\w.]+(#gmail.com){1}', mail_address)
print (str(result))
The first char must be alphanumeric (\w+), from there it catches every set of chars ([\w.]+), followed by only one instance of #gmail.com
This is what it prints:
['#gmail.com']
But it should print
['hello.89#gmail.com']
What am I doing wrong?
EDIT: Here's the Regex I chose:
\A(\w+[\w.]+#gmail\.com)\Z
Just alter the parentheses so that it includes all of your desired output:
result = re.findall(r'(\w+[\w.]+#gmail.com)', mail_address)
I have slightly altered your expression insofar as the gmail.com part is now only a string. Additionally, you don't need to convert the results to string plus you don't need to repeat a group just once.
That being said, in the end, you'd end up having:
import re
mail_address = "hello.89#gmail.com"
result = re.findall(r'(\w+[\w.]+#gmail.com)', mail_address)
print (result)
# ['hello.89#gmail.com']
Problem is in the parentheses as Jan mentioned. But your regex can be also simplified to this:
result = re.findall(r'(\w+[\w.]+#gmail.com)', mail_address)
Demo: https://regex101.com/r/Z5EGbZ/1
Quantifier after #gmail.com is meaningless.
this should work, using your regex only
regex = r"\w+[\w.]+(#gmail.com){1}"
test_str = "hello.89#gmail.com"
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
check online compiler
I have list of files with a pattern sub-*_task-XYZabc_run-*_bold.json and sub-*_task-PQRghu_bold.json, for example:
sub-03_task-dis_run-01_bold.json
sub-03_task-dis_run-02_bold.json
sub-03_task-dis_run-03_bold.json
sub-03_task-dis_run-04_bold.json
sub-03_task-dis_run-05_bold.json
sub-03_task-dis_run-06_bold.json
sub-03_task-fb_run-01_bold.json
sub-03_task-fb_run-02_bold.json
sub-03_task-fb_run-03_bold.json
sub-03_task-fb_run-04_bold.json
I intend to find all different task names from the filename. In the above example, dis and fb are the two tasks.
What kind of regex should I use to find TASKNAME from task-TASKNAME in a given filename?
The following regex should do it :
(?<=task-).*?(?=_)
see regex demo / explanation
python ( demo )
import re
regex = r"(?<=task-).*?(?=_)"
str = """sub-03_task-dis_run-01_bold.json
sub-03_task-dis_run-02_bold.json
sub-03_task-dis_run-03_bold.json
sub-03_task-dis_run-04_bold.json
sub-03_task-dis_run-05_bold.json
sub-03_task-dis_run-06_bold.json
sub-03_task-fb_run-01_bold.json
sub-03_task-fb_run-02_bold.json
sub-03_task-fb_run-03_bold.json
sub-03_task-fb_run-04_bold.json"""
matches = re.finditer(regex, str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("{match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))