How to extract certain pattern from a url using regex in Python?

How to extract certain pattern from a url using regex in Python? - python

I have some bunch of urls like below
https://data.hova.com/strap/nik/sql_output1574414532.89.zip
https://data.hova.com/strap/asr/sql_output1574414532.89.zip
https://data.hova.com/strap/olr/sql_output1574414532.89.zip
Now I want to extract just the zip file name ie sql_output1574414532.89.zip, sql_output1574414532.89.zip, sql_output1574414532.89.zip respectively.
Now I could have used a simple split to get the filenames but if you observe, the directory name before the zip file changes like nik, asr, olr etc.
So I want to use regex so that I only look at anything that starts with sql and ends with zip.
So this is what I did
import re
string = "https://data.hova.com/strap/nik/sql_output1574414532.89.zip"
pattern = r'^sql\.zip$'
match = re.search(pattern, string)
print(match)
But the match comes as None. What am I doing wrong?

The pattern r'^sql\.zip$' matches only one string: "sql.zip".
For your purpose you need something like sql.+zip$, or, if you expect that sql string can be encountered in URL before file name, change it to sql[^/]+zip$.

Related

Match text between parenthesis that end with .md

I need to get the text inside the parenthesis where the text ends with .md using a regex (if you know another way you can say it) in python.
Original string:
[Romanian (Romania)](books/free-programming-books-ro.md)
Expected result:
books/free-programming-books-ro.md

This should work:
import re
s = '[Romanian (Romania)](books/free-programming-books-ro.md)'
result = re.findall(r'[^\(]+\.md(?=\))',s)
['books/free-programming-books-ro.md']

Regex filter containing word at beginning but not containing another word

suppose i have the following string
GPH_EPL_GK_FIN
i want a regex that ill be using in python that looks for such string from a csv file (not relevant to this question) for records that start with GPH but DONT contain EPL
i know carrot ^ is used for searching at beginning
so i have something like this
^GPH_.*
i want to include the NOT contain part as well, how do i chain the regex?
i.e.
(^GPH_.*)(?!EPL)
i would like to take this a step further eventually and any records that are returned without EPL, i.e.
GPH_ABC_JKL_OPQ
to include AFTER GPH_ the EPL part
i.e. desired result
GPH_EPL_ABC_JKL_OPQ

To cover both requirements:
compose a pattern to match lines that start with GPH but DONT contain EPL
insert EPL_ part into matched line to a particular position
import re
# sample string containing lines
s = '''GPH_EPL_GK_FIN
GPH_ABC_JKL_OPQ'''
pat = re.compile(r'^(GPH_)(?!.*EPL.*)')
for line in s.splitlines():
print(pat.sub('\\1EPL_', line))
The output:
GPH_EPL_GK_FIN
GPH_EPL_ABC_JKL_OPQ

This here would do, I think:
^GPH_(?!EPL).*
This will return any string that start with GPH and does not have EPL after GPH_.

I'm just guessing that one option would be,
(?<=^GPH_(?!EPL))
and re.sub with,
EPL_
Test
import re
print(re.sub(r"(?<=^GPH_(?!EPL))", "EPL_", "GPH_ABC_JKL_OPQ"))
Output
GPH_EPL_ABC_JKL_OPQ

Simply use this:
https://regex101.com/r/GwBsg2/2
pattern: ^(?!^(?:[^_\n]+_)*EPL_?(?:[^_\n]+_?)*)(.*)GPH
substitute: \1GPH_EPL
flags: gm

Extract file name with a regular expression

I want to create a regular expressions to extract the filename of an url
https://example.net/img/src/img.jpg
I want to extract img1.jpg
I use urlparse from python but it extract the path in this way
img/src/img.jpg
How I can extract the file name with a regular expression

Using str.split and negative indexing
url = "https://example.net/img/src/img.jpg"
print(url.split("/")[-1])
Output:
img.jpg
or using os.path.basename
import urlparse, os
url = "https://example.net/img/src/img.jpg"
a = urlparse.urlparse(url)
print(os.path.basename(a.path)) #--->img.jpg

You can either use a split on / and select the last element of the returned array (the best solution in my opinion)
or if you really want to use a regex you can use the following one
(?<=\/)(?:(?:\w+\.)*\w+)$
Note that only the following filenames are accepted: DEMO
You can adapt and change the \w to accept other characters if necessary.
Explanations:
(?<=\/) positive lookbehind on / and $ add the constraint that the filename string is the last element of the path
(?:(?:\w+\.)*\w+) is used to extract words that are composed of several letters/digits and eventually underscores followed by a dot, this group can be repeated as many time as necessary (xxx.tar.gz file for example) and then followed by the final extension.

If your url pattern is static you can use positive lookahead ,
import re
pattern =r'\w+(?=\.jpg)'
text="""https://example.net/img/src/img.jpg
"""
print(re.findall(pattern,text)[0])
output:
img

Python - Parsing JSON formatted text file with regex

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?
The text shows up like this:
{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","author":null,"d‌escription":null,"fi‌leAssetId":"034b9317‌-60d9-45c2-b6d6-0f24‌b59e1991","filename"‌:"Reports.pdf"},"cre‌atedBy":1531,"create‌dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌bat.png","id":3041,"‌inheritedPermissions‌":false,"name":"map"‌,"permissions":[23,8‌7,35,49,65],"type":3‌,"viewLevel":2},{"__‌type":"WikiNode:http‌:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","children":[],"c‌ontent":
I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error
For the FileAssetid I tried this regex:
regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")
But i get the following 034b9317‌, 60d9, 45c2, b6d6, 0f24‌b59e1991
Im not to sure how to get the data as its displayed.

How about using positive lookahead and lookbehind:
(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")
captures the fileAssetId and
(?<=\"filename\":\").+?(?=\")
matches the filename.
For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)
To get a list of all matches use re.findall or re.finditer instead of re.match.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator with the objects.

You can use python's walk method and check each entry with re.match.
In case that the string you got is not convertable to a python dict, you can use just regex:
print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)
Solution for your example:
import re
example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'
regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))
executing this yields:
34b9317‌-60d9-45c2-b6d6-0f24‌b59e1991

Try adding \n to the string that you are entering in to the file (\n means new line)

Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:
json_pattern = (
r'(?(DEFINE)'
r'(?P<whitespace>( |\n|\r|\t)*)'
r'(?P<boolean>true|false)'
r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
r'(?P<document>(?&object)|(?&array))'
r')'
r'(?&document)'
)
json_regex = regex.compile(json_pattern)
match = json_regex.match(json_document_text)
You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

Python regular expressions matching within set

While testing on http://gskinner.com/RegExr/ (online regex tester), the regex [jpg|bmp] returns results when either jpg or bmp exist, however, when I run this regex in python, it only return j or b. How do I make the regex take the whole word "jpg" or "bmp" inside the set ? This may have been asked before however I was not sure how to structure question to find the answer. Thanks !!!
Here is the whole regex if it helps
"http://www\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)"
Its just basically to look for pictures in a url

Use (jpg|bmp) instead of square brackets.
Square brackets mean - match a character from the set in the square brackets.
Edit - you might want something like that: [^ ].*?(jpg|bmp) or [^ ].*?\.(jpg|bmp)

When you are using [] your are creating a character class that contains all characters between the brackets.
So your are not matching for jpg or bmp you are matching for either a j or a p or a g or a | ...
You should add an anchor for the end of the string to your regex
http://www\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)$
^ ^^
if you need double escaping then every where in your pattern
http://www\\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)$
to ensure that it checks for the file ending at the very end of the string.

If you are searching a list of URLs
urls = [ 'http://some.link.com/path/to/file.jpg',
'http://some.link.com/path/to/another.png',
'http://and.another.place.com/path/to/not-image.txt',
]
to find ones that match a given pattern you can use:
import re
for url in urls:
if re.match(r'http://.*(jpg|png|gif)$'):
print url
which will output
http://some.link.com/path/to/file.jpg
http://some.link.com/path/to/another.png
re.match() will test for a match at the start of the string and return a match object for the first two links, and None for the third.
If you are getting just the extension, you can use the following:
for url in urls:
m = re.match(r'http://.*(jpg|png|gif)$')
print m.group(0)
which will print
('jpg',)
('png',)
You will get just the extensions because that's what was defined as a group.
If you need to find the url in a long string of text (such as returned from wget), you need to use re.search() and enclose the part you are interested in with ( )'s. For example,
response = """dlkjkd dkjfadlfjkd fkdfl kadfjlkadfald ljkdskdfkl adfdf
kjakldjflkhttp://some.url.com/path/to/file.jpgkaksdj fkdjakjflakdjfad;kadj af
kdlfjd dkkf aldfkaklfakldfkja df"""
reg = re.search(r'(http:.*/(.*\.(jpg|png|gif)))', response)
print reg.groups()
will print
('http://some.url.com/path/to/file.jpg', 'file.jpg', 'jpg',)
or you can use re.findall or re.finditer in place of re.search to get all of the URL's in the long response. Search will only return the first one.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract certain pattern from a url using regex in Python? - python

The pattern r'^sql\.zip$' matches only one string: "sql.zip". For your purpose you need something like sql.+zip$, or, if you expect that sql string can be encountered in URL before file name, change it to sql[^/]+zip$.

Related

Match text between parenthesis that end with .md

Regex filter containing word at beginning but not containing another word

Extract file name with a regular expression

Python - Parsing JSON formatted text file with regex

Python regular expressions matching within set

Categories

Resources