extract URL from string in python

extract URL from string in python - python

I want to extract a full URL from a string.
My code is:
import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
print re.match(r'(ftp|http)://.*\.(jpg|png)$', data)
Output:
None
Expected Output
http://www.google.com/a.jpg
I found so many questions on StackOverflow, but none worked for me.
I have seen many posts and this is not a duplicate. Please help me! Thanks.

You were close!
Try this instead:
r'(ftp|http)://.*\.(jpg|png)'
You can visualize this here.
I would also make this non-greedy like this:
r'(ftp|http)://.*?\.(jpg|png)'
You can visualize this greedy vs. non-greedy behavior here and here.
By default, .* will match as much text as possible, but you want to match as little text as possible.
Your $ anchors the match at the end of the line, but the end of the URL is not the end of the line, in your example.
Another problem is that you're using re.match() and not re.search(). Using re.match() starts the match at the beginning of the string, and re.search() searches anywhere in the string. See here for more information.

You should use search instead of match.
import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
url=re.search('(ftp|http)://.*\.(jpg|png)', data)
if url:
print url.group(0)

Find the start of the url by using find(http:// , ftp://) . Find the end of url using find(jpg , png). Now get the substring
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
start = data.find('http://')
kk = data[start:]
end = kk.find('.jpg')
print kk[0:end+4]

Related

How to use Regex to extract a string from a specific string until a specific symbol in python?

Question
Assume that I have a string like this:
example_text = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
Expectation
And I want to only extract the first url, which is
output = "https://www.example.com/link_1.html"
I think using regex to find the url start from "https" and end up '\' will be a good solution.
If so, how can I write the regex pattern?
I try something like this:
`
re.findall("https://([^\\\\)]+)", example_text)
output = ['www.example.com/link_1.html', 'www.example.com/link_2.html']
But then, I need to add "https://" back and choose the first item in the return.
Is there any other solution?

You need to tweak your regex a bit.
What you were doing before:
https://([^\\\\)]+) this matches your link but only captures the part after https:// since you used the capturing token after that.
Updated Regex:
(https\:\/\/[^\\\\)]+) this matches the link and also captures the whole token (escaped special characters to avoid errors)
In Code:
import re
input = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
print(re.findall("(https\:\/\/[^\\\\)]+)", input))
Output:
['https://www.example.com/link_1.html', "https://www.example.com/link_2.html'"]
You could also use (https\:\/\/([^\\\\)]+).html) to get the link with https:// and without it as a tuple. (this also avoids the ending ' that you might get in some links)
If you want only the first one, simply do output[0].

Try:
match = re.search(r"https://[^\\']+", example_text)
url = match.group()
print(url)
output:
https://www.example.com/link_1.html

Python - Parsing JSON formatted text file with regex

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?
The text shows up like this:
{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","author":null,"d‌escription":null,"fi‌leAssetId":"034b9317‌-60d9-45c2-b6d6-0f24‌b59e1991","filename"‌:"Reports.pdf"},"cre‌atedBy":1531,"create‌dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌bat.png","id":3041,"‌inheritedPermissions‌":false,"name":"map"‌,"permissions":[23,8‌7,35,49,65],"type":3‌,"viewLevel":2},{"__‌type":"WikiNode:http‌:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","children":[],"c‌ontent":
I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error
For the FileAssetid I tried this regex:
regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")
But i get the following 034b9317‌, 60d9, 45c2, b6d6, 0f24‌b59e1991
Im not to sure how to get the data as its displayed.

How about using positive lookahead and lookbehind:
(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")
captures the fileAssetId and
(?<=\"filename\":\").+?(?=\")
matches the filename.
For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)
To get a list of all matches use re.findall or re.finditer instead of re.match.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator with the objects.

You can use python's walk method and check each entry with re.match.
In case that the string you got is not convertable to a python dict, you can use just regex:
print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)
Solution for your example:
import re
example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'
regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))
executing this yields:
34b9317‌-60d9-45c2-b6d6-0f24‌b59e1991

Try adding \n to the string that you are entering in to the file (\n means new line)

Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:
json_pattern = (
r'(?(DEFINE)'
r'(?P<whitespace>( |\n|\r|\t)*)'
r'(?P<boolean>true|false)'
r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
r'(?P<document>(?&object)|(?&array))'
r')'
r'(?&document)'
)
json_regex = regex.compile(json_pattern)
match = json_regex.match(json_document_text)
You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

Regex multiline syntax help (python)

I'm struggling to do multiline regex with multiple matches.
I have data separated by newline/linebreaks like below. My pattern matches each of these lines if i test it separately. How can i match all the occurrences (specifically numbers?
I've read that i could/should use DOTALL somehow (possibly with MULTILINE). This seems to match any character (newlines also) but not sure of any eventual side effects. Don't want to have it match an integer or something and give me malformed data in the end.
Any info on this would be great.
What i really need though, is some assistance in making this example code work. I only need to fetch the numbers from the data.
I used re.fullmatch when i only needed one specific match in a previous case and not entirely sure which function i should use now by the way (finditer, findall, search etc.).
Thank you for any and all help :)
data = """http://store.steampowered.com/app/254060/
http://www.store.steampowered.com/app/254061/
https://www.store.steampowered.com/app/254062
store.steampowered.com/app/254063
254064"""
regPattern = '^\s*(?:https?:\/\/)?(?:www\.)?(?:store\.steampowered\.com\/app\/)?([0-9]+)\/?\s*$'
evaluateData = re.search(regPattern, data, re.DOTALL | re.MULTILINE)
if evaluateString2 is not None:
print('do stuff')
else:
print('found no match')

import re
p = re.compile(ur'^\s*(?:https?:\/\/)?(?:www\.)?(?:store\.steampowered\.com\/app\/)?([0-9]+)\/?\s*$', re.MULTILINE)
test_str = u"http://store.steampowered.com/app/254060/\nhttp://www.store.steampowered.com/app/254061/\nhttps://www.store.steampowered.com/app/254062\nstore.steampowered.com/app/254063\n254064"
re.findall(p, test_str)
https://regex101.com/r/rC9rI0/1
this gives [u'254060', u'254061', u'254062', u'254063', u'254064'].
Are you trying to return those specific integers?

re.search stop at the first occurrence
You should use this intead
re.findall(regPattern, data, re.MULTILINE)
['254060', '254061', '254062', '254063', '254064']
Note: Search was not working for me (python 2.7.9). It just return the first line of data

/ has no special meaning so you do not have to escape it (and in not-raw strings you would have to escape every \)
try this
regPattern = r'^\s*(?:https?://)?(?:www\.)?(?:store\.steampowered\.com/app/)?([0-9]+)/?\s*$'

Regex to capture url until a certain character

With a url such as
https://search.yahoo.com/search?p=Fetty+Wap&fr=fp-tts&
I am using
pat = re.compile('<a href="(https?://.*?)".*',re.DOTALL)
as a search pattern.
I want to pick any url like the yahoo url above, but I want to capture the url up to the literal ? in the actual url.
In other words I want to extract the url up to ?, knowing that all the urls I'm parsing don't have the ? character. In such a case I need to capture all of the url.
The above regex works and extracts the url but goes to the end of the url. How can I get it to stop at the first ? it encounters, and keep going to the end if it doesn't encounter a ?

Regex is really the wrong tool for the job. Doing a basic string split will get you exactly what you want.
def beforeQuestionMrk(inputStr):
return inputStr.split("?")[0]
url = "https://search.yahoo.com/sometext"
url2 = "https://search.yahoo.com/search?p=Fetty+Wap&fr=fp-tts&"
print(beforeQuestionMrk(url))
print(beforeQuestionMrk(url2))
#https://search.yahoo.com/sometext
#https://search.yahoo.com/search
If you really wanted wanted to use regex I suppose you could fo the following:
import re
def getBeforeQuestRegex(inputStr):
return re.search(r"(.+?\?|.+)", inputStr).group(0)
print(getBeforeQuestRegex("https://search.yahoo.com/search?p=Fetty+Wap&fr=fp-tts&"))
print(getBeforeQuestRegex("https://search.yahoo.com/sometext"))
#https://search.yahoo.com/search?
#https://search.yahoo.com/sometext

Bobble bubbles solution above worked very well for me;
"You can try like this by use of negated class: ]*?href="(http[^"?]+)"<- bobbles answer.
url looks like this
https://search.yahoo.com/search?p=Justin+Bieber&fr=fp-tts&fr2=p:fp,m:tn,ct:all......
or it could be something like this
https://www.yahoo.com/style/5-joyful-bob-ross-tees-202237009.html
objective was to extract full url if there was no literal ? in it, but if it did to stop just before the literal ?.
was Bobble Bubbles answer and works very cleanly, does what I wanted done, Again thank you for everyone in participating in this discussion, really appreciate it.

I agree with other answer, that using regexp here is not a solution, especially because there my be any number of parameters before opening of the <a> tag and href parameter, there can be a new line in between too.
but, answering to the initial question:
'*', '+', and '?' qualifiers are all greedy - they match as much text as possible
that's why there are non-greedy versions of them:
'*?', '+?' and '??'

How can I only take the regex without using re.split()?

I'm sorry for the title, I don't know how to ask this properly.
There's an HTML file I want to parse, using python2.7's re library. I know that I should use BeautifulSoup or stuff like that for parsing, but that doesn't really matter here.
There's a part in this HTML file: <br>(40%)</TD>. In this example, the number (here, it's 40) changes every now and then.
I want to use this number as a variable. To get that done, I'm using re.split() to split off everything preceding 40 and everything following 40 :
html = getHtmlTextFromWebsite()
very_important_number = int(re.split("%\)</TD>", (re.split("<br>\(", html))[1])[0])
I find this a very cumbersome way to do this and I'm wondering whether there are better ways.

If all you need is to have the number, you should use the match (first match) or finditer (all matches) function instead.
import re
p = re.compile("<br>\((\d+%)\)<\/TD>", re.IGNORECASE)
for m in p.finditer(html_code):
print m.start(), m.group()
It will also give you the position of each number found.

This is not the regex you asked, but it strips everything before and after the number:
very_important_number = int(html.strip("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ%()<>br/"))
For regex, you can use the re.match method.
Hope this helps!

What about
r = re.compile(r'<br>\((\d+)%\)</TD>', re.I)
m = r.search(text)
print m.group(1)
Here you can search for this exact pattern throughout the text and get it along with its "environment" (the match). From the match, you now extract the group #1 (the () enclosed part).
You as well can search without a compiled re object:
m = re.search(r'<br>\((\d+)%\)</TD>', text, re.I)
print m.group(1)

you can try this
import re
html = getHtmlTextFromWebsite()
number = re.findall(r'<br>(.*?)%</br>', html)
to print the number:
print number[0]
which prints 40 with the example text.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

extract URL from string in python - python

You should use search instead of match. import re data = "ahahahttp://www.google.com/a.jpg>hhdhd" url=re.search('(ftp|http)://.*\.(jpg|png)', data) if url: print url.group(0)

Find the start of the url by using find(http:// , ftp://) . Find the end of url using find(jpg , png). Now get the substring data = "ahahahttp://www.google.com/a.jpg>hhdhd" start = data.find('http://') kk = data[start:] end = kk.find('.jpg') print kk[0:end+4]

Related

How to use Regex to extract a string from a specific string until a specific symbol in python?

Python - Parsing JSON formatted text file with regex

Regex multiline syntax help (python)

Regex to capture url until a certain character

How can I only take the regex without using re.split()?

Categories

Resources