Retrieving a string using REGEX in Python 2.7.2 - python

I have the following code snippet from page source:
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder");
the
'PDFObject('
is unique on the page. I want to retreive url content using REGEX. In this case I need to get
http://www.site.com/doc55.pdf
Please help.

Here is an alternative for solving your problem without using regex:
url,in_object = None, False
with open('input') as f:
for line in f:
in_object = in_object or 'PDFObject(' in line
if in_object and 'url:' in line:
url = line.split('"')[1]
break
print url

In order to be able to find "something that happens in the line after something else", you need to match things "including the newline". For this you use the (dotall) modifier - a flag added during the compilation.
Thus the following code works:
import re
r = re.compile(r'(?<=PDFObject).*?url:.*?(http.*?)"', re.DOTALL)
s = '''var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder"); '''
print r.findall(s)
Explanation:
r = re.compile( compile regular expression
r' treat this string as a regular expression
(?<=PDFObject) the match I want happens right after PDFObject
.*? then there may be some other characters...
url: followed by the string url:
.*? then match whatever follows until you get to the first instance (`?` : non-greedy match of
(http:.*?)" match the string http: up to (but not including) the first "
', end of regex string, but there's more...
re.DOTALL) set the DOTALL flag - this means the dot matches all characters
including newlines. This allows the match to continue from one line
to the next in the .*? right after the lookbehind

using a combination of look-behind and look-ahead assertions
import re
re.search(r'(?<=url:).*?(?=",)', s).group().strip('" ')
'http://www.site.com/doc55.pdf'

This works:
import re
src='''\
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
URL: "http://www.site.com/doc52.PDF",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder"); '''
print [m.group(1).strip('"') for m in
re.finditer(r'^url:\s*(.*)[\W]$',
re.search(r'PDFObject\(\{(.*)',src,re.M | re.S | re.I).group(1),re.M|re.I)]
prints:
['http://www.site.com/doc55.pdf', 'http://www.site.com/doc52.PDF']

Regex
new\s+PDFObject\(\{\s*url:\s*"[^"]+"
Demo
Extract url only

If 'PDFObject(' is the unique identifier in the page, you only have to match the first next quoted content.
Using the DOTALL flag (re.DOTALL or re.S) and the non-greedy star (*?), you can write:
import re
snippet = '''
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder");
'''
# First version using unnamed groups
RE_UNNAMED = re.compile(r'PDFObject\(.*?"(.*?)"', re.S)
# Second version using named groups
RE_NAMED = re.compile(r'PDFObject\(.*?"(?P<url>.*?)"', re.S)
RE_UNNAMED.search(snippet, re.S).group(1)
RE_NAMED.search(snippet, re.S).group('url')
# result for both: 'http://www.site.com/doc55.pdf'
If you don't want to compile your regex because it's used once, simply this syntax:
re.search(r'PDFObject\(.*?"(.*?)"', snippet, re.S).group(1)
re.search(r'PDFObject\(.*?"(?P<url>.*?)"', snippet, re.S).group('url')
Four choices, one should match you need and taste!

Although the other answers may appear to work, most do not take into account that the only unique thing on the page is 'PDFObject('. A much better regular expression would be the following:
PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",
It takes into account that 'PDFObject(' is unique and contains some basic URL verification.
Below is an example of how this regex could be used in python
>>> import re
>>> strs = """var myPDF = new PDFObject({
... url: "http://www.site.com/doc55.pdf",
... id: "pdfObjectContainer",
... width: "100%",
... height: "700px",
... pdfOpenParams: {
... navpanes: 0,
... statusbar: 1,
... toolbar: 1,
... view: "FitH"
... }
... }).embed("pdf_placeholder");"""
>>> re.search(r'PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",',strs).group(1)
'http://www.site.com/doc55.pdf'
A pure python (no regex) alternative would be:
>>> unique = 'PDFObject({\nurl: "'
>>> start = strs.find(unique) + len(unique)
>>> end = start + strs[start:].find('"')
>>> strs[start:end]
'http://www.site.com/doc55.pdf'
No regex oneliner:
>>> (lambda u:(lambda s:(lambda e:strs[s:e])(s+strs[s:].find('"')))(strs.find(u)+len(u)))('PDFObject({\nurl: "')
'http://www.site.com/doc55.pdf'

Related

How to replace URL parts with regex in Python?

I have a JSON file with URLs that looks something like this:
{
"a_url": "https://foo.bar.com/x/env_t/xyz?asd=dfg",
"b_url": "https://foo.bar.com/x/env_t/xyz?asd=dfg",
"some other property": "blabla",
"more stuff": "yep",
"c_url": "https://blabla.com/somethingsomething?maybe=yes"
}
In this JSON, I want to look up all URLs that have a specific format, and then replace some parts in it.
In URLs that have the format of the first 2 URLs, I want to replace "foo" by "fooa" and "env_t" by "env_a", so that the output looks like this:
{
"a_url": "https://fooa.bar.com/x/env_a/xyz?asd=dfg",
"b_url": "https://fooa.bar.com/x/env_a/xyz?asd=dfg",
"some other property": "blabla",
"more stuff": "yep",
"c_url": "https://blabla.com/somethingsomething?maybe=yes"
}
I can't figure out how to do this. I came up with this regex:
https://foo([a-z]?)\.bar\.com/x/(.+)/.+\"
In regex101 this matches my URLs and captures the groups that I'm seeking to replace, but I can't figure out how to do this with Python's regex.sub().
1. Using regex
import regex
url = 'https://foo.bar.com/x/env_t/xyz?asd=dfg'
new_url = regex.sub(
'(env_t)',
'env_a',
regex.sub('(foo)', 'fooa', url)
)
print(new_url)
output:
https://fooa.bar.com/x/env_a/xyz?asd=dfg
2. Using str.replace
with open('./your-json-file.json', 'r+') as f:
content = f.read()
new_content = content \
.replace('foo', 'fooa') \
.replace('env_t', 'env_a')
f.seek(0)
f.write(new_content)
f.truncate()

Python: How to get full match with RegEx

I'm trying to filter out a link from some java script. The java script part isin't relevant anymore because I transfromed it into a string (text).
Here is the script part:
<script>
setTimeout("location.href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe';", 2000);
$(function() {
$("#whats_new_panels").bxSlider({
controls: false,
auto: true,
pause: 15000
});
});
setTimeout(function(){
$("#download_messaging").hide();
$("#next_button").show();
}, 10000);
</script>
Here is what I do:
import re
def get_link_from_text(text):
text = text.replace('\n', '')
text = text.replace('\t', '')
text = re.sub(' +', ' ', text)
search_for = re.compile("href[ ]*=[ ]*'[^;]*")
debug = re.search(search_for, text)
return debug
What I want is the href link and I kind of get it, but for some reason only like this
<_sre.SRE_Match object; span=(30, 112), match="href = 'https://airdownload.adobe.com/air/win/dow>
and not like I want it to be
<_sre.SRE_Match object; span=(30, 112), match="href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe'">
So my question is how to get the full link and not only a part of it.
Might the problem be that re.search isin't returning longer strings? Because I tried altering the RegEx, I even tried matching the link 1 by 1, but it still returns only the part I called out earlier.
I've modified it slightly, but for me it returns the complete string you desire now.
import re
text = """
<script>
setTimeout("location.href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe';", 2000);
$(function() {
$("#whats_new_panels").bxSlider({
controls: false,
auto: true,
pause: 15000
});
});
setTimeout(function(){
$("#download_messaging").hide();
$("#next_button").show();
}, 10000);
</script>
"""
def get_link_from_text(text):
text = text.replace('\n', '')
text = text.replace('\t', '')
text = re.sub(' +', ' ', text)
search_for = re.compile("href[ ]*=[ ]*'[^;]*")
debug = search_for.findall(text)
print(debug)
get_link_from_text(text)
Output:
["href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe'"]

Dynamically double-quote "keys" in text to form valid JSON string in python

I'm working with text contained in JS variables on a webpage and extracting strings using regex, then turning it into JSON objects in python using json.loads().
The issue I'm having is the unquoted "keys". Right now, I'm doing a series of replacements (code below) to "" each key in each string, but what I want is to dynamically identify any unquoted keys before passing the string into json.loads().
Example 1 with no space after : character
json_data1 = '[{storeName:"testName",address:"12345 Road",address2:"Suite 500",city:"testCity",storeImage:"http://www.testLink.com",state:"testState",phone:"999-999-9999",lat:99.9999,lng:-99.9999}]'
Example 2 with space after : character
json_data2 = '[{storeName: "testName",address: "12345 Road",address2: "Suite 500",city: "testCity",storeImage: "http://www.testLink.com",state: "testState",phone: "999-999-9999",lat: 99.9999,lng: -99.9999}]'
Example 3 with space after ,: characters
json_data3 = '[{storeName: "testName", address: "12345 Road", address2: "Suite 500", city: "testCity", storeImage: "http://www.testLink.com", state: "testState", phone: "999-999-9999", lat: 99.9999, lng: -99.9999}]'
Example 4 with space after : character and newlines
json_data4 = '''[
{
storeName: "testName",
address: "12345 Road",
address2: "Suite 500",
city: "testCity",
storeImage: "http://www.testLink.com",
state: "testState",
phone: "999-999-9999",
lat: 99.9999, lng: -99.9999
}]'''
I need to create pattern that identifies which are keys and not random string values containing characters such as the string link in storeImage. In other words, I want to dynamically find keys and double-quote them to use json.loads() and return a valid JSON object.
I'm currently replacing each key in the text this way
content = re.sub('storeName:', '"storeName":', content)
content = re.sub('address:', '"address":', content)
content = re.sub('address2:', '"address2":', content)
content = re.sub('city:', '"city":', content)
content = re.sub('storeImage:', '"storeImage":', content)
content = re.sub('state:', '"state":', content)
content = re.sub('phone:', '"phone":', content)
content = re.sub('lat:', '"lat":', content)
content = re.sub('lng:', '"lng":', content)
Returned as string representing valid JSON
json_data = [{"storeName": "testName", "address": "12345 Road", "address2": "Suite 500", "city": "testCity", "storeImage": "http://www.testLink.com", "state": "testState", "phone": "999-999-9999", "lat": 99.9999, "lng": -99.9999}]
I'm sure there is a better way of doing this but I haven't been able to find or come up with a regex pattern to handle these. Any help is greatly appreciated!
Something like this should do the job: ([{,]\s*)([^"':]+)(\s*:)
Replace for: \1"\2"\3
Example: https://regex101.com/r/oV0udR/1
That repetition is of course unnecessary. You could put everything into a single regex:
content = re.sub(r"\b(storeName|address2?|city|storeImage|state|phone|lat|lng):", r'"\1":', content)
\1 contains the match within the first (in this case, only) set of parentheses, so "\1": surrounds it with quotes and adds back the colon.
Note the use of a word boundary anchor to make sure we match only those exact words.
Regex: (\w+)\s?:\s?("?[^",]+"?,?)
Regex demo
import re
text = 'storeName: "testName", '
text = re.sub('(\w+)\s?:\s?("?[^",]+"?,?)', "\"\g<1>\":\g<2>", text)
print(text)
Output: "storeName":"testName",

Python regular expression matching a multiline javascript code

I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. My example is:
function initialize()
{
var myLatlng = new google.maps.LatLng(23.800567,5.942068);
var myOptions =
{
panControl: true,
zoomControl: true,
scaleControl: false,
streetViewControl: true,
zoom: 11,
center: myLatlng,
mapTypeId: google.maps.MapTypeId.HYBRID
}
var map = new google.maps.Map(document.getElementById("map"), myOptions);
var bounds = new google.maps.LatLngBounds();
var locations = [
['<div CLASS="Tekst"><B>tss fsdf<\/B><BR>hopp <BR><\/div>', 54.538665,24.885818, 1, 'text']
,
['<div CLASS="Tekst"><\/div>', 24.465462,24.966919, 1, 'text']
]
What I want to extract is context in locations. As result I want to look like:
- '<div CLASS="Tekst"><B>tss fsdf<\/B><BR>hopp <BR><\/div>',
54.538665,24.885818, 1, 'text'
- '<div CLASS="Tekst"><\/div>', 24.465462,24.966919, 1, 'text'
I try regex like this:
regex = r"var locations =\[\[(.+?)\]\]"
But it doesnt work.
hello you can try this regex
re.findall("(<div.+)\]",toparse)

re.compile regex assistance (python, beautifulsoup)

Using this code from a different thread
import re
import requests
from bs4 import BeautifulSoup
data = """
<script type="text/javascript">
window._propertyData =
{ *** a lot of random code and other data ***
"property": {"street": "21st Street", "apartment": "2101", "available": false}
*** more data ***
}
</script>
"""
soup = BeautifulSoup(data, "xml")
pattern = re.compile(r'\"street\":\s*\"(.*?)\"', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print pattern.search(script.text).group(1)
This gets me the desired result:
21st Street
However, i was trying to get the whole thing by trying different variations of the regex and couldn't achieve the output to be:
{"street": "21st Street", "apartment": "2101", "available": false}
I have tried the following:
pattern = re.compile(r'\"property\":\s*\"(.*?)\{\"', re.MULTILINE | re.DOTALL)
Its not producing the desired result.
Your help is appreciated!
Thanks.
As per commented above , correct your typo and you use this
r"property\W+({.*?})"
RegexDemo
property : look for exact string
\W+ : matches any non-word character
({.*?}) : capture group one
.* matches any character inside braces {}
? matches as few times as possible
You can try this:
\"property\":\s*(\{.*?\})
capture group 1 contains yor desired data
Explanation
Sample Code:
import re
regex = r"\"property\":\s*(\{.*?\})"
test_str = ("window._propertyData = \n"
" { *** a lot of random code and other data ***\n"
" \"property\": {\"street\": \"21st Street\", \"apartment\": \"2101\", \"available\": false}\n"
" *** more data ***\n"
" }")
matches = re.finditer(regex, test_str, re.MULTILINE | re.DOTALL)
for matchNum, match in enumerate(matches):
print(match.group(1))
Run it here
Try this, It may be long but work's fine
\"property\"\:\s*(\{((?:\"\w+\"\:\s*\"?[\w\s]+\"?\,?\s?)+?)\})
https://regex101.com/r/7KzzRV/3
import re
import ast
data = """
<script type="text/javascript">
window._propertyData =
{ *** a lot of random code and other data ***
"property": {"street": "21st Street", "apartment": "2101", "available": false}
*** more data ***
}
</script>
"""
property = re.search(r'"property": ({.+?})', data)
str_form = property.group(1)
print('str_form: ' + str_form)
dict_form = ast.literal_eval(str_form.replace('false', 'False'))
print('dict_form: ', dict_form)
out:
str_form: {"street": "21st Street", "apartment": "2101", "available": false}
dict_form: {'available': False, 'street': '21st Street', 'apartment': '2101'}

Categories

Resources