Extract string with more than one URL

Extract string with more than one URL - python

I have a string with several URLs inside it. I have managed to use regex to extract the first URL, but I really need them all. My script so far below:
data = ['https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg":[355,342],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg":[441,425],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg":[500,482],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg":[483,466],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg":[399,385]}']
url = data[data.find("https://"):]
url[:url.find('"')]
Sorry - above script didn't use regex, but was another way I tried to do this. My regex script below which pretty much does the same thing. I don't really mind what we use, just want to try get all the URLs, since both my scripts only extract the first URL.
url=re.search('(https)://.*?\.(jpg)', data)
if url:
print(url.group(0))
I am scraping amazon products - this is the context. I've also updated the string to one of the actual examples.. Thanks everyone for the comments/help

Maybe this way:
URL_list = [i for i in data.split('"') if 'http' in i]
It doesn't use regex, but in this code I don't see a need for regex.

Your new example string (from data[0]) is missing an opening curly brace and a double quote but after adding that, you can read it as JSON using the standard library. You might have simply copy/pasted it incorrectly.
In[2]: data = ['https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg":[355,342],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg":[441,425],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg":[500,482],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg":[483,466],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg":[399,385]}']
In[3]: import json
In[4]: d = json.loads('{"%s' % data[0])
In[5]: d
Out[5]:
{'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg': [355,
342],
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg': [441,
425],
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg': [500,
482],
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg': [483,
466],
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg': [399,
385]}
In[6]: list(d.keys())
Out[6]:
['https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg',
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg',
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg',
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg',
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg']

Related

How do I create a dynamic url based on user input using python?

New to python here so this is mostly a syntax & library question. I'm looking for the most efficient way to create a dynamic url within python that is built on user input to later search and parse.
Example of the code I have so far:
collection = "MERRA"
version = "5.12.4"
url = "https://misc.gov/search/granules.umm_json?short_name=" , {collection}, "&" , "Version=" ,{version}, "&pretty=true'"
Output:
>>> print(url)
('https://misc.gov/search/granules.umm_json?short_name=', {'M2I1NXASM'}, '&', 'Version=', {'5.12.4'}, "&pretty=true'")
Goal Output:
>>> print(url)
https://misc.gov/search/granules.umm_json?short_name=M2I1NXASM&Version=5.12.4&pretty=true
collection and version are manually defined by the user for now.
Which python libraries (if needed) are best to use for this? How can I fix my syntax so that my output doesn't contain spaces, quotes, or curly brackets?

Try this method to string concatenation with variable value substitution:
collection = "MERRA"
version = "5.12.4"
url = "https://misc.gov/search/granules.umm_json?short_name="+str(collection)+"&"+"Version="+str(version)+"&pretty=true"
print('#'*100)
print(url)
####################################################################################################
https://misc.gov/search/granules.umm_json?short_name=MERRA&Version=5.12.4&pretty=true

Removing text from response body

I have some lines of code in Python and thanks to requests and a post request I want to retrieve some data from a server, it should return a JSON file, but the problem is that the response contains a string starting with /*-secure-, then the structure of the normal JSON file and again at the end of the response, after the JSON I can see again something which doesn't belong to JSON structure: */.
How can I get rid of this stuff which leads the JSON decoder to generate a traceback? Thank you!

You can use the strip() function.
In [1]: x = "/*-secure-{'test': 'yes'}-secure-*/"
In [2]: y = x.strip("/*-secure-")
In [3]: y
Out[3]: "{'test': 'yes'}"

This is ugly and I would personally go with #wpercy's answer, but I've not posted a python answer for a while.
>>> x = "/*-secure-{'test': 'yes'}-secure-*/"
>>> x.split("-secure-")[1]
"{'test': 'yes'}"

Do I dare mention this? (Yes, I do.)
>>> x = "/*-secure-{'test': 'yes'}-secure-*/"
>>> x[10:-10]
"{'test': 'yes'}"

Regex to parse out a part of URL using python

I am having data as follows,
data['url']
http://hostname.com/aaa/uploads/2013/11/a-b-c-d.jpg https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/e-f-g-h.gif https://www.aaa.com/
http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html
http://hostname.com/ddd/uploads/2013/11/w-e-r-t.ico
http://hostname.com/ddd/uploads/2013/11/r-t-y-u.aspx https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/t-r-w-q.jpeg https://www.aaa.com/
I want to find out the formats such as .jpg, .gif, .png, .ico, .aspx, .html, .jpeg and parse it out backwards until it finds a "/". Also I want to check for several occurance all through the string. My output should be,
data['parsed']
a-b-c-d
e-f-g-h
e-f-g-h a-a-a-a
w-e-r-t
r-t-y-u
t-r-w-q
I am thinking instead of writing individual commands for each of the formats, is there a way to write everything under a single command.
Can anybody help me in writing for theses commands? I am new to regex and any help would be appreciated.

this builds a list of name to extension pairs
import re
results = []
for link in data:
matches = re.search(r'/(\w-\w-\w-\w)\.(\w{2,})\b', link)
results.append((matches.group(1), matches.group(2)))

This pattern returns the file names. I have just used one of your urls to demonstrate, for more, you could simply append the matches to a list of results:
import re
url = "http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html"
p = r'((?:[a-z]-){3}[a-z]).'
matches = re.findall(p, url)
>>> print('\n'.join(matches))
e-f-g-h
a-a-a-a
There is the assumption that the urls all have the general form you provided.

You might try this:
data['parse'] = re.findall(r'[^/]+\.[a-z]+ ',data['url'])
That will pick out all of the file names with their extensions. If you want to remove the extensions, the code above returns a list which you can then process with list comprehension and re.sub like so:
[re.sub('\.[a-z]+$','',exp) for exp in data['parse']]
Use the .join function to create a string as demonstrated in Totem's answer

How to do parsing in python?

I'm kinda new to Python. And I'm trying to find out how to do parsing in Python?
I've got a task: to do parsing with some piece of unknown for me symbols and put it to DB. I guess I can create DB and tables with help of SQLAlchemy, but I have no idea how to do parsing and what all these symbols below mean?
http://joxi.ru/YmEVXg6Iq3Q426
http://joxi.ru/E2pvG3NFxYgKrY
$$HDRPUBID 112701130020011127162536
H11127011300UNIQUEPONUMBER120011127
D11127011300UNIQUEPONUMBER100001112345678900000001
D21127011300UNIQUEPONUMBER1000011123456789AR000000001
D11127011300UNIQUEPONUMBER200002123456987X000000001
D21127011300UNIQUEPONUMBER200002123456987XIR000000000This item is inactive. 9781605600000
$$EOFPUBID 1127011300200111271625360000005
Thanks in advance those who can give me some advices what to start from and how the parsing is going on?

The best approach is to first figure out where each token begins and ends, and write a regular expression to capture these. The site RegexPal might help you design the regex.

As other suggest take a look to some regex tutorials, and also re module help.
Probably you're looking to something like this:
import re
headerMapping = {'type': (1,5), 'pubid': (6,11), 'batchID': (12,21),
'batchDate': (22,29), 'batchTime': (30,35)}
poaBatchHeaders = re.findall('\$\$HDR\d{30}', text)
parsedBatchHeaders = []
batchHeaderDict = {}
for poaHeader in poaBatchHeaders:
for key in headerMapping:
start = headerMapping[key][0]-1
end = headerMapping[key][1]
batchHeaderDict.update({key: poaHeader[start:end]})
parsedBatchHeaders.append(batchHeaderDict)
Then you have list with dicts, each dict contains data for each attribute. I assume that you have your datafile in text which is string. Each dict is made for one found structure (POA Batch Header in example).
If you want to parse it further, you have to made a function to parse each date in each attribute.
def batchDate(batch):
return (batch[0:2]+'-'+batch[2:4]+'-20'+batch[4:])
for header in parsedBatchHeaders:
header.update({'batchDate': batchDate( header['batchDate'] )})
Remember, that's an example and I don't know documentation of your data! I guess it works like that, but rest is up to you.

Slicing URL with Python

I am working with a huge list of URL's. Just a quick question I have trying to slice a part of the URL out, see below:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3
How could I slice out:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234
Sometimes there is more than two parameters after the CONTENT_ITEM_ID and the ID is different each time, I am thinking it can be done by finding the first & and then slicing off the chars before that &, not quite sure how to do this tho.
Cheers

Use the urlparse module. Check this function:
import urlparse
def process_url(url, keep_params=('CONTENT_ITEM_ID=',)):
parsed= urlparse.urlsplit(url)
filtered_query= '&'.join(
qry_item
for qry_item in parsed.query.split('&')
if qry_item.startswith(keep_params))
return urlparse.urlunsplit(parsed[:3] + (filtered_query,) + parsed[4:])
In your example:
>>> process_url(a)
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This function has the added bonus that it's easier to use if you decide that you also want some more query parameters, or if the order of the parameters is not fixed, as in:
>>> url='http://www.domainname.com/page?other_value=xx&param3&CONTENT_ITEM_ID=1234&param1'
>>> process_url(url, ('CONTENT_ITEM_ID', 'other_value'))
'http://www.domainname.com/page?other_value=xx&CONTENT_ITEM_ID=1234'

The quick and dirty solution is this:
>>> "http://something.com/page?CONTENT_ITEM_ID=1234&param3".split("&")[0]
'http://something.com/page?CONTENT_ITEM_ID=1234'

Another option would be to use the split function, with & as a parameter. That way, you'd extract both the base url and both parameters.
url.split("&")
returns a list with
['http://www.domainname.com/page?CONTENT_ITEM_ID=1234', 'param2', 'param3']

I figured it out below is what I needed to do:
url = "http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3"
url = url[: url.find("&")]
print url
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'

Parsin URL is never as simple I it seems to be, that's why there are the urlparse and urllib modules.
E.G :
import urllib
url ="http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3"
query = urllib.splitquery(url)
result = "?".join((query[0], query[1].split("&")[0]))
print result
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This is still not 100 % reliable, but much more than splitting it yourself because there are a lot of valid url format that you and me don't know and discover one day in error logs.

import re
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3'
m = re.search('(.*?)&', url)
print m.group(1)

Look at the urllib2 file name question for some discussion of this topic.
Also see the "Python Find Question" question.

This method isn't dependent on the position of the parameter within the url string. This could be refined, I'm sure, but it gets the point across.
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3'
parts = url.split('?')
id = dict(i.split('=') for i in parts[1].split('&'))['CONTENT_ITEM_ID']
new_url = parts[0] + '?CONTENT_ITEM_ID=' + id

An ancient question, but still, I'd like to remark that query string paramenters can also be separated by ';' not only '&'.

beside urlparse there is also furl, which has IMHO better API.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract string with more than one URL - python

Maybe this way: URL_list = [i for i in data.split('"') if 'http' in i] It doesn't use regex, but in this code I don't see a need for regex.

Related

How do I create a dynamic url based on user input using python?

Removing text from response body

Regex to parse out a part of URL using python

How to do parsing in python?

Slicing URL with Python

Categories

Resources