Extracting URLs of certain pattern from text

Extracting URLs of certain pattern from text - python

I have following URLs that needs to be fetched from page. This is the attempt I made but it is skipping part of string if it has =. Different kinds of URLs given below:
AjaxRender.htm?encparams=2~2586506573108327708~9SpSI_aPBiryk3VIKwmkjN-FD4jkS5GoDobsBCN6pRnZOhBsmrEOgT8vg5KciKjOmt25k3kEDZ00r7f48bIsRPSZWTHJbSCpS815cCNyQrsobBzLZlao8ww-rWwg0lLDIb10gJ3vWUl3zLIAQi5vBGLglJKXcSEg7wCXZUEm5aVHCQiGChz5f8oeiBtPXAV_A9XQ7xU5HUzyzTzyEMJICw==&rwebid=8347449&rhost=1
AjaxRender.htm?encparams=7~502085760588479881~-8dtDO_8-jpTBfqALerDcDLkIIRWnom8BG9WdtIVgqGTlRDn37waNvbaM_VHLrcntsGabZPzMiFlNxsrmqx4VpCZrtJmjyCcOBr9AY1B2GxnTlh3ngYfIYbhnDi-W6Hpb8V77OSS-WviMKsgF87gcWvjGzEd02a7Q_3XQ2FvdZ2rvwDlwG4izypuO7Ob63Gh&rwebid=8347449&rhost=1
AjaxRender.htm?encparams=5~6917684668841666406~70K0Ijfg4OhPeKlzLP8aQV4JjBq9WuXnpC3enGYXfyWj5-28RyHRnjGRJypZBi0knr3io-9UdjdlOWuLqisI_pkZ0hQzFA5bhlRkX7siC6uMUA6A_MntiLDNGTrKN47TvrAxRd_JpQQUprReVHYwSdUEQvVUtpKn1_Ku5WG_zyWe_0Sd7FLftU1ti6pYf_tfMyNiDalQzyrPDQ35sAXcYIDyhSYI08uZCmTq5vrjSNkQChnMSW73MKri42rVM3JVP8j5LfCf3Zrws54M8KkFRnvfsyYeYd-hATgywsv9i2rtU3A-KPP6lSrL6jqbkAXVTezFRYV00ZNUhvX8NrL8Ew==&rwebid=8347449&rhost=1
AjaxRender.htm?encparams=2~3180584448022130058~v_d5bPfBCJINSmPxaUaByy3S5n5h5UbQ53k5QKhqYbz7KXeHku95HjcqE2MnU18rRhcdnBghBW90u-GS3tqZc5FBGt6Z9-mNBnr6RPTiAlIdlG9vO8QDW7e7vMS5H2Yue3sRQ5ANzNKGoAXe3Z5GpC1HWW9DA55OGRkLRsGdNRbN3VkqiVpObCQNGHyDYhfrh_WF8uPpAb5WE2s9sDvrSVDkUfuvHclAarXoua9OYsDQtYaDGxaquDkZrIO-VEYgjv-CPKwCkOQyOVqdq--QQ-GQNvi8vHk05uoiU6-9Kg4=&rwebid=8347449&rhost=1
AjaxRender.htm?encparams=5~7279828915011575224~9KhHzCPV9FXMYfGNPF7W0MNL_4Ljv3YFdCr_JVtQN0GhUhD7ohGtUTYCzRJvS4sI6uyoM3TTrNmHaMsidk_BiN9qXRKpdEhJHGgfHHLzU1vtAXejIwnQUxB5Oexjkt74WeBnEfVSrxVfvhRM3LoB076SYiK5x92bA8WqJg62YtsUWV7vqtsCpvKyn9ssF7nnjlTmUqIWpBkqC9ZtcfN7-A==&rwebid=8347449&rhost=1
AjaxRender.htm?encparams=7~502085760588479881~-8dtDO_8-jpTBfqALerDcDLkIIRWnom8BG9WdtIVgqGTlRDn37waNvbaM_VHLrcntsGabZPzMiFlNxsrmqx4VpCZrtJmjyCcOBr9AY1B2GxnTlh3ngYfIYbhnDi-W6Hpb8V77OSS-WviMKsgF87gcWvjGzEd02a7Q_3XQ2FvdZ2rvwDlwG4izypuO7Ob63Gh&rwebid=8347449&rhost=1
AjaxRender.htm?encparams=3~4781276045400603393~duZpRpWJA0naDjmpXNSp__ILjEXoOrwiv9SVBUjldBK4ebRdYWlzxwRudeyHrXoCC-XM_xEKr475_ViwwaHlnqFgEqteM3N6bDAgOxWEc8Y5Klh5d3Ivb_6qG6VsfMmp8oaT3nLnuALjX8vfqBN72WsNlwWeGMR3lOTuQnHgbl2betlejT6KsRx7ycVv71mxe8BP7oDIdI29Baetjlv1YA==&rwebid=8347449&rhost=1
AjaxRender.htm?encparams=6~7112793196313446100~IVBMr0jpuDOH9HKclY47FtAJQXrgqOsD6P7mbOwJOcbDWAbviVmEg1HZScYqiKL5svd6BGA7jm22V6uEvquNb_-cZEyfDIFGbNxF3WNTwXcGX13GWcVi6tg7Acgdw8SHEEvhJzw1U01lvMS-Ptks6eeWj0cDdM_Al9hS5WkUA4ZR7rQK5CU9Uovn9WWF5I-6Ot0zcXZKaJMNIndiPYdIq0rpcpehlB8k&rwebid=8347449&rhost=1
AjaxRender.htm?encparams=10~1438958542856547329~OUrqnIrSPt0QON_7Q12RhcKfwyc22cFvE0xIIobEoUIFu91yWu5SK_jSW59wazXcfcxjpZnQ9YTWAH5kxu8H2B-lu2vO9J47cqg9ThA6AvDFRhj-6moF1_6ymrCKqhbcJdQddN24hShw9IwJOs2uDYJ2bECVJlnoraak4PGtBLHV4TnoVy9eZJxVPNB3XbIumIivk84XZyg=&rwebid=8347449&rhost=1
The issue I am facing that string before &rwebid gets - or = occasionally (base 64?) which breaks thing.
Updated
https://regex101.com/r/pudx92/2

Why not just stop on string delimiter " ?
AjaxRender[.]htm[?]encparams=[^\"]*

Related

python3.6 How do I regex a url from a .txt?

I need to grab a url from a text file.
The URL is stored in a string like so: 'URL=http://example.net'.
Is there anyway I could grab everything after the = char up until the . in '.net'?
Could I use the re module?

text = """A key feature of effective analytics infrastructure in healthcare is a metadata-driven architecture. In this article, three best practice scenarios are discussed: https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare Automating ETL processes so data analysts have more time to listen and help end users , https://www.google.com/, https://www.facebook.com/, https://twitter.com
code below catches all urls in text and returns urls in list."""
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)
output:
[
'https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare',
'https://www.google.com/',
'https://www.facebook.com/',
'https://twitter.com'
]

i dont have much information but i will try to help with what i got im assuming that URL= is part of the string in that case you can do this
re.findall(r'URL=(.*?).', STRINGNAMEHERE)
Let me go more into detail about (.*?) the dot means Any character (except newline character) the star means zero or more occurences and the ? is hard to explain but heres an example from the docs "Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’." the brackets place it all into a group. All this togethear basicallly means it will find everything inbettween URL= and .

You don't need RegEx'es (the re module) for such a simple task.
If the string you have is of the form:
'URL=http://example.net'
Then you can solve this using basic Python in numerous ways, one of them being:
file_line = 'URL=http://example.net'
start_position = file_line.find('=') + 1 # this gives you the first position after =
end_position = file_line.find('.')
# this extracts from the start_position up to but not including end_position
url = file_line[start_position:end_position]
Of course that this is just going to extract one URL. Assuming that you're working with a large text, where you'd want to extract all URLs, you'll want to put this logic into a function so that you can reuse it, and build around it (achieve iteration via the while or for loops, and, depending on how you're iterating, keep track of the position of the last extracted URL and so on).
Word of advice
This question has been answered quite a lot on this forum, by very skilled people, in numerous ways, for instance: here, here, here and here, to a level of detail that you'd be amazed. And these are not all, I just picked the first few that popped up in my search results.
Given that (at the time of posting this question) you're a new contributor to this site, my friendly advice would be to invest some effort into finding such answers. It's a crucial skill, that you can't do without in the world of programming.
Remember, that whatever problem it is that you are encountering, there is a very high chance that somebody on this forum had already encountered it, and received an answer, you just need to find it.

Please try this. It worked for me.
import re
s='url=http://example.net'
print(re.findall(r"=(.*)\.",s)[0])

How to extract long URL from email with Python?

I need to extract a very long URL (example below) from an email message that I grab using Gmail's IMAP.
https://example.com/account/resetpassword?code=e8EkT%2B48uMCHr3Sq4QZVr0%2FVHrTBwQvhYwubjeaKozn29I7VGvWSYNO6VNRLXCK230P%2FklDrFC6BpPI7OF%2F5yawHlux80jqTBhTq2QRS4r7sEnSM9qKV1mIXkTzx%2B5tjakgElg%3D%3D&returnUrl=example.com
However, when I try to print the grabbed message, I notice that my long URL has some extra things like =\r\n and 3D inside of it (see examples below) or it is split in several lines by =.
https://example.com/account/resetpa=\r\nssword?code=3De8EkT%2B48uMCHr3Sq4QZVr0%2FVHrTBwQvhYwubjeaKozn29I7VGvWSYNO6V=\r\nNRLXCK230P%2FklDrFC6BpPI7OF%2F5yawHlux80jqTBhTq2QRS4r7sEnSM9qKV1mIXkTzx%2B5=\r\ntjakgElg%3D%3D&returnUrl=3Dexample.com
https://example.com/account/resetpa=
ssword?code=3De8EkT%2B48uMCHr3Sq4QZVr0%2FVHrTBwQvhYwubjeaKozn29I7VGvWSYNO6V=
NRLXCK230P%2FklDrFC6BpPI7OF%2F5yawHlux80jqTBhTq2QRS4r7sEnSM9qKV1mIXkTzx%2B5=
tjakgElg%3D%3D&returnUrl=3Dexample.com
How can I make sure that nothing is added to the long URL so that I could use it later to open?

I believe that format with = and 3D is called quoted printable. https://en.wikipedia.org/wiki/Quoted-printable
You could try using quopri.decodestring(string). https://docs.python.org/2/library/quopri.html

"\r\n" is a carriage return, which you can get rid of by using urlstring.replace("\r\n", ""). %3D means =(source), but I don't see why this would be an issue for you. The only issue is the carriage returns, which print your URL on different lines.

How to place the iterated string variable in another string

Quick overview: I am writing a a very simple script using Python and Selenium to view Facebook Metrics for multiple Facebook pages.
I am trying to find a clean way to loop through the pages and output their results (it's only one number that I am collecting).
Here is what I have right now but it is not working.
# Navigate to metrics page
pages = ["page_example_1", "page_example_2", "page_example_3"]
for link in pages:
browser.get(('https://www.facebook.com/{link}/insights/?section=navVideos'))

# Navigate to metrics page
pages = ["page_example_1", "page_example_2", "page_example_3"]
for link in pages:
browser.get('https://www.facebook.com/'+ link + '/insights/?section=navVideos')
its just string concatenation
or if you are so much inclined to use that syntax, have a look at the comment by #heather

It didn't work for you, because you aimed to use Python 3.6's f-strings, but forgot to prepend your string with the f char - crucial for this syntax. E.g. your code should be (only the relevant part):
browser.get(f'https://www.facebook.com/{link}/insights/?sights/?section=navVideos')
Alternatively you could use string formatting (e.g. the established approach before 3.6):
browser.get('https://www.facebook.com/{}/insights/?sights/?section=navVideos'.format(link))
In general, string concatenation - 'string1' + variable + 'string2' - is discouraged in python for performance and readability reasons.
BTW, in your sample code you had brackets around the get()'s argument - it is browser.get((arg)), which essentially turned it to a tuple, and might've caused error in the call. Not sure was it a typo or on purpose, as you can see I and the other responders have removed it.

Python - How to read a csv files separated by commas which have commas within the values?

The file has an URL which contain commas within it. For example:
~oref=https://tuclothing.tests.co.uk/c/Girls/Girls_Underwear_Socks&Tights?INITD=GNav-CW-GrlsUnderwear&title=Underwear,+Socks+&+Tights
Between Underwear and +Socks there is a comma which is making my life not easy.
Is there a way to indicate to the reader(Pandas, CSV reader..etc) that the whole URL is just one value?
This is a bigger sample with columns and values:
Event Time,User ID,Advertiser ID,TRAN Value,Other Data,ORD Value,Interaction Time,Conversion ID,Segment Value 1,Floodlight Configuration,Event Type,Event Sub-Type,DBM Auction ID,DBM Request Time,DBM Billable Cost (Partner Currency),DBM Billable Cost (Advertiser Currency),
1.47E+15,CAESEKoMzQamRFTrkbdTDT5F-gM,2934701,,~oref=https://tuclothing.tests.co.uk/c/NewIn/NewIn_Womens?q=%3AnewArrivals&page=2&size=24,4.60E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,
1.47E+15,CAESEKQhGXdLq0FitBKF5EPPfgs,2934701,,~oref=https://tuclothing.tests.co.uk/c/Women/Women_Accessories?INITD=GNav-WW-Accesrs&q=%3AnewArrivals&title=Accessories&mkwid=sv5biFf2y_dm&pcrid=90361315613&pkw=leather%20bag&pmt=e&med=Search&src=Google&adg=Womens_Accessories&kw=leather+bag&cmp=TU_Women_Accessories&adb_src=4,4.73E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,
1.47E+15,CAESEEpNRaLne21k6juip9qfAos,2934701,,num=16512910;~oref=https://tuclothing.tests.co.uk/,1,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,
1.47E+15,CAESEJ3a2YRrPSSeeRUFHDSoXNQ,2934701,,~oref=https://tuclothing.tests.co.uk/c/Girls/Girls_Underwear_Socks&Tights?INITD=GNav-CW-GrlsUnderwear&title=Underwear,+Socks+&+Tights,8.12E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,0,0,0
1.47E+15,CAESEGmwaNjTvIrQ3MoIvqiRC8U,2934701,,~oref=https://tuclothing.tests.co.uk/login/checkout,1.75E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,
1.47E+15,CAESEM3G-Nh6Q0OhboLyOhtmtiI,2934701,,~oref=https://3984747.fls.doubleclick.net/activityi;~oref=http%3A%2F%2Fwww.tests.co.uk%2Fshop%2Fgb%2Fgroceries%2Ffrozen-%2Fbeef--pork---lamb,3.74E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,
1.47E+15,CAESENlK7oc-ygl637Y2is3a90c,2934701,,~oref=https://tuclothing.tests.co.uk/,5.10E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,

It looks like, in this case, the only comma which you are having issues with is located in a URL. You could run your csv file through a preprocessor method which strips out commas in your URLs or URL encode them.
Personally, I would opt for the URL encoding method which will convert the comma to %2E, this way you don't have a comma in your URL when you start reading your csv row values, yet the URL still retains its working link to the reference/destination page.
If you had this issue with other fields (not a URL), or in other unknown/random locations in the csv row, then the solution would not be easy at all. But since you know exactly where the issue is occurring each time, you could perform a static lookup for that character and replace if found in that particular field.

Escaping characters for instance query matching in webpy

(The title may be in error here, but I believe that the problem is related to escaping characters)
I'm using webpy to create a VERY simple todo list using peewee with Sqlite to store simple, user submitted todo list items, such as "do my taxes" or "don't forget to interact with people", etc.
What I've noticed is that the DELETE request fails on certain inputs that contain specific symbols. For example, while I can add the following entries to my Sqlite database that contains all the user input, I cannot DELETE them:
what?
test#
test & test
This is a test?
Any other user input with any other symbols I'm able to DELETE with no issues. Here's the webpy error message I get in the browser when I try to DELETE the inputs list above:
<class 'peewee.UserInfoDoesNotExist'> at /del/test
Instance matching query does not exist: SQL: SELECT "t1"."id", "t1"."title" FROM "userinfo" AS t1 WHERE ("t1"."title" = ?) PARAMS: [u'test']
Python /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/peewee.py in get, line 2598
Web POST http://0.0.0.0:7700/del/test
When I view the database file (called todoUserList.db) in sqlitebrowser, I can see that these entries do exist with the symbols, they're all there.
In my main webpy app script, I'm using a regex to search through the db to make a DELETE request, it looks like this:
urls = (
'/', 'Index',
'/del/(.*?)', 'Delete'
)
I've tried variations of the regex, such as '/del/(.*)', but still get the same error, so I don't think that's the problem.
Given the error message above, is webpy not "seeing" certain symbols in the user input because they're not being escaped properly?
Confused as to why it seems to only happen with the select symbols listed above.

Depending on how the URL escaping is functioning it could be an issue in particular with how "?" and "&" are interpreted by the browser (in a typical GET style request & and ? are special character used to separate query string parameters)

Instead of passing those in as part of the URL itself you should pass them in as an escaped querystring. As far as I know, no web server is going to respect wacky values like that as part of a URL. If they are escaped and put in the querystring (or POST body) you'll be fine, though.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting URLs of certain pattern from text - python

Why not just stop on string delimiter " ? AjaxRender[.]htm[?]encparams=[^\"]*

Related

python3.6 How do I regex a url from a .txt?

How to extract long URL from email with Python?

How to place the iterated string variable in another string

Python - How to read a csv files separated by commas which have commas within the values?

Escaping characters for instance query matching in webpy

Categories

Resources