Get substring from a string with specific pattern - python

I am working with twitter links and I have extracted the links from twitter's api.
https://twitter.com/shiromiru36/status/1302713597521403904/photo/1
The above is an example of links that I have.
I would like to get front part of the link only, I want to delete
/photo/1
So that I can get
https://twitter.com/shiromiru36/status/1302713597521403904
Currently I extract the link by counting the number of words that I don't want
url = https://twitter.com/shiromiru36/status/1302713597521403904/photo/1
url[:-8]
I would like to ask if there is any way to extract the link by finding its own pattern. As the unwanted part is after the 2nd last "/". I am thinking whether I can delete the unwanted part by finding the 2nd last "/" first, and then delete the words after it.
Thank you.

You could do something like this:
'/'.join(s.split('/')[:-2])

Try this
url = 'https://twitter.com/shiromiru36/status/1302713597521403904/photo/1'
url=(((url[::-1]).split('/',2))[-1])[::-1]
print(url)
https://twitter.com/shiromiru36/status/1302713597521403904

Related

Using regex to find something in the middle of a href while looping

For "extra credit" in a beginners class in Python that I am taking I wanted to extract data out of a URL using regex. I know that there are other ways I could probably do this, but my regex desperately needs work so...
Given a URL to start at, find the xth occurrence of a href on the page, and use that link to go down a level. Rinse and repeat until I have found the required link on the page at the requested depth on the site.
I am using Python 3.7 and Beautiful Soup 4.
At the beginning of the program, after all of the house-keeping is done, I have:
starting_url = 'http://blah_blah_blah_by_Joe.html'
extracted_name = re.findall('(?<=by_)([a-zA-Z0-9]+)[^.html]*', starting_url)
selected_names.append(extracted_name)
# Just for testing purposes
print(selected_name) [['Joe']]
Hmm, a bit odd didn't expect a nested list, but I know how to flatten a list, so ok. Let's go on.
I work my way through a couple of loops, opening each url for the next level down by using:
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
Continue processing and, in the loop where the program should have found the href I want:
# Testing to check I have found the correct href
print(desired_link) <a href="http://blah_blah_blah_by_Mary.html">blah
blah</a>
type(desired_link) bs4.element.tag
Correct link, but a "type" new to me and not something I can use re.findall on. So more research and I have found:
for link in soup.find_all('a') :
tags = link.get('href')
type(tags) str
print(tags)
http://blah_blah_blah_by_George.html
http://blah_blah_blah_by_Bill.html
http://blah_blah_blah_by_Mary.html
etc.
Right type, but when I look at what printed, I think what I am looking at is maybe just one long string? And I need a way to just assign the third href in the string to a variable that I can use in re.findall('regex expression', desired_link).
Time to ask for help, I think.
And, while we are at it, any ideas about why I get the nested list the first time I used re.findall with the regex?
Please let me know how to improve this question so it is clearer what I've done and what I'm looking for (I KNOW you guys will, without me even asking).
You've printed every link on the page. But each time in the loop tags contains only one of them (you can print len(tags) to validate it easily).
Also I suggest replacing [a-zA-Z0-9]+ with \w+ - it will catch letters, numbers and underscores and is much cleaner.

How can I extract the title from a URL in Python without using any external module?

I'd like to create a variable whose data is the title extracted from an URL without using any external module.
I'm new with Python so if you can, please explain what does every part of the code.
Thanks.
PD: I'm using Python 3.
PD2: I mean the title tag of its HTML.
Let html be an HTML string (say, the HTML source of this particular page). You can find the opening and closing tags with str.find(). The string is converted to the lower case to allow case-insensitive search.
start = html.lower().find('<title>') + len('<title>')
end = html.lower().find('</title>')
You can then extract the part of the HTML string between the tags:
html[start:end]
#'How can I extract the title from a URL in Python without using any...'
Assuming by "title", you mean the title of a resource: take a URL like https://www.foo.com/bar/baz/resource.jpg. You need to split it into a list along the /s, then take the last item in that list. The code
url = "https://www.foo.com/bar/baz/resource.jpg"
print(url.split('/')[-1])
gives the output
resource.jpg

how to make dango url ignore query string in urls.py

I'm working on a website where I can have the user choose from a list of similar elements on a list page and I want it to open a separate page based on the element they choose in the list
I have a list setup and each element in the list is identifiable with a random string of 15 characters containing [0-9], [a-z] and [A-Z].
Example of an url for an element of the list: http://127.0.0.1:8000/view?s=fkiscl49gtisceg
where s is the identifier (kinda like how youtube videos have a seperate link)
however I can't understand how I need to make django ignore the ?s=fkiscl49gtisceg part of the string. I've written the path like this now:
path('view/(?P<s>[\w]{15})', element_display, name='s'),
django however tels me that the page was not found... How do I fix this?
The principle is simply that Django does not take any account of the querystring. Therefore, you should not have it in your pattern. The URL should just be path('view').

Finding urls containing a specific string

I haven't used RegEx before, and everyone seems to agree that it's bad for webscraping and html in particular, but I'm not really sure how to solve my little challenge without.
I have a small Python scraper that opens 24 different webpages. In each webpage, there's links to other webpages. I want to make a simple solution that gets the links that I need and even though the webpages are somewhat similar, the links that I want are not.
The only common thing between the urls seems to be a specific string: 'uge' or 'Uge' (uge means week in Danish - and the week number changes every week, duh). It's not like the urls have a common ID or something like that I could use to target the correct ones each time.
I figure it would be possible using RegEx to go through the webpage and find all urls that has 'uge' or 'Uge' in them and then open them. But is there a way to do that using BS? And if I do it using RegEx, how would a possible solution look like?
For example, here are two of the urls I want to grab in different webpages:
http://www.domstol.dk/KobenhavnsByret/retslister/Pages/Uge45-Tvangsauktioner.aspx
http://www.domstol.dk/esbjerg/retslister/Pages/Straffesageruge32.aspx
This should work... The RegEx uge\d\d? tells it to find "uge" followed by a digit, and possibly another one.
import re
for item in listofurls:
l = re.findall("uge\d\d?", item, re.IGNORECASE):
if l:
print item #just do whatever you want to do when it finds it
Yes, you can do this with BeautifulSoup.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
# To find just 'Uge##' or 'uge##', as specified in the question:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("[Uu]ge\d+"))]
# To find without regard to case at all:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("(?i)uge\d+"))]
Or just use a simple for loop:
list_of_urls = ["""LIST GOES HERE"""]
for url in list_of_urls:
if 'uge' in url.lower():
# Code to execute
The regex expression would look something like: uge\d\d

Parsing html for domain links

I have a script that parses an html page for all the links within it. I am getting all of them fine, but I have a list of domains I want to compare it against. So a sample list contains
list=['www.domain.com', 'sub.domain.com']
But I may have a list of links that look like
http://domain.com
http://sub.domain.com/some/other/page
I can strip off the http:// just fine, but in the two example links I just posted, they both should match. The first I would like to match against the www.domain.com, and the second, I would like to match against the subdomain in the list.
Right now I am using url2lib for parsing the html. What are my options in completely this task?
You might consider stripping 'www.' from the list and doing something as simple as:
url = 'domain.com/'
for domain in list:
if url.startswith(domain):
... do something ...
Or trying both wont hurt either I spose:
url = 'domain.com/'
for domain in list:
domain_minus_www = domain
if domain_minus_www.startswith('www.'):
domain_minus_www = domain_minus_www[4:]
if url.startswith(domain) or url.startswith(domain_minus_www):
... do something ...

Categories

Resources