I'm new to python and trying to figure this out, so sorry if this has been asked. I couldn't find it and don't know what this may be called.
So the short of it. I want to take a link like:
http://www.somedomainhere.com/embed-somekeyhere-650x370.html
and turn it into this:
http://www.somedomainhere.com/somekeyhere
The long of it, I have been working on an addon for xbmc that goes to a website, grabs a url, goes to that url to find another url. Basically a url resolver.
So the program searches the site and comes up with somekeyhere-650x370.html. But that page is in java and is unusable to me. but when I go to com/somekeyhere that code is usable. So I need to grab the first url, change the url to the usable page and then scrape that page.
So far the code I have is
if 'somename' in name:
try:
n=re.compile('<iframe title="somename" type="text/html" frameborder="0" scrolling="no" width=".+?" height=".+?" src="(.+?)">" frameborder="0"',re.DOTALL).findall(net().http_GET(url).content)[0]
CONVERT URL to .com/somekeyhere SO BELOW NA CAN READ IT.
na = re.compile("'file=(.+?)&.+?'",re.DOTALL).findall(net().http_GET(na).content)[0]
Any suggestions on how I can accomplish converting the url?
I really didn't get the long of your question.
However, answering the short
Assumptions:
somekey is a alphanumeric
a='http://www.domain.com/embed-somekey-650x370.html'
p=re.match(r'^http://www.domain.com/embed-(?P<key>[0-9A-Za-z]+)-650x370.html$',a)
somekey=p.group('key')
requiredString="http://www.domain.com/"+somekey #comment1
I have really provided a very specific answer here for just the domain name.
You should modify the regex as required. I see your code in question uses regex and hence i assume
you can frame a regex to match your requirement better.
EDIT 1 : also see urlparse from here
https://docs.python.org/2/library/urlparse.html?highlight=urlparse#module-urlparse
It provides an easy way to get to parse your url
Also, in line with "#comment1" you can actually save the domain name to a variable and reuse it here
Related
Is it possible to download a video with controlslist="nodownload" and if so how? There is a poster tag and a src tag with urls, but when I tried to open them it only said Bad URL hash.
the whole thing looks like this: <video controlslist="nodownload" loop="" poster="https://scontent-fra3-1.xx.fbcdn.net/v/t39.35426-6/306972082_627455715702176_816739884095398058_n.jpg?_nc_cat=106&ccb=1-7&_nc_sid=cf96c8&_nc_ohc=gGkqXkxok9sAX-lD2Df&_nc_ht=scontent-fra3-1.xx&oh=00_AfAwpoDJdXRX_30nbuDBub38X9EcpUWJnI4yRPZ2PI1WUA&oe=63D59017" src="https://video-fra3-1.xx.fbcdn.net/v/t42.1790-2/307608880_755324925546967_7278828413698270618_n.?_nc_cat=106&ccb=1-7&_nc_sid=cf96c8&_nc_ohc=2s1fEESN7NUAX_El4Hb&_nc_oc=AQmgHTnQH8pGCmL7kHQnvHKKzkDFJc-6kTQazbteeA1cA21gUhBHplAVKoQmgAfQa2n1lKhOdkZlAXTbObUQycEp&_nc_ht=video-fra3-1.xx&oh=00_AfAOYW5RS8PEp52dlofPE3OtHjgd2SM0dxvk-dhnBIK8BQ&oe=63D18E33" width="100%" height="175"></video>
any help is appreciated <3
You can fix the "Bad URL hash" by replacing every & with ;. Use String functions in your programming language to do this.
Fixing that above issue brings a new error about an expired/invalid signature. This means the URL changes slightly after some hours. Solution: Is that you need to always first extract the latest version of the link, then secondly you attempt to do a file save. If you're lucky, you won't need a step three.
Web-scraping adjacent question about URLs acting whacky.
If I go to glassdoor job search and enter in 6 fields (Austin, "engineering manager", fulltime, exact city, etc.. ). I get a results page with 38 results. This is the link I get. Ideally I'd like to save this link with its search criteria and reference it later.
https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22engineering+manager%22&sc.locationSeoString=austin&locId=1139761&locT=C?jobType=fulltime&fromAge=30&radius=0&minRating=4.00
However, If I copy that exact link and paste it into a new tab, it doesn't act as desired.
It redirects to this different link, maintaining some of the criteria but losing the location criteria, bringing up thousands of results from around the country instead of just Austin.
https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22engineering+manager%22&fromAge=30&radius=0&minRating=4.0
I understand I could use selenium to select all 6 fields, I'd just like to understand what's going on here and know if there is a solution involving just using a URL.
The change of URL seems to happen on the server that is handling the request. I would think this is how it's configured on the server-side endpoint for it to trim out extra parameters and redirects you to another URL. There's nothing you can do about this since however you pass it, it will always resolve into the second URL format.
I have also tried URL shortener but the same behavior persists.
The only way around this is to use Automation such as Selenium to enable the same behaviour to select and display the results from the first URL.
I'm making a system - mostly in Python with Scrapy - in which I can, basically, find information about a specific product. But the thing is that the request URL is massive huge, I got a clue that I should change some parts of it with variables to reach that specific product in which I would like to search for, but the URL has so many fields that I don't know, for sure, how to make it.
e.g: "https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano&adgrpid=86887777368&hvadid=392971063429&hvdev=c&hvlocphy=9047761&hvnetw=g&hvpos=1t1&hvqmt=e&hvrand=11390662277799676774&hvtargid=kwd-597187395757&hydadcr=5658_10696978&tag=hydrbrgk-20&ref=pd_sl_21pelgocuh_e%2Frobot.txt"
"demi+lovato+365+dias+do+ano" it's the book title, but I can see a lot of information on URL that I simply can't supply and of course, it changes from title to title. One solution I thought could be possible was to POST on search bar the title in which I was looking for and find it on result page but I don't know if it's the best approach since in fact, this is the first time I'll be working with web scraping.
Someone has some tip for how can I do that. All I could find was how to scrape all products for price comparison, scrape specific information about all these products and things like that but nothing about search for specific products.
Thanks for any contribs, this is very important for me and sorry about anything, I'm not a very present user and I'm not an English native speaker.
Feel free to make me any advice about user behavior, be better is always something I aim to.
You should use rule available in scrapy framework. This will help you to define how to navigate the site and its sub-site. Additionally you can configure other tags like span or div other than anchor tags to look for url of the link. By this way, additional query params in the link will be populated by the scrapy session as it emulate click on the hypelinks. If you skip the additional query params in the URL, there is a high chance that you will be blocked
How does scrapy use rules?
You don't need to follow that long link at all, often the different parameters are associated with your current session or settings/filters and you can keep only what you need.
Here is what I meant:
You can generate same result using these 2 urls:
https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano
https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano&adgrpid=86887777368&hvadid=392971063429&hvdev=c&hvlocphy=9047761&hvnetw=g&hvpos=1t1&hvqmt=e&hvrand=11390662277799676774&hvtargid=kwd-597187395757&hydadcr=5658_10696978&tag=hydrbrgk-20&ref=pd_sl_21pelgocuh_e%2Frobot.txt
If both links are generating same results then that's it, otherwise you will definitely have to play with different parameters, you can't predict website behavior without actually doing the test and having a lot of parameters is an issue then try something like:
from urllib.parse import quote_plus
base_url = "https://www.amazon.com.br"
link = base_url + "/k=%s&adgrpid=%s&hvadid=%s" % ( quote_plus(title), '86887777368', '392971063429' )
I am trying to read review data data from alexaskillstore.com website using BeautifulSoup. For this, I am specifying the target url as https://www.alexaskillstore.com/Business-Leadership-Series/B078LNGS5T, where the string after Business-Leadership-Series/ keeps changing for all the different skills.
I want to know how can I input a regular expression or similar code to my input url so that I am able to read every link that starts with https://www.alexaskillstore.com/Business-Leadership-Series/.
You can't. The web is client-server based, so unless the server is kind enough to map the content for you, you have no way to know which URLs will be responsive and which won't.
You may be able to scrape some index page(s) to find the keys (B078LNGS5T and the like) you need. Once you have them all, actually generating the URLs is a simple matter of string substitution.
I am having some unknown trouble when using xpath to retrieve text from an HTML page from lxml library.
The page url is www.mangapanda.com/one-piece/1/1
I want to extract the selected chapter name text from the drop down select tag. Now I just want the first option so the XPath to find that is pretty easy. That is :-
.//*[#id='chapterMenu']/option[1]/text()
I verified the above using Firepath and it gives correct data. but when I am trying to use lxml for the purpose I get not data at all.
from lxml import html
import requests
r = requests.get("http://www.mangapanda.com/one-piece/1/1")
page = html.fromstring(r.text)
name = page.xpath(".//*[#id='chapterMenu']/option[1]/text()")
But in name nothing is stored. I even tried other XPath's like :-
//div/select[#id='chapterMenu']/option[1]/text()
//select[#id='chapterMenu']/option[1]/text()
The above were also verified using FirePath. I am unable to figure out what could be the problem. I would request some assistance regarding this problem.
But it is not that all aren't working. An xpath that working with lxml xpath here is :-
.//img[#id='img']/#src
Thank you.
I've had a look at the html source of that page and the content of the element with the id chapterMenu is empty.
I think your problem is that it is filled using javascript and javascript will not be automatically evaluated just by reading the html with lxml.html
You might want to have a look at this:
Evaluate javascript on a local html file (without browser)
Maybe you're able to trick it though... In the end, also javascript needs to fetch the information using a get request. In this case it requests: http://www.mangapanda.com/actions/selector/?id=103&which=191919
Which is json and can be easily turned into a python dict/array using the json library.
But you have to find out how to get the id and the which parameter if you want to automate this.
The id is part of the html, look for document['mangaid'] within one of the script tags and which can maybe stay 191919 has to be 0... although I couldn't find it in any source I found it, when it is 0 you will be redirected to the proper url.
So there you go ;)
The source document of the page you are requesting is in a default namespace:
<html xmlns="http://www.w3.org/1999/xhtml">
even if Firepath does not tell you about this. The proper way to deal with namespaces is to redeclare them in your code, which means associating them with a prefix and then prefixing element names in XPath expressions.
name = page.xpath('//*[#id='chapterMenu']/xhtml:option[1]/text()',
namespaces={'xhtml': 'http://www.w3.org/1999/xhtml'})
Then, the piece of the document the path expression above is concerned with is:
<select id="chapterMenu" name="chapterMenu"></select>
As you can see, there is no option element inside it. Please tell us what exactly you'd like to find.