Parsing HTML code to find a link - python

I need to get a link that is buried in an html code (does not show up on the website). I've tried parsing the page with BeautifulSoup, but it only gets the links on the webpage. Is there a way to parse html code to find the link?

You can find everything that is in the sourcecode of a html page, in your case it probably looks like http://someurl>. Php can do the job for you, if you give me more details about the link and in what website you are trying to find the link and what you want to do with it, I might look into some of my own code made for extracting url's (links) from any given website.
I found a nice example code for you (does more than extracting url's). This will hopefully help you on your way: http://www.web-max.ca/PHP/misc_23.php

Related

Beautiful Soup object's HTML, does not match HTML from the web browser

I am scraping the links from this website https://www.firstmallorca.com/en/search, for each of the properties that appear on it, so I can further scrape them and collect more detailed data.
My problem is that the parsed HTML(I am using html5lib parser) from which I scrape the data seems to be different in some areas with respect to the HTML which I see on the browser's DevTool. To demonstrate this:
1.This is the last link I select. On the browser, its href="/en/sales/penthouse-in-santa-ponsa/102512"
1.Image
2.I print the parsed HTML from the Beautiful Soup Object from the webpage with bs4Object.prettfy() and I copy the whole output into notepad++.
3.Then, in the notepad I look for the same element as in point 1. I find it and the href="/en/sales/finca-in-portocolom/159515", which is different from what I see on the actual webpage.3.Image
I do not understand the nature of what's happening. On point 3, I was expecting to see href="/en/sales/penthouse-in-santa-ponsa/102512" instead of href="/en/sales/finca-in-portocolom/159515".
It seems to me like I am doing the scraping on other similar webpage, though not the one I see through the browser.
The website loads content via javascript, which your parser does not execute.
This is a task for selenium.
The selenium package is used to automate the interaction with the web browser from Python.

Beautiful Soup not picking up some data form the website

I have been trying to scrape some data using beautiful soup from https://www.eia.gov/coal/markets/. However when I parse the contents some of the data does not show up at all. Those data fields are visible in chrome inspector but not in the soup. The thing is they do not seem to be text elements. I think they are fed using an external database. I have attached the screenshots below. Is there any other way to scrape that data?
Thanks in advance.
Google inspector:
Beautiful soup parsed content:
#DMart is correct. The data you are looking for is being populated by Javascript, have a look at line 1629 in the page source. Beautiful soup doesn't act as a client browser so there is nowhere for the JS to execute. So it looks like selenium is your best bet.
See This thread for more information.
Not enough detail in your question but this information is probably dynamically loaded and you're not fetching the entire page source.
Without your code it's tough to see if you're using selenium to do it (you tagged this questions as such) which may indicate you're using page_source which does not guarantee you the entire completed source of the page you're looking at.
If you're using requests its even more unlikely you're capturing the entire page's completed source code.
The data is loaded via ajax, so it is not available in the initial document. If you go to the networking tab in chrome dev tools you will see that the site reaches out to https://www.eia.gov/coal/markets/coal_markets_json.php. I searched for some of the numbers in the response and it looks like the data you are looking for is there.
This is a direct json response from the backend. Its better than selenium if you can get it to work.
Thanks you all!
Opening the page using selenium using a webdriver and then parsing the page source using beautiful soup worked.
webdriver.get('https://www.eia.gov/coal/markets/')
html=webdriver.page_source
soup=BS(html)
table=soup.find("table",{'id':'snl_dpst'})
rows=table.find_all("tr")

css selector for instagram post alongside comments not working

In my example code below I have navigated to Obama's first Instagram post. I am trying to point to the portion of the page that is his post and the comments beside it.
driver.get("https://www.instagram.com/p/B-Sj7CggmHt/")
element = driver.find_element_by_css_selector("div._97aPb")
I want this to work for the page of any post and of any Instagram user, but it seems that the xpath for the post alongside the comments changes. How can I find the post image + comments combined block regardless of which post it is? Would appreciate any help thank you.
I would also like to be able to individually point to the image and individually point to the comments. I have gone through multiple user profiles and multiple posts but both the xpaths and css selectors seem to change. Would also appreciate guidance on any reading or resources where I can learn how to properly point to different html elements.
You could try selecting based on the top level structure. Looking more closely, there is always an article tag, and then the photo is in the 4th div in, right under the header.
You can do this with BeautifulSoup with something like this:
from BeautifulSoup import BeautifulSoup as soup
article = soup.find('article')
divs_in_article = article.find_all('div')
divs_in_article[3] should have the data you are looking for. If BeautifulSoup grabs dives under that first header tag, you may have to get creative and skip that tag first. I would test it myself but I don't have ChromeDriver running right now.
Alternatively, you could try:
images = soup.find_all('img')
to grab all image tags in the page. This may work too.
BeautifulSoup has a lot of handy methods to get you tagging things based on structure. Take a look at going back and forth, going sideways , going down and going up. You should be able to discern the structure using the developer tools in your browser and then come up with a way to select the collections you care about for comments.

Get full link from page with Scrapy

I want to get torrents links from page. With chrome source browser I see the link is:
href="browse.php?search=Brooklyn+Nine-Nine&page=1"
But then i scrap this link with Scrapy i only get:
href="browse.php?page=1"
this "search=Brooklyn+Nine-Nine&" part is not in the link.
Into page's torrents search form I enter "Brooklyn Nine-Nine", and it will show all search results.
So my question will be is it chromes automatic links formatting feature? and how I could get link with Scrapy as Chromes shows.
I think i could enter missing part by my self. Such like replacing spaces with plus sign in text that is used for search.
Or maybe were there some more elegant solution...
It's all okey... I did a mistake in my script. My search text was empty so the links also was without any additional text.

Unable to find exact source code of my blog

I am into a project where I deal with parsing HTML of web pages. So, I took my blog (Bloggers Blog - Dynamic Template) and tried to read the content of it. Unfortunately I failed to look at "actual" source of the blog's webpage.
Here is what I observed:
I clicked view source on a random article of my blog and tried to find the content in it. and I couldn't find any. It was all JavaScript.
So, I saved the webpage to my laptop and checked the source again, this time I found the content.
I also checked the source using developers tools in browsers and again found the content in it.
Now, I tried the python way
import urllib
from bs4 import BeautifulSoup
soup = BeautifulSoup( urllib.urlopen("my-webpage-address") )
print soup.prettify()
I even didn't find the content in the HTML code in it.
Finally, why I am unable to find the content in the source code in case1, 4.
How should I get the actual HTML code? I wish to hear any python library that would do the job.
The content is loaded via JavaScript (AJAX). It's not in the "source".
In step 2, you are saving the resulting page, not the original source. In step 3, you're seeing what's being rendered by the browser.
Steps 1 and 4 "don't work" because you're getting the page's source (which doesn't contain the content). You need to actually run the JavaScript, which isn't easy for a screen scraper to do.

Categories

Resources