Hi I would like to scrape with beautiful soup, but normally the iframe src should be an html link, this time I encounter a wordpress URL that is basically the folder structure that leads to the PHP file.
I was wondering if there is any way to scrape the table inside that file?
The table DIV tags exist when I inspect elements in Chrome, howeve, when I loaded the link with BeautifulSoup, the content within the iframe disappears(table).
Please help
When the contents are loaded by JavaScript or PHP, the Selenium library can be more useful and handy to extract the required data.
Related
I am scraping the links from this website https://www.firstmallorca.com/en/search, for each of the properties that appear on it, so I can further scrape them and collect more detailed data.
My problem is that the parsed HTML(I am using html5lib parser) from which I scrape the data seems to be different in some areas with respect to the HTML which I see on the browser's DevTool. To demonstrate this:
1.This is the last link I select. On the browser, its href="/en/sales/penthouse-in-santa-ponsa/102512"
1.Image
2.I print the parsed HTML from the Beautiful Soup Object from the webpage with bs4Object.prettfy() and I copy the whole output into notepad++.
3.Then, in the notepad I look for the same element as in point 1. I find it and the href="/en/sales/finca-in-portocolom/159515", which is different from what I see on the actual webpage.3.Image
I do not understand the nature of what's happening. On point 3, I was expecting to see href="/en/sales/penthouse-in-santa-ponsa/102512" instead of href="/en/sales/finca-in-portocolom/159515".
It seems to me like I am doing the scraping on other similar webpage, though not the one I see through the browser.
The website loads content via javascript, which your parser does not execute.
This is a task for selenium.
The selenium package is used to automate the interaction with the web browser from Python.
I've been trying to scrape both the HTML tags, its contents, and CSS styles that are applied to it from an external website. I am currently using a combination of BeautifulSoup & Selenium. The HTML tags and contents can be found by using
soup.find_all() or driver.find_elements_by_xpath()
However, for css style, I haven't figured out a way.
Sample Inspect Screenshot from Chrome
For instance, as you can see in the above screenshot (ignore Korean), When I select
<div class="ab_sub_heading">
I want to be able to reach and scrape at the same time that we can see on the right in the Inspect page.
.ab_sub_heading { width: 80%, margin-top: 3.3333em; ...}.
Is there any fast way to do this on BeautifulSoup / Selenium?
I'm open to using other libraries or frameworks if needed as well. Thank you.
i need to crawl a website.
get some of its pages and store them with all of the CSS files and images. exactly like saving the pages in browser.
i have tried selenium, but with selenium i can only save the html not full page so it is not possible to do this with selenium.
i want to know can i do this using Scrapy?
if it is not possible using Scrapy what else can i use?
Yes - you should be able to do this in scrapy
Inside of the <head>tag in the html you should see urls to javascript references in <script> tags and you should see <link> tags that give you the url to get the css files
Once you get the url, it's a simple matter to do a request in scrapy. The scrapy tutorial shows this:
https://doc.scrapy.org/en/latest/intro/tutorial.html#a-shortcut-for-creating-requests
These urls contain the raw css or javascript and you can either download that separately or construct a new single HTML document
One thing to note is that the <script> tags may contain the full javascript and not a url reference. In this case you'll get the data when you get the html portion
I am using BeautifulSoup python package to scrape a table of data from a webpage. The tables have many pages that can be clicked through, I was hoping that i could extract each page of the table by running through adjusted URL's that identify the page but this particular site is statically updating the table using javascript that changes the source code.
Does anyone know a work-around? I am new to BeautifulSoup and do not know if there is a way to do this.
I have a link i want to scrape the content from that looks like this:
https://www.whatever.com/getDescModuleAjax.htm?productId=32663684002&t=1478698394335
But when i want to open it with selenium it won't work. When i load it in a normal Browser it opens as plain Text with the Html in a bracket like this:
window.productDescription='<div style="clea....
#I want this
....n.jpg" width="950"/></p></div>'";
I was thinking i will Download the source code as plain text and extract the content i need using Bs4. But this can't be the best solution. is there a way to ignore the tags and load the web page normally using selenium and python?
If all the source code is inside of JS variable:
window.variable="<div>...</div>" then you probably can't use bs4 to resolve it since bs4 works for pure html DOM nodes.
Is there a way to ignore the tags and load the web page normally using selenium and python
Most likely Selenium should be able to force on-page JS to get executed and load variable content into page's DOM. Try to search where window.productDescription or productDescription expression is applied/used (in which onloaded .js files)?