I'm parsing webpages with selenium and beautifulsoup4,
and I have a problem with parsing specific webpage.
I got different html source pages when I actually view html source on that page, and parsing with selenium or bs4.
The difference is existence of form and input.
When I parse that page, I got html with
<form action="" method="post" name="fmove">
<input name="goAction" style="display:none" type="submit"/>
</form>
I can't find what to input or submit.
Please let me understand this problem.
Thanks!
I'm going to concentrate on '[finding] what to input or submit' without touching on wider questions. Even so, what I tell you is not guaranteed to yield answers if code associated with that page does not arrange to fill the form's action attribute and/or some of its input elements with name and value pairs.
First, open the page in the Chrome browser. Use the item in the context menu to 'Inspect' the element on the screen to find the Javascript that finally submits that form. Put a breakpoint on the line in the code where this happens. Now reload the page (F5) and exercise the form. The code should stop at the breakpoint. You should be able to see the properties of the form element, including action and the name-value pairs, in the rightmost portion of the screen where you can copy them for use in your own code.
PS: I really must mention that it's difficult to be sure of a lot of this without knowing what site you're scraping. Good luck!
Related
I am automating a process using Selenium and python. Right now, I am trying to click on a button in a webpage (sorry I cannot share the link, since it requires credential to login), but there is no way my code can find this button element. I have tried every selector (by id, css selector, xpath, etc.) and done a lot of googling, but no success.
Here is the source content from the web page:
<button onclick="javascript: switchTabs('public');" aria-selected="false" itemcount="-1" type="button" title="Public Reports" dontactassubmit="false" id="public" aria-label="" class="col-xs-12 text-left list-group-item tabbing_class active"> Public Reports </button>
I also added a sleep command before this to make sure the page is fully loaded, but it does not work.
Can anyone help how to select this onclick button?
Please let me know if you need more info.
Edit: you can take a look at this picture to get more insight (https://ibb.co/cYXWkL0). The yellow arrow indicates the button I want to click on.
The element you trying to click is inside an iframe. So, you need to switch driver into the iframe content before accessing elements inside it.
I can't give you a specific code solution since you didn't share a link to that page, even not all that HTML block. You can see solutions for similar questions here or enter link description here. More results can be found with google search
I am trying to scrape:
https://www.lanebryant.com/
My crawler starts from a URL and then goes further to all the links that are mentioned on that page. Now, I scraped for other site and my logic works by checking if URL contains "products" string and then downloads the product's information. In this site there is no such thing as mentioned previously. How do I distinguish between a product's page and a regular page? (All it requires is an if statement. I hope my question is clear. For the record, here is the product's page for this site:
https://www.lanebryant.com/faux-wrap-maxi-dress/prd-358414#color/0000081590
Something that might be helpful in this case is to go through several product pages (visually at first), and to look for similarities in their html. If you're new to this, just go to the page and then do something similar to right click + "View Page Source" (this is the way to do it on Chrome). For the page example you gave, an example of probably relevant element would be: <input type="submit"
class="cta-btn btn btn--full mar-add-to-bag asc-bag-action grid__item"
value="Add to Bag">, which corresponds to the "Add to Bag" button.
Then you might look into how to use BS to actually go through the html elements of the page and do your filtering based on this.
Hope that helps!
I am trying to automate some actions on a website. My script fills out a form and clicks post and then the website essentially asks if you are sure you want to post and you need to click the post button a second time. The problem is, while the old post button is no longer visible to the user, Selenium can only find the old post button and insists that the new post button does not exist.
Here is what the HTML looks like for the new post button.
<span id="__w2__w927tml1_submit_question_anon">
<a class="modal_cancel modal action" href="#"
id="__w2__w927tml1_add_original">Ask Original Question</a>
</span>
I have tried every different locator I can think of but it always locates the old post button. The closest I've got is locating the parent class shown above, but when trying to parse through its children it says there are none. I am at a loss for what to do here. Thanks for your help.
So I made a code that reads and prints everything in between specified text in HTML code, example , reads all between paragraphs<> - this gets printed.
This was from sentdex lesson - here
There is no problem with code, but rather with what is coming out.
I filtered with very specific criteria
paragraphs = re.findall(r'<div style="font-size: 23px; margin-top: 20px;" class="jsdfx-sentiment-present">(.*?)</div>',str(respData))
So as already mentioned, it works. Content later is printed and it prints
 
. As I understand this is non-braking space in HTML. Instead of space I expected to see numbers. In this website , numbers in this location are updating every few seconds.
How can I get to these numbers instead of receiving  ?
Regards!
It depends on how exactly you're downloading the page, and from where, but because you say the value changes constantly when looking at it in a web browser, I'd wager that when you download the page, that   is exactly what's inside that div - and the page changes it on-the-fly via javascript or something while you're actually viewing the page. Your tutorial uses a static tag, one that's the same every time you load the page, rather than one that gets dynamically set after the page is already active.
It's fairly common to do this in web development for dynamic values - put a placeholder value in a div, and then dynamically edit the content as is appropriate. If course, if you just take a snapshot of the page (and even moreso if you take that snapshot before the javascript code and whatnot that would have filled in that value has had a chance to run) you're not going to see the change, and you get only the default value, without the number being filled in.
Based on the tutorial you linked, you're probably using urllib. If you want to get dynamic content from a HTML page, that's probably not the best tool to use - you should look into selenium and BeautifulSoup. This StackOverflow Answer goes into a lot more detail on effective solutions to this problem.
I'm fairly new to Python, and this is my first post to stackoverflow, and as a starting project I'm trying to write a program that will gather the prices of board games from different websites that sell them. As part of this I'm trying to write a function that will use a website's built-in search function to find the webpage I want for a game that I input.
The code I'm using so far is:
import requests
body = {'keywords':'galaxy trucker'}
con = requests.post('http://www.thirstymeeples.co.uk/', data=body)
print(con.content)
My problem is that the webpage it returns is not the webpage I get when I manually input and search for 'galaxy trucker' on the website itself.
The html for the search form in question is
<form method="post" action="http://www.thirstymeeples.co.uk/">
<input type="search" name="keywords" id="keywords" class="searchinput" value>
</form>
I have read this but with that the difference to me seems to be that the search actually appears on the webpage, whereas with mine, the web address provided in the action section does not itself display a search bar. In this example too, there is no id keyword in the html, whereas in mine there is, does this make a difference?
No search form on the index page, but if you do a "manual" search from the "games" page (which does hae a form), you end up on a page with this url:
http://www.thirstymeeples.co.uk/games/search/q?collection=search_games|search_designers|search_publishers&loose_ends=right&search_mode=all&keywords=galaxy+trucker
Notice that this page does take GET params, and that if you change the keywords from "galaxy+trucker" to anything else you get an updated result page. IOW, you want to do a GET request at http://www.thirstymeeples.co.uk/games/search/q:
r = requests.get("http://www.thirstymeeples.co.uk/games/search/q", params={"keywords": "galaxy trucker"})
print(r.content)