Scraping a website that has certain problems

Scraping a website that has certain problems - python

I want to scrape this website and scrape all articles by this author, with Python(response or Selenium libraries) and put them in PDF file.
However, when I click on the button "Show More" that is in the bottom, after 8 times, it doesn't anymore display more articles, hence I can't access them all(idea was to automate selenium, to click on it until all articles are showed, and then scrape them all). Is there a workaround? Alternative ways I can access all articles chronologically and scrape them?
My idea was to somehow analyze if the links come from alternative source, but I'm clueless. However, I scraped successfully those articles that are displayed.
Thanks in advance!

Use findElements and search for <h2 class="css-1j9dxys e1xfvim30">...</h2> which will give you a list of all titles. Each time when you click the Show more the size of the list will get extended by 10 or so. So the idea is to simply click the button untill the size of the list does not change. Use a while loop. Something like:
List<WebElements> oldList = Driver.findElements(by.cssSelector("h2.css-
1j9dxys.e1xfvim30"));
List<WebElements> newList = new ArrayList<>();
WebElement button = Driver.findElement(by.xpath("//button[text()='Show More']"));
while(newList.size!=oldList.size){
button.click();
newList = List<WebElements> newList = Driver.findElements(by.cssSelector("h2.css-
1j9dxys.e1xfvim30));
}
I might have some mistakes in the code but the idea is there. Good luck!

Related

Finding downloadable content in webpage without knowing what the page looks like

I want to find downloadable content in a webpage but I don't know what the webpage looks like. Right now, I am looking at all links
links = driver.find_elements(By.XPATH, "//a[#href]")
and buttons
buttons = driver.find_elements(By.TAG_NAME, "button")
For each link (which has an href attribute as seen in the XPATH specification), I check whether or not it points to a page with some form of machine readable file I am looking for (either .csv or .json); if the link does not have one of these extensions as a suffix, I assume it does not reference a machine readable file.
As for the buttons, I know of no way to check what they may contain other than naively clicking on them (button.click()). While this is clearly dangerous, especially because this function will be applied on thousands of websites, I don't know how else to do it.
Is there any other way I could check for downloadable content? Additionally, are there any other page elements I should be looking for, besides links and buttons, and are there any more efficient methods of doing what I want?
Any help is greatly appreciated. Thanks!

Selenium scraping with HTML changing after refresh

I am using Selenium along with python to scrape some pages. I have many web pages that represent the same type of objects(football player information) but each of them has a slightly different HTML layout. In particular my main issue here is that the div class identifiers change when refreshing or changing web page, in a way which is unpredictable.
In the specific case I would like to get the data in the div which class identifier "jss176", but when I get to another player this will change to "jss450" for example, with no meaningful pattern to be found.
Is there a way I can go around this? I was thinking of navigating through the Childs starting from div with id = "root", but I don't seem to find a good piece of code to achieve this.
Thank you very much!

If only the id's change, but not the web structure, you could scrape the info by XPATH.
https://www.tutorialspoint.com/what-is-xpath-in-selenium-with-python
You can directly access the div you want and select in chrome "copy XPATH" option in the browser.

Selenium unable to automate click on web element

I should start by saying that I'm not a web developer but I have been using Selenium with Python for a few weeks and think I've gotten the basics down.
What I have is a page with a weekly calendar on it. Image as follows:
What happens is that you can click on any of the coloured boxes which will bring up a register for a class. It features items that you can click on which bring up new information. The problem I have is that I can't click on the items - or, rather, I don't know how to automate the click. I have automated clicking with Selenium before in other situations.
That looks as follows:
As you can see, the calendar still appears in the background. That is to say, the item we click on doesn't take us to a new page but, I suppose, runs some kind of method which shows the register and populates it with data.
My problem is this: I want to automate the process so that I can click on each of these items then scrape the information. In order to do that I need to be able to automate the clicking for each of the items.
So what have I done? When I've done this before, I've searched through the web html for the relevant part and then grabbed the xPath to the element I needed. But here I can't do that. Why not? Well, firstly, I just can't find that element!
Take a look at this close up of the first column:
It's divided into columns, but then I'd expect the clickable area to be an element within that. As you can see, it's not. Furthermore, the clickable area is just the coloured box itself, but if you look closely you can see the element goes outside of that area. I have gone very close with my mouse cursor to see exactly what's clickable, and it definitely is just the coloured box.
So I've not been able to get the element at all.
I thought I might be able to just find out where we went after I clicked the button, but when I got the link address, it just said it was the same page with no differences.
I appreciate I'm asking quite a broad question here, but the problem is that I don't really know where to start. If someone could give me at least that much, I would be grateful. Like if I could just click on each of these one at a time... I've found where the populated data is so I could grab that without a problem.
Well, here's to hoping.
Edit: I should add that there are some JavaScript items (tag type script, type='text/javascript'). I presume that the answer is in there somewhere, but there is a lot of Javascript and I'm not adept at reading it. It's hard for me to tell what script does what. If I could at least figure out what script runs when I click the item then I think I'd be onto something, but I have no idea. Even that would help me.

I had encountered similar problem when scraping for Instagram followers in mobile view where it was floating box when showing the accounts followers name. The approach I took was identifying the name of floating dialog box and clicking the element in it. It might different in your case of html.
Trying looking at this link. Selenium Scroll inside of popup div

Hard to say without the HTML. Maybe try Katalon Recorder (chrome extension) and see if that can detect the xpath for you? It might also be you have to use some kind of javascript to invoke the method for the element

Selenium Python: clicking links produced by JSON application

[ Ed: Maybe I'm just asking this? Not sure -- Capture JSON response through Selenium ]
I'm trying to use Selenium (Python) to navigate via hyperlinks to pages in a web database. One page returns a table with hyperlinks that I want Selenium to follow. But the links do not appear in the page's source. The only html that corresponds to the table of interest is a tag indicating that the site is pulling results from a facet search. Within the div is a <script type="application/json"> tag and a handful of search options. Nothing else.
Again, I can view the hyperlinks in Firefox, but not using "View Page Source" or Selenium's selenium.webdriver.Firefox().page_source call. Instead, that call outputs not the <script> tag but a series of <div> tags that appear to define the results' format.
Is Selenium unable to navigate output from JSON applications? Or is there another way to capture the output of such applications? Thanks, and apologies for the lack of code/reproducibility.

Try using execute_script() and get the links by running JavaScript, something like:
driver.execute_script("document.querySelector('div#your-link-to-follow').click();")
Note: if the div are generated by scripts dynamically, you may want to implicitly wait a few seconds before executing the script.

I've confronted a similar situation on a website with JavaScript (http://ledextract.ces.census.gov to be specific). I had pretty good luck just using Selenium's get_element() methods. The key is that even if not everything about the hyperlinks appears in the page's source, Selenium will usually be able to find it by navigating to the website since doing that will engage the JavaScript that produces the additional links.
Thus, for example, you could try mousing over the links, finding their titles, and then using:
driver.find_element_by_xpath("//*[#title='Link Title']").click()
Based on whatever title appears by the link when you mouse over it.
Or, you may be able to find the links based on the text that appears on them:
driver.find_element_by_partial_link_text('Link Text').click()
Or, if you have a sense of the id for the links, you could use:
driver.find_element_by_id('Link_ID').click()
If you are at a loss for what the text, title, ID, etc. would be for the links you want, a somewhat blunt response is to try to pull the id, text, and title for every element off the website and then save that to a file that you can look for to identify likely candidates for the links you're wanting. That should show you a lot more (in some respects) than just the source code for the site would:
AllElements = driver.find_elements_by_xpath('//*')
for Element in AllElements:
print 'ID = %s TEXT = %s Title =%s' %(Element.get_attribute("id"), Element.get_attribute("text"), Element.get_attribute("title"))
Note: if you have or suspect you have a situation where you'll have multiple links with the same title/text, etc. then you may want to use the find_elements (plural) methods to get lists of all those satisfying your criteria, specify the xpath more explicitly, etc.

Interact with webpage Beautifulsoup and python

I'm using Python 2.7 with beautifulsoup and urllib2, I'm trying to scrap this page: angel.co/companies
As you see it shows a list with companies and it ends with a button "More" to show the others. As you click the button, more companies appear to watch and it creates a new tag with the new list of resutls. The button is in this div: <div class="more" data-page="2">More</div> and each time you click it the data-page increases.
I'd like to know if it's possible to scrap this page completely (so it clicks the "More" button each time it arrives to the end). I suppose it is scrapping the css and changing it but I never did so and I haven't found information about this anywhere.

Depending on what you want to do you could use their API for this. If you are not sure what it is and how to use it, try googling around for an answer. Here's one for starters.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.