How to access a table in mediawiki - python

Right now I am using the mediawiki api and requests module to attempt to pull certain information from a sort of table off of a wikipedia page. As an example, we will use the song Zombie where there is a 'table' on the right where it tells me the album, the author, the release date and so forth. The only issue I'm running into is that I don't know how to query this data as I'm using this link as the endpoint: https://en.wikipedia.org/w/api.php?format=json&formatversion=2&action=query&titles=Zombie_(song)&prop=extracts
to attempt to search for what I need but it brings up the text on the page. I've tried the sandbox and I've had issues trying to find what would give me the information I need. I appreciate any advice and input, thanks.

For that sort of metadata you'd be best off using Wikidata. In the sidebar on Wikipedia there's a link to the Wikidata item, and you can use an API query such as https://www.wikidata.org/w/api.php?action=wbgetclaims&entity=Q169298 to get the data in a structured way. For information about what those results mean, see the Wikibase API docs.
[Edit:] To get the entity ID, you can use wbgetentities with a Wikipedia title (titles) and wiki ID (sites); e.g.: https://www.wikidata.org/w/api.php?action=wbgetentities&sites=enwiki&titles=Zombie_(song)

Related

Trying to parse all results from z-lib to build a database of book titles

I am trying to scrape a list of all available books in z-library, but results are only provided though a search term and I want the titles for all books.
Also, queries only feature 10 pages of 50 results per page, 500 in total. Doing an empty search using only a spacebar renders top 500 most popular books.
I intend to use Selenium and Python but I can't get around to accessing the entire list of books.
https://book4you.org/
Any ideas?
Thanks
You can not get all the books data by Selenium or web scraper on such site since these tools will only present you a results displayed by GUI for a specific key search query.
This maybe can be performed via some API GET protocol request to get ALL the data from the DB, however we can not know if that possible or not until we know all the API requests available for that specific site.

Scraping data from a webpage based on VIEWSTATES

I'm attempting to scrape the details of all documents on this link.
The problem I'm facing is that the site is created using ASP.NET and the Viewstates aren't me to access the data directly, and I tried a mixture of beautifulSoup, Scrapy and Selenium, but to no avail. The data consists of 12782 documents whose pdf download link I need to extract from the page that redirects from each entry of the returned results on the aforementioned page.
The site also has an API here, but the catch here is that it only returns 2000 data points at any given point of time, so the ~12k data points is out of question.
Can someone help me with ANY ONE of the following:
Create a scraper to get the pdf links
Generate a query to get all the data from the API
Any recurrence relation that helps me generate links to get the queries for the API
Using the requests section in the API to get all the records at the same time delivered to your email
Ideally, a solution in python would be great, but if you can help me get a csv file of all the links, that would also work. Thanks in advance!
I ended up solving the problem by using the request functionality which was located here.
It took in a particular query and my email address and sent me the entire data dump I needed. From that data dump, I could use all the pdf links.

How to find the correct URL when you made some choices on the web page?

I'm very new to learn about web scraping. By using xpath selector i am trying to get the knowledge on that webpage : https://seffaflik.epias.com.tr/transparency/uretim/planlama/kgup.xhtml
But the point is, whenever you change the date or the powerplant name, URL does not change therefore when you fetch the response, you are getting always the same and wrong answer. Is there a way to find the correct URL or anything else related to HTML Markup etc ?
For a scraping operation like this, you'll need to do a bit more than just load the document and then grab the content. The document in-question relies on JavaScript to load new information from some other resource after the user has defined a particular set of parameters and updated the form.
After loading the document, you'll need to define your search parameters. You can do this via JavaScript injection or via your browser's console. For example, if you were trying to define the value for the first date field, you could use
document.querySelectorAll('#j_idt199 input')[1].value = "Some/New/Date";
Repeat this process for the other fields you wish to define in your search, and then run the following code to programmatically execute your search:
document.querySelector('#j_idt199 button').click();
After that, you can either grab the information you want using plain JS query selectors, or you can implement a scraping library like artoo.js to help you interpret the data and export it.

getting links from table in web page

I am trying to go to a website, use their search tool to query a database, and grab all of the links from the table of search results displayed below the search tool. The problem is, the source for the website only shows the html for the search tool. Can anyone help me figure out how to get the links from the table? The address of the search tool is:
https://wagyu.digitalbeef.com/
I was hoping to use BeautifulSoup and python 3.6 on a windows 10 machine to read the pages associated with those links and grab the name of the cows and it's parents to create a more advanced pedigree chart than what is available on the site. Thanks for the help.
Just to clarify, I can manually grab a single link, use bs to grab the html for that page, and pull out the pedigree info. I just don't know how to grab the links from the search results page.

Scrapy: Get data on page and following link

I have been using scrapy for a personal project. My problem is very similar to the question asked on the following page:
Scrapy: Follow link to get additional Item data?
The page I am scraping is the following:
http://www.tennisinsight.com/player_activity.php?player_id=51
This page has a list of matches in this form for eg:
Round of 16 Def. Ivan Dodig(37,2.41) (CRO) 6-3 6-3 Recap Match Stats $1.043
I have currently written in scrapy code that opens every link on the page which has the "Match Stats" link, and scrapes data on that page into an individual record
In addition to this, I want to scrape the "Odds" column (which is the $1.043 above) and add this data to the record.
I have searched for an answer and it seems that I have to use the Request meta field and pass this data along to the parse method. However, I have a problem because I am struggling to incorporate it into my code. The answer from the stackoverflow link I linked above is "To scrape additional fields which are on other pages, in a parse method extract URL of the page with additional info, create and return from that parse method a Request object with that URL and pass already extracted data via its meta parameter."
This makes perfect sense, however, the URLs that I scrape are in the rules, so I dont know how to extract the required data.
Here is part of my code so far which will hopefully better explain my problem.
rules = (
Rule(SgmlLinkExtractor(allow=r"match_stats_popup.php\?matchID=\d+",
restrict_xpaths='//td[#class="matchStyle"]',
tags='a', attrs='href', process_value=getPopupLink), callback='parse_match', follow=True)
The parse_match function parses the match stats into one item.
So what happens is that each of these match stats links are opened up, and there is no way for me to access the main page's Odds column.
Any help will be much appreciated.

Categories

Resources