Scrape multiple table pages with BeautifulSoup and Python

Scrape multiple table pages with BeautifulSoup and Python - python

http://www.indymini.com/p/mini-marathon/miniresults
I want to scrap table available on this url with python BS4 but when i change the table size or change page, url does not chang.

When navigating through the table, the URL does not change because the table seems to be implemented using javascript (DataTables library in particular) - and uses AJAX to get relevant data to display.
So, basically, I don't see a way you could scrape the page using BS4 and get data other than those displayed by default, when the page loads.
On the other hand, as the data is retrieved using AJAX, you could try to figure out the format of the AJAX request (what parameter does what with respect to the results you want, for example using Firebug) and retrieve the data directly in JSON format by calling the AJAX URL that supplies the data table.
But, depending on your intended use of the data, you might want to consider asking the owner of the website for permission to download and use the data. And, who knows - they might be willing to help.

Well its a ajax call that is sent to server via GET, here is quick and dirty scrapping code in python
ajax url is
import requests,time
c=0
data=list()
for i in range(1,2278):
url='http://results.xacte.com/json/search?eventId=1387&callback=jQuery18309972286304579958_1494520029659&sEcho=8&iColumns=13&sColumns=&iDisplayStart='+str(c)+'&iDisplayLength=10&mDataProp_0=&mDataProp_1=bib&mDataProp_2=firstname&mDataProp_3=lastname&mDataProp_4=sex&mDataProp_5=age&mDataProp_6=city&mDataProp_7=state&mDataProp_8=country&mDataProp_9=&mDataProp_10=&mDataProp_11=&mDataProp_12=&sSearch=&bRegex=false&sSearch_0=&bRegex_0=false&bSearchable_0=false&sSearch_1=&bRegex_1=false&bSearchable_1=true&sSearch_2=&bRegex_2=false&bSearchable_2=true&sSearch_3=&bRegex_3=false&bSearchable_3=true&sSearch_4=&bRegex_4=false&bSearchable_4=true&sSearch_5=&bRegex_5=false&bSearchable_5=true&sSearch_6=&bRegex_6=false&bSearchable_6=true&sSearch_7=&bRegex_7=false&bSearchable_7=true&sSearch_8=&bRegex_8=false&bSearchable_8=true&sSearch_9=&bRegex_9=false&bSearchable_9=true&sSearch_10=&bRegex_10=false&bSearchable_10=true&sSearch_11=&bRegex_11=false&bSearchable_11=false&sSearch_12=&bRegex_12=false&bSearchable_12=false&iSortCol_0=0&sSortDir_0=asc&iSortingCols=1&bSortable_0=false&bSortable_1=true&bSortable_2=true&bSortable_3=true&bSortable_4=true&bSortable_5=true&bSortable_6=true&bSortable_7=true&bSortable_8=true&bSortable_9=false&bSortable_10=false&bSortable_11=false&bSortable_12=false&_='+str(time.time())
r=requests.get(url)
c+=1
print (r.text,'-------------',)
#do whatever you want to do with it, r.text will give the raw data.

Related

How to scrape a react table when there are no table tags

I am trying to scrape the table from this link. There are no table tags on the page so I am trying to access it using the class "rt-table". When I inspect the table in developer tools, I can see the html I need in the elements section (I am using Chrome). However, when I view the source code using requests, this part of the code is now missing. Does anyone know what the problem could be?

If you just want those stats you can access the hidden api like this:
import requests
import pandas as pd
url = f'https://api.nhle.com/stats/rest/en/team/summary?isAggregate=false&isGame=true&sort=%5B%7B%22property%22:%22points%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22wins%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22teamId%22,%22direction%22:%22ASC%22%7D%5D&start=0&limit=100&factCayenneExp=gamesPlayed%3E=1&cayenneExp=gameTypeId=2%20and%20seasonId%3C=20072008%20and%20seasonId%3E=20072008'
data = requests.get(url).json()
df = pd.DataFrame(data['data'])
df.to_csv('nj devils data.csv',index=False)
To see where I got that URL go to the page you are trying to scrape in your browser, then open Developer Tools - Network - fetch/XHR and hit refresh, you'll see a bunch of network requests happen if you click on them you can explore the data returns in the "preview" tab. This can be recreated like above. This is not always the case, sometimes you need to send certain headers or a payload of information to authenticate a request to an api, but in this case it works quite easily.

Web scraping using python, how to deal with ngif?

I'm trying to read the price of a fund which is not available through an API. The fund is listed here https://bors.e24.no/#!/instrument/KL-AFMI2.OSE
At first I thought this would be a simple task so I looked at beautifulsoup, but realized that what I wanted was not returned. A far as I can tell due to the:
<-- ngIf: $root.allowStreamingToggle -->
I'm a beginner so hoping someone can help me with an easy way to get this value.

I see json being returned from the following endpoint in network tab
import requests
headers = {'user-agent': 'Mozilla/5.0'}
r = requests.get('https://bors.e24.no/server/components/graphdata/(PRICE)/DAY/KL-AFMI2.OSE?points=500&stop=2019-07-30&period=1weeks', headers = headers).json()
Price is then
r['rows'][0]['values']['series']['c1']['data'][3][1]

The tag "ngIf" almost certainly means that the website you are attempting to scrape is an AngularJS app... in which case, the data is almost certainly NOT in the HTML page you are pulling and attempting to parse with BeautifulSoup.
Rather, the page is probably pulling the data later -- say, via AJAX -- and the rendering it IN to the page via Angular's client-side code.
If all that is right... then BeautifulSoup is not the right tool.
You might have some hope if you can identify the AJAX call that the page is calling, then call THAT directly. Inspect it to see the data structure; if you are lucky maybe it is JSON and then super easy to parse. If that looks promising, then you can probably simply use the requests library, and skip BeautifulSoup. But you have to do the reverse engineering to figure out WHAT you should be calling.
Here, try this: I did a little snooping with the browser console. Is this the data you are looking for? get info for KL-AFMI2.OSE
If so.. then just use that URL directly in requests.

How i post data on search-Bar website using python script?

As all we know in web application we have get method and post data method.
Here my problem appear with post data.
For example i want to make my python code that access for search bar of website by insert same values and submit (the website button), then check for the page.
How the code gonna be then if there any documentation about this python concepts!
I am totally confused
Note : i am just beginner in python.

If the website relies on javascript, you're going to need to use something like Selenium which will emulate a typical browser and allow you to insert information onto a page and execute javascript commands.
If, however, the search bar simply posts data to a URL. You can determine that URL and then use requests to post the data and retrieve the result.
resp = requests.post('http://website/search', data = {'term':'value'})

Beautiful soup - data not in HTML file

I am new to Python. I am trying to scrape data from a website and the data I want can not be seen on view > source in the browser. It comes from another file. It is possible to scrape the actual data on the screen with Beautifulsoup and Python?
example site www[dot]catleylakeman[dot]co(dot)uk/cds_banks.php
If not, is this possible using another route?
Thanks

The "other file" is http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369145707664 - you can find this out (and I suspect you already have) by using chrome's developer tools, network tab (or the equivalent in your browser).
This format is easier to parse than the final html would be; generally HTML scrapers should be used as a last resort if the website does not publish raw data like the above.

My guess is, the url you are actually looking for is:
http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369146012122
I found it using the developer toolbar and looking at the network traffic (builtin to chrome and firefox, also using firebug). It gets called in with Ajax. You do not even need beatiful soup to parse that one as it seems to be a long string separated with *| and sometimes **|. The following should get you initial access to that data:
import urllib2
f = urllib2.urlopen('http://www.catleylakeman.co.uk/bankCDS.php?ignoreMe=1369146012122')
try:
data = f.read().split('*|')
finally:
f.close()
print data

Scraping Ajax using Python

I am trying to get the data in the table at this website which is updated via jquery after the page loads (I have permission) :
http://whichchart.com/
I currently use selenium and beautifulsoup to get data, however because this data is not visible in the html source, I can't access it. I have tried PyQt4 but it likewise does not get the updated html source.
The values are visible in firebug and chrome developer, so are there any python packages out there which can exploit this and feed it to beautifulsoup?
I'm not a massive techie so ideally I would like a solution which would work in Python or the next easiest software type.
I'm aware I can get it via proprietary "screen-scraper" software, but that is expensive.

Page is making AJAX call to get a data to http://whichchart.com/service.php?action=NewcastleCoal which returns values in JSON. So you can do the following:
Use urllib to get data using HTTP
Parse that data with json library reads method
Now you have a python object to process
If you need to process HTML page content I would suggest to use library like BeautifulSoup or scrapy

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape multiple table pages with BeautifulSoup and Python - python

http://www.indymini.com/p/mini-marathon/miniresults I want to scrap table available on this url with python BS4 but when i change the table size or change page, url does not chang.

Related

How to scrape a react table when there are no table tags

Web scraping using python, how to deal with ngif?

How i post data on search-Bar website using python script?

Beautiful soup - data not in HTML file

Scraping Ajax using Python

Categories

Resources