I am just starting out to learning Python and have taken some online courses in my free time.
I am trying to find the data source for this website, making a daily count of departures from the airport, and eventually building a flights vs date plot.
Have spent two weeks investigating the page source, but am unable to find the json source. Would a kind soul please show me where the json source is? Thanks!
https://www.changiairport.com/en/flights/departures.html
https://www.changiairport.com/cag-web/flights/departures?lang=en&callback=JSON_CALLBACK&date=today
There you go. That will give you the flight schedule for today. The date parameter can probably be other things too, but I don't know what the options are. Any normal get request should work, it seems publicly accessible.
You just need to right click and go to "inspect" then hit the "network" tab and then just browse through the different requests.
Just a note:
Just for the record, this is called scraping and it's often in a legal
gray area where, so long as you aren't using it too extensively or
making a profit off of it you probably won't get in any trouble, but
just make sure you have permission from the company if you plan to
make a lot of calls to an open API like this. It's usually against their terms of service, but as an unenforced clause that they will only use if you become a nuisance.
Related
I work for a power company, and have been tasked with building a database. I have a pretty beginner/intermediate understanding level of python, and can fuddle decently with MSSQL. They have procured Azure for this project, and I am completely lost of how to start this task.
Here is one of the sources of data that I want to scrape every minute.
http://ets.aeso.ca/ets_web/docroot/tradingPage.html - this is a complete overview of the Alberta power market in real time.
Ideally, I would want to be able to scrape this data and other sources, and then modify it to fit into in a certain format and push it onto the SQL server.
Do I need virtual machines that are just looping over python scripts? Or do I need managed instances? This data also then needs to be able to be queried right after it is scraped. Eventually this data may feed machine learning algorithms (I don't know jack about that either but I have been told it should play friendly with that type of enviornment).
Just looking to see if anyone has any insight in how you would approach this, and can tell me what I clearly don't know and haven't thought of. Any insight is truly appreciated.
Thanks!
Currently I am working on a project that will scrape content from various similarly designed websites which contain dynamic content. My end goal is to then aggregate all this data into one application or report of sorts. I made some progress in terms of pulling the needed data from one page but my lack of experience and knowledge in this realm has left me thinking I went down the wrong path.
https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu
The above link is the perfect example of the type of page I will be pulling from.
In my initial attempt I was able to have the page scroll to the bottom all the while collecting data from the various elements using, plus the manual scroll.
cards = driver.find_elements_by_css_selector("div[class^='product-card__Content']")
This allowed me to on the fly pull all the data points I needed, minus the overarching category, which happens to be a parent element, this is something I can map manually in excel, but would prefer to be able to have it pulled alongside everything else.
This got me thinking that maybe I should have taken a top down approach, rathen then what I am seeing as a bottom up approach? But no matter how hard I try based on advice on others I could not get it working as intended where I can pull the category from the parent div due to my lack of understanding.
Based on input of others I was able to make a pivot of sorts and using the below code, I was able to get the category as well as the product name, without any need to scroll the page at all, which went against every experience I have had with this project so far - I am unclear how/why this is possible.
for product_group_name in driver.find_elements_by_css_selector("div[class^='products-grid__ProductGroupTitle']"):
for product in driver.find_elements_by_xpath("//div[starts-with(#class,'products-grid__ProductGroup')][./div[starts-with(#class,'products-grid__ProductGroupTitle')][text()='" + product_group_name.text + "']]//div[starts-with(#class,'consumer-product-card__InViewContainer')]"):
print (product_group_name.text, product.text)
The problem with this code, which is much quicker as it does not rely on scrolling, is that no matter how I approach it I am unable to pull the additional data points of brand and price. Obviously it is something in my approach, but outside of my knowledge level currently.
Any specific or general advice would be appreciated as I would like to scale this into something a bit more robust as my knowledge builds, I would like to be able to have this scanning multiple different URLS at set points in the day, long way away from this but I want to make sure I start on the right path if possible. Based off what I have provided, is the top down approach better in this case? Bottom up? Is this subjective?
I have noticed comments about pulling the entire source code of the page and working with that, would that be a valid approach and possibly better suited to my needs? Would it even be possible based on the dynamic nature of the page?
Thank you.
I just deployed my first ever web app and I am curious if there is an easy way to track every time someone visits my website, well I am sure there is but how?
Easy as pie, use Google Analytics, you just have to include a tiny script in your app's pages
http://www.google.com/analytics/
PythonAnywhere Dev here. You also have your access log. You can click through this from your web app tab. It shows you the raw data about your visitors. I would personally also use something like Google Analytics. However you don't need to do anything to be able to just see your raw visitor data. It's already there.
know from myself people are obsessed with traffic, statistics, looking at other sites – tracking their stats and so on. And if there is enough demand, of course there are sites to satisfy You. I wanted to put those sites and tools in one list together, because at least for me this field was really unclear – I didn’t know what means Google Pagerank, Alexa, Compete, Technorati rankings and I could continue so on. I must say not always these stats are precise, but however they give at least overview, how popular the certain page is, how many visitors that sites gets – and if You compare those stats with Your site statistics, You can get pretty precise results then.
http://www.stuffedweb.com/3-tools-to-track-your-website-visitors/
http://www.1stwebdesigner.com/design/10-ways-how-to-track-site-traffic-popularity-statistics/
I am a huge fan of Cloudflare's analytics. It is super easy to setup, and you don't have to worry about adding a javascript blurb to each page. Cloudflare is also able to track all of the things that visit your page without loading the javascript.
http://www.cloudflare.com
I am a social scientist and a complete newbie/noob when it comes to coding. I have searched through the other questions/tutorials but am unable to get the gist of how to crawl a news website targeting the comments section specifically. Ideally, I'd like to tell python to crawl a number of pages and return all the comments as a .txt file. I've tried
from bs4 import BeautifulSoup
import urllib2
url="http://www.xxxxxx.com"
and that's as far as I can go before I get an error message saying bs4 is not a module. I'd appreciate any kind of help on this, and please, if you decide to respond, DUMB IT DOWN for me!
I can run wget on terminal and get all kinds of text from websites which is awesome IF I could actually figure out how to save the individual output html files into one big .txt file. I will take a response to either question.
Try Scrapy. It is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
You will most likely encounter this as you go, but in some cases, if the site is employing 3rd party services for comments, like Disqus, you will find that you will not be able to pull the comments down in this manner. Just a heads up.
I've gone down this route before and have had to tailor the script to a particular site's layout/design/etc.
I've found libcurl to be extremely handy, if you don't mind doing the post-processing using Python's string handler functions.
If you don't need to implement it purely in Python, you can make use of wget's recursive mirroring option to handle the content pull, then write your python code to parse the downloaded files.
I'll add my two cents here as well.
The first things to check are that you installed beautiful soup, and that it lives somewhere that it can be found. There's all kinds of things that can go wrong here.
My experience is similar to yours: I work at a web startup, and we have a bunch of users who register, but give us no information about their job (which is actually important for us). So my idea was to scrape the homepage and the "About us" page from the domain in their email address, and try to put a learning algorithm around the data that I captured to predict their job. The results for each domain are stored as a text file.
Unfortunately (for you...sorry), the code I ended up with was a bit complicated. The problem is that you'll end up getting a lot of garbage when you do the scraping, and you'll have to filter it out. You'll also end up with encoding issues, and (assuming you want to do some learning here) you'll have to get rid of low-value words. The total code is about 1000 lines, and I'll post some important pieces that may help you out here, if you're interested.
I need to get the number of unique visitors(say, for the last 5 minutes) that are currently looking at an article, so I can display that number, and sort the articles by most popular.
ex. Similar to how most forums display 'There are n people viewing this thread'
How can I achieve this on Google App Engine? I am using Python 2.7.
Please try to explain in a simple way because I recently started learning programming and I am working on my first project. I don't have lots of experience. Thank you!
Create a counter (property within entity) and increase it transactionally for every page view. If you have more then a few pageviews a second you need to look into sharded counters.
There is no way to tell when someone stops viewing a page unless you use Javascript to inform the server when that happens. Forums etc typically assume that someone has stopped viewing a page after n minutes of inactivity, and base their figures on that.
For minimal resource use, I would suggest using memcache exclusively here. If the value gets evicted, the count will be incorrect, but the consequences of that are minimal, and other solutions will use a lot more resources.
Did you consider Google Analytics service for getting statistics? Read this article about real-time monitoring using this service. Please note: a special script must be embedded on every page you want to monitor.