Test how my website appears to a program [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
A website may be accessed not only by a user on a browser, but also programs, bots and crawlers. I have a website running on Google App Engine with python which has non-static HTML pages that are generated by a python program by combining, merging and looping strings. However, they are also not dynamic pages in the sense that no user input is required to generate these pages. The content generation by python is solely for convenience, brevity and ease of maintenance, and is set completely by the url.
Some search engines cannot index dynamic pages. I would like to know if these pages qualify as 'dynamic', i.e. whether they can be crawled or indexed for the usual metadata and content by such bots, and in general would like a way to check how any url appears to a bot or crawler like the ones used by search engines, so that I can see when a certain url is uncrawlable.
If anyone knows of any resources or techniques available, it'd be really helpful.

Some search engines cannot index dynamic pages.
Not true. Clients cannot know and do not care if the server got the content by executing a script or just reading a static file.
Most search engines won't execute client side JavaScript. Most search engines will not submit forms.
If your content is accessible by following links (that are in the HTML), then search engines can get the pages.

Lynx is a text-based browser that gives you a pretty good idea of how a searchbot would see your page. Ancient, tried and true.

Related

How to best mix HTML developement with server side Python programming? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
Are there efficient practices for leveraging an HTML IDE with Python (No Framework) instead of the typical outputting of hand coded HTML in python programs that you can't use HTML IDE's with?
I loved the way PHP, JSP, Classic ASP, and .net allow you to include server side code in HTML with the <% tags. I know some think this is poor form but I personally could stay highly organized with include files while leveraging the WYSIWIG HTML IDE's for presentation polish and experimentation as well as code intelisense.
FYI: I have gone the IIS ASP route but it just isn't working anymore for anyone I could find online using the latest versions of IIS(8.0). I'm completely open to other web servers/approaches just so long as its something efficient and would supportable from a reputable remote web hosting provider.
Thank You!
Tim
Django (https://www.djangoproject.com/) and Flask (http://flask.pocoo.org/) both let you use template languages which let you manipulate and customize HTML pages. However, these processes differ from PHP-style systems in that the code in the HTML page is only related to how you view the data in the page. The bulk of the processing happens in the python code, or the controller (MVC frameworks)
Django's template language: https://docs.djangoproject.com/en/dev/topics/templates/
Flask's template language (Jinja2): http://jinja.pocoo.org/docs/
Have a look at Mako template library. It is very easy to use, and yet powerful (usage documentation).
Also, I believe that other popular template libraries can be used outside of a framework as well.

MongoDB aggregation REST [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I'm reading around and see that it is a bad idea to have remote application talk directly to my MongoDB e.g. install a Mongodb driver in a phone app. The best way is to have a REST interface on a server to talk between the database and the end user. But what about the aggregation framework?
I see Sleepy.mongoose and Eve but I cannot see anything about aggregation.
Is there any way/or REST interface which allows you to make aggregation calls (I'm interested in subdocuments)?
E.g. requesting $ curl 'http://localhost:27080/customFunction/Restaurant' and return all the subdocuments matching shop.kind with Restaurant.
I'm familiar with python and java, is there any API framework that allows you to do that?
Before you get flagged as off-topic as you likely will for asking for opinions and not a specific programming question I'll just say one bit. Hopefully on-topic.
I highly doubt that most projects will go beyond being a basic CRUD adaptor allowing you access to collection objects and sometimes (badly) database objects. Is with their various ORM backed counterparts they will do doubt allow a similar query syntax to be executed from the client, so queries could be composed and sent through as JSON, which will not surprisingly look much like (identical) to the standard query syntax for MongoDB.
For myself I prefer to roll my own, and largely because you may want to implement a lot of customer behavior and actions, and in some way abstract a little from having a lot of CRUD code in the client. Let's face it, you're probably passing through and passing JSON that is going into the native structures you're using anyway. So it's not hard really. Anyhow, each to his own I suppose.
There is a listing of other implementations on available here:
http://docs.mongodb.org/ecosystem/tools/http-interfaces/

Scraping html WITHOUT uniquie identifiers using python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I would like to design an algorithm using python that scrapes thousands of pages like this one and this one, gathers all the data and inserts it into a MySQL database. The script will be run on a weekly or bi-weekly basis to update the database of any new information added to each individual page.
Ideally I would like a scraper that is easy to work with for table structured data but also data that does not have unique identifiers (ie. id and classes attributes).
Which scraper add-on should I use? BeautifulSoup, Scrapy or Mechanize?
Are there any particular tutorials/books I should be looking at for this desired result?
In the long-run I will be implementing a mobile app that works with all this data through querying the database.
first thought:
(in order to save some time) Have you seen thewaybackmachine? http://archive.org/web/
2nd thought:
If you are going to develop a mobile app then the layout of this site doesn't lend itself to be put on handheld devices easily. I would suggest not bothering with the webpage portion of this. You are just going to have to dig all the information out eventually and change your scrappers each time they change some little thing on their website.
You can get the data from their developer API in Json or CSV format.
From the raw data you can make it into whatever format you want. (for personal use only according to their site)
Caveats:
Pay attention to the robots.txt file on the site.
http://www.robotstxt.org/robotstxt.html
If they don't want to be scrapped they will tell you so. You can do this for personal use, but if you try making money from their content you will find yourself sued.
You could use lxml, which can take XPath specifiers. It takes a while to get used to the XPath syntax, but it's useful in cases like this.

Building a website that interacts with DB and/or XML [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I'm looking to get into web development. My goal is to create some interactive webpages that interact with MS SQL databases (read/insert/updates), and also possibly sites that interact with XML files.
I've got some basic understanding of Python and Perl scripting. Can someone point me in the right direction in either of those languages to accomplish what i'm looking to do, or if it's easier to accomplish in another language what would that be?
Apologies if my stated goal is too broad.
I'd strongly suggest you to look into some of the web development frameworks. They take care of many low-level tasks which is needed in order to build a solid web page. I'm not very familiar with perl, so I can only suggest Python frameworks, especially one of the my favourites - Django. It has very good documentation which is essential for the first-timer. I believe you should be fine as long as you follow the official documentation.
Good luck
You can use SQL Alchamy in python, and lxml or the default ElementTree xml module for simple cases.
I have done both for a webservice I maintain, and they work nice.
You can also use a web development framework. I personally suggest Flask based on that it is a lightweight framework as opposted to django for instance. However, depending on your exact use case the latter might be better.

tools for Crawling popular forum/bulletin board software [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I've started writing a crawler to crawl vbulletin boards. However, I am not a web programmer (json api's I can do, but that isn't really web-crawling), and as such I do not know what the best way to crawl is, and what tools are available.
I am more than capable of writing the crawler, but the I find the underlying HTML very irregular, and so I don't want to be a victim of the structure of the HTML changing in a newer version of vbulletin.
I'm writing an interface using pycurl and beautiful soup. However, is there a better way to do this, are there any good crawlers already available for vbulletin ? (language is not a concern). A meta forum crawler (works with more than one forum type) would be even better.
If you cannot suggest one, could you advise me, if you have the experience, from what I should expect from the stability of the underlying HTML, should I worry about a new version of vbulletin breaking my crawler ?
Perhaps there is a better way to extract a vbulletin dataset ?
Having HTML change is an inherit issue with webcrawling. That is why it should only be an absolute last resort. Maintaining crawlers can be a huge task, as you have seen, because HTML can change daily and there are no guarentees.
Because the data that is usually being searched for is uniform, scrapy is an excellent choice.
http://doc.scrapy.org/en/0.14/index.html
It uses xpath to select elements, which is relatively easy to mainatin imo.
Even if there is a vbulletin specific scraper it is still dependent on HTML which can break at will. Because vbulletin is a platform you are probably pretty well off scraping it. I would think HTML would only change on version updates which shouldn't be that often.
Does the mobile API provide you with any functionality you need?
https://www.vbulletin.com/forum/content.php/367-API-Overview, I guess this depends on per site vbulletin setup.

Categories

Resources