I have a few longer articles in my blog and would like to split them somehow into multiple pages, e.g. through inserting manual page breaks.
Is there any plugin for this or any default template mechanism?
There are at least two related plugins in the Pelican Plugins repository:
Series
Sub-parts
I imagine you will find that one — or both in conjunction — will suit your needs.
Related
Is HTML or CSS open source like PHP or Python?
For example PHP itself:
PHP source link in GITHUB
You must understand the difference between the specification (documentation, idea) and the implementation (interpreter, compiler, engine). Refer to https://softwareengineering.stackexchange.com/questions/238724/what-license-is-html-released-under for more detailed answers.
HTML
HTML only provides tags around text but other than that html files are practically just text files with extra markers. These tags are then parsed by the parser of the browser you use and displayed in a certain way. The easiest example of this is probably the <strong></strong>, <i></i> and <b></b> tags. These tags only purpose is to modify the text within them to look different. But the strong tag looks different on different browsers. Some browsers make text bold while others do other changes. This shows you that the browser really is the interpreter and there are not always strict rules on what tags must do. The current standard (html5) was created by a group called the WHATWG. See more information also here https://www.w3.org/html/. and their own website https://html.spec.whatwg.org/multipage/. They have a github https://github.com/whatwg/html and I believe you can make changes if they accept your merge requests.
CSS
CSS is also not really a language. CSS is prescribed by the css working group their members are publicly known. https://www.w3.org/Style/CSS/members. If you scroll through the list you will see most of these people are engineers from big tech companies. They do have a github for css it is used for people to post issues. https://github.com/w3c/csswg-drafts. I do believe you can create a merge request although i am unsure.
Summary
In short yes they are basically opensource. However changes to either of these repositories even when accepted and merged will not work untill browsers have implemented the changes.
note
I am not an expert by any means I googled most of my information. I could be wrong about multiple things.
I am building a blog with Django. This blog is made up by posts whose content would benefit a lot from using Markdown syntax. But some of this posts also include SageMath cells, which process some Python-Sage Code enclosed in some special divs, as in this tutorial. Now every Markdown engine parses the Python code, and both tools (Markdown and SageMath) come into conflict.
Which would be smartest approach for implementing these posts?
I'm trying to investigate if I can use Pelican to generate my site and having trouble seeing how to present two different lists of articles, each rendered in their own template.
My site has blog pieces, feature articles, and white papers. The index page for each of these needs to be rendered in a different style.
After a little playing around, it appears to me that I would probably need to have a category for each article-type. Is it possible to tell the engine to render each category list using a different template.
Alternatively, and perhaps better for increased flexibility, is it possible to create a page that loops through a category and i control the display of the article list? I could then solve the problem by creating a page of my own for each category. Does a page have access to the pelican context (i.e. the catalog of content built by the generators' generate_context methods)
I am planning to develop a web-based application which could crawl wikipedia for finding relations and store it in a database. By relations, I mean searching for a name say,'Bill Gates' and find his page, download it and pull out the various information from the page and store it in a database. Information may include his date of birth, his company and a few other things. But I need to know if there is any way to find these unique data from the page, so that I could store them in a database. Any specific books or algorithms would be greatly appreciated. Also mentioning of good opensource libraries would be helpful.
Thank You
If you haven't already, you should have a look at DBpedia. Many categories of wiki articles have "Infoboxes" for the kinds of information you describe, and they've made a database out of it:
http://en.wikipedia.org/wiki/DBpedia
You might also leverage some of the information in Metaweb's Freebase (which overlaps and I believe may even integrate the info from DBpedia.) They have an API for querying their graph database, and there's a Python wrapper for it called freebase-python.
UPDATE: Freebase is no more; they were acquired by Google and eventually folded into the Google Knowledge Graph. There is an API but I don't think they have anything like the formal sync'ing Freebase had with public sources like Wikipedia. I'm personally disappointed in how this looks to have turned out. :-/
As for the natural language processing bit, if you do make headway on that problem you might consider these databases as repositories for any information you do mine.
You mention Python and Open Source, so I would investigate the NLTK (Natural Language Toolkit). Text mining and natural language processing is one of those things that you can do a lot with a dumb algorithm (eg. Pattern matching), but if you want to go a step further and do something more sophisticated - ie. Trying to extract information that is stored in a flexible manner or trying to find information that might be interesting but is not known a priori, then natural language processing should be investigated.
NLTK is intended for teaching, so it is a toolkit. This approach suits Python very well. There are a couple of books for it as well. The O'Reilly book is also published online with an open license. See NLTK.org
Jvc, there are existing python modules that can do everything you mentioned above.
For pulling information from webpages, I like to use Selenium, http://seleniumhq.org/projects/ide/. Basically, you can localize and retrieve information on any webpage using a number of identifiers (id, Xpath, etc).
However, like winwaed said, it can be inflexible if you are simply "pattern matching", especially since some websites use dynamic code- meaning the identifiers can change with each subsequent reload of the page. But, this problem can be solved by adding regular expressions, i.e. (.*), to your code. Check out this youtube video, http://www.youtube.com/watch?v=Ap_DlSrT-iE. Even though he is using BeautifulSoup to scrape the website- you can see how he uses regular expressions to pull the information from the page.
Also, I'm not sure what type of database you are working with, but pyodbc, http://code.google.com/p/pyodbc/, can work with SQL types, and also mainstream databases like Microsoft Access.
So, my advice is to look into Selenium for finding the info on the webpage, pyodbc to store and retrieve it, and regular expressions when the identifiers are dynamic.
I am working on a data-mining project for which I need to analyse the progress of discussion in a thread of a forum. I am interested in extracting information like time of post, stats of post's author (no. of posts, joining date, etc.), text of the post, etc.
However while using standard scraping tools (like Scrapy in python) I need to write the regular expressions for detecting these fields in the page's html source. As these tags vary with the type of forum, it is becoming a major problem to tackle the regular expressions for every forum. Is there a standard bank of such regular expressions available, so that they can be used based on the type of forum?
Or is there any other technique to extract these fields from the forum's page.
I wrote some configuration files for some major forums. Hope you can decipher and infer how to parse it.
For VBulletin:
enclosed_section=tag:table,attributes:id;threadslist
thread=tag:a,attributes:id;REthread_title_
list_next_page=type:next_page,attributes:anchor_text;>
post=tag:div,attributes:id;REpost_message_
thread_next_page=type:next_page,attributes:anchor_text;>
enclosed_section is the div that contains links to all the threads
thread is where you'll find the link to each thread
list_next_page is the link to the next page with list of threads
post is the div with the post text.
thread_next_page is the link to the next page of the thread
For Invision:
enclosed_section=tag:table,attributes:id;forum_table
thread=tag:a,attributes:class;topic_title
list_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href
post=tag:div,attributes:class;post entry-content |
thread_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href
post_count_section=tag:td,attributes:class;stats
post_count=tag:li,attributes:,reg_exp:(\d+) Repl
You'll still have to create several approaches per forum. But as Henley suggests, there are also a lot of forums that share their structure.
About easily parsing the dates of the forum's threads, dateparser was born from this specific requirement and it could be of great help.