I want to extract the introduction part of a wikipedia article(ignoring all other stuff, including tables, images and other parts). I looked at html source of the articles, but I don't see any special tag which this part is wrapped in.
Can anyone give me a quick solution to this? I'm writing python scripts.
thanks
You may want to check mwlib to parse the wikipedia source
Alternatively, use the wikidump lib
HTML screen scraping through BeautifulSoup
Ah, there is a question already on SO on this topic:
Parsing a Wikipedia dump
How to parse/extract data from a mediawiki marked-up article via python
I think you can often get to the intro text by taking the full page, stripping out all the tables, and then looking for the first sequence of <p>...</p> blocks after the marker. That last bit would be this regex:
/<!-- bodytext -->.*?(<p>.*?<\/p>\s*)+/
With the .S option to make . match newlines...
Related
Is HTML or CSS open source like PHP or Python?
For example PHP itself:
PHP source link in GITHUB
You must understand the difference between the specification (documentation, idea) and the implementation (interpreter, compiler, engine). Refer to https://softwareengineering.stackexchange.com/questions/238724/what-license-is-html-released-under for more detailed answers.
HTML
HTML only provides tags around text but other than that html files are practically just text files with extra markers. These tags are then parsed by the parser of the browser you use and displayed in a certain way. The easiest example of this is probably the <strong></strong>, <i></i> and <b></b> tags. These tags only purpose is to modify the text within them to look different. But the strong tag looks different on different browsers. Some browsers make text bold while others do other changes. This shows you that the browser really is the interpreter and there are not always strict rules on what tags must do. The current standard (html5) was created by a group called the WHATWG. See more information also here https://www.w3.org/html/. and their own website https://html.spec.whatwg.org/multipage/. They have a github https://github.com/whatwg/html and I believe you can make changes if they accept your merge requests.
CSS
CSS is also not really a language. CSS is prescribed by the css working group their members are publicly known. https://www.w3.org/Style/CSS/members. If you scroll through the list you will see most of these people are engineers from big tech companies. They do have a github for css it is used for people to post issues. https://github.com/w3c/csswg-drafts. I do believe you can create a merge request although i am unsure.
Summary
In short yes they are basically opensource. However changes to either of these repositories even when accepted and merged will not work untill browsers have implemented the changes.
note
I am not an expert by any means I googled most of my information. I could be wrong about multiple things.
Thank you for your time to read this
I wanted to know if there's any way that i can get a specific code from different links but they are all of the same domain i mean if i put many facebook pages links it gets all their names in a text file and each one in different line
I think if i understood you need the user's name form the link.
facebook.com/zuck
acebook.com/moskov
You can track this and extract the pagetitle, this may not be accurate always.
> <title id="pageTitle">Mark Zuckerberg</title>
> <title id="pageTitle">Dustin Moskovitz</title>
html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).
https://github.com/Alir3z4/html2text
if you want to read from the url check the below explanations
How to read html from a url in python 3
I'm fetching webpages with a bunch of javascript on it, and I'm interested in parsing through the javascript portion of the pages for certain relevant info. Right now I have the following code in Python/BeautifulSoup/regex:
scriptResults = soup('script',{'type' : 'text/javascript'})
which yields an array of scripts, of which I can use a for loop to search for text I'd like:
for script in scriptResults:
for block in script:
if *patterniwant* in block:
**extract pattern from line using regex**
(Text in asterisks is pseudocode, of course.)
I was wondering if there was a better way for me to just use regex to find the pattern in the soup itself, searching only through the scripts themselves? My implementation works, but it just seems really clunky so I wanted something more elegant and/or efficient and/or Pythonic.
Thanks in advance!
I lot of website have client side data in JSON format. I that case I would suggest to extract JSON part from JavaScirpt code and parse it using Python's json modules (e.g. json.json.loads ). As a result you will get standard dictionary object.
Another option is to check with your browser what sort of AJAX requests application makes. Quite often it also returns structured data in JSON.
I would also check if page has any structured data already available (e.g. OpenGraph, microformats, RDFa, RSS feeds). A lot of web sites include this to improve pages SEO and make it better integrating with social network sharing.
Can anyone point me towards a ready made RSS screen scraper, preferably in Python in order to get full text RSS feeds?
There's a good list of them here, which mentions Feed Parser, which you use like this:
import feedparser
python_wiki_rss_url = "http://www.python.org/cgi-bin/moinmoin/" \
"RecentChanges?action=rss_rc"
feed = feedparser.parse( python_wiki_rss_url )
You can then do things like:
for item in feed["items"]:
print item["title"]
feedparser.org is great
Sorry but it doesn't exist in python, though they do in php. You are more then welcome to use and improve the one I made named scraped. Though it does not do all sites, it is a recipe based system that currently only handles the NYT, WSJ and the Economist. I am working on an all inclusive algorithm, but its a major undertaking. It includes a ton of analysis to the different types of html and xml. Even the 3 sites mentioned above, have vastly different algorithms on how to scrape their sites WSJ being the most complex by far. They screw their HTML up with so much useless crap, mainly to just stop you.
Here is the program I was talking about, it requires lxml but it explains everything in the readme. It reads the config files, parses partial rss feeds, takes links and then scrapes those links, formulating in the end a RSS 2.0 xml file. Which I mainly convert into a ebook for my kindle. I utilize lxml, BeautifulSoup and feedparser.
http://tinyurl.com/yh3s9pa
You can also look at the calibre project, which uses a similar method to the way I do it, on recipes.
Because regular expressions scare me, I'm trying to find a way to remove all HTML tags and resolve HTML entities from a string in Python.
Use lxml which is the best xml/html library for python.
import lxml.html
t = lxml.html.fromstring("...")
t.text_content()
And if you just want to sanitize the html look at the lxml.html.clean module
Use BeautifulSoup! It's perfect for this, where you have incoming markup of dubious virtue and need to get something reasonable out of it. Just pass in the original text, extract all the string tags, and join them.
While I agree with Lucas that regular expressions are not all that scary, I still think that you should go with a specialized HTML parser. This is because the HTML standard is hairy enough (especially if you want to parse arbitrarily "HTML" pages taken off the Internet) that you would need to write a lot of code to handle the corner cases. It seems that python includes one out of the box.
You should also check out the python bindings for TidyLib which can clean up broken HTML, making the success rate of any HTML parsing much higher.
How about parsing the HTML data and extracting the data with the help of the parser ?
I'd try something like the author described in chapter 8.3 in the Dive Into Python book
if you use django you might also use
http://docs.djangoproject.com/en/dev/ref/templates/builtins/#striptags
;)
You might need something more complicated than a regular expression. Web pages often have angle brackets that aren't part of a tag, like this:
<div>5 < 7</div>
Stripping the tags with regex will return the string "5 " and treat
< 7</div>
as a single tag and strip it out.
I suggest looking for already-written code that does this for you. I did a search and found this: http://zesty.ca/python/scrape.html It also can resolve HTML entities.
Regular expressions are not scary, but writing your own regexes to strip HTML is a sure path to madness (and it won't work, either). Follow the path of wisdom, and use one of the many good HTML-parsing libraries.
Lucas' example is also broken because "sub" is not a method of a Python string. You'd have to "import re", then call re.sub(pattern, repl, string). But that's neither here nor there, as the correct answer to your question does not involve writing any regexes.
Looking at the amount of sense people are demonstrating in other answers here, I'd say that using a regex probably isn't the best idea for your situation. Go for something tried and tested, and treat my previous answer as a demonstration that regexes need not be that scary.