Python - sending unicode characters (prefixed with \u) in an HTTP POST request - python

I'm writing a program which fetches and edits articles on Wikipedia, and I'm having a bit of trouble handling Unicode characters prefixed with \u. I've tried .encode("utf8") and it isn't seeming to do the trick here. How can I properly encode these values prefixed with \u to POST to Wikipedia? See this edit for my problem.
Here is some code:
To get the page:
url = "http://en.wikipedia.org/w/api.php?action=query&format=json&titles="+urllib.quote(name)+"&prop=revisions&rvprop=content"
articleContent = ClientCookie.urlopen(url).read().split('"*":"')[1].split('"}')[0].replace("\\n", "\n").decode("utf-8")
Before I POST the page:
data = dict([(key, value.encode('utf8')) for key, value in data.iteritems()])
data["text"] = data["text"].replace("\\", "")
editInfo = urllib2.Request("http://en.wikipedia.org/w/api.php", urllib.urlencode(data))

You are downloading JSON data without decoding it. Use the json library for that:
import json
articleContent = ClientCookie.urlopen(url)
data = json.load(articleContent)
JSON encoded data looks a lot like Python, it uses \u escaping as well, but it is in fact a subset of JavaScript.
The data variable now holds a deep datastructure. Judging by the string splitting, you wanted this piece:
articleContent = data['query']['pages'].values()[0]['revisions'][0]['*']
Now articleContent is an actual unicode() instance; it is the revision text of the page you were looking for:
>>> print u'\n'.join(data['query']['pages'].values()[0]['revisions'][0]['*'].splitlines()[:20])
{{For|the game|100 Bullets (video game)}}
{{GOCEeffort}}
{{italic title}}
{{Supercbbox <!--Wikipedia:WikiProject Comics-->
| title =100 Bullets
| image =100Bullets vol1.jpg
| caption = Cover to ''100 Bullets'' vol. 1 "First Shot, Last Call". Cover art by Dave Johnson.
| schedule = Monthly
| format =
|complete=y
|Crime = y
| publisher = [[Vertigo (DC Comics)|Vertigo]]
| date = August [[1999 in comics|1999]] – April [[2009 in comics|2009]]
| issues = 100
| main_char_team = [[Agent Graves]] <br/> [[Mr. Shepherd]] <br/> The Minutemen <br/> [[List of characters in 100 Bullets#Dizzy Cordova (also known as "The Girl")|Dizzy Cordova]] <br/> [[List of characters in 100 Bullets#Loop Hughes (also known as "The Boy")|Loop Hughes]]
| writers = [[Brian Azzarello]]
| artists = [[Eduardo Risso]]<br>Dave Johnson
| pencillers =
| inkers =
| colorists = Grant Goleash<br>[[Patricia Mulvihill]]

Related

Selecting Custom Field from RSS feed

So I'm trying to select the information within a custom feed from an RSS file. So far I'm making use of feed parser to access more common fields within the file. I was wondering if anyone would know how to achieve this given that there are numerous custom fields. The specific custom field i'm trying to select is the venue address.
<item>
<title>First 5 Forever babies, books and rhymes</title>
<description>Thursday, June 9, 2022, 9:30&nbsp;&ndash;&nbsp;10am <br/><br/><img src="https://www.trumba.com/i/DgBXwGE0qzQ0Bl7V7ynz3DTV.jpg" title="First 5 Forever babies, books and rhymes"
<link>https://www.brisbane.qld.gov.au/trumba?trumbaEmbed=view%3devent%26eventid%3d159467519</link>
<x-trumba:ealink>https://eventactions.com/eventactions/brisbane-city-council#/actions/030ma0wxb5wympxw4nud1vrr72</x-trumba:ealink>
<category>2022/06/09 (Thu)</category>
<guid isPermaLink="false">http://uid.trumba.com/event/159467519</guid>
<x-trumba:masterid isPermaLink="false">http://uid.trumba.com/master/159467517</x-trumba:masterid>
<xCal:summary>First 5 Forever babies, books and rhymes</xCal:summary>
<xCal:location>Nundah Library</xCal:location>
<xCal:dtstart>2022-06-08T23:30:00Z</xCal:dtstart>
<x-trumba:localstart tzAbbr="EAST" tzCode="260">2022-06-09T09:30:00</x-trumba:localstart>
<x-trumba:formatteddatetime>Thursday, June 9, 2022, 9:30 - 10am</x-trumba:formatteddatetime>
<xCal:dtend>2022-06-09T00:00:00Z</xCal:dtend>
<x-trumba:localend tzAbbr="EAST" tzCode="260">2022-06-09T10:00:00</x-trumba:localend>
<x-microsoft:cdo-alldayevent>false</x-microsoft:cdo-alldayevent>
<x-trumba:customfield name="Event Type" id="21" type="number">Library events</x-trumba:customfield>
<x-trumba:customfield name="Venue" id="22542" type="text">Nundah Library</x-trumba:customfield>
<x-trumba:customfield name="Venue address" id="22505" type="text">Nundah Library, 1 Bage Street (via Primrose Lane), Nundah</x-trumba :customfield>
<x-trumba:customfield name="Parent event" id="42212" type="text">First 5 Forever children's literacy sessions</x-trumba:customfield>
<x-trumba:customfield name="Age range" id="21858" type="text">Infants and toddlers</x-trumba:customfield>
<x-trumba:customfield name="Cost" id="22177" type="text">Free</x-trumba:customfield>
<x-trumba:customfield name="Event type" id="21859" type="text">Free</x-trumba:customfield>
<x-trumba:customfield name="Library event types" id="22496" type="text">Babies, books & rhymes,Children's literacy</x-trumba:customfield>
<x-trumba:customfield name="Event image" id="40" type="uri" imageWidth="1290" imageHeight="775">https://www.trumba.com/i/DgBXwGE0qzQ0Bl7V7ynz3DTV.jpg</x-trumba:customfield>
<x-trumba:customfield name="Age" id="23562" type="text">0-1 year olds</x-trumba:customfield>
<x-trumba:categorycalendar>Brisbane's calendar|Library events</x-trumba:categorycalendar>
Examples of the code I have used previously to retrieve information can be seen below:
blog_feed = feedparser.parse(url)
posts = blog_feed.entries
for post in posts:
#collecting the title for each individual item in the RSS file
title = post.title
#selecting the entire item as "word"
word = posts[counter]
counter = counter+1
#we know that the date of the event is always stored after the code (category)
date = word.category
After attempting to use BS4 I can successfully retrieve the address, but I am still unsure if it is possible to make use of this method within a loop to find the address of each item within the RSS file and then append the address to a main list given another parameter is true.
with open("brisbane-city-council.rss") as fp:
soup = BeautifulSoup(fp, "html.parser")
addrress = soup.find("x-trumba:customfield", id="22505")
print(addrress)
Below is the for loop I am using.
for post in posts:
#collecting the title for each individual item in the RSS file
title = post.title
#selecting the entire item as "word"
word = posts[counter]
counter = counter+1
#we know that the date of the event is always stored after the code (category)
date = word.category
#pulling down the link as it is unique for each event
link = word.link
#formatting date for ease of use and to allow functionality to be completed
date = date.split(' (')
date = date[0]
date = datetime.datetime.strptime(date, "%Y/%m/%d").date()
if date > start_date and date < end_date:
post_list.append(title)
description = post.summary
h = html2text.HTML2Text()
h.ignore_links = True
description = h.handle(description)
description_list.append(description)
link_list.append(link)
else:
continue

Django: Select object contains all keywords

Here is the db looks like:
id | Post | tag
1 | Post(1) | 'a'
2 | Post(1) | 'b'
3 | Post(2) | 'a'
4 | Post(3) | 'b'
And here is the code of the module
class PostMention(models.Model):
tag = models.CharField(max_length=200)
post = models.ForeignKey(Post,on_delete=models.CASCADE)
Here is the code of search,
def findPostTag(tag):
keywords=tag.split(' ')
keyQs = [Q(tag=x) for x in keywords]
keyQ = keyQs.pop()
for i in keyQs:
keyQ &= i
a = PostMention.objects.filter(keyQ).order_by('-id')
if not a:
a=[]
return a
(this code does not work correctly)
I withdraw the tags and save each as one row in the database. Now I want to make a search function that the user can input more than one keywords at the same time, like 'a b', and it will return 'Post(1)'. I searched for some similar situations, but seems all about searching for multi keywords in one row at the same time, like using Q(tag='a') & Q(tag='b'), it will search for the tag that equals to both 'a' and 'b'(in my view), which is not what I want (and get no result, obviously). So is there any solution to solve this? Thanks.
Is this cases, django provides, ManyToManyField, to work correctly you must to use:
class Tags(models.Model):
tag = models.CharField(unique=True, verbose_name='Tags')
class Post(models.Model): #your model
title = models.CharField(verbosone_name = 'Title')
post_tags = models.ManyToManyField(Tags, verbose_name='Choice your tags')
So you'll choice many tags to your post

How to retrieve well formatted JSON from AWS Lambda using Python

I have a function in AWS Lambda that connects to the Twitter API and returns the tweets which match a specific search query I provided via the event. A simplified version of the function is below. There's a few helper functions I use like get_secret to manage API keys and process_tweet which limits what data gets sent back and does things like convert the created at date to a string. The net result is that I should get back a list of dictionaries.
def lambda_handler(event, context):
twitter_secret = get_secret("twitter")
auth = tweepy.OAuthHandler(twitter_secret['api-key'],
twitter_secret['api-secret'])
auth.set_access_token(twitter_secret['access-key'],
twitter_secret['access-secret'])
api = tweepy.API(auth)
cursor = tweepy.Cursor(api.search,
q=event['search'],
include_entities=True,
tweet_mode='extended',
lang='en')
tweets = list(cursor.items())
tweets = [process_tweet(t) for t in tweets if not t.retweeted]
return json.dumps({"tweets": tweets})
From my desktop then, I have code which invokes the lambda function.
aws_lambda = boto3.client('lambda', region_name="us-east-1")
payload = {"search": "paint%20protection%20film filter:safe"}
lambda_response = aws_lambda.invoke(FunctionName="twitter-searcher",
InvocationType="RequestResponse",
Payload=json.dumps(payload))
results = lambda_response['Payload'].read()
tweets = results.decode('utf-8')
The problem is that somewhere between json.dumpsing the output in lambda and reading the payload in Python, the data has gotten screwy. For example, a line break which should be \n becomes \\\\n, all of the double quotes are stored as \\" and Unicode characters are all prefixed by \\. So, everything that was escaped, when it was received by Python on my desktop with the escaping character being escaped. Consider this element of the list that was returned (with manual formatting).
'{\\"userid\\": 190764134,
\\"username\\": \\"CapitalGMC\\",
\\"created\\": \\"2018-09-02 15:00:00\\",
\\"tweetid\\": 1036267504673337344,
\\"text\\": \\"Protect your vehicle\'s paint! Find out how on this week\'s blog.
\\\\ud83d\\\\udc47\\\\n\\\\nhttps://url/XYMxPhVhdH https://url/mFL2Zv8nWW\\"}'
I can use regex to fix some problems (\\" and \\\\n) but the Unicode is tricky because even if I match it, how do I replace it with a properly escaped character? When I do this in R, using the aws.lambda package, everything is fine, no weird escaped escapes.
What am I doing wrong on my desktop with the response from AWS Lambda that's garbling the data?
Update
The process tweet function is below. It literally just pulls out the bits I care to keep, formats the datetime object to be a string and returns a dictionary.
def process_tweet(tweet):
bundle = {
"userid": tweet.user.id,
"username": tweet.user.screen_name,
"created": str(tweet.created_at),
"tweetid": tweet.id,
"text": tweet.full_text
}
return bundle
Just for reference, in R the code looks like this.
payload = list(search="paint%20protection%20film filter:safe")
results = aws.lambda::invoke_function("twitter-searcher"
,payload = jsonlite::toJSON(payload
,auto_unbox=TRUE)
,type = "RequestResponse"
,key = creds$key
,secret = creds$secret
,session_token = creds$session_token
,region = creds$region)
tweets = jsonlite::fromJSON(results)
str(tweets)
#> 'data.frame': 133 obs. of 5 variables:
#> $ userid : num 2231994854 407106716 33553091 7778772 782310 ...
#> $ username: chr "adaniel_080213" "Prestige_AdamL" "exclusivedetail" "tedhu" ...
#> $ created : chr "2018-09-12 14:07:09" "2018-09-12 11:31:56" "2018-09-12 10:46:55" "2018-09-12 07:27:49" ...
#> $ tweetid : num 1039878080968323072 1039839019989983232 1039827690151444480 1039777586975526912 1039699310382931968 ...
#> $ text : chr "I liked a #YouTube video https://url/97sRShN4pM Tesla Model 3 - Front End Package - Suntek Ultra Paint Protection Film" "Another #Corvette #ZO6 full body clearbra wrap completed using #xpeltech ultimate plus PPF ... Paint protection"| __truncated__ "We recently protected this Tesla Model 3 with Paint Protection Film and Ceramic Coating.#teslamodel3 #charlotte"| __truncated__ "Tesla Model 3 - Front End Package - Suntek Ultra Paint Protection Film https://url/AD1cl5dNX3" ...
tweets[131,]
#> userid username created tweetid
#> 131 190764134 CapitalGMC 2018-09-02 15:00:00 1036267504673337344
#> text
#> 131 Protect your vehicle's paint! Find out how on this week's blog.👇\n\nhttps://url/XYMxPhVhdH https://url/mFL2Zv8nWW
In your lambda function you should return a response object with a JSON object in the response body.
# Lambda Function
def get_json(event, context):
"""Retrieve JSON from server."""
# Business Logic Goes Here.
response = {
"statusCode": 200,
"headers": {},
"body": json.dumps({
"message": "This is the message in a JSON object."
})
}
return response
Don't use json.dumps()
I had a similar issue, and when I just returned "body": content instead of "body": json.dumps(content) I could easily access and manipulate my data. Before that, I got that weird form that looks like JSON, but it's not.

Removed the default content in nested expression

I am using Pyparsing module and the nestedExpr function in it.
I want to give a delimitter instead of the default whitespace-delimited in the content argument of nestedexpr function.
If I have a text such as the following
text = "{{Infobox | birth_date = {{birth date and age|mf=yes|1981|1|31}}| birth_place = ((Memphis, Tennessee|Memphis)), ((Tennessee)), U.S.| instrument = ((Beatboxing)), guitar, keyboards, vocalsprint expr.parse| genre = ((Pop music|Pop)), ((contemporary R&B|R&B))| occupation = Actor, businessman, record producer, singer| years_active = 1992–present| label = ((Jive Records|Jive)), ((RCA Records|RCA)), ((Zomba Group of Companies|Zomba))| website = {{URL|xyz.com|Official website}} }}"
When I give nestedExpr('{{','}}').parseString(text) I need the output as the following list:
['Infobox | birth_date =' ,['birth date and age|mf=yes|1981|1|31'],'| birth_place = ((Memphis, Tennessee|Memphis)), ((Tennessee)), U.S.| instrument = ((Beatboxing)), guitar, keyboards, vocalsprint expr.parse| genre = ((Pop music|Pop)), ((contemporary R&B|R&B))| occupation = Actor, businessman, record producer, singer| years_active = 1992–present| label = ((Jive Records|Jive)), ((RCA Records|RCA)), ((Zomba Group of Companies|Zomba))| website =',[ 'URL|xyz.com|Official website' ]]
How can I give a ',' or '|' as the delimmiter instead of the whitespace-delimited characters? I tried giving the characters but it didnt work.

Content of infobox of Wikipedia

I need to get the content of an infobox of any movie. I know the name of the movie. One way is to get the complete content of a Wikipedia page and then parse it until I find {{Infobox and then get the content of the infobox.
Is there any other way for the same using some API or parser?
I am using Python and the pywikipediabot API.
I am also familiar with the wikitools API. So instead of pywikipedia if someone has solution related to the wikitools API, please mention that as well.
Another great MediaWiki parser is mwparserfromhell.
In [1]: import mwparserfromhell
In [2]: import pywikibot
In [3]: enwp = pywikibot.Site('en','wikipedia')
In [4]: page = pywikibot.Page(enwp, 'Waking Life')
In [5]: wikitext = page.get()
In [6]: wikicode = mwparserfromhell.parse(wikitext)
In [7]: templates = wikicode.filter_templates()
In [8]: templates?
Type: list
String Form:[u'{{Use mdy dates|date=September 2012}}', u"{{Infobox film\n| name = Waking Life\n| im <...> critic film|waking-life|Waking Life}}', u'{{Richard Linklater}}', u'{{DEFAULTSORT:Waking Life}}']
Length: 31
Docstring:
list() -> new empty list
list(iterable) -> new list initialized from iterable's items
In [10]: templates[:2]
Out[10]:
[u'{{Use mdy dates|date=September 2012}}',
u"{{Infobox film\n| name = Waking Life\n| image = Waking-Life-Poster.jpg\n| image_size = 220px\n| alt =\n| caption = Theatrical release poster\n| director = [[Richard Linklater]]\n| producer = [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West\n| writer = Richard Linklater\n| starring = [[Wiley Wiggins]]\n| music = Glover Gill\n| cinematography = Richard Linklater<br />[[Tommy Pallotta]]\n| editing = Sandra Adair\n| studio = [[Thousand Words]]\n| distributor = [[Fox Searchlight Pictures]]\n| released = {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}\n| runtime = 101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>\n| country = United States\n| language = English\n| budget =\n| gross = $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>\n}}"]
In [11]: infobox_film = templates[1]
In [12]: for param in infobox_film.params:
print param.name, param.value
name Waking Life
image Waking-Life-Poster.jpg
image_size 220px
alt
caption Theatrical release poster
director [[Richard Linklater]]
producer [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West
writer Richard Linklater
starring [[Wiley Wiggins]]
music Glover Gill
cinematography Richard Linklater<br />[[Tommy Pallotta]]
editing Sandra Adair
studio [[Thousand Words]]
distributor [[Fox Searchlight Pictures]]
released {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}
runtime 101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>
country United States
language English
budget
gross $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>
Don't forget that params are mwparserfromhell objects too!
Instead of reinventing the wheel, check out DBPedia, which has already extracted all Wikipedia infoboxes into an easily parsable database format.
Any infobox is a template transcluded by curly brackets. Let's have a look to a template and how it is transcluded in wikitext:
Infobox film
{{Infobox film
| name = Actresses
| image = Actrius film poster.jpg
| alt =
| caption = Catalan language film poster
| native_name = ([[Catalan language|Catalan]]: '''''Actrius''''')
| director = [[Ventura Pons]]
| producer = Ventura Pons
| writer = [[Josep Maria Benet i Jornet]]
| screenplay = Ventura Pons
| story =
| based_on = {{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}
| starring = {{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna Lizaran]]|[[Mercè Pons]]}}
| narrator = <!-- or: |narrators = -->
| music = Carles Cases
| cinematography = Tomàs Pladevall
| editing = Pere Abadal
| production_companies = {{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - Departament de Cultura]]|[[Televisión Española]]}}
| distributor = [[Buena Vista International]]
| released = {{film date|df=yes|1997|1|17|[[Spain]]}}
| runtime = 100 minutes
| country = Spain
| language = Catalan
| budget =
| gross = <!--(please use condensed and rounded values, e.g. "£11.6 million" not "£11,586,221")-->
}}
There are two high level Page methods in Pywikibot to parse the content of any template inside the wikitext content. Both use mwparserfromhell if installed; otherwise a regex is used but the regex may fail for nested templates with depth > 3:
raw_extracted_templates
raw_extracted_templates is a Page property with returns a list of tuples with two items each. The first item is the template identifier as str, 'Infobox film' for example. The second item is an OrderedDict with template parameters identifier as keys and their assignmets as values. For example the template fields
| name = FILM TITLE
| image = FILM TITLE poster.jpg
| caption = Theatrical release poster
results in an OrderedDict as
OrderedDict((name='FILM TITLE', image='FILM TITLE poster.jpg' caption='Theatrical release poster')
Now how get it with Pywikibot?
from pprint import pprint
import pywikibot
site = pywikibot.Site('wikipedia:en') # or pywikibot.Site('en', 'wikipedia') for older Releases
page = pywikibot.Page(site, 'Actrius')
all_templates = page.page.raw_extracted_templates
for tmpl, params in all_templates:
if tmpl == 'Infobox film':
pprint(params)
This will print
OrderedDict([('name', 'Actresses'),
('image', 'Actrius film poster.jpg'),
('alt', ''),
('caption', 'Catalan language film poster'),
('native_name',
"([[Catalan language|Catalan]]: '''''Actrius''''')"),
('director', '[[Ventura Pons]]'),
('producer', 'Ventura Pons'),
('writer', '[[Josep Maria Benet i Jornet]]'),
('screenplay', 'Ventura Pons'),
('story', ''),
('based_on',
"{{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}"),
('starring',
'{{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna '
'Lizaran]]|[[Mercè Pons]]}}'),
('narrator', ''),
('music', 'Carles Cases'),
('cinematography', 'Tomàs Pladevall'),
('editing', 'Pere Abadal'),
('production_companies',
'{{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla '
'S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - '
'Departament de Cultura]]|[[Televisión Española]]}}'),
('distributor', '[[Buena Vista International]]'),
('released', '{{film date|df=yes|1997|1|17|[[Spain]]}}'),
('runtime', '100 minutes'),
('country', 'Spain'),
('language', 'Catalan'),
('budget', ''),
('gross', '')])
templatesWithParams()
This is similar to raw_extracted_templates property but the method returns a list of tuples with again two items. The first item is the template as a Page object. The second item is a list of template parameters. Have a look at the sample:
Sample code
from pprint import pprint
import pywikibot
site = pywikibot.Site('wikipedia:en') # or pywikibot.Site('en', 'wikipedia') for older Releases
page = pywikibot.Page(site, 'Actrius')
all_templates = page.templatestemplatesWithParams()
for tmpl, params in all_templates:
if tmpl.title(with_ns=False) == 'Infobox film':
pprint(tmpl)
This will print the list:
['alt=',
"based_on={{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}",
'budget=',
'caption=Catalan language film poster',
'cinematography=Tomàs Pladevall',
'country=Spain',
'director=[[Ventura Pons]]',
'distributor=[[Buena Vista International]]',
'editing=Pere Abadal',
'gross=',
'image=Actrius film poster.jpg',
'language=Catalan',
'music=Carles Cases',
'name=Actresses',
'narrator=',
"native_name=([[Catalan language|Catalan]]: '''''Actrius''''')",
'producer=Ventura Pons',
'production_companies={{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla '
'S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - Departament de '
'Cultura]]|[[Televisión Española]]}}',
'released={{film date|df=yes|1997|1|17|[[Spain]]}}',
'runtime=100 minutes',
'screenplay=Ventura Pons',
'starring={{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna '
'Lizaran]]|[[Mercè Pons]]}}',
'story=',
'writer=[[Josep Maria Benet i Jornet]]']
You can get the wikipage content with pywikipdiabot, and then, you can search for the infobox with regex, a parser like mwlib [0], or even stick with pywikipediabot and use one of his template tools. For example on textlib you'll find some functions to deal with templates (hint: search for "# Functions dealing with templates"). [1]
[0] - http://pypi.python.org/pypi/mwlib
[1] - http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikipedia/pywikibot/textlib.py?view=markup

Categories

Resources