Content of infobox of Wikipedia

Content of infobox of Wikipedia - python

I need to get the content of an infobox of any movie. I know the name of the movie. One way is to get the complete content of a Wikipedia page and then parse it until I find {{Infobox and then get the content of the infobox.
Is there any other way for the same using some API or parser?
I am using Python and the pywikipediabot API.
I am also familiar with the wikitools API. So instead of pywikipedia if someone has solution related to the wikitools API, please mention that as well.

Another great MediaWiki parser is mwparserfromhell.
In [1]: import mwparserfromhell
In [2]: import pywikibot
In [3]: enwp = pywikibot.Site('en','wikipedia')
In [4]: page = pywikibot.Page(enwp, 'Waking Life')
In [5]: wikitext = page.get()
In [6]: wikicode = mwparserfromhell.parse(wikitext)
In [7]: templates = wikicode.filter_templates()
In [8]: templates?
Type: list
String Form:[u'{{Use mdy dates|date=September 2012}}', u"{{Infobox film\n| name = Waking Life\n| im <...> critic film|waking-life|Waking Life}}', u'{{Richard Linklater}}', u'{{DEFAULTSORT:Waking Life}}']
Length: 31
Docstring:
list() -> new empty list
list(iterable) -> new list initialized from iterable's items
In [10]: templates[:2]
Out[10]:
[u'{{Use mdy dates|date=September 2012}}',
u"{{Infobox film\n| name = Waking Life\n| image = Waking-Life-Poster.jpg\n| image_size = 220px\n| alt =\n| caption = Theatrical release poster\n| director = [[Richard Linklater]]\n| producer = [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West\n| writer = Richard Linklater\n| starring = [[Wiley Wiggins]]\n| music = Glover Gill\n| cinematography = Richard Linklater<br />[[Tommy Pallotta]]\n| editing = Sandra Adair\n| studio = [[Thousand Words]]\n| distributor = [[Fox Searchlight Pictures]]\n| released = {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}\n| runtime = 101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>\n| country = United States\n| language = English\n| budget =\n| gross = $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>\n}}"]
In [11]: infobox_film = templates[1]
In [12]: for param in infobox_film.params:
print param.name, param.value
name Waking Life
image Waking-Life-Poster.jpg
image_size 220px
alt
caption Theatrical release poster
director [[Richard Linklater]]
producer [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West
writer Richard Linklater
starring [[Wiley Wiggins]]
music Glover Gill
cinematography Richard Linklater<br />[[Tommy Pallotta]]
editing Sandra Adair
studio [[Thousand Words]]
distributor [[Fox Searchlight Pictures]]
released {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}
runtime 101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>
country United States
language English
budget
gross $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>
Don't forget that params are mwparserfromhell objects too!

Instead of reinventing the wheel, check out DBPedia, which has already extracted all Wikipedia infoboxes into an easily parsable database format.

Any infobox is a template transcluded by curly brackets. Let's have a look to a template and how it is transcluded in wikitext:
Infobox film
{{Infobox film
| name = Actresses
| image = Actrius film poster.jpg
| alt =
| caption = Catalan language film poster
| native_name = ([[Catalan language|Catalan]]: '''''Actrius''''')
| director = [[Ventura Pons]]
| producer = Ventura Pons
| writer = [[Josep Maria Benet i Jornet]]
| screenplay = Ventura Pons
| story =
| based_on = {{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}
| starring = {{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna Lizaran]]|[[Mercè Pons]]}}
| narrator = <!-- or: |narrators = -->
| music = Carles Cases
| cinematography = Tomàs Pladevall
| editing = Pere Abadal
| production_companies = {{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - Departament de Cultura]]|[[Televisión Española]]}}
| distributor = [[Buena Vista International]]
| released = {{film date|df=yes|1997|1|17|[[Spain]]}}
| runtime = 100 minutes
| country = Spain
| language = Catalan
| budget =
| gross = <!--(please use condensed and rounded values, e.g. "£11.6 million" not "£11,586,221")-->
}}
There are two high level Page methods in Pywikibot to parse the content of any template inside the wikitext content. Both use mwparserfromhell if installed; otherwise a regex is used but the regex may fail for nested templates with depth > 3:
raw_extracted_templates
raw_extracted_templates is a Page property with returns a list of tuples with two items each. The first item is the template identifier as str, 'Infobox film' for example. The second item is an OrderedDict with template parameters identifier as keys and their assignmets as values. For example the template fields
| name = FILM TITLE
| image = FILM TITLE poster.jpg
| caption = Theatrical release poster
results in an OrderedDict as
OrderedDict((name='FILM TITLE', image='FILM TITLE poster.jpg' caption='Theatrical release poster')
Now how get it with Pywikibot?
from pprint import pprint
import pywikibot
site = pywikibot.Site('wikipedia:en') # or pywikibot.Site('en', 'wikipedia') for older Releases
page = pywikibot.Page(site, 'Actrius')
all_templates = page.page.raw_extracted_templates
for tmpl, params in all_templates:
if tmpl == 'Infobox film':
pprint(params)
This will print
OrderedDict([('name', 'Actresses'),
('image', 'Actrius film poster.jpg'),
('alt', ''),
('caption', 'Catalan language film poster'),
('native_name',
"([[Catalan language|Catalan]]: '''''Actrius''''')"),
('director', '[[Ventura Pons]]'),
('producer', 'Ventura Pons'),
('writer', '[[Josep Maria Benet i Jornet]]'),
('screenplay', 'Ventura Pons'),
('story', ''),
('based_on',
"{{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}"),
('starring',
'{{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna '
'Lizaran]]|[[Mercè Pons]]}}'),
('narrator', ''),
('music', 'Carles Cases'),
('cinematography', 'Tomàs Pladevall'),
('editing', 'Pere Abadal'),
('production_companies',
'{{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla '
'S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - '
'Departament de Cultura]]|[[Televisión Española]]}}'),
('distributor', '[[Buena Vista International]]'),
('released', '{{film date|df=yes|1997|1|17|[[Spain]]}}'),
('runtime', '100 minutes'),
('country', 'Spain'),
('language', 'Catalan'),
('budget', ''),
('gross', '')])
templatesWithParams()
This is similar to raw_extracted_templates property but the method returns a list of tuples with again two items. The first item is the template as a Page object. The second item is a list of template parameters. Have a look at the sample:
Sample code
from pprint import pprint
import pywikibot
site = pywikibot.Site('wikipedia:en') # or pywikibot.Site('en', 'wikipedia') for older Releases
page = pywikibot.Page(site, 'Actrius')
all_templates = page.templatestemplatesWithParams()
for tmpl, params in all_templates:
if tmpl.title(with_ns=False) == 'Infobox film':
pprint(tmpl)
This will print the list:
['alt=',
"based_on={{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}",
'budget=',
'caption=Catalan language film poster',
'cinematography=Tomàs Pladevall',
'country=Spain',
'director=[[Ventura Pons]]',
'distributor=[[Buena Vista International]]',
'editing=Pere Abadal',
'gross=',
'image=Actrius film poster.jpg',
'language=Catalan',
'music=Carles Cases',
'name=Actresses',
'narrator=',
"native_name=([[Catalan language|Catalan]]: '''''Actrius''''')",
'producer=Ventura Pons',
'production_companies={{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla '
'S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - Departament de '
'Cultura]]|[[Televisión Española]]}}',
'released={{film date|df=yes|1997|1|17|[[Spain]]}}',
'runtime=100 minutes',
'screenplay=Ventura Pons',
'starring={{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna '
'Lizaran]]|[[Mercè Pons]]}}',
'story=',
'writer=[[Josep Maria Benet i Jornet]]']

You can get the wikipage content with pywikipdiabot, and then, you can search for the infobox with regex, a parser like mwlib [0], or even stick with pywikipediabot and use one of his template tools. For example on textlib you'll find some functions to deal with templates (hint: search for "# Functions dealing with templates"). [1]
[0] - http://pypi.python.org/pypi/mwlib
[1] - http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikipedia/pywikibot/textlib.py?view=markup

Related

Skip element with getElementsByTagName if it doesn't exist

I have a script that parses an XML file looking for certain attributes. However, when I try to define an attribute that doesnt exist, it throws an error. What is the best way to resolve this?
For example, this code looks for all works given by an API.
elif respCode == '4':
works = xdoc.getElementsByTagName('works')[0]
print " " + 'Works found: ' + str(len(works.getElementsByTagName('work'))) + ' different works'
for work in works.getElementsByTagName('work'):
author = work.attributes["author"].value
title = work.attributes["title"].value
editionCount = work.attributes["editions"].value
date = work.attributes["lyr"].value
format = work.attributes["format"].value
owi = work.attributes["owi"].value
wi = work.attributes["wi"].value
When fed the following XML file
<classify xmlns="http://classify.oclc.org">
<response code="4"/>
<!-- Classify is a product of OCLC Online Computer Library Center: http://classify.oclc.org -->
<workCount>7</workCount>
<start>0</start>
<maxRecs>25</maxRecs>
<orderBy>thold desc</orderBy>
<input type="isbn">1</input>
<works>
<work author="Barlow, Alfred E. (Alfred Ernest), 1861-1914 | Geological Survey of Canada" editions="33" format="eBook" holdings="270" hyr="2018" itemtype="itemtype-book-digital" lyr="1904" owi="12532881" schemes="DDC LCC" title="Reprint of a report on the origin, geological relations and composition of the nickel and copper deposits of the Sudbury Mining District, Ontario, Canada" wi="9090518"/>
<work author="Skillen, James W." editions="2" format="Book" holdings="237" hyr="2014" itemtype="itemtype-book" lyr="2014" owi="1361997817" schemes="DDC LCC" title="The good of politics : a biblical, historical, and contemporary introduction" wi="849787504"/>
<work author="Buchori, Binny | Buchori, Binny [Contributor] | Husain, Thamrin, 1974- | Salampessy, Zairin, 1968-" editions="4" format="Book" holdings="21" hyr="2011" itemtype="itemtype-book" lyr="2001" owi="475047565" schemes="DDC LCC" title="Ketika semerbak cengkih tergusur asap mesiu : tragedi kemanusiaan Maluku di balik konspirasi militer, kapitalis birokrat, dan kepentingan elit politik" wi="48642781"/>
<work author="Bauman, Amy" editions="3" format="Book" holdings="11" hyr="2009" itemtype="itemtype-book" lyr="2009" owi="481071496" schemes="DDC" title="Pirate's treasure : a peek-a-boo adventure" wi="615048025"/>
<work author="Stanton, Geoffrey | CfBT Education Trust" editions="3" format="eBook" holdings="9" hyr="2015" itemtype="itemtype-book-digital" lyr="2008" owi="4889708365" schemes="DDC" title="Learning matters : making the 14-19 reforms work for learners : by emphasising learning programmes as well as qualifications : by learning from previous initiatives" wi="751807280"/>
<work author="Ide, Arthur Frederick" editions="2" format="Book" holdings="5" hyr="1985" itemtype="itemtype-book" lyr="1985" owi="64427876" schemes="DDC LCC" title="Idol worshippers in America : Phyllis Schlafly, Ronald Reagan, Jerry Falwell, and the Moral Majority on women, work, and homosexuality : with a parallel translation and critical commentary on Genesis 19" wi="79169264"/>
<work editions="3" format="Book" holdings="5" hyr="2020" itemtype="itemtype-book" lyr="2020" owi="10209736909" schemes="DDC" title="52 weeks of socks" wi="1142963815"/>
</works>
</classify>
The code trips on the last element because the tag <author> is not defined. How can I define my author variable to a certain value if the tag in the XML file is undefined?
Thanks!

You can get around this problem by using try except blocks, your code will look something like this:
try:
author = work.attributes["author"].value
except:
author = 'defaultValue'
you can find more information on how try/except block work here

Separating text/text processing using regex

I have a paragraph that needs to be separated by a certain list of keywords.
Here is the text (a single string):
"Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore Address: 189 West Moncler Drive Home Phone: 353 273 400 Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019 Author: social worker"
So I want to separate this paragraph based on the variable names using python. "Evaluation Note", "Date","ID","Contact","Name","Address","Home Phone","Additional Information" and "Author" are the variable names. I think using regex seems nice but I don't have a lot of experience in regex.
Here is what I am trying to do:
import re
regex = r"Evaluation Note(?:\:)? (?P<note>\D+) Date(?:\:)? (?P<date>\D+)
ID(?:\:)? (?P<id>\D+) Contact(?:\:)? (?P<contact>\D+)Name(?:\:)? (? P<name>\D+)"
test_str = "Evaluation Note: Suspected abuse by own mother. Date 3/13/2019
ID: #N/A Contact: Not Specified Name: Cecilia Valore "
matches = re.finditer(regex, test_str, re.MULTILINE)
But doesn't find any patterns.

You can probably generate that regex on the fly. So long as the order of the params is fixed.
Here my try at it, it does do the job. The actual regex it is shooting for is something like Some Key(?P<some_key>.*)Some Other Key(?P<some_other_key>.*), and so on.
import re
test_str = r'Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore '
keys = ['Evaluation Note', 'Date', 'ID', 'Contact', 'Name']
def find(keys, string):
keys = [(key, key.replace(' ', '_')) for key in keys] # spaces aren't valid param names
pattern = ''.join([f'{key}(?P<{name}>.*)' for key, name in keys]) # generate the actual regex
for find in re.findall(pattern, test_str):
for item in find:
yield item.strip(':').strip() # clean up the result
for find in find(keys, test_str):
print(find)
Which returns:
Suspected abuse by own mother.
3/13/2019
#N/A
Not Specified
Cecilia Valore

You can use search to get locations of variables and parse text accordingly. You can customize it easily.
import re
en = re.compile('Evaluation Note:').search(text)
print(en.group())
d = re.compile('Date').search(text)
print(text[en.end()+1: d.start()-1])
print(d.group())
i_d = re.compile('ID:').search(text)
print(text[d.end()+1: i_d.start()-1])
print(i_d.group())
c = re.compile('Contact:').search(text)
print(text[i_d.end()+1: c.start()-1])
print(c.group())
n = re.compile('Name:').search(text)
print(text[c.end()+1: n.start()-1])
print(n.group())
ad = re.compile('Address:').search(text)
print(text[n.end()+1: ad.start()-1])
print(ad.group())
p = re.compile('Home Phone:').search(text)
print(text[ad.end()+1: p.start()-1])
print(p.group())
ai = re.compile('Additional Information:').search(text)
print(text[p.end()+1: ai.start()-1])
print(ai.group())
aut = re.compile('Author:').search(text)
print(text[ai.end()+1: aut.start()-1])
print(aut.group())
print(text[aut.end()+1:])
this will output:
Evaluation Note: Suspected abuse by own mother.
Date: 3/13/2019
ID: #N/A
Contact: Not Specified
Name: Cecilia Valore
Address: 189 West Moncler Drive
Home Phone: 353 273 400
Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019
Author: social worker
I hope this helps

Mapping each element in list to different column in pandas dataframe

Background: I have a dataframe with individuals' names and addresses. I'm trying to catalog people associated with each person in my dataframe, so I'm running each row/record in the dataframe through an external API that returns a list of people associated with the individual. The idea is to write a series of functions that calls the API, returns the list of relatives, and appends each name in the list to a distinct column in the original dataframe. The code will eventually be parallelized.
The dataframe:
import pandas as pd
df = pd.DataFrame({
'first_name': ['Kyle', 'Ted', 'Mary', 'Ron'],
'last_name': ['Smith', 'Jones', 'Johnson', 'Reagan'],
'address': ['123 Main Street', '456 Maple Street', '987 Tudor Place', '1600 Pennsylvania Avenue']},
columns = ['first_name', 'last_name', 'address'])
The first function, which calls the API and returns a list of names:
import requests
import json
import numpy as np
from multiprocessing import Pool
def API_call(row):
api_key = '123samplekey'
first_name = str(row['First_Name'])
last_name = str(row['Last_Name'])
address = str(row['Street_Address'])
url = 'https://apiaddress.com/' + '?first_name=' + first_name + '?last_name=' + last_name + '?address' = address + '?api_key' + api_key
response = requests.get(url)
JSON = response.json()
name_list = []
for index, person in enumerate(JSON['people']):
name = JSON['people'].get('name')
name_list.append(name)
return name_list
This function works well. For each person in the dataframe, a list of family/friends is returned. So, for Kyle Smith, the function returns [Heather Smith, Dan Smith], for Ted Jones the function returns [Al Jones, Karen Jones, Tiffany Jones, Natalie Jones], and so on for each row/record in the dataframe.
Problem: I'm struggling to write a subsequent function that will iterate through the returned list and append each value to a unique column that corresponds to the searched name in the dataframe. I want the function to return a database that looks like this:
First_Name | Last_Name | Street_Address | relative1_name | relative2_name | relative3_name | relative4_name
-----------------------------------------------------------------------------------------------------------------------------
Kyle | Smith | 123 Main Street | Heather Smith | Dan Smith | |
Ted | Jones | 456 Maple Street | Al Jones | Karen Jones | Tiffany Jones | Natalie Jones
Mary | Johnson | 987 Tudor Place | Kevin Johnson | | |
Ron | Reagan | 1600 Pennsylvania Avenue | Nancy Reagan | Patti Davis | Michael Reagan | Christine Reagan
NOTE: The goal is to vectorize everything, so that I can use the apply method and eventually run the whole thing in parallel. Something along the lines of the following code has worked for me in the past, when the "API_call" function was returning a single object instead of a list that needed to be iterated/mapped:
def API_call(row):
# all API parameters
url = 'https//api.com/parameters'
response = request.get(url)
JSON = response.json()
single_object = JSON['key1']['key2'].get('key3')
return single_object
def second_function(data):
data['single_object'] = data.apply(API_call, axis =1)
return data
def parallelize(dataframe, function):
df_splits = np.array_split(dataframe, 10)
pool = Pool(4)
df_whole = pd.concat(pool.map(function, df_splits))
pool.close()
pool.join()
return df_whole
parallelize(df, second_function)
The problem is I just can't write a vectorizable function (second_function) that maps names from the list returned by the API to unique columns in the original dataframe. Thanks in advance for any help!

import pandas as pd
def make_relatives_frame(relatives):
return pd.DataFrame(data=[relatives],
columns=["relative%i_name" % x for x in range(1, len(relatives) + 1)])
# example output from an API call
df_names = pd.DataFrame(data=[["Kyle", "Smith"]], columns=["First_Name", "Last_Name"])
relatives = ["Heather Smith", "Dan Smith"]
df_relatives = make_relatives_frame(relatives)
df_names[df_relatives.columns] = df_relatives
# example output from another API Call with more relatives
df_names2 = pd.DataFrame(data=[["John", "Smith"]], columns=["First_Name", "Last_Name"])
relatives2 = ["Heath Smith", "Daryl Smith", "Scott Smith"]
df_relatives2 = make_relatives_frame(relatives2)
df_names2[df_relatives2.columns] = df_relatives2
# example of stacking the outputs
total_df = df_names.append(df_names2)
print total_df
The above code should get you started. Obviously it is just a representative example, but you should be able to refactor it to fit your specific use case.

Parsing unstructured text file with Python

I have a text file, a few snippets of which look like this:
Page 1 of 515
Closing Report for Company Name LLC
222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
File number: Jackie Grant Status: Fell Thru Primary closing party: Seller
Acceptance: 01/01/2001 Closing date: 11/11/2011 Property type: Commercial Lease
MLS number: Sale price: $200,000 Commission: $1,500.00
Notes: 08/15/2000 02:30PM by Roger Lodge This property is a Commercial Lease handled by etc..
Seller: Company Name LLC
Company name: Company Name LLC
Address: 222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
Home: Pager:
Business: Fax:
Mobile: Email:
Buyer: Tomlinson, Ladainian
Address: 222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
Home: Pager:
Business: 555-555-5555 Fax:
Mobile: Email:
Lessee Agent: Blank, Arthur
Company name: Sprockets Inc.
Address: 5001 Old Man Dr, North Las Vegas, NV, 89002
Home: (575) 222-3455 Pager:
Business: Fax: 999-9990
Mobile: (702) 600-3492 Email: sprockets#yoohoo.com
Leasing Agent: Van Uytnyck, Chameleon
Company name: Company Name LLC
Address:
Home: Pager:
Business: Fax: 909-222-2223
Mobile: 595-595-5959 Email:
(should be 2 spaces here.. this is not in normal text file)
Printed on Friday, June 12, 2015
Account owner: Roger Goodell
Page 2 of 515
Report for Adrian (Allday) Peterson
242 N 9th Street, #100 & 200
File number: Soap Status: Closed/Paid Primary closing party: Buyer
Acceptance: 01/10/2010 Closing date: 01/10/2010 Property type: RRR
MLS number: Sale price: $299,000 Commission: 33.00%
Seller: SOS, Bank
Address: 242 N 9th Street, #100 & 200
Home: Pager:
Business: Fax:
Mobile: Email:
Buyer: Sabel, Aaron
Address:
Home: Pager:
Business: Fax:
Mobile: Email: sia#yoohoo.com
Escrow Co: Schneider, Patty
Company name: National Football League
Address: 242 N 9th Street, #100 & 200
Home: Pager:
Business: 800-2009 Fax: 800-1100
Mobile: Email:
Buyers Agent: Munchak, Mike
Company name: Commission Group
Address:
Home: Pager:
Business: Fax:
Mobile: 483374-3892 Email: donation#yoohoo.net
Listing Agent: Ricci, Christina
Company name: Other Guys
Address:
Home: Pager:
Business: Fax:
Mobile: 888-333-3333 Email: general.adama#cylon.net
Here's my code:
import re
file = open('file-path.txt','r')
# if there are more than two consecutive blank lines, then we start a new Entry
entries = []
curr = []
prev_blank = False
for line in file:
line = line.rstrip('\n').strip()
if (line == ''):
if prev_blank == True:
# end of the entry, create append the entry
if(len(curr) > 0):
entries.append(curr)
print curr
curr = []
prev_blank = False
else:
prev_blank = True
else:
prev_blank = False
# we need to parse the line
line_list = line.split()
str = ''
start = False
for item in line_list:
if re.match('[a-zA-Z\s]+:.*',item):
if len(str) > 0:
curr.append(str)
str = item
start = True
elif start == True:
str = str + ' ' + item
Here is the output:
['number: Jackie Grant', 'Status: Fell Thru Primary closing', 'Acceptance: 01/01/2001 Closing', 'date: 11/11/2011 Property', 'number: Sale', 'price: $200,000', 'Home:', 'Business:', 'Mobile:', 'Home:', 'Business: 555-555-5555', 'Mobile:', 'Home: (575) 222-3455', 'Business:', 'Mobile: (702) 600-3492', 'Home:', 'Business:', 'Mobile: 595-595-5959']
My issues are as follows:
First, there should be 2 records as output, and I'm only outputting one.
In the top block of text, my script has trouble knowing where the previous value ends, and the new one starts: 'Status: Fell Thru' should be one value, 'Primary closing party:', 'Buyer
Acceptance: 01/10/2010', 'Closing date: 01/10/2010', 'Property type: RRR', 'MLS number:', 'Sale price: $299,000', 'Commission: 33.00%' should be caught.
Once this is parsed correctly, I will need to parse again to separate keys from values (ie. 'Closing date':01/10/2010), ideally in a list of dicts.
I can't think of a better way other than using regex to pick out keys, and then grabbing the snippets of text that follow.
When complete, I'd like a csv w/a header row filled with keys, that I can import into pandas w/read_csv. I've spent quite a few hours on this one..

(This isn't a complete answer, but it's too long for a comment).
Field names can have spaces (e.g. MLS number)
Several fields can appear on each line (e.g. Home: Pager:)
The Notes field has the time in it, with a : in it
These mean you can't take your approach to identifying the fieldnames by regex. It's impossible for it to know whether "MLS" is part of the previous data value or the subsequent fieldname.
Some of the Home: Pager: lines refer to the Seller, some to the Buyer or the Lessee Agent or the Leasing Agent. This means the naive line-by-line approach I take below doesn't work either.
This is the code I was working on, it runs against your test data but gives incorrect output due to the above. It's here for a reference of the approach I was taking:
replaces = [
('Closing Report for', 'Report_for:')
,('Report for', 'Report_for:')
,('File number', 'File_number')
,('Primary closing party', 'Primary_closing_party')
,('MLS number', 'MLS_number')
,('Sale Price', 'Sale_Price')
,('Account owner', 'Account_owner')
# ...
# etc.
]
def fix_linemash(data):
# splits many fields on one line into several lines
results = []
mini_collection = []
for token in data.split(' '):
if ':' not in token:
mini_collection.append(token)
else:
results.append(' '.join(mini_collection))
mini_collection = [token]
return [line for line in results if line]
def process_record(data):
# takes a collection of lines
# fixes them, and builds a record dict
record = {}
for old, new in replaces:
data = data.replace(old, new)
for line in fix_linemash(data):
print line
name, value = line.split(':', 1)
record[name.strip()] = value.strip()
return record
records = []
collection = []
blank_flag = False
for line in open('d:/lol.txt'):
# Read through the file collecting lines and
# looking for double blank lines
# every pair of blank lines, process the stored ones and reset
line = line.strip()
if line.startswith('Page '): continue
if line.startswith('Printed on '): continue
if not line and blank_flag: # record finished
records.append( process_record(' '.join(collection)) )
blank_flag = False
collection = []
elif not line: # maybe end of record?
blank_flag = True
else: # false alarm, record continues
blank_flag = False
collection.append(line)
for record in records:
print record
I'm now thinking it would be a much better idea to do some pre-processing tidyup steps over the data:
Strip out "Page n of n" and "Printed on ..." lines, and similar
Identify all valid field names, then break up the combined lines, meaning every line has one field only, fields start at the start of a line.
Run through and just process the Seller/Buyer/Agents blocks, replacing fieldnames with an identifying prefix, e.g. Email: -> Seller Email:.
Then write a record parser, which should be easy - check for two blank lines, split the lines at the first colon, use the left bit as the field name and the right bit as the value. Store however you want (nb. that dictionary keys are unordered).

I suppose it is easier to start a new record by hitting the word "Page".
Just share a little bit of my own experience - it just too difficult to write a generalized parser.
The situation isn't that bad given the data here. Instead of using a simple list to store an entry, use an object. Add all other fields as attributes/values to the object.

Python - sending unicode characters (prefixed with \u) in an HTTP POST request

I'm writing a program which fetches and edits articles on Wikipedia, and I'm having a bit of trouble handling Unicode characters prefixed with \u. I've tried .encode("utf8") and it isn't seeming to do the trick here. How can I properly encode these values prefixed with \u to POST to Wikipedia? See this edit for my problem.
Here is some code:
To get the page:
url = "http://en.wikipedia.org/w/api.php?action=query&format=json&titles="+urllib.quote(name)+"&prop=revisions&rvprop=content"
articleContent = ClientCookie.urlopen(url).read().split('"*":"')[1].split('"}')[0].replace("\\n", "\n").decode("utf-8")
Before I POST the page:
data = dict([(key, value.encode('utf8')) for key, value in data.iteritems()])
data["text"] = data["text"].replace("\\", "")
editInfo = urllib2.Request("http://en.wikipedia.org/w/api.php", urllib.urlencode(data))

You are downloading JSON data without decoding it. Use the json library for that:
import json
articleContent = ClientCookie.urlopen(url)
data = json.load(articleContent)
JSON encoded data looks a lot like Python, it uses \u escaping as well, but it is in fact a subset of JavaScript.
The data variable now holds a deep datastructure. Judging by the string splitting, you wanted this piece:
articleContent = data['query']['pages'].values()[0]['revisions'][0]['*']
Now articleContent is an actual unicode() instance; it is the revision text of the page you were looking for:
>>> print u'\n'.join(data['query']['pages'].values()[0]['revisions'][0]['*'].splitlines()[:20])
{{For|the game|100 Bullets (video game)}}
{{GOCEeffort}}
{{italic title}}
{{Supercbbox <!--Wikipedia:WikiProject Comics-->
| title =100 Bullets
| image =100Bullets vol1.jpg
| caption = Cover to ''100 Bullets'' vol. 1 "First Shot, Last Call". Cover art by Dave Johnson.
| schedule = Monthly
| format =
|complete=y
|Crime = y
| publisher = [[Vertigo (DC Comics)|Vertigo]]
| date = August [[1999 in comics|1999]] – April [[2009 in comics|2009]]
| issues = 100
| main_char_team = [[Agent Graves]] <br/> [[Mr. Shepherd]] <br/> The Minutemen <br/> [[List of characters in 100 Bullets#Dizzy Cordova (also known as "The Girl")|Dizzy Cordova]] <br/> [[List of characters in 100 Bullets#Loop Hughes (also known as "The Boy")|Loop Hughes]]
| writers = [[Brian Azzarello]]
| artists = [[Eduardo Risso]]<br>Dave Johnson
| pencillers =
| inkers =
| colorists = Grant Goleash<br>[[Patricia Mulvihill]]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Content of infobox of Wikipedia - python

Instead of reinventing the wheel, check out DBPedia, which has already extracted all Wikipedia infoboxes into an easily parsable database format.

Related

Skip element with getElementsByTagName if it doesn't exist

Separating text/text processing using regex

Mapping each element in list to different column in pandas dataframe

Parsing unstructured text file with Python

Python - sending unicode characters (prefixed with \u) in an HTTP POST request

Categories

Resources