Separating text/text processing using regex

Separating text/text processing using regex - python

I have a paragraph that needs to be separated by a certain list of keywords.
Here is the text (a single string):
"Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore Address: 189 West Moncler Drive Home Phone: 353 273 400 Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019 Author: social worker"
So I want to separate this paragraph based on the variable names using python. "Evaluation Note", "Date","ID","Contact","Name","Address","Home Phone","Additional Information" and "Author" are the variable names. I think using regex seems nice but I don't have a lot of experience in regex.
Here is what I am trying to do:
import re
regex = r"Evaluation Note(?:\:)? (?P<note>\D+) Date(?:\:)? (?P<date>\D+)
ID(?:\:)? (?P<id>\D+) Contact(?:\:)? (?P<contact>\D+)Name(?:\:)? (? P<name>\D+)"
test_str = "Evaluation Note: Suspected abuse by own mother. Date 3/13/2019
ID: #N/A Contact: Not Specified Name: Cecilia Valore "
matches = re.finditer(regex, test_str, re.MULTILINE)
But doesn't find any patterns.

You can probably generate that regex on the fly. So long as the order of the params is fixed.
Here my try at it, it does do the job. The actual regex it is shooting for is something like Some Key(?P<some_key>.*)Some Other Key(?P<some_other_key>.*), and so on.
import re
test_str = r'Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore '
keys = ['Evaluation Note', 'Date', 'ID', 'Contact', 'Name']
def find(keys, string):
keys = [(key, key.replace(' ', '_')) for key in keys] # spaces aren't valid param names
pattern = ''.join([f'{key}(?P<{name}>.*)' for key, name in keys]) # generate the actual regex
for find in re.findall(pattern, test_str):
for item in find:
yield item.strip(':').strip() # clean up the result
for find in find(keys, test_str):
print(find)
Which returns:
Suspected abuse by own mother.
3/13/2019
#N/A
Not Specified
Cecilia Valore

You can use search to get locations of variables and parse text accordingly. You can customize it easily.
import re
en = re.compile('Evaluation Note:').search(text)
print(en.group())
d = re.compile('Date').search(text)
print(text[en.end()+1: d.start()-1])
print(d.group())
i_d = re.compile('ID:').search(text)
print(text[d.end()+1: i_d.start()-1])
print(i_d.group())
c = re.compile('Contact:').search(text)
print(text[i_d.end()+1: c.start()-1])
print(c.group())
n = re.compile('Name:').search(text)
print(text[c.end()+1: n.start()-1])
print(n.group())
ad = re.compile('Address:').search(text)
print(text[n.end()+1: ad.start()-1])
print(ad.group())
p = re.compile('Home Phone:').search(text)
print(text[ad.end()+1: p.start()-1])
print(p.group())
ai = re.compile('Additional Information:').search(text)
print(text[p.end()+1: ai.start()-1])
print(ai.group())
aut = re.compile('Author:').search(text)
print(text[ai.end()+1: aut.start()-1])
print(aut.group())
print(text[aut.end()+1:])
this will output:
Evaluation Note: Suspected abuse by own mother.
Date: 3/13/2019
ID: #N/A
Contact: Not Specified
Name: Cecilia Valore
Address: 189 West Moncler Drive
Home Phone: 353 273 400
Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019
Author: social worker
I hope this helps

Related

How to extract certain paragraph from text file

def extract_book_info(self):
books_info = []
for file in os.listdir(self.book_folder_path):
title = "None"
author = "None"
release_date = "None"
last_update_date = "None"
language = "None"
producer = "None"
with open(self.book_folder_path + file, 'r', encoding = 'utf-8') as content:
book_info = content.readlines()
for lines in book_info:
if lines.startswith('Title'):
title = lines.strip().split(': ')
elif lines.startswith('Author'):
try:
author = lines.strip().split(': ')
except IndexError:
author = 'Empty'
elif lines.startswith('Release date'):
release_date = lines.strip().split(': ')
elif lines.startswith('Last updated'):
last_update_date = lines.strip().split(': ')
elif lines.startswith('Produce by'):
producer = lines.strip().split(': ')
elif lines.startswith('Language'):
language = lines.strip().split(': ')
elif lines.startswith('***'):
pass
books_info.append(Book(title, author, release_date, last_update_date, producer, language, self.book_folder_path))
with open(self.book_info_path, 'w', encoding="utf-8") as book_file:
for book_info in books_info:
book_file.write(book_info.__str__() + "\n")
I was using this code tried to extract the book title , author , release_date ,
last_update_date, language, producer, book_path).
This the the output I achieve:
['Title', 'The Adventures of Sherlock Holmes'];;;['Author', 'Arthur Conan Doyle'];;;None;;;None;;;None;;;['Language', 'English'];;;data/books_data/;;;
This is the output I should achieved.
May I know what method I should used to achieve the following output
The Adventures of Sherlock Holmes;;;Arthur Conan Doyle;;;November29,2002;;;May20,2019;;;English;;;
This is the example of input:
Title: The Adventures of Sherlock Holmes
Author: Arthur Conan Doyle
Release Date: November 29, 2002 [eBook #1661]
[Most recently updated: May 20, 2019]
Language: English
Character set encoding: UTF-8
Produced by: an anonymous Project Gutenberg volunteer and Jose Menendez
*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***
cover

str.split gives you a list as a result. You're using it to assign to a single value instead.
'Title: Sherlock Holmes'.split(':') # => ['Title', 'Sherlock Holmes']
What I can gather from your requirement you want to access the second element from the split every time. You can do so by:
...
for lines in book_info:
if lines.startswith('Author'):
_, author = lines.strip().split(':')
elif...
Be careful since this can throw an IndexError if there is no second element in a split result. (That's why there's a try on the author param in your code)
Also, avoid calling __str__ directly. That's what the str() function calls for you anyway. Use that instead.

Split Australian addresses into street_address, suburb, state and postcode

I have scraped addresses from a webiste but their format is not consistent, for instance:
address = '139 McKinnon Road, PINELANDS, NT, 829'
address_2 = '108 East Point Road, Fannie Bay, NT, 820'
address_3 = '3-11 Hamilton Street, Townsville City, QLD, 4810'
I have tried to split them by space ' ' but couldn't get the desired result.
I have tried:
if "," in address:
raw_address = address.split(",")
splitted_address = [
adr for adr in raw_address if not adr.islower() and not adr.isupper()
]
splitted_suburb = [adr for adr in raw_address if adr.isupper()]
item["Street_Address"] = splitted_address[0].strip()
item["Suburb"] = splitted_address[1].strip()
item["State"] = splitted_suburb[0].strip()
item["Postcode"] = splitted_address[2].strip()
else:
raw_address = address.split(" ")
splitted_address = [
adr for adr in raw_address if not adr.islower() and not adr.isupper()
]
splitted_suburb = [adr for adr in raw_address if adr.isupper()]
item["Street_Address"] = " ".join(splitted_address[:-1])
item["Suburb"] = splitted_suburb[0]
item["State"] = splitted_suburb[1]
item["Postcode"] = splitted_address[-1]
And my desired output should be like this:
Street_Address,Suburb,State,Postcode
Units 1-14, 29 Wiltshire Lane, DELACOMBE, VIC, 3356
How can I split the full address into these specific fields?
Update:
I have parsed out the desired fields using regex pattern:
regex_str = "(^.*?(?:Lane|Street|Boulevard|Crescent|Place|Road|Highway|Avenue|Drive|Circuit|Parade|Telopea|Nicklin Way|Terrace|Square|Court|Close|Endeavour Way|Esplanade|East|The Centreway|Mall|Quay|Gateway|Low Way|Point|Rd|Morinda|Way|Ave|St|South Steyne|Broadway|HQ|Expressway|Strett|Castlereagh|Meadow Way|Track|Kulkyne Way|Narabang Way|Bank)),? ?(.*?),? ?([A-Z]{3}),? ?(\d{,4})$"
matches = re.search(regex_str, full_address)
street, suburb, state, postcode = matches.groups()
item["Street_Address"] = street
item["Suburb"] = suburb
item["State"] = state
item["Postcode"] = postcode
It is working for some addresses like with address_3 but with address_1, address_2 this pattern is not working I am getting None Type error:
File "colliers_sale.py", line 164, in parse_details
street, suburb, state, postcode = matches.groups()
AttributeError: 'NoneType' object has no attribute 'groups'
How can I fix this?

you can use regular expression but probably need multiple pattern, some thing like this:
import re
match = None
if (match := re.search( r'(.*?\d+-\d+),? (.+?) ([A-Z ]+) ([A-Z]+) (\d+)$', address)):
pass # this match address, address_3, address_4
elif (match := re.search(r'(\d+-\d+) (.+?), (.+?), ([A-Z]+), (\d+)$', address)):
pass # this match address_2
# elif (...another pattern...)
if match:
print( match[1], match[2], match[3], match[4], match[5], sep=' # ')
else:
print( 'nothing match')

try 're' package. You can do t using regular expressions like this
import re
address = 'Units 1-14, 29 Wiltshire Lane DELACOMBE VIC 3356'
address_2 = '3-11 Hamilton Street, Townsville City, QLD, 4810'
address_3 = '6-10 Mount Street MOUNT DRUITT NSW 2770'
address_4 = '34-36 Fairfield Street FAIRFIELD EAST NSW 2165'
addresses = [address, address_2, address_3, address_4]
for add in addresses:
print(', '.join(re.findall(r"(.*\d+-\d+)[, ]+(\w*\s*\w+\s+\w+)[, ]+(\w*\s*\w+)[, ]+(\w+)[, ]+(\d+)", add)[0]))
parentheses in pattern part of re.findall will help you capture wanted parts.

Use regex to match multiple words in sequence

I am trying to create a function that will return a string from the text based on these conditions:
If 'recurring payment authorized on' in the string, get the 1st text after 'on'
If 'recurring payment' in the string, get everything before
Currently I have written the following:
#will be used in an apply statement for a column in dataframe
def parser(x):
x_list = x.split()
if " recurring payment authorized on " in x and x_list[-1]!= "on":
return x_list[x_list.index("on")+1]
elif " recurring payment" in x:
return ' '.join(x_list[:x_list.index("recurring")])
else:
return None
However this code looks awkward and is not robust. I want to use regex to match those strings.
Here are some examples of what this function should return:
recurring payment authorized on usps abc should return usps
usps recurring payment abc should return usps
Any help on writing regex for this function will be appreciated. The input string will only contain text; there will be no numerical and special characters

Using Regex with lookahead and lookbehind pattern matching
import re
def parser(x):
# Patterns to search
pattern_on = re.compile(r'(?<= authorized on )(.*?)(\s+)')
pattern_recur = re.compile(r'^(.*?)\s(?=recurring payment)')
m = pattern_on.search(t)
if m:
return m.group(0)
m = pattern_recur.search(t)
if m:
return m.group(0)
return None
tests = ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]
for t in tests:
found = parser(t)
if found:
print("In text: {}\n Found: {}".format(t, found))
Output
In text: recurring payment authorized on usps abc
Found: usps
In text: usps recurring payment abc
Found: usps
In text: recurring payment authorized on att xxx xxx
Found: att
In text: recurring payment authorized on 25.05.1980 xxx xxx
Found: 25.05.1980
In text: att recurring payment xxxxx
Found: att
In text: 12.14.14. att recurring payment xxxxx
Found: 12.14.14. att
Explanation
Lookahead and Lookbehind pattern matching
Regex Lookbehind
(?<=foo) Lookbehind Asserts that what immediately precedes the current
position in the string is foo
So in pattern: r'(?<= authorized on )(.*?)(\s+)'
foo is " authorized on "
(.*?) - matches any character (? causes it not to be greedy)
(\s+) - matches at least one whitespace
So the above causes (.*?) to capture all characters after " authorized on " until the first whitespace character.
Regex Lookahead
(?=foo) Lookahead Asserts that what immediately follows the current position in the string is foo
So with: r'^(.*?)\s(?=recurring payment)'
foo is 'recurring payment'
^ - means at beginning of the string
(.*?) - matches any character (non-greedy)
\s - matches white space
Thus, (.*?) will match all characters from beginning of string until we get whitespace followed by "recurring payment"
Better Performance
Desirable since you're applying to Dataframe which may have lots of columns.
Take the pattern compilation out of the parser and place it in the module (33% reduction in time).
def parser(x):
# Use predined patterns (pattern_on, pattern_recur) from globals
m = pattern_on.search(t)
if m:
return m.group(0)
m = pattern_recur.search(t)
if m:
return m.group(0)
return None
# Define patterns to search
pattern_on = re.compile(r'(?<= authorized on )(.*?)(\s+)')
pattern_recur = re.compile(r'^(.*?)\s(?=recurring payment)')
tests = ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]

I am not sure that this level of complexity requires RegEx.
Hoping that RegEx is not a strict requirement for you here's a solution not using it:
examples = [
'stuff more stuff recurring payment authorized on ExampleA useless data',
'other useless data ExampleB recurring payment',
'Not matching phrase payment example authorized'
]
def extract_data(phrase):
result = None
if "recurring payment authorized on" in phrase:
result = phrase.split("recurring payment authorized on")[1].split()[0]
elif "recurring payment" in phrase:
result = phrase.split("recurring payment")[0]
return result
for example in examples:
print(extract_data(example))
Output
ExampleA
other useless data ExampleB
None

Not sure if this is any faster, but Python has conditionals:
If authorized on is present then
match the next substring of non-space characters else
match everything that occurs before recurring
Note that the result will be in capturing group 2 or 3 depending on which matched.
import re
def xstr(s):
if s is None:
return ''
return str(s)
def parser(x):
# Patterns to search
pattern = re.compile(r"(authorized\son\s)?(?(1)(\S+)|(^.*) recurring)")
m = pattern.search(t)
if m:
return xstr(m.group(2)) + xstr(m.group(3))
return None
tests = ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]
for t in tests:
found = parser(t)
if found:
print("In text: {}\n Found: {}".format(t, found))

You can do this with a single regex and without explicit lookahead nor lookbehind.
Please let me know if this works, and how it performs against #DarryIG's solution.
import re
from collections import namedtuple
ResultA = namedtuple('ResultA', ['s'])
ResultB = namedtuple('ResultB', ['s'])
RX = re.compile('((?P<init>.*) )?recurring payment ((authorized on (?P<authorized>\S+))|(?P<rest>.*))')
def parser(x):
'''https://stackoverflow.com/questions/59600852/use-regex-to-match-multiple-words-in-sequence
>>> parser('recurring payment authorized on usps abc')
ResultB(s='usps')
>>> parser('usps recurring payment abc')
ResultA(s='usps')
>>> parser('recurring payment authorized on att xxx xxx')
ResultB(s='att')
>>> parser('recurring payment authorized on 25.05.1980 xxx xxx')
ResultB(s='25.05.1980')
>>> parser('att recurring payment xxxxx')
ResultA(s='att')
>>> parser('12.14.14. att recurring payment xxxxx')
ResultA(s='12.14.14. att')
'''
m = RX.match(x)
if m is None:
return None # consider ValueError
recurring = m.groupdict()['init'] or m.groupdict()['rest']
authorized = m.groupdict()['authorized']
if (recurring is None) == (authorized is None):
raise ValueError('invalid input')
if recurring is not None:
return ResultA(recurring)
else:
return ResultB(authorized)

Parsing unstructured text file with Python

I have a text file, a few snippets of which look like this:
Page 1 of 515
Closing Report for Company Name LLC
222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
File number: Jackie Grant Status: Fell Thru Primary closing party: Seller
Acceptance: 01/01/2001 Closing date: 11/11/2011 Property type: Commercial Lease
MLS number: Sale price: $200,000 Commission: $1,500.00
Notes: 08/15/2000 02:30PM by Roger Lodge This property is a Commercial Lease handled by etc..
Seller: Company Name LLC
Company name: Company Name LLC
Address: 222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
Home: Pager:
Business: Fax:
Mobile: Email:
Buyer: Tomlinson, Ladainian
Address: 222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
Home: Pager:
Business: 555-555-5555 Fax:
Mobile: Email:
Lessee Agent: Blank, Arthur
Company name: Sprockets Inc.
Address: 5001 Old Man Dr, North Las Vegas, NV, 89002
Home: (575) 222-3455 Pager:
Business: Fax: 999-9990
Mobile: (702) 600-3492 Email: sprockets#yoohoo.com
Leasing Agent: Van Uytnyck, Chameleon
Company name: Company Name LLC
Address:
Home: Pager:
Business: Fax: 909-222-2223
Mobile: 595-595-5959 Email:
(should be 2 spaces here.. this is not in normal text file)
Printed on Friday, June 12, 2015
Account owner: Roger Goodell
Page 2 of 515
Report for Adrian (Allday) Peterson
242 N 9th Street, #100 & 200
File number: Soap Status: Closed/Paid Primary closing party: Buyer
Acceptance: 01/10/2010 Closing date: 01/10/2010 Property type: RRR
MLS number: Sale price: $299,000 Commission: 33.00%
Seller: SOS, Bank
Address: 242 N 9th Street, #100 & 200
Home: Pager:
Business: Fax:
Mobile: Email:
Buyer: Sabel, Aaron
Address:
Home: Pager:
Business: Fax:
Mobile: Email: sia#yoohoo.com
Escrow Co: Schneider, Patty
Company name: National Football League
Address: 242 N 9th Street, #100 & 200
Home: Pager:
Business: 800-2009 Fax: 800-1100
Mobile: Email:
Buyers Agent: Munchak, Mike
Company name: Commission Group
Address:
Home: Pager:
Business: Fax:
Mobile: 483374-3892 Email: donation#yoohoo.net
Listing Agent: Ricci, Christina
Company name: Other Guys
Address:
Home: Pager:
Business: Fax:
Mobile: 888-333-3333 Email: general.adama#cylon.net
Here's my code:
import re
file = open('file-path.txt','r')
# if there are more than two consecutive blank lines, then we start a new Entry
entries = []
curr = []
prev_blank = False
for line in file:
line = line.rstrip('\n').strip()
if (line == ''):
if prev_blank == True:
# end of the entry, create append the entry
if(len(curr) > 0):
entries.append(curr)
print curr
curr = []
prev_blank = False
else:
prev_blank = True
else:
prev_blank = False
# we need to parse the line
line_list = line.split()
str = ''
start = False
for item in line_list:
if re.match('[a-zA-Z\s]+:.*',item):
if len(str) > 0:
curr.append(str)
str = item
start = True
elif start == True:
str = str + ' ' + item
Here is the output:
['number: Jackie Grant', 'Status: Fell Thru Primary closing', 'Acceptance: 01/01/2001 Closing', 'date: 11/11/2011 Property', 'number: Sale', 'price: $200,000', 'Home:', 'Business:', 'Mobile:', 'Home:', 'Business: 555-555-5555', 'Mobile:', 'Home: (575) 222-3455', 'Business:', 'Mobile: (702) 600-3492', 'Home:', 'Business:', 'Mobile: 595-595-5959']
My issues are as follows:
First, there should be 2 records as output, and I'm only outputting one.
In the top block of text, my script has trouble knowing where the previous value ends, and the new one starts: 'Status: Fell Thru' should be one value, 'Primary closing party:', 'Buyer
Acceptance: 01/10/2010', 'Closing date: 01/10/2010', 'Property type: RRR', 'MLS number:', 'Sale price: $299,000', 'Commission: 33.00%' should be caught.
Once this is parsed correctly, I will need to parse again to separate keys from values (ie. 'Closing date':01/10/2010), ideally in a list of dicts.
I can't think of a better way other than using regex to pick out keys, and then grabbing the snippets of text that follow.
When complete, I'd like a csv w/a header row filled with keys, that I can import into pandas w/read_csv. I've spent quite a few hours on this one..

(This isn't a complete answer, but it's too long for a comment).
Field names can have spaces (e.g. MLS number)
Several fields can appear on each line (e.g. Home: Pager:)
The Notes field has the time in it, with a : in it
These mean you can't take your approach to identifying the fieldnames by regex. It's impossible for it to know whether "MLS" is part of the previous data value or the subsequent fieldname.
Some of the Home: Pager: lines refer to the Seller, some to the Buyer or the Lessee Agent or the Leasing Agent. This means the naive line-by-line approach I take below doesn't work either.
This is the code I was working on, it runs against your test data but gives incorrect output due to the above. It's here for a reference of the approach I was taking:
replaces = [
('Closing Report for', 'Report_for:')
,('Report for', 'Report_for:')
,('File number', 'File_number')
,('Primary closing party', 'Primary_closing_party')
,('MLS number', 'MLS_number')
,('Sale Price', 'Sale_Price')
,('Account owner', 'Account_owner')
# ...
# etc.
]
def fix_linemash(data):
# splits many fields on one line into several lines
results = []
mini_collection = []
for token in data.split(' '):
if ':' not in token:
mini_collection.append(token)
else:
results.append(' '.join(mini_collection))
mini_collection = [token]
return [line for line in results if line]
def process_record(data):
# takes a collection of lines
# fixes them, and builds a record dict
record = {}
for old, new in replaces:
data = data.replace(old, new)
for line in fix_linemash(data):
print line
name, value = line.split(':', 1)
record[name.strip()] = value.strip()
return record
records = []
collection = []
blank_flag = False
for line in open('d:/lol.txt'):
# Read through the file collecting lines and
# looking for double blank lines
# every pair of blank lines, process the stored ones and reset
line = line.strip()
if line.startswith('Page '): continue
if line.startswith('Printed on '): continue
if not line and blank_flag: # record finished
records.append( process_record(' '.join(collection)) )
blank_flag = False
collection = []
elif not line: # maybe end of record?
blank_flag = True
else: # false alarm, record continues
blank_flag = False
collection.append(line)
for record in records:
print record
I'm now thinking it would be a much better idea to do some pre-processing tidyup steps over the data:
Strip out "Page n of n" and "Printed on ..." lines, and similar
Identify all valid field names, then break up the combined lines, meaning every line has one field only, fields start at the start of a line.
Run through and just process the Seller/Buyer/Agents blocks, replacing fieldnames with an identifying prefix, e.g. Email: -> Seller Email:.
Then write a record parser, which should be easy - check for two blank lines, split the lines at the first colon, use the left bit as the field name and the right bit as the value. Store however you want (nb. that dictionary keys are unordered).

I suppose it is easier to start a new record by hitting the word "Page".
Just share a little bit of my own experience - it just too difficult to write a generalized parser.
The situation isn't that bad given the data here. Instead of using a simple list to store an entry, use an object. Add all other fields as attributes/values to the object.

Getting facts form an RDF Graph in a way that I can use using RDFlib

I am trying to learn to use RDF and am trying to pull a set of facts out of dbpedia as my learning exercise. The following code sample is sort of working but for subjects such as spouse it always pulls out the person them selves.
QUESTIONS:
get_name_from_uri() pulls out the last part of the URI and removes the underscores - There has got to be a better way to get a persons name
results for spouse pull back the spouse but also pull back the data subject - not sure whats going on there
Some results pull back data in both URI format and as a text item -
This is the output from the code block and shows some of the odd results I am getting (see the mixed output in the properties, the fact he is married to himself and the mangled name of Josephine?
Accessing facts for Napoleon held at http://dbpedia.org/resource/Napoleon
There are 800 facts about Napoleon stored at the URI
http://dbpedia.org/resource/Napoleon
Here are a few:-
Ontology:deathdate
Napoleon died on 1821-05-05
Ontology:birthdate
Napoleon was born on 1769-08-15
Property:spouse retruns the person themslves twice !
Napoleon was married to Marie Louise, Duchess of Parma
Napoleon was married to Napoleon
Napoleon was married to Jos%C3%A9phine de Beauharnais
Napoleon was married to Napoleon
Property:title retruns text and uri's
Napoleon Held the title: "The Death of Napoleon"
Napoleon Held the title: http://dbpedia.org/resource/Emperor_of_the_French
Napoleon Held the title: http://dbpedia.org/resource/King_of_Italy
Napoleon Held the title: First Consul of France
Napoleon Held the title: Provisional Consul of France
Napoleon Held the title: http://dbpedia.org/resource/Napoleon
Napoleon Held the title: Emperor of the French
Napoleon Held the title: http://dbpedia.org/resource/Co-Princes_of_Andorra
Napoleon Held the title: from the Memoirs of Bourrienne, 1831
Napoleon Held the title: Protector of the Confederation of the Rhine
Ontology birth place returns three records
Napoleon was born in Ajaccio
Napoleon was born in Corsica
Napoleon was born in Early modern France
This is the python that produces the output above, it requires rdflib and is very much a work in progress.
import rdflib
from rdflib import Graph, URIRef, RDF
######################################
# A quick test of a python library reflib to get data from an rdf graph
# D Moore 15/3/2014
# needs rdflib > version 3.0
# CHANGE THE URI BELOW TO A DIFFERENT PERSON AND SEE WHAT HAPPENS
# COULD DO WITH A WEB FORM
# NOTES:
#
#URI_ref = 'http://dbpedia.org/resource/Richard_Nixon'
#URI_ref = 'http://dbpedia.org/resource/Margaret_Thatcher'
#URI_ref = 'http://dbpedia.org/resource/Isaac_Newton'
#URI_ref = 'http://dbpedia.org/resource/Richard_Nixon'
URI_ref = 'http://dbpedia.org/resource/Napoleon'
#URI_ref = 'http://dbpedia.org/resource/apple'
##########################################################
def get_name_from_uri(dbpedia_uri):
# pulls the last part of a uri out and removes underscores
# got to be an easier way but it works
output_string = ""
s = dbpedia_uri
# chop the url into bits devided by the /
tokens = s.split("/")
# because the name of our person is in the last section itterate through each token
# and replace the underscore with a space
for i in tokens :
str = ''.join([i])
output_string = str.replace('_',' ')
# returns the name of the person without underscores
return(output_string)
def is_person(uri):
##### SPARQL way to do this
uri = URIRef(uri)
person = URIRef('http://dbpedia.org/ontology/Person')
g= Graph()
g.parse(uri)
resp = g.query(
"ASK {?uri a ?person}",
initBindings={'uri': uri, 'person': person}
)
print uri, "is a person?", resp.askAnswer
return resp.askAnswer
URI_NAME = get_name_from_uri(URI_ref)
NAME_LABEL = ''
if is_person(URI_ref):
print "Accessing facts for", URI_NAME, " held at ", URI_ref
g = Graph()
g.parse(URI_ref)
print "Person Extract for", URI_NAME
print "There are ",len(g)," facts about", URI_NAME, "stored at the URI ",URI_ref
print "Here are a few:-"
# Ok so lets get some facts for our person
for stmt in g.subject_objects(URIRef("http://dbpedia.org/ontology/birthName")):
print URI_NAME, "was born " + str(stmt[1])
for stmt in g.subject_objects(URIRef("http://dbpedia.org/ontology/deathDate")):
print URI_NAME, "died on", str(stmt[1])
for stmt in g.subject_objects(URIRef("http://dbpedia.org/ontology/birthDate")):
print URI_NAME, "was born on", str(stmt[1])
for stmt in g.subject_objects(URIRef("http://dbpedia.org/ontology/eyeColor")):
print URI_NAME, "had eyes coloured", str(stmt[1])
for stmt in g.subject_objects(URIRef("http://dbpedia.org/property/spouse")):
print URI_NAME, "was married to ", get_name_from_uri(str(stmt[1]))
for stmt in g.subject_objects(URIRef("http://dbpedia.org/ontology/reigned")):
print URI_NAME, "reigned ", get_name_from_uri(str(stmt[1]))
for stmt in g.subject_objects(URIRef("http://dbpedia.org/ontology/children")):
print URI_NAME, "had a child called ", get_name_from_uri(str(stmt[1]))
for stmt in g.subject_objects(URIRef("http://dbpedia.org/property/profession")):
print URI_NAME, "(PROPERTY profession) was trained as a ", get_name_fro m_uri(str(stmt[1]))
for stmt in g.subject_objects(URIRef("http://dbpedia.org/property/child")):
print URI_NAME, "PROPERTY child ", get_name_from_uri(str(stmt[1]))
for stmt in g.subject_objects(URIRef("http://dbpedia.org/property/deathplace")):
print URI_NAME, "(PROPERTY death place) died at: ", str(stmt[1])
for stmt in g.subject_objects(URIRef("http://dbpedia.org/property/title")):
print URI_NAME, "(PROPERTY title) Held the title: ", str(stmt[1])
for stmt in g.subject_objects(URIRef("http://dbpedia.org/ontology/sex")):
print URI_NAME, "was a ", str(stmt[1])
for stmt in g.subject_objects(URIRef("http://dbpedia.org/ontology/knownfor")):
print URI_NAME, "was known for ", str(stmt[1])
for stmt in g.subject_objects(URIRef("http://dbpedia.org/ontology/birthPlace")):
print URI_NAME, "was born in ", get_name_from_uri(str(stmt[1]))
else:
print "ERROR - "
print "Resource", URI_ref, 'does not look to be a person or there is no record in dbpedia'

Getting names
*get_name_from_uri* is doing something with the URI. Since DBpedia data has rdfs:labels on almost everything, it's probably a better idea to ask for the rdfs:label and to use that as a value. E.g., look at the results of this SPARQL query run the DBpedia SPARQL endpoint:
select ?spouse ?spouseName where {
dbpedia:Napoleon dbpedia-owl:spouse ?spouse .
?spouse rdfs:label ?spouseName .
filter( langMatches(lang(?spouseName),"en") )
}
spouse spouseName
http://dbpedia.org/resource/Jos%C3%A9phine_de_Beauharnais "Joséphine de Beauharnais"#en
http://dbpedia.org/resource/Marie_Louise,_Duchess_of_Parma "Marie Louise, Duchess of Parma"#en
Unexpected Spouses
The documentation for subject_objects says that
subject_objects(self, predicate=None)
A generator of (subject, object) tuples for the given predicate
You're seeing, correctly, that there are four triples in DBpedia that have the predicate dbpprop:spouse (by the way, is there a reason you're not using dbpedia-owl:spouse?) and have Napoleon as a subject or object:
Napoleon spouse Marie Louise, Duchess of Parma
Marie Louise, Duchess of Parma spouse Napoleon
Napoleon spouse Jos%C3%A9phine de Beauharnais
Jos%C3%A9phine de Beauharnais spouse Napoleon
For each one of those, you're printing out
"Napoleon was married to X"
where X is the object of the triple. Perhaps you should use objects instead:
objects(self, subject=None, predicate=None)
A generator of objects with the given subject and predicate
URI vs. text (literal) results
The data described by DBpedia ontology properties (those whose URIs begin with http://dbpedia.org/ontology/, typically abbreviated dbpedia-owl:) is much “cleaner” than the data described by the DBpedia raw data properties (those whose URIs begin with http://dbpedia.org/property/, typically abbreviated dbpprop:). E.g., when you're looking at the titles, you're using the property dbpprop:title, and there are both URIs and literals as values. It doesn't look like there's a dbpedia-owl:title, though, so in this case you'll just have to deal with it. It's easy enough to filter out one or the other though:
select ?title where {
dbpedia:Napoleon dbpprop:title ?title
filter isLiteral(?title)
}
title
================================================
"Emperor of the French"#en
"Protector of the Confederation of the Rhine"#en
"First Consul of France"#en
"Provisional Consul of France"#en
""The Death of Napoleon""#en
"from the Memoirs of Bourrienne, 1831"#en
select ?title where {
dbpedia:Napoleon dbpprop:title ?title
filter isURI(?title)
}
title
=================================================
http://dbpedia.org/resource/Co-Princes_of_Andorra
http://dbpedia.org/resource/Emperor_of_the_French
http://dbpedia.org/resource/King_of_Italy

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Separating text/text processing using regex - python

Related

How to extract certain paragraph from text file

Split Australian addresses into street_address, suburb, state and postcode

Use regex to match multiple words in sequence

Parsing unstructured text file with Python

Getting facts form an RDF Graph in a way that I can use using RDFlib

Categories

Resources