Making a (hopefully simple) wiki parser with python - python

With the help of joksnet's programs here I've managed to get plaintext Wikipedia articles that I'm looking for.
The text returned includes Wiki markup for the headings, so for example, the sections of the Albert Einstein article are returned like this:
==Biography==
===Early life and education===
blah blah blah
What I'd really like to do is feed the retrieved text to a function and wrap all the top level sections in bold html tags and the second level sections in italics, like this:
<b>Biography</b>
<i>Early life and education</i>
blah blah blah
But I'm afraid I don't know how to even start, at least not without making the function dangerously naive. Do I need to use regular expressions?
Any suggestions greatly appreciated.
PS Sorry if "parsing" is too strong a word for what I'm trying to do here.

I think the best way here would be to let MediaWiki take care of the parsing. I don't know the library you're using, but basically this is the difference between
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Albert%20Einstein&rvprop=content
which returns the raw wikitext and
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Albert%20Einstein&rvprop=content&rvparse
which returns the parsed HTML.

You can use regex and scraping modules like Scrapy and Beautifulsoup to parse and scrape wiki pages.
Now that you clarified your question I suggest you use the py-wikimarkup module that is hosted on github. The link is https://github.com/dcramer/py-wikimarkup/ . I hope that helps.

I ended up doing this:
def parseWikiTitles(x):
counter = 1
while '===' in x:
if counter == 1:
x = x.replace('===','<i>',1)
counter = 2
else:
x = x.replace('===',r'</i>',1)
counter = 1
counter = 1
while '==' in x:
if counter == 1:
x = x.replace('==','<b>',1)
counter = 2
else:
x = x.replace('==',r'</b>',1)
counter = 1
x = x.replace('<b> ', '<b>', 50)
x = x.replace(r' </b>', r'</b>', 50)
x = x.replace('<i> ', '<i>', 50)
x = x.replace(r' </i>', r'<i>', 50)
return x
I pass the string of text with wiki titles to that function and it returns the same text with the == and === replaced with bold and italics HTML tags. The last thing removes spaces before and after titles, for example == title == gets converted to <b>title</b> instead of <b> title </b>
Has worked without problem so far.
Thanks for the help guys,
Alex

Related

Regex not specific enough

So I wrote a program for my Kindle e-reader that searches my highlights and deletes repetitive text (it's usually information about the book title, author, page number, etc.). I thought it was functional but sometimes there would random be periods (.) on certain lines of the output. At first I thought the program was just buggy but then I realized that the regex I'm using to match the books title and author was also matching any sentence that ended in brackets.
This is the code for the regex that I'm using to detect the books title and author
titleRegex = re.compile('(.+)\((.+)\)')
Example
Desired book title and author match: Book title (Author name)
What would also get matched: *I like apples because they are green (they are sometimes red as well). *
In this case it would delete everything and leave just the period at the end of the sentence. This is obviously not ideal because it deletes the text I highlighted
Here is the unformatted text file that goes into my program
The program works by finding all of the matches for the regexes I wrote, looping through those matches and one by one replacing them with empty strings.
Would there be any ways to make my title regex more specific so that it only picks up author titles and not full sentences that end in brackets? If not, what steps would I have to take to restructure this program?
I've attached my code to the bottom of this post. I would greatly appreciate any help as I'm a total coding newbie. Thanks :)
import re
titleRegex = re.compile('(.+)\((.+)\)')
titleRegex2 = re.compile(r'\ufeff (.+)\((.+)\)')
infoRegex = re.compile(r'(.) ([a-zA-Z]+) (Highlight|Bookmark|Note) ([a-zA-Z]+) ([a-zA-Z]+) ([0-9]+) (\|)')
locationRegex = re.compile(r' Location (\d+)(-\d+)? (\|)')
dateRegex = re.compile(r'([a-zA-Z]+) ([a-zA-Z]+) ([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)')
timeRegex = re.compile(r'([0-9]+):([0-9]+):([0-9]+) (AM|PM)')
newlineRegex = re.compile(r'\n')
sepRegex = re.compile('==========')
regexList = [titleRegex, titleRegex2, infoRegex, locationRegex, dateRegex, timeRegex, sepRegex, newlineRegex]
string = open("/Users/devinnagami/myclippings.txt").read()
for x in range(len(regexList)):
newString = re.sub(regexList[x], ' ', string)
string = newString
finalText = newString.split(' ')
with open('booknotes.txt', 'w') as f:
for item in finalText:
f.write('%s\n' % item)
There isn't enough information to tell if "Book title (Book Author)" is different than something like "I like Books (Good Ones)" without context. Thankfully, the text you showed has plenty of context. Instead of creating several different regular expressions, you can combine them into one expression to encode that context.
For instance:
quoteInfoRegex = re.compile(
r"^=+\n(?P<title>.*?) \((?P<author>.*?)\)\n" +
r"- Your Highlight on page (?P<page>[\d]+) \| Location (?P<location>[\d-]+) \| Added on (?P<added>.*?)\n" +
r"\n" +
r"(?P<quote>.*?)\n", flags=re.MULTILINE)
for m in quoteInfoRegex.finditer(data):
print(m.groupdict())
This will pull out each line of the text, and parse it, knowing that the book title is the first line after the equals, and the quote itself is below that.

Regex with counting variable in Python?

I want to adjust unmarked chapter numbers in a text without changing unmarked page numbers (so I can later remove them). I intend to use a regex statement including a variable made up of a incrementally increasing chapter number to do this.
The text string is like this:
"Introduction text
1
Chapter 1 title
Chapter 1 text
192 (page number)
Chapter 1 text some more
193
2
Chapter 2 title
Chapter 2 text
194
And so on"
Using the variable chapter_fix = 1 combined with the operator of chapter_fix = chapter_fix =+ 1
I want to take the regex strings '\n1\n', '\n2\n', '\n3\n'...up to '\n365\n' and convert it to the strings '\nChapter 1\n', '\nChapter 2\n', '\nChapter 3\n'...up to '\nChapter 365\n' without modifying those random page number that are not in the same number series.
Unfortunately, I am an amateur coder and despite working on this problem for several hours and having some basis in similar regex problems I just really can't figure it out. Frankly I don't even really have any usable code to present as an example but I would really appreciate help in figuring this out to learn this better.
I have included my current, profoundly incorrect, code below. I am just going to make the adjustment manually this time but I'd still like to learn to do it right if anybody can help.
###Module imports###
import re
text = '''
Intro Text
SECTION I
Section I Title
118
1
Chapter 1 Title
119
text
121
text
122
2
Chapter 2 Title
text
'''
###Function Definitions###
def chapter_correction (text):
chapter_correction = 1
pattern = r'\n{}\n'.format(chapter_correction)
print(pattern)
text = re.sub(pattern, text, chapter_correction, flags=re.IGNORECASE)
chapter_correction += 1
return text
text = chapter_correction (text)
print(text)

remove words starting with "#" in a column from a dataframe

I have a dataframe called tweetscrypto and I am trying to remove all the words from the column "text" starting with the character "#" and gather the result in a new column "clean_text". The rest of the words should stay exactly the same:
tweetscrypto['clean_text'] = tweetscrypto['text'].apply(filter(lambda x:x[0]!='#', x.split()))
it does not seem to work. Can somebody help?
Thanks in advance
Please str.replace string starting with #
Sample Data
text
0 News via #livemint: #RBI bars banks from links
1 Newsfeed from #oayments_source: How Africa
2 is that bitcoin? not my thing
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(\#\w+.*?)',"")
Still, can capture # without escaping as noted by #baxx
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(#\w+.*?)',"")
clean_text
0 News via : bars banks from links
1 Newsfeed from : How Africa
2 is that bitcoin? not my thing
In this case it might be better to define a method rather than using a lambda for mainly readability purposes.
def clean_text(X):
X = X.split()
X_new = [x for x in X if not x.startswith("#")
return ' '.join(X_new)
tweetscrypto['clean_text'] = tweetscrypto['text'].apply(clean_text)

Python3 replace tags based on condition of the type of tag

I want all the tags in a text that look like <Bus:1234|Bob Alice> or <Car:5678|Nelson Mandela> to be replaced with <a my-inner-type="CR:1234">Bob Alice</a> and <a my-inner-type="BS:5678">Nelson Mandela</a> respectively. So basically, depending on the Type whether TypeA or TypeB, I want to replace the text accordingly in a text string using Python3 and regex.
I tried doing the following in python but not sure if that's the right approach to go forward:
import re
def my_replace():
re.sub(r'\<(.*?)\>', replace_function, data)
With the above, I am trying to do a regex of the< > tag and every tag I find, I pass that to a function called replace_function to split the text between the tag and determine if it is a TypeA or a TypeB and compute the stuff and return the replacement tag dynamically. I am not even sure if this is even possible using the re.sub but any leads would help. Thank you.
Examples:
<Car:1234|Bob Alice> becomes <a my-inner-type="CR:1234">Bob Alice</a>
<Bus:5678|Nelson Mandela> becomes <a my-inner-type="BS:5678">Nelson Mandela</a>
This is perfectly possible with re.sub, and you're on the right track with using a replacement function (which is designed to allow dynamic replacements). See below for an example that works with the examples you give - probably have to modify to suit your use case depending on what other data is present in the text (ie. other tags you need to ignore)
import re
def replace_function(m):
# note: to not modify the text (ie if you want to ignore this tag),
# simply do (return the entire original match):
# return m.group(0)
inner = m.group(1)
t, name = inner.split('|')
# process type here - the following will only work if types always follow
# the pattern given in the question
typename = t[4:]
# EDIT: based on your edits, you will probably need more processing here
# eg:
if t.split(':')[0] == 'Car':
typename = 'CR'
# etc
return '<a my-inner-type="{}">{}</a>'.format(typename, name)
def my_replace(data):
return re.sub(r'\<(.*?)\>', replace_function, data)
# let's just test it
data = 'I want all the tags in a text that look like <TypeA:1234|Bob Alice> or <TypeB:5678|Nelson Mandela> to be replaced with'
print(my_replace(data))
Warning: if this text is actually full html, regex matching will not be reliable - use an html processor like beautifulsoup. ;)
Probably an extension to #swalladge's answer but here we use the advantage of a dictionary, if we know a mapping. (Think replace dictionary with a custom mapping function.
import re
d={'TypeA':'A',
'TypeB':'B',
'Car':'CR',
'Bus':'BS'}
def repl(m):
return '<a my-inner-type="'+d[m.group(1)]+m.group(2)+'">'+m.group(3)+'</a>'
s='<TypeA:1234|Bob Alice> or <TypeB:5678|Nelson Mandela>'
print(re.sub('<(.*?)(:\d+)\|(.*?)>',repl,s))
print()
s='<Bus:1234|Bob Alice> or <Car:5678|Nelson Mandela>'
print(re.sub('<(.*?)(:\d+)\|(.*?)>',repl,s))
OUTPUT
<a my-inner-type="A:1234">Bob Alice</a> or <a my-inner-type="B:5678">Nelson Mandela</a>
<a my-inner-type="BS:1234">Bob Alice</a> or <a my-inner-type="CR:5678">Nelson Mandela</a>
Working example here.
regex
We capture what we need in 3 groups and refer to them through match object.Highlighted in bold are the three groups that we captured in the regex.
<(.*?)(:\d+)\|(.*?)>
We use these 3 groups in our repl function to return the right string.
Sorry this isn't a complete answer but I'm falling asleep at the computer, but this is the regex that'll match either of the strings you provided, (<Type)(\w:)(\d+\|)(\w+\s\w+>). Check out https://pythex.org/ for testing your regex stuff.
Try with:
import re
def get_tag(match):
base = '<a my-inner-type="{}">{}</a>'
inner_type = match.group(1).upper()
my_inner_type = '{}{}:{}'.format(inner_type[0], inner_type[-1], match.group(2))
return base.format(my_inner_type, match.group(3))
print(re.sub(r'\<(\w+):(\d+)\W([^\>]+).*', get_tag, '<Bus:1234|Bob Alice>'))
print(re.sub(r'\<(\w+):(\d+)\W([^\>]+).*', get_tag, '<Car:5678|Nelson Mandela>'))
This code will work if you have it in the form <Type:num|name>:
def replaceupdate(tag):
replace = ''
t = ''
i = 1
ident = ''
name = ''
typex = ''
while t != ':':
typex += tag[i]
t = tag[i]
i += 1
t = ''
while t != '|':
if tag[i] == '|':
break
ident += tag[i]
t = tag[i]
i += 1
t = ''
i += 1
while t != '>':
name += tag[i]
t = tag[i]
i += 1
replace = '<a my-inner-type="{}{}">{}</a>'.format(typex, ident, name)
return replace
I know it does not use regex and it has to split the text some other way, but this is the main bulk.

How to replace digits in string?

Ok say I have a string in python:
str="martin added 1 new photo to the <a href=''>martins photos</a> album."
the string contains a lot more css/html in real world use
What is the fastest way to change the 1 ('1 new photo') to say '2 new photos'. of course later the '1' may say '12'.
Note, I don't know what the number is, so doing a replace is not acceptable.
I also need to change 'photo' to 'photos' but I can just do a .replace(...).
Unless there is a neater, easier solution to modify both?
Update
Never mind. From the comments it is evident that the OP's requirement is more complicated than it appears in the question. I don't think it can be solved by my answer.
Original Answer
You can convert the string to a template and store it. Use placeholders for the variables.
template = """%(user)s added %(count)s new %(l_object)s to the
<a href='%(url)s'>%(text)s</a> album."""
options = dict(user = "Martin", count = 1, l_object = 'photo',
url = url, text = "Martin's album")
print template % options
This expects the object of the sentence to be pluralized externally. If you want this logic (or more complex conditions) in your template(s) you should look at a templating engine such as Jinja or Cheetah.
It sounds like this is what you want (although why is another question :^)
import re
def add_photos(s,n):
def helper(m):
num = int(m.group(1)) + n
plural = '' if num == 1 else 's'
return 'added %d new photo%s' % (num,plural)
return re.sub(r'added (\d+) new photo(s?)',helper,s)
s = "martin added 0 new photos to the <a href=''>martins photos</a> album."
s = add_photos(s,1)
print s
s = add_photos(s,5)
print s
s = add_photos(s,7)
print s
Output
martin added 1 new photo to the <a href=''>martins photos</a> album.
martin added 6 new photos to the <a href=''>martins photos</a> album.
martin added 13 new photos to the <a href=''>martins photos</a> album.
since you're not parsing html, just use an regular expression
import re
exp = "{0} added ([0-9]*) new photo".format(name)
number = int(re.findall(exp, strng)[0])
This assumes that you will always pass it a string with the number in it. If not, you'll get an IndexError.
I would store the number and the format string though, in addition to the formatted string. when the number changes, remake the format string and replace your stored copy of it. This will be much mo'bettah' then trying to parse a string to get the count.
In response to your question about the html mattering, I don't think so. You are not trying to extract information that the html is encoding so you are not parsing html with regular expressions. This is just a string as far as that concern goes.

Categories

Resources