How to replace digits in string? - python

Ok say I have a string in python:
str="martin added 1 new photo to the <a href=''>martins photos</a> album."
the string contains a lot more css/html in real world use
What is the fastest way to change the 1 ('1 new photo') to say '2 new photos'. of course later the '1' may say '12'.
Note, I don't know what the number is, so doing a replace is not acceptable.
I also need to change 'photo' to 'photos' but I can just do a .replace(...).
Unless there is a neater, easier solution to modify both?

Update
Never mind. From the comments it is evident that the OP's requirement is more complicated than it appears in the question. I don't think it can be solved by my answer.
Original Answer
You can convert the string to a template and store it. Use placeholders for the variables.
template = """%(user)s added %(count)s new %(l_object)s to the
<a href='%(url)s'>%(text)s</a> album."""
options = dict(user = "Martin", count = 1, l_object = 'photo',
url = url, text = "Martin's album")
print template % options
This expects the object of the sentence to be pluralized externally. If you want this logic (or more complex conditions) in your template(s) you should look at a templating engine such as Jinja or Cheetah.

It sounds like this is what you want (although why is another question :^)
import re
def add_photos(s,n):
def helper(m):
num = int(m.group(1)) + n
plural = '' if num == 1 else 's'
return 'added %d new photo%s' % (num,plural)
return re.sub(r'added (\d+) new photo(s?)',helper,s)
s = "martin added 0 new photos to the <a href=''>martins photos</a> album."
s = add_photos(s,1)
print s
s = add_photos(s,5)
print s
s = add_photos(s,7)
print s
Output
martin added 1 new photo to the <a href=''>martins photos</a> album.
martin added 6 new photos to the <a href=''>martins photos</a> album.
martin added 13 new photos to the <a href=''>martins photos</a> album.

since you're not parsing html, just use an regular expression
import re
exp = "{0} added ([0-9]*) new photo".format(name)
number = int(re.findall(exp, strng)[0])
This assumes that you will always pass it a string with the number in it. If not, you'll get an IndexError.
I would store the number and the format string though, in addition to the formatted string. when the number changes, remake the format string and replace your stored copy of it. This will be much mo'bettah' then trying to parse a string to get the count.
In response to your question about the html mattering, I don't think so. You are not trying to extract information that the html is encoding so you are not parsing html with regular expressions. This is just a string as far as that concern goes.

Related

R Regex, get string between quotations marks

So. I'm trying to extract the Document is original from the string below.
c:1:{s:7:"note";s:335:"Document is original-no need to register again";}
Two thoughts:
A little bit of work, get most components of that structure:
string <- 'c:1:{s:7:"note";s:335:"Document is original-no need to register again";}'
strcapture("(.*):(.*):(.*)",
strsplit(regmatches(string, gregexpr('(?<={)[^}]+(?=})', string, perl = TRUE))[[1]], ";")[[1]],
proto = list(s="", len=1L, x=""))
# s len x
# 1 s 7 "note"
# 2 s 335 "Document is original-no need to register again"
A simpler approach, perhaps a little more hard-coded:
regmatches(string, gregexpr('(?<=")([^;"]+)(?=")', string, perl = TRUE))[[1]]
# [1] "note"
# [2] "Document is original-no need to register again"
From here, you need to figure out how to dismiss "note" and then perhaps strsplit(.., "-") to get the substring you want.

String Manipulation for a text extracted with a css selector

I wrote a code to extract multiple information regarding movies. I have a problem with manipulating one of the strings :
'\n'
'\t\t\t\t85\xa0mins \xa0\n'
'\t\t\t\t\n'
'\t\t\t\t\tMore details at\n'
'\t\t\t\t\t'
I want to extract the duration of the movie, which, in this case, is the number 85.
I don't really know how to extract it since the format is pretty weird. My web scraping program yields items as dictionaries, for example:
{'film_director': ['Alfred Hitchcock'],
'film_rating': ['4.0'],
'film_time': '\n'
'\t\t\t\t81\xa0mins \xa0\n'
'\t\t\t\t\n'
'\t\t\t\t\tMore details at\n'
'\t\t\t\t\t',
'film_title': ['Rope'],
'film_year': ['1948']}
I have tried splitting it, but it doesn't seem to work. Any other ideas?
film_dict = {'film_director': ['Alfred Hitchcock'],
'film_rating': ['4.0'],
'film_time': '\n'
'\t\t\t\t81\xa0mins \xa0\n'
'\t\t\t\t\n'
'\t\t\t\t\tMore details at\n'
'\t\t\t\t\t',
'film_title': ['Rope'],
'film_year': ['1948']}
film_time = (film_dict ['film_time'].replace ('\t', '')[: 7])
print (film_time)
Line eleven takes the film time value, removes the tab character and the truncates it to just the part that you need. The replace method replaces '\n' with nothing which just removes it. [:7] slices from the beginning of the modified string up to character number 8.

Python3 replace tags based on condition of the type of tag

I want all the tags in a text that look like <Bus:1234|Bob Alice> or <Car:5678|Nelson Mandela> to be replaced with <a my-inner-type="CR:1234">Bob Alice</a> and <a my-inner-type="BS:5678">Nelson Mandela</a> respectively. So basically, depending on the Type whether TypeA or TypeB, I want to replace the text accordingly in a text string using Python3 and regex.
I tried doing the following in python but not sure if that's the right approach to go forward:
import re
def my_replace():
re.sub(r'\<(.*?)\>', replace_function, data)
With the above, I am trying to do a regex of the< > tag and every tag I find, I pass that to a function called replace_function to split the text between the tag and determine if it is a TypeA or a TypeB and compute the stuff and return the replacement tag dynamically. I am not even sure if this is even possible using the re.sub but any leads would help. Thank you.
Examples:
<Car:1234|Bob Alice> becomes <a my-inner-type="CR:1234">Bob Alice</a>
<Bus:5678|Nelson Mandela> becomes <a my-inner-type="BS:5678">Nelson Mandela</a>
This is perfectly possible with re.sub, and you're on the right track with using a replacement function (which is designed to allow dynamic replacements). See below for an example that works with the examples you give - probably have to modify to suit your use case depending on what other data is present in the text (ie. other tags you need to ignore)
import re
def replace_function(m):
# note: to not modify the text (ie if you want to ignore this tag),
# simply do (return the entire original match):
# return m.group(0)
inner = m.group(1)
t, name = inner.split('|')
# process type here - the following will only work if types always follow
# the pattern given in the question
typename = t[4:]
# EDIT: based on your edits, you will probably need more processing here
# eg:
if t.split(':')[0] == 'Car':
typename = 'CR'
# etc
return '<a my-inner-type="{}">{}</a>'.format(typename, name)
def my_replace(data):
return re.sub(r'\<(.*?)\>', replace_function, data)
# let's just test it
data = 'I want all the tags in a text that look like <TypeA:1234|Bob Alice> or <TypeB:5678|Nelson Mandela> to be replaced with'
print(my_replace(data))
Warning: if this text is actually full html, regex matching will not be reliable - use an html processor like beautifulsoup. ;)
Probably an extension to #swalladge's answer but here we use the advantage of a dictionary, if we know a mapping. (Think replace dictionary with a custom mapping function.
import re
d={'TypeA':'A',
'TypeB':'B',
'Car':'CR',
'Bus':'BS'}
def repl(m):
return '<a my-inner-type="'+d[m.group(1)]+m.group(2)+'">'+m.group(3)+'</a>'
s='<TypeA:1234|Bob Alice> or <TypeB:5678|Nelson Mandela>'
print(re.sub('<(.*?)(:\d+)\|(.*?)>',repl,s))
print()
s='<Bus:1234|Bob Alice> or <Car:5678|Nelson Mandela>'
print(re.sub('<(.*?)(:\d+)\|(.*?)>',repl,s))
OUTPUT
<a my-inner-type="A:1234">Bob Alice</a> or <a my-inner-type="B:5678">Nelson Mandela</a>
<a my-inner-type="BS:1234">Bob Alice</a> or <a my-inner-type="CR:5678">Nelson Mandela</a>
Working example here.
regex
We capture what we need in 3 groups and refer to them through match object.Highlighted in bold are the three groups that we captured in the regex.
<(.*?)(:\d+)\|(.*?)>
We use these 3 groups in our repl function to return the right string.
Sorry this isn't a complete answer but I'm falling asleep at the computer, but this is the regex that'll match either of the strings you provided, (<Type)(\w:)(\d+\|)(\w+\s\w+>). Check out https://pythex.org/ for testing your regex stuff.
Try with:
import re
def get_tag(match):
base = '<a my-inner-type="{}">{}</a>'
inner_type = match.group(1).upper()
my_inner_type = '{}{}:{}'.format(inner_type[0], inner_type[-1], match.group(2))
return base.format(my_inner_type, match.group(3))
print(re.sub(r'\<(\w+):(\d+)\W([^\>]+).*', get_tag, '<Bus:1234|Bob Alice>'))
print(re.sub(r'\<(\w+):(\d+)\W([^\>]+).*', get_tag, '<Car:5678|Nelson Mandela>'))
This code will work if you have it in the form <Type:num|name>:
def replaceupdate(tag):
replace = ''
t = ''
i = 1
ident = ''
name = ''
typex = ''
while t != ':':
typex += tag[i]
t = tag[i]
i += 1
t = ''
while t != '|':
if tag[i] == '|':
break
ident += tag[i]
t = tag[i]
i += 1
t = ''
i += 1
while t != '>':
name += tag[i]
t = tag[i]
i += 1
replace = '<a my-inner-type="{}{}">{}</a>'.format(typex, ident, name)
return replace
I know it does not use regex and it has to split the text some other way, but this is the main bulk.

Making a (hopefully simple) wiki parser with python

With the help of joksnet's programs here I've managed to get plaintext Wikipedia articles that I'm looking for.
The text returned includes Wiki markup for the headings, so for example, the sections of the Albert Einstein article are returned like this:
==Biography==
===Early life and education===
blah blah blah
What I'd really like to do is feed the retrieved text to a function and wrap all the top level sections in bold html tags and the second level sections in italics, like this:
<b>Biography</b>
<i>Early life and education</i>
blah blah blah
But I'm afraid I don't know how to even start, at least not without making the function dangerously naive. Do I need to use regular expressions?
Any suggestions greatly appreciated.
PS Sorry if "parsing" is too strong a word for what I'm trying to do here.
I think the best way here would be to let MediaWiki take care of the parsing. I don't know the library you're using, but basically this is the difference between
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Albert%20Einstein&rvprop=content
which returns the raw wikitext and
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Albert%20Einstein&rvprop=content&rvparse
which returns the parsed HTML.
You can use regex and scraping modules like Scrapy and Beautifulsoup to parse and scrape wiki pages.
Now that you clarified your question I suggest you use the py-wikimarkup module that is hosted on github. The link is https://github.com/dcramer/py-wikimarkup/ . I hope that helps.
I ended up doing this:
def parseWikiTitles(x):
counter = 1
while '===' in x:
if counter == 1:
x = x.replace('===','<i>',1)
counter = 2
else:
x = x.replace('===',r'</i>',1)
counter = 1
counter = 1
while '==' in x:
if counter == 1:
x = x.replace('==','<b>',1)
counter = 2
else:
x = x.replace('==',r'</b>',1)
counter = 1
x = x.replace('<b> ', '<b>', 50)
x = x.replace(r' </b>', r'</b>', 50)
x = x.replace('<i> ', '<i>', 50)
x = x.replace(r' </i>', r'<i>', 50)
return x
I pass the string of text with wiki titles to that function and it returns the same text with the == and === replaced with bold and italics HTML tags. The last thing removes spaces before and after titles, for example == title == gets converted to <b>title</b> instead of <b> title </b>
Has worked without problem so far.
Thanks for the help guys,
Alex

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources