Remove whitespace before a specific character in python? - python

I was wondering if you knew the best way to do this.
This program uses OCR to read text. Occasionally, spaces appear before a decimal point like so:
{'MORTON BASSET BLK SESAME SEE': '$6.89'}
{"KELLOGG'S RICE KRISPIES": '$3.49'}
{'RAID FLY RIBBON 4PK': '$1 .49'}
as you can see, a space appears before the decimal point on the last entry. Any ideas on how to strip JUST this whitespace?
Thank you :)
EDIT: contents before decimal point may contain a varying amount of whitespace. Like
$1 .49
$1 .49
$1 .49

Use regular expressions.
import re
a_list = {"1 .49", "1 .49", "1 .49"}
for a in a_list:
print re.sub(' +.', '.', a)
Result will be
1.49
1.49
1.49

You can just strip out all whitespace from the string, assuming that they follow the same format. SOmething like this:
for item in items:
for key in item.keys():
item[key] = item[key].replace(" ", "")
The key part is replacing the whitespace with no whitespace.
If you just want the whitespace before the ".", then you could use:
.replace(" .", ".") instead.
This would only replace 1 white space. To replace multiple, you could use a while loop like this:
while ' .' in item[key]:
item[key].replace(' .', '.')

For your dict obj:-
>>> d = {'RAID FLY RIBBON 4PK': '$1 .49'}
>>> d['RAID FLY RIBBON 4PK'] = d['RAID FLY RIBBON 4PK'].replace(' ','')
>>> d
{'RAID FLY RIBBON 4PK': '$1.49'}
Even if there is varying space; replace would work fine. See this:-
>>> d = {'RAID FLY RIBBON 4PK': '$1 .49'}
>>> d['RAID FLY RIBBON 4PK'] = d['RAID FLY RIBBON 4PK'].replace(' ','')
>>> d
{'RAID FLY RIBBON 4PK': '$1.49'}

This is trivial with split and join:
"".join("1 .49".split())
This works because splits on one or more spaces. To do this for each value in a dictionary:
{k, "".join(v.split()) for k,v in dict_.items()}

i think that maybe you want something more generic not only for that key:
for key, value in d.items():
d[key]=value.replace(" ","")
in this way independent of the key othe number of space the result will be without white spaces

Sure:
string.replace(' .', '')

Related

how do i use string.replace() to replace only when the string is exactly matching

I have a dataframe with a list of poorly spelled clothing types. I want them all in the same format , an example is i have "trous" , "trouse" and "trousers", i would like to replace the first 2 with "trousers".
I have tried using string.replace but it seems its getting the first "trous" and changing it to "trousers" as it should and when it gets to "trouse", it works also but when it gets to "trousers" it makes "trousersersers"! i think its taking the strings which contain trous and trouse and trousers and changing them.
Is there a way i can limit the string.replace to just look for exactly "trous".
here's what iv troied so far, as you can see i have a good few changes to make, most of them work ok but its the likes of trousers and t-shirts which have a few similar changes to be made thats causing the upset.
newTypes=[]
for string in types:
underwear = string.replace(('UNDERW'), 'UNDERWEAR').replace('HANKY', 'HANKIES').replace('TIECLI', 'TIECLIPS').replace('FRAGRA', 'FRAGRANCES').replace('ROBE', 'ROBES').replace('CUFFLI', 'CUFFLINKS').replace('WALLET', 'WALLETS').replace('GIFTSE', 'GIFTSETS').replace('SUNGLA', 'SUNGLASSES').replace('SCARVE', 'SCARVES').replace('TROUSE ', 'TROUSERS').replace('SHIRT', 'SHIRTS').replace('CHINO', 'CHINOS').replace('JACKET', 'JACKETS').replace('KNIT', 'KNITWEAR').replace('POLO', 'POLOS').replace('SWEAT', 'SWEATERS').replace('TEES', 'T-SHIRTS').replace('TSHIRT', 'T-SHIRTS').replace('SHORT', 'SHORTS').replace('ZIP', 'ZIP-TOPS').replace('GILET ', 'GILETS').replace('HOODIE', 'HOODIES').replace('HOODZIP', 'HOODIES').replace('JOGGER', 'JOGGERS').replace('JUMP', 'SWEATERS').replace('SWESHI', 'SWEATERS').replace('BLAZE ', 'BLAZERS').replace('BLAZER ', 'BLAZERS').replace('WC', 'WAISTCOATS').replace('TTOP', 'T-SHIRTS').replace('TROUS', 'TROUSERS').replace('COAT', 'COATS').replace('SLIPPE', 'SLIPPERS').replace('TRAINE', 'TRAINERS').replace('DECK', 'SHOES').replace('FLIP', 'SLIDERS').replace('SUIT', 'SUITS').replace('GIFTVO', 'GIFTVOUCHERS')
newTypes.append(underwear)
types = newTypes
Assuming you're okay with not using string.replace(), you can simply do this:
lst = ["trousers", "trous" , "trouse"]
for i in range(len(lst)):
if "trous" in lst[i]:
lst[i] = "trousers"
print(lst)
# Prints ['trousers', 'trousers', 'trousers']
This checks if the shortest substring, trous, is part of the string, and if so converts the entire string to trousers.
Use a dict for string to be replaced:
d={
'trous': 'trouser',
'trouse': 'trouser',
# ...
}
newtypes=[d.get(string,string) for string in types]
d.get(string,string) will return string if string is not in d.

How to split list elements to a line separated by space

I have a list in python as :
values = ['Maths\n', 'English\n', 'Hindi\n', 'Science\n', 'Physical_Edu\n', 'Accounts\n', '\n']
print("".join(values))
I want output should be as :-
Subjects: Maths English Hindi Science Physical_Edu Accounts
I am new to Python, I used join() method but unable to get expected output.
You could map the str.stripfunction to every element in the list and join them afterwards.
values = ['Maths\n', 'English\n', 'Hindi\n', 'Science\n', 'Physical_Edu\n', 'Accounts\n', '\n']
print("Subjects:", " ".join(map(str.strip, values)))
Using a regular expression approach:
import re
lst = ['Maths\n', 'English\n', 'Hindi\n', 'Science\n', 'Physical_Edu\n', 'Accounts\n', '\n']
rx = re.compile(r'.*')
print("Subjects: {}".format(" ".join(match.group(0) for item in lst for match in [rx.match(item)])))
# Subjects: Maths English Hindi Science Physical_Edu Accounts
But better use strip() (or even better: rstrip()) as provided in other answers like:
string = "Subjects: {}".format(" ".join(map(str.rstrip, lst)))
print(string)
strip() each element of the string and then join() with a space in between them.
a = ['Maths\n', 'English\n', 'Hindi\n', 'Science\n', 'Physical_Edu\n', 'Accounts\n', '\n']
print("Subjects: " +" ".join(map(lambda x:x.strip(), a)))
Output:
Subjects: Maths English Hindi Science Physical_Edu Accounts
As pointed out by #miindlek, you can also achieve the same thing, by using map(str.strip, a) in place of map(lambda x:x.strip(), a))
What you can do is use this example to strip the newlines and join them using:
joined_string = " ".join(stripped_array)

splitting strings using re.split

I have multiple strings (>1000) of the form:
\r\nSenor Sisig\nThe Chairman\nCupkates\nLittle Green Cyclo\nSanguchon\nSeoul on Wheels\nKasa Indian\n\nGo Streatery\nWhip Out!\nLiba Falafel\nGrilled Cheese Bandits\r\n
The strings may have a whitespace before the '\n'
How do I split these strings (in an efficient way) so as to avoid getting any empty or duplicate (the whitespace case) elements?
I was using:
re.split(r'\r|\n', str)
EDIT:
some more examples:
\r\nThe Creme Brulee Cart \r\nCurry Up Now\r\nKoJa Kitchen\r\nAn the Go\r\nPacific Puffs\r\nEbbett's Good to Go\r\nFiveten Burger\r\nGo Streatery\r\nHiyaaa\r\nSAJJ\r\nKinder's Truck\r\nBlue Saigon\r
\r\nThe Chairman\r\nSanguchon\r\nSeoul on Wheels\r\nGo Streatery\r\nStreet Dog Truck\r\nKinder's Truck\r\nYummi BBQ\r\nLexie's Frozen Custard\r\nDrewski's Hot Rod Kitchen\r
\n An the Go \n Cheese Gone Wild \n Cupkates \n Curry Up Now \n Fins on the Hoof\n KoJa Kitchen\n Lobsta Truck \n Oui Chef \n Sanguchon\n Senor Sisig \n The Chairman \n The Rib Whip
thanks!
Your example doesn't show any "whitespace before the \n" except for a single optional \r.
If that's all you're trying to handle, instead of splitting on either \r or \n, split on a possible \r and a definite \n:
re.split(r"\r?\n", s)
Of course that's assuming you don't have any bare \r without \n to handle. If you need to handle \r, \r\n, and \n all equally (similar to Python's universal newline support…):
re.split(r"\r|\n|(\r\n)", s)
Or, more simply:
re.split(r"(\r|\n)+", s)
If you want to remove leading spaces, tabs, multiple \r, etc., you could do that in the regexp, or just call lstrip on each result:
map(str.lstrip, re.split(r"\r|\n", s))
… but that can leave you with empty elements. You could filter those out, but it's probably better to just split on any run of whitespace that ends with a \n instead:
re.split(r"\s*\n", s)
That will still leave empty elements at the start and end, because your string starts and ends with newlines, and that's what re.split is supposed to do. If you want to eliminate them, you can either strip the string before parsing, or toss the end values after parsing:
re.split(r"\s*\n", s.strip())
re.split(r"\s*\n", s)[1:-1]
I think one of these last two is exactly what you want… but that's really just a guess based on the limited information you gave. If not, then one of the others (along with its explanation) should hopefully be enough for you to write what you really want.
From your new examples, it looks like what you really want to split on is any run of whitespace that includes at least one \n. And your input may or may not have newlines at the start and end (your first example has both, your second has \r\n at the start but nothing at the end…), and you want to ignore them if it does. So:
re.split(r"\s*\n\s*", s.strip())
However, at this point, it might be worth asking why you're trying to parse this as a string instead of as a text file. Assuming you got these from some file or file-like object, instead of this:
with open(path, 'r') as f:
s = f.read()
results = re.split(regexpr, s.strip())
… something like this might be a lot more readable, and more than fast enough (maybe not as fast as the optimal regexp, but still so fast that any wasted string-processing time is swamped by the actual file reading time anyway):
with open(path, 'r') as f:
results = filter(None, map(str.strip, f))
Especially if you just want to iterate over this list once, in which case (assuming either Python 3.x, or using ifilter and imap from itertools if 2.x) this version doesn't have to read the whole file into memory and process it before you start doing your actual work.
re.split(r'[\s\n\r]+', str.strip())
>>> s = "\r\nSenor Sisig\nThe Chairman\nCupkates\nLittle Green Cyclo\nSanguchon\nSeoul on Wheels\nKasa Indian\n\nGo Streatery\nWhip Out!\nLiba Falafel\nGrilled Cheese Bandits\r\n"
>>> [x for x in s.strip("\r\n").split("\n") if x]
['Senor Sisig', 'The Chairman', 'Cupkates', 'Little Green Cyclo', 'Sanguchon', 'Seoul on Wheels', 'Kasa Indian', 'Go Streatery', 'Whip Out!', 'Liba Falafel', 'Grilled Cheese Bandits']
If you insist on regex
>>> import re
>>> re.split(r"[\r\n]+", s.strip("\r\n"))
['Senor Sisig', 'The Chairman', 'Cupkates', 'Little Green Cyclo', 'Sanguchon', 'Seoul on Wheels', 'Kasa Indian', 'Go Streatery', 'Whip Out!', 'Liba Falafel', 'Grilled Cheese Bandits']
Just filter out the empty values
list(ifilter(None, re.split(r"\r|\n", your_string)))
Pythons regular expressions offer you the \s -character class which matches any whitespace in [ \t\n\r\f\v] (unless UNICODE flag is set, then it depends on the character database in use).
As mentioned in the other answers (#abarnert), your regex could be \s*\n which is 0 or more whitespace ending with an \n. Below is an example.
In [1]: import re
In [2]: from itertools import ifilter
In [3]: my_string = """\r\nSenor Sisig \nThe Chairman\nCupkates\nLittle Green Cyclo\nSanguchon\nSeoul on Wheels\nKasa Indian\n\nGo Streatery\nWhip Out!\nLiba Falafel\nGrilled Cheese Bandits\r\n"""
In [4]: list(ifilter(None, re.split(r"\s*\n", my_string)))
Out[4]:
['Senor Sisig',
'The Chairman',
'Cupkates',
'Little Green Cyclo',
'Sanguchon',
'Seoul on Wheels',
'Kasa Indian',
'Go Streatery',
'Whip Out!',
'Liba Falafel',
'Grilled Cheese Bandits']
Note that I'm using ifilter from the itertools package. You could use filter or a list comp.
Like so:
[x for x in re.split("\s*\n", my_string) if x]

python, re.search / re.split for phrases which looks like a title, i.e. starting with an uppper case

I have a list of phrases (input by user) I'd like to locate them in a text file, for examples:
titles = ['Blue Team', 'Final Match', 'Best Player',]
text = 'In today Final match, The Best player is Joe from the Blue Team and the second best player is Jack from the Red team.'
1./ I can find all the occurrences of these phrases like so
titre = re.compile(r'(?P<title>%s)' % '|'.join(titles), re.M)
list = [ t for t in titre.split(text) if titre.search(t) ]
(For simplicity, I am assuming a perfect spacing.)
2./ I can also find variants of these phrases e.g. 'Blue team', final Match', 'best player' ... using re.I, if they ever appear in the text.
But I want to restrict to finding only variants of the input phrases with their first letter upper-cased e.g. 'Blue team' in the text, regardless how they were entered as input, e.g. 'bluE tEAm'.
Is it possible to write something to "block" the re.I flag for a portion of a phrase? In pseudo code I imagine generate something like '[B]lue Team|[F]inal Match'.
Note: My primary goal is not, for example, calculating frequency of the input phrases in the text but extracting and analyzing the text fragments between or around them.
I would use re.I and modify the list-comp to:
l = [ t for t in titre.split(text) if titre.search(t) and t[0].isupper() ]
I think regular expressions won't let you specify just a region where the ignore case flag is applicable. However, you can generate a new version of the text in which all the characters have been lower cased, but the first one for every word:
new_text = ' '.join([word[0] + word[1:].lower() for word in text.split()])
This way, a regular expression without the ignore flag will match taking into account the casing only for the first character of each word.
How about modifying the input so that it is in the correct case before you use it in the regular expression?

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources