splitting strings using re.split - python

I have multiple strings (>1000) of the form:
\r\nSenor Sisig\nThe Chairman\nCupkates\nLittle Green Cyclo\nSanguchon\nSeoul on Wheels\nKasa Indian\n\nGo Streatery\nWhip Out!\nLiba Falafel\nGrilled Cheese Bandits\r\n
The strings may have a whitespace before the '\n'
How do I split these strings (in an efficient way) so as to avoid getting any empty or duplicate (the whitespace case) elements?
I was using:
re.split(r'\r|\n', str)
EDIT:
some more examples:
\r\nThe Creme Brulee Cart \r\nCurry Up Now\r\nKoJa Kitchen\r\nAn the Go\r\nPacific Puffs\r\nEbbett's Good to Go\r\nFiveten Burger\r\nGo Streatery\r\nHiyaaa\r\nSAJJ\r\nKinder's Truck\r\nBlue Saigon\r
\r\nThe Chairman\r\nSanguchon\r\nSeoul on Wheels\r\nGo Streatery\r\nStreet Dog Truck\r\nKinder's Truck\r\nYummi BBQ\r\nLexie's Frozen Custard\r\nDrewski's Hot Rod Kitchen\r
\n An the Go \n Cheese Gone Wild \n Cupkates \n Curry Up Now \n Fins on the Hoof\n KoJa Kitchen\n Lobsta Truck \n Oui Chef \n Sanguchon\n Senor Sisig \n The Chairman \n The Rib Whip
thanks!

Your example doesn't show any "whitespace before the \n" except for a single optional \r.
If that's all you're trying to handle, instead of splitting on either \r or \n, split on a possible \r and a definite \n:
re.split(r"\r?\n", s)
Of course that's assuming you don't have any bare \r without \n to handle. If you need to handle \r, \r\n, and \n all equally (similar to Python's universal newline support…):
re.split(r"\r|\n|(\r\n)", s)
Or, more simply:
re.split(r"(\r|\n)+", s)
If you want to remove leading spaces, tabs, multiple \r, etc., you could do that in the regexp, or just call lstrip on each result:
map(str.lstrip, re.split(r"\r|\n", s))
… but that can leave you with empty elements. You could filter those out, but it's probably better to just split on any run of whitespace that ends with a \n instead:
re.split(r"\s*\n", s)
That will still leave empty elements at the start and end, because your string starts and ends with newlines, and that's what re.split is supposed to do. If you want to eliminate them, you can either strip the string before parsing, or toss the end values after parsing:
re.split(r"\s*\n", s.strip())
re.split(r"\s*\n", s)[1:-1]
I think one of these last two is exactly what you want… but that's really just a guess based on the limited information you gave. If not, then one of the others (along with its explanation) should hopefully be enough for you to write what you really want.
From your new examples, it looks like what you really want to split on is any run of whitespace that includes at least one \n. And your input may or may not have newlines at the start and end (your first example has both, your second has \r\n at the start but nothing at the end…), and you want to ignore them if it does. So:
re.split(r"\s*\n\s*", s.strip())
However, at this point, it might be worth asking why you're trying to parse this as a string instead of as a text file. Assuming you got these from some file or file-like object, instead of this:
with open(path, 'r') as f:
s = f.read()
results = re.split(regexpr, s.strip())
… something like this might be a lot more readable, and more than fast enough (maybe not as fast as the optimal regexp, but still so fast that any wasted string-processing time is swamped by the actual file reading time anyway):
with open(path, 'r') as f:
results = filter(None, map(str.strip, f))
Especially if you just want to iterate over this list once, in which case (assuming either Python 3.x, or using ifilter and imap from itertools if 2.x) this version doesn't have to read the whole file into memory and process it before you start doing your actual work.

re.split(r'[\s\n\r]+', str.strip())

>>> s = "\r\nSenor Sisig\nThe Chairman\nCupkates\nLittle Green Cyclo\nSanguchon\nSeoul on Wheels\nKasa Indian\n\nGo Streatery\nWhip Out!\nLiba Falafel\nGrilled Cheese Bandits\r\n"
>>> [x for x in s.strip("\r\n").split("\n") if x]
['Senor Sisig', 'The Chairman', 'Cupkates', 'Little Green Cyclo', 'Sanguchon', 'Seoul on Wheels', 'Kasa Indian', 'Go Streatery', 'Whip Out!', 'Liba Falafel', 'Grilled Cheese Bandits']
If you insist on regex
>>> import re
>>> re.split(r"[\r\n]+", s.strip("\r\n"))
['Senor Sisig', 'The Chairman', 'Cupkates', 'Little Green Cyclo', 'Sanguchon', 'Seoul on Wheels', 'Kasa Indian', 'Go Streatery', 'Whip Out!', 'Liba Falafel', 'Grilled Cheese Bandits']

Just filter out the empty values
list(ifilter(None, re.split(r"\r|\n", your_string)))
Pythons regular expressions offer you the \s -character class which matches any whitespace in [ \t\n\r\f\v] (unless UNICODE flag is set, then it depends on the character database in use).
As mentioned in the other answers (#abarnert), your regex could be \s*\n which is 0 or more whitespace ending with an \n. Below is an example.
In [1]: import re
In [2]: from itertools import ifilter
In [3]: my_string = """\r\nSenor Sisig \nThe Chairman\nCupkates\nLittle Green Cyclo\nSanguchon\nSeoul on Wheels\nKasa Indian\n\nGo Streatery\nWhip Out!\nLiba Falafel\nGrilled Cheese Bandits\r\n"""
In [4]: list(ifilter(None, re.split(r"\s*\n", my_string)))
Out[4]:
['Senor Sisig',
'The Chairman',
'Cupkates',
'Little Green Cyclo',
'Sanguchon',
'Seoul on Wheels',
'Kasa Indian',
'Go Streatery',
'Whip Out!',
'Liba Falafel',
'Grilled Cheese Bandits']
Note that I'm using ifilter from the itertools package. You could use filter or a list comp.
Like so:
[x for x in re.split("\s*\n", my_string) if x]

Related

Regex for matching alphabet, numbers and special charters while looping in python

I am trying to find words and print using below code. Everything is working perfect but only issue is i am unable to print the last word(which is number).
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid ']
import re
for i in words:
y = re.findall('{} ([^ ]*)'.format(i), textfile)
print(y)
Text file i working with:
textfile = """1, REBECCA M. ROTH , COLLECTOR OF TAXES of the taxing district of the
township of MORRIS for Six Hundred Sixty Seven dollars andFifty Two cents, the land
in said taxing district described as Block No. 10303 Lot No. 10 :
and known as 239 E HANOVER AVE , on the tax Taxes For: 2012
Sewer
Assessments For Improvements
Total Cost of Sale 35.00
Total
Premium (if any) Paid 1,400.00 """
Would like to know where am i making mistake.
Any suggestion is appreciated.
A couple of issues:
As others have mentioned, you need to escape special characters like parentheses ( ) and dots .. Very simply, you can use re.escape
Another issue is the trailing space in Premium \(if any\) Paid (it's trying to match two spaces instead of one as you're also checking for a space in your regex {} ([^ ]*))
You should instead change your code to the following:
See working code here
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid']
import re
for i in words:
y = re.findall('{} ([^ ]*)'.format(re.escape(i)), textfile)
print(y)
Two problems:
Your current 'Premium (if any) Paid ' string ends on a space, and '{} ([^ ]*)' also has a space after {}, which adds them together. Delete the trailing space in 'Premium (if any) Paid '.
You need to escape parenthesis, so if you want to keep your regular expression unchanged, the string in the list should be ['Premium \(if any\) Paid']. You can also use re.escape instead.
For your particular cases, this seems to be an optimal solution:
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid']
import re
for i in words:
y = re.findall('{}\s+([\S]*)'.format(re.escape(i)), text, re.I)
print(y)

String separation by commas, but with a condition (ignore comma separated single word)

With the following code (a bit messy, I acknowledge) I separate a string by commas, but the condition is that it doesn't separate when the string contains comma separated single words, for example:
It doesn't separate "Yup, there's a reason why you want to hit the sack just minutes after climax" but it separates "The increase in heart rate, which you get from masturbating, is directly beneficial to the circulation, and can reduce the likelihood of a heart attack" into ['The increase in heart rate', 'which you get from masturbating', 'is directly beneficial to the circulation', 'and can reduce the likelihood of a heart attack']
The problem is the purpose of the code fails when it encounters with such a string: "When men ejaculate, it releases a slew of chemicals including oxytocin, vasopressin, and prolactin, all of which naturally help you hit the pillow." I don't want a separation after oxytocin, but after prolactin. I need a regex to do that.
import os
import textwrap
import re
import io
from textblob import TextBlob
string = str(input_string)
listy= [x.strip() for x in string.split(',')]
listy = [x.replace('\n', '') for x in listy]
listy = [re.sub('(?<!\d)\.(?!\d)', '', x) for x in listy]
listy = filter(None, listy) # Remove any empty strings
newstring= []
for segment in listy:
wc = TextBlob(segment).word_counts
if listy[len(listy)-1] != segment:
if len(wc) > 3: # len(segment.split(' ')) > 7:
newstring.append(segment+"&&")
else:
newstring.append(segment+",")
else:
newstring.append(segment)
sep = [x.strip() for x in (' '.join(newstring)).split('&&')]
Consider the below..
mystr="When men ejaculate, it releases a slew of chemicals including oxytocin, vasopressin, and prolactin, all of which naturally help you hit the pillow."
rExp=r",(?!\s+(?:and\s+)?\w+,)"
mylst=re.compile(rExp).split(mystr)
print(mylst)
should give the below output..
['When men ejaculate', ' it releases a slew of chemicals including oxytocin, vasopressin, and prolactin', ' all of which naturally help you hit the pillow.']
Let's look at how we split the string...
,(?!\s+\w+,)
Use every comma that is not followed by((?! -> negative look ahead) \s+\w+, space and a word with comma.
The above would fail in case of vasopressin, and as and is not followed by ,. So introduce a conditional and\s+ within.
,(?!\s+(?:and\s+)?\w+,)
Although I might want to use the below
,(?!\s+(?:(?:and|or)\s+)?\w+,)
Test regex here
Test code here
In essence consider replacing your line
listy= [x.strip() for x in string.split(',')]
with
listy= [x.strip() for x in re.split(r",(?!\s+(?:and\s+)?\w+,)",string)]

Removing white space from txt with python

I have a .txt file (scraped as pre-formatted text from a website) where the data looks like this:
B, NICKOLAS CT144531X D1026 JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS
I'd like to remove all extra spaces (they're actually different number of spaces, not tabs) in between the columns. I'd also then like to replace it with some delimiter (tab or pipe since there's commas within the data), like so:
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Looked around and found that the best options are using regex or shlex to split. Two similar scenarios:
Python Regular expression must strip whitespace except between quotes,
Remove white spaces from dict : Python.
You can apply the regex '\s{2,}' (two or more whitespace characters) to each line and substitute the matches with a single '|' character.
>>> import re
>>> line = 'ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS '
>>> re.sub('\s{2,}', '|', line.strip())
'ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS'
Stripping any leading and trailing whitespace from the line before applying re.sub ensures that you won't get '|' characters at the start and end of the line.
Your actual code should look similar to this:
import re
with open(filename) as f:
for line in f:
subbed = re.sub('\s{2,}', '|', line.strip())
# do something here
What about this?
your_string ='ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS'
print re.sub(r'\s{2,}','|',your_string.strip())
Output:
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Expanation:
I've used re.sub() which takes 3 parameter, a pattern, a string you want to replace with and the string you want to work on.
What I've done is taking at least two space together , I 've replaced them with a | and applied it on your string.
s = """B, NICKOLAS CT144531X D1026 JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS
"""
# Update
re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
In [71]: print re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Considering there are at least two spaces separating the columns, you can use this:
lines = [
'B, NICKOLAS CT144531X D1026 JUDGE ANNIE WHITE JOHNSON ',
'ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS '
]
for line in lines:
parts = []
for part in line.split(' '):
part = part.strip()
if part: # checking if stripped part is a non-empty string
parts.append(part)
print('|'.join(parts))
Output for your input:
B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
It looks like your data is in a "text-table" format.
I recommend using the first row to figure out the start point and length of each column (either by hand or write a script with regex to determine the likely columns), then writing a script to iterate the rows of the file, slice the row into column segments, and apply strip to each segment.
If you use a regex, you must keep track of the number of columns and raise an error if any given row has more than the expected number of columns (or a different number than the rest). Splitting on two-or-more spaces will break if a column's value has two-or-more spaces, which is not just entirely possible, but also likely. Text-tables like this aren't designed to be split on a regex, they're designed to be split on the column index positions.
In terms of saving the data, you can use the csv module to write/read into a csv file. That will let you handle quoting and escaping characters better than specifying a delimiter. If one of your columns has a | character as a value, unless you're encoding the data with a strategy that handles escapes or quoted literals, your output will break on read.
Parsing the text above would look something like this (i nested a list comprehension with brackets instead of the traditional format so it's easier to understand):
cols = ((0,34),
(34, 50),
(50, 59),
(59, None),
)
for line in lines:
cleaned = [i.strip() for i in [line[s:e] for (s, e) in cols]]
print cleaned
then you can write it with something like:
import csv
with open('output.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter='|',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
for line in lines:
spamwriter.writerow([line[col_start:col_end].strip()
for (col_start, col_end) in cols
])
Looks like this library can solve this quite nicely:
http://docs.astropy.org/en/stable/io/ascii/fixed_width_gallery.html#fixed-width-gallery
Impressive...

Python string grouping?

Basically, I print a long message but I want to group all of those words into 5 character long strings.
For example "iPhone 6 isn’t simply bigger — it’s better in every way. Larger, yet dramatically thinner." I want to make that
"iPhon 6isn' tsimp lybig ger-i t'sbe terri never yway. Large r,yet drama tical lythi nner. "
As suggested by #vaultah, this is achieved by splitting the string by a space and joining them back without spaces; then using a for loop to append the result of a slice operation to an array. An elegant solution is to use a comprehension.
text = "iPhone 6 isn’t simply bigger — it’s better in every way. Larger, yet dramatically thinner."
joined_text = ''.join(text.split())
splitted_to_six = [joined_text[char:char+6] for char in range(0,len(joined_text),6)]
' '.join(splitted_to_six)
I'm sure you can use the re module to get back dashes and apostrophes as they're meant to be
Simply do the following.
import re
sentence="iPhone 6 isn't simply bigger - it's better in every way. Larger, yet dramatically thinner."
sentence = re.sub(' ', '', sentence)
count=0
new_sentence=''
for i in sentence:
if(count%5==0 and count!=0):
new_sentence=new_sentence+' '
new_sentence=new_sentence+i
count=count+1
print new_sentence
Output:
iPhon e6isn 'tsim plybi gger- it'sb etter ineve ryway .Larg er,ye tdram atica llyth inner .

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources