I am extracting the records from the file that has information of interest over three or more lines. Information is in sequence, it follows a reasonable pattern but it is can
have some boilerplate text in between.
Since this is a text file converted from PDF it is also possible that there is a page number or some other simple control elements in between.
Pattern consists of:
starting line: last name and first name separated by comma, and nothing else
next line will have two long numbers (>=7 digits) followed by two dates
last line of interest will have 4-digit number followed by a date
Pattern of interest is marked in BOLD):
LAST NAME ,FIRST NAME
... nothing or possibly some junk text
999999999 9999999 MM/DD/YY MM/DD/YY junk text
... nothing or possibly some junk text
9999 MM/DD/YY junk
I dont care
My target text by default looks something like:
SOME IRRELEVANT TEXT
DOE ,JOHN
200000002 100000070 04/04/13 12/12/12 XYZ IJK ABC SOMETHING SOMETHING
0999 12/22/12 0 1 0 SOMETHING ELSE
MORE OF SOMETHING ELSE
but it is possible to encounter something in between so it would look like:
SOME IRRELEVANT TEXT
DOE ,JOHN
Page 13 Header
200000002 100000070 04/04/13 12/12/12 XYZ IJK ABC SOMETHING SOMETHING
0999 12/22/12 0 1 0 SOMETHING ELSE
MORE OF SOMETHING ELSE
I dont really need to validate much here so I am catching three lines with a following regex.
Since I know that this pattern will occur as a substring, but with possible insertions
So far, I have been catching these elements with following three reg. expressions:
(([A-Z]+\s+)+,[A-Z]+)
(\d{7,}\s+\d{7,}\s+(\d{2}/\d{2}/\d{2}\s+){2})
(\d{4}\s+\d{2}/\d{2}/\d{2})
but I would like to extract the whole data of interest.
Is that possible and if so, how?
Here I have added regular expressions to a list and tried finding a match one after the other... Is this what you were looking for ??
import re
f = open("C:\\Users\\mridulp\\Desktop\\temp\\file1.txt")
regexpList = [re.compile("(([A-Z]+\s+)+,[A-Z]+)"),
re.compile("^.*(\d{7,}\s+\d{7,}\s+(\d{2}/\d{2}/\d{2}\s+){2})"),
re.compile("^.*(\d{4}\s+\d{2}/\d{2}/\d{2}).*")]
lines = f.readlines()
i = 0
for l in lines:
mObj = regexpList[i].match(l)
if mObj:
print mObj.group(1)
i = i + 1
if i > 2:
i = 0
f.close()
This should pull all instances of the desired substrings from the larger string for you:
re.findall('([A-Z]+\s+,[A-Z]+).+?(\d+\s+\d+\s+\d{2}\/\d{2}\/\d{2}\s+\d{2}\/\d{2}\/\d{2}).+?(\d+\s+\d{2}\/\d{2}\/\d{2})', x, re.S)
The resulting list of tuples can be stitched together if needed to get a list of desired substrings with the junk text removed.
Related
I have this string and I'm basically trying to get the numbers after the "$" shows up. For example, I would want an output like:
>>> 100, 654, 123, 111.654
The variable and string:
file = """| $100 on the first line
| $654 on the second line
| $123 on the third line
| $111.654 on the fourth line"""
And as of right now, I have this bit of code that I think helps me separate the numbers. But I can't figure out why it's only separating the fourth line. It only prints out 111.654
txt = io.StringIO(file).getvalue()
idx = txt.rfind('$')
print(txt[idx+1:].split()[0])
Is there an easier way to do this or am I just forgetting something?
Your code finds only the last $ because that's exactly what you programmed it to do.
You take the entire input, find the last $, and then split the rest of the string. This specifically ignores any other $ in the input.
You cite "line" as if it's a unit of your program, but you've done nothing to iterate through lines. I recommend that you quit fiddling with io and simply use standard file operations. You find this in any tutorial on Python files.
In the meantime, here's how you handle the input you have:
by_lines = txt.split('\n') # Split in newline characters
for line in by_lines:
idx = line.rfind('$')
print(line[idx+1:].split()[0])
Output:
100
654
123
111.654
Does that get you moving?
Regular expressions yay:
import re
matches = re.findall(r'\$(\d+\.\d+|\d+)', file)
Finds all integer and float amounts, ensures trailing '.' fullstops are not incorrectly captured.
This should do it! For every character in txt: if it is '$' then continue until you find a space.
print(*[txt[i+1: i + txt[i:].find(' ')] for i in range(0, len(txt)) if txt[i]=='$'])
Output:
100 654 123 111.654
Your whole sequence appears to be a single string. Try using the split function to break it into separate lines. Then, I believe you need to iterate through the entire list, searching for $ at each iteration.
I'm not the most fluent in python, but maybe something like this:
for i in txt.split('\n'):
idx=txt.rfind('$')
print(txt[idx+1].split()[0])
How about this?
re.findall('\$(\d+\.?\d*)', file)
# ['100', '654', '123', '111.654']
The regex looks for the dollar sign \$ then grabs the maximum sized group available () containing one or more digits \d+ and zero or one decimal points \.? and zero or more digits \d* after that.
I'm hoping to get some regex assistance. I've got lines of columnar text that I'd like to split with regexes. Each column can be phrases of arbitrary characters, separated by a whitespace or maybe even two. Columns are separated by a larger number of whitespaces, perhaps at least 4.
Ultimately, I need to match a date if its in the second column.
Here's an example. I need the date in this column to be the group important_date
Rain in Spain 11/01/2000 90 Days
important_date should not match the date in this next line:
Another line of text 10/15/1990
# EXAMPLE:
import re
regex = r"(.*)\s(?P<important_date>\d{1,2}\/\d{1,2}\/\d{4}).*"
match_this = " Rain in Spain 11/01/2000 90 Days"
not_this = " Another line of text 10/15/1990"
print(f"Finding this date is good:{re.search(regex, match_this).group('important_date')}" )
print(f"But this one should throw an error:{re.search(regex,not_this).group('important_date')}")
I'm also comparing these regexes against lots of other lines of text with various structures, so this is why I don't want to just split on a string of " ". To know I've got the important_date, I need to know that the whole line looks like: one-column, second column is date, maybe another column after the date too.
Doing this with a single regex would also just fit much more easily into the rest of the application. I'm worried that line.split(" ") and checking the resulting list would interfere with other checks going on in this app.
I have not been able to figure out how to write the first part of the regex that captures words with no-more-than-2 spaces between them. Can I use lookaheads for this somehow?
Thank you!
Try this: (?m)^\s*(\w+\s)+\s+(?P<important_date>\d\d/\d\d/\d\d\d\d).*$ (https://regex101.com/r/PnIU3e/3).
I assume that the first column consists of words separated by single spaces, and is separated from the second column by more than one space.
You can split on fields of 2 or more spaces and only use the data if it is the second field:
for x in (match_this, not_this):
te=re.split(r'[ \t]{2,}',x)
if re.match(r'\d{1,2}\/\d{1,2}\/\d{4}', te[2]):
# you have an important date
print(te[2])
else:
# you don't
print('no match')
I've used BeautifulSoup and pandas to create a csv with columns that contain error codes and corresponding error messages.
Before formatting, the columns look something like this
-132456ErrorMessage
-3254Some other Error
-45466You've now used 3 different examples. 2 more to go.
-10240 This time there was a space.
-1232113That was a long number.
I've successfully isolated the text of the codes like this:
dfDSError['text'] = dfDSError['text'].map(lambda x: x.lstrip('-0123456789'))
This returns just what I want.
But I've been struggling to come up with a solution for the codes.
I tried this:
dfDSError['codes'] = dfDSError['codes'].replace(regex=True,to_replace= r'\D',value=r'')
But that will append numbers from the error message to the end of the code number. So for the third example above instead of 45466 I would get 4546632. Also I would like to keep the leading minus sign.
I thought maybe that I could somehow combine rstrip() with a regex to find where there was a nondigit or a space next to a space and remove everything else, but I've been unsuccessful.
for_removal = re.compile(r'\d\D*')
dfDSError['codes'] = dfDSError['codes'].map(lambda x: x.rstrip(re.findall(for_removal,x)))
TypeError: rstrip arg must be None, unicode or str
Any suggestions? Thanks!
You can use extract:
dfDSError[['code','text']] = dfDSError.text.str.extract('([-0-9]+)(.*)', expand=True)
print (dfDSError)
text code
0 ErrorMessage -132456
1 Some other Error -3254
2 You've now used 3 different examples. 2 more t... -45466
3 This time there was a space. -10240
4 That was a long number. -1232113
Super NOOB to Python (2.4.3): I am executing a function containing a regular expression which searches through a txt file that I'm importing. I am able to read and run re.search on the text file and the output is correct. I need to fun this for multiple occurrences. The regex occurs 48 times in the text). The code is as follows:
!/usr/bin/python
import re
dataRead = open('pd_usage_14-04-23.txt', 'r')
dataWrite = open('test_write.txt', 'w')
text = (dataRead.read()) #reads and initializes text for conversion to string
s = str(text) #converts text to string for reading
def user(str):
re1='((?:[a-z][a-z]+))' # Word 1
re2='(\\s+)' # White Space 1
re3='((?:[a-z][a-z]+))' # Word 2
re4='(\\s+)' # White Space 2
re5='((?:[a-z][a-z]*[0-9]+[a-z0-9]*))' # Alphanum 1
rg = re.compile(re1+re2+re3+re4+re5,re.IGNORECASE|re.DOTALL)
#alphanum1=rg.group(5)
re.findall(rg, s, flags=0)
#print "("+alphanum1+")"+"\n"
#if m:
#word1=m.group(1)
#ws1=m.group(2)
#word2=m.group(3)
#ws2=m.group(4)
#alphanum1=m.group(5)
#print "("+alphanum1+")"+"\n"
return
user(s)
dataRead.close()
dataWrite.close()
OUTPUT: g706454
THIS OUTPUT IS CORRECT! BUT...!
I need to run it multiple times reading text thats further down.
I have 2 other definitions that need to be ran multiple times also. I need all 3 to run consecutively, and then run again but starting with the next line or something to search and output newer data. All the logic I tried implement returns the same output.
So I have something like this:
for count in range (0,47):
if stop_read:
date(s)
usage(s)
user(s)
stop_read is a definition that finds the next line after the data that I'm looking for (date, usage, user). I figured I could call this to say If you hit stop_read, read the next line and run definitions all over again.
Any help is greatly appreciated!
Here is what I do for a regex in Python 3, should be similar to Python 2. This is for a multiline searc.
regex = re.compile("\\w+-\\d+\\b", re.MULTILINE)
Then later on in code I have something like:
myset.update([m.group(0) for m in regex.finditer(logmsg.text)])
Maybe you might want to update your Python if you can, 2.4 is old, old, and stale.
looks like re.findall would solve your problem:
re.findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
How do I remove the +4 from zipcodes, in python?
I've got data like
85001
52804-3233
Winston-Salem
And I want that to become
85001
52804
Winston-Salem
>>> zip = '52804-3233'
>>> zip[:5]
'52804'
...and of course when you parse your lines from the original data you should insert some kind of rule to distinguish between zipcode to fix and other strings, but I don't know how your data looks like, so I can't help much (you could check if they are only digits and the '-' symbol, maybe?).
>>> import re
>>> s = "52804-3233"
>>> # regex to remove a dash and 4 digits after the dash after 5 digits:
>>> re.sub('(\d{5})-\d{4}', '\\1', s)
'52804'
The \\1 is a so called back reference and gets replaced by the first group, which would be the 5 digit zipcode in this case.
You could try something like this:
for input in inputs:
if input[:5].isnumeric():
input = input[:5]
# Takes the first 5 characters from the string
Just take away the first 5 characters of anything that is numbers in the first 5 positions.
re.sub('-\d{4}$', '', zipcode)
This grabs all items of the format 00000-0000 with a space or other word boundary before and after the number and replaces it with the first five digits. The other regex's posted will match some other number formats that you might not want.
re.sub('\b(\d{5})-\d{4}\b', '\\1', zipcode)
Or without regex:
output = [line[:5] if line[:5].isnumeric() and line[6:].isnumeric() else line for line in text if line]