Regex to split phrases separated into columns by many whitespaces - python

I'm hoping to get some regex assistance. I've got lines of columnar text that I'd like to split with regexes. Each column can be phrases of arbitrary characters, separated by a whitespace or maybe even two. Columns are separated by a larger number of whitespaces, perhaps at least 4.
Ultimately, I need to match a date if its in the second column.
Here's an example. I need the date in this column to be the group important_date
Rain in Spain 11/01/2000 90 Days
important_date should not match the date in this next line:
Another line of text 10/15/1990
# EXAMPLE:
import re
regex = r"(.*)\s(?P<important_date>\d{1,2}\/\d{1,2}\/\d{4}).*"
match_this = " Rain in Spain 11/01/2000 90 Days"
not_this = " Another line of text 10/15/1990"
print(f"Finding this date is good:{re.search(regex, match_this).group('important_date')}" )
print(f"But this one should throw an error:{re.search(regex,not_this).group('important_date')}")
I'm also comparing these regexes against lots of other lines of text with various structures, so this is why I don't want to just split on a string of " ". To know I've got the important_date, I need to know that the whole line looks like: one-column, second column is date, maybe another column after the date too.
Doing this with a single regex would also just fit much more easily into the rest of the application. I'm worried that line.split(" ") and checking the resulting list would interfere with other checks going on in this app.
I have not been able to figure out how to write the first part of the regex that captures words with no-more-than-2 spaces between them. Can I use lookaheads for this somehow?
Thank you!

Try this: (?m)^\s*(\w+\s)+\s+(?P<important_date>\d\d/\d\d/\d\d\d\d).*$ (https://regex101.com/r/PnIU3e/3).
I assume that the first column consists of words separated by single spaces, and is separated from the second column by more than one space.

You can split on fields of 2 or more spaces and only use the data if it is the second field:
for x in (match_this, not_this):
te=re.split(r'[ \t]{2,}',x)
if re.match(r'\d{1,2}\/\d{1,2}\/\d{4}', te[2]):
# you have an important date
print(te[2])
else:
# you don't
print('no match')

Related

Is there a away for the spilt method to split depending on what the user has entered in python?

user_lst = list(map(int,input("Numbers: ").split(" ")))
So that line just maps the list to the corresponding int values, but is there a way for the split method to detect whether it is splitting on a , or " " or any other character?
For example, if the user enters
1,2,3,4
or
1 2 3 4
or
1-2-3-4
I want it to split the list by the specified split character.
I tried this:
user_lst = list(map(int,input("Numbers: ").split(" ","-",",")))
but obviously split only takes two arguments at most. But even with only two arguments it still gives that error.
I know you can use a few if statements and indexing to first check what the user has used to separate it with, but I just want it done on one line or two at most.
You can just use the regex split function instead of the base .split method:
import re
PATTERN = r"[-, ]"
user_lst = list(map(int, re.split(PATTERN, input("Numbers: "))))
You can change PATTERN to any regex pattern which will match the symbols you want to be able to split on. More on regex patterns here and you can play around with your own patterns on regex101

How to replace these values in lines of text

I have several rows of text. The first row is a header row, and each subsequent line represents the fields of data, each value is separated with a comma. Within each line are one to three dollar values, ranging from single digit dollar values ($4.50) to triple digit ($100,000.34). They are also surrounded by quotes.
206360941,5465685679,"$4,073.77",567845676547,"$88,457.21",34589309683
I need to eliminate the quotations and dollar sign for the money values, as well as the comma inside. The period separator for the decimal value needs to stay, so "$6,801.56" becomes 6801.56
I've used regex to eliminate the dollar sign as well as quotations--
with open("datafile.csv", "r") as file:
data = file.readlines()
for i in data:
i = re.sub('[$"]', '', i)
which then makes the data look like 7545245,6,801.56,3545647
so if I split by a comma, it cuts larger values in two.
['206360941,5465685679,4,073.77,567845676547,88,457.21,34589309683']
I thought about splitting by quotations, doing some more regex and rejoining with .join() but it turns out that only the currency values with a comma contain quotations, the smaller values with no comma do not.
Also, I know I can use re.findall(r'\$\d{1,3}\,\d\d\d\.\d\d', i) to draw out the number format, if I print it, it will output a list like [$100,351.35]
I am just not sure what to do with it after that.
This seems to work:
>>> data = '206360941,5465685679,"$4,073.77",567845676547,"$88,457.21",34589309683'
>>> re.findall(r'"\$((\d+),)*(\d+)(\.\d+)"', data)
[('4,', '4', '073', '.77'), ('88,', '88', '457', '.21')]
>>> re.sub(r'"\$((\d+),)*(\d+)(\.\d+)"', r'\2\3\4', data)
'206360941,5465685679,4073.77,567845676547,88457.21,34589309683'
The idea is to grab the data before and after the decimal point, keeping the latter as well. Then, given that the first group is identical to the second one, just replace with the contents of all groups except the first one. If there are more than one comma, you'll probably need a more dynamic approach.
That's why you need this ((\d+),)* group, which captures a subgroup and the comma. You should replace this whole group with the subgroup.
I'd recommend using csv.reader (or csv.DictReader if you want to do other processing on each column) to read the file as this will parse each column automatically. Once you read the file, you can do your regex on each column so no need to split the line yourself. The default delimiter and quotechar for csv.reader is as you would need, I believe.
Did you try the module locale? As in How do I use Python to convert a string to a number if it has commas in it as thousands separators?
It'll be easier than regex.
First of all you could go about deleting all commas that are inside of quotes.
Pseudo code might look like:
s = Your String
insideQuotes = false;
charIndex = 0;
while (c = nextChar() != null){
if(c == "\""){
insideQuotes = !insideQuotes;
}else if(insideQuotes && c == ","){
s.removeAt(charIndex, "");
charIndex--;
}
}
Now that there are no more commas inside the quotes, you only need to remove the dollar signs and the quotes themselves!
Hope it helps!

Reading financial statements using REGEX

I'm working on a project where I have to read scanned images of financial statements. I used tesseract 4 to convert the image to a text output, which looks as such (here is a snippet):
REVENUE 9,000,000 900,000
COST OF SALES 900,000 900,000
GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000
I would like to break the above into a list of three entries, where the first entry is the text, then the second and third entries would be the numbers. For example the first row would look something like this:
[[REVENUE], [9,000,000], [9,000,000]]
I came across this stack overflow post where someone attempts to use re.match() to the .groups() method to find the pattern: How to split strings into text and number?
I'm just being introduced to regex and I'm struggling to properly understand the syntax and documentation. I'm trying to use a cheat sheet for now, but I'm having a tough time figuring out how to go about this, please help.
I wrote this regex through watching your first expected output. But i am not sure what your desired output is with your third sentence.
([A-Za-z ]+)(?=\d|\S) match name until we found a number or symbol.
.*? for the string which we do not care
([\d,]+)\s([\d,]+|(?=-\n|-$)) match one or two groups of number, if there is only one group of number, this group should end with newline or end of text.
Test code(edited):
import re
regex = r"([A-Za-z ]+)(?=\d|\S).*?([\d,]+)\s([\d,]+|(?=-\n|-$))"
text = """
REVENUE 9,000,000 900,000
COST OF SALES 900,000 900,000
GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000
Business taxes 999 -
"""
print(re.findall(regex,text))
# [('REVENUE ', '9,000,000', '900,000'), ('COST OF SALES ', '900,000', '900,000'), ('GROSS PROFIT ', '900,000', '900,000'), ('Business taxes ', '999', '')]
Regexes are overkill for this problem as you've stated it.
text.split() and a join of the items before the last two is better suited to this.
lines = [ "REVENUE 9,000,000 900,000",
"COST OF SALES 900,000 900,000",
"GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000" ]
out = []
for line in lines:
parts = line.split()
if len(parts) < 3:
raise InputError
if len(parts) == 3:
out.append(parts)
else:
out.append([' '.join(parts[0:len(parts)-2]), parts[-2], parts[-1]])
out will contain
[['REVENUE', '9,000,000', '900,000'],
['COST OF SALES', '900,000', '900,000'],
['GROSS PROFIT (90%; 2016 - 90%)', '900,000', '900,000']]
If the label text needs further extraction, you could use regexes, or you could simply look at the items in parts[0:len(parts)-2] and process them based on the words and numbers there.
To detect the string
rev_str = "[[REVENUE], [9,000,000], [9,000,000]]"
and extract the values
("REVENUE", "9,000,000", "9,000,000")
you would do
import re
x = re.match(r"\[\[([A-Z]+)\], \[([0-9,]+)\], \[([0-9,]+)\]\]", rev_str)
x.groups()
# ('REVENUE', '9,000,000', '9,000,000')
Let's unpack this big ol' string.
Square brackets signify a range of characters. For example, [A-Z] means to look for all letters from A to Z, whereas [0-9,] means to look for the digits 0 through 9, as well as the character ,. The - here is an operator used inside square brackets to denote a range of characters that we want.
The + operator means to look for at least one occurrence of whatever immediately precedes it. For example, the expression [A-Z]+ means to look for at least one occurrence of any of the letters A through Z. You can also use the * operator instead, to look for at least zero occurrences of whatever precedes it.
The round brackets (i.e. parentheses) signify a group to be extracted from the regex. Whenever that pattern is matched, whatever is inside any expression in parentheses will be extracted and returned as a group. For example, ([A-Z+]) means to look for at least one occurrence of any of the letters A through Z, and then save whatever that turns out to be. We access this by doing x.groups() after assigning the result of the regex match to a variable x.
Otherwise, it's straightforward - accommodating for the pattern [[TEXT], [NUMBER], [NUMBER]]. The square brackets are escaped with the \ character, because we want to interpret them literally, rather than as a range of characters.
Overall, the re.match() function will search rev_str for any places where the given pattern matches, keep track of the groups within that match, and return those groups when you call x.groups().
This is a fairly simple example, but you've gotta start somewhere, right? You should be able to use this as a starting point for making a more complicated regex expression to process more of your code.

Python 2.7: How to split on first occurrence?

I am trying to split a string I extract on the first occurrence of a comma. I have tried using the split, but something is wrong, as it doesn't split.
for i in range(len(items)):
alldata = items[i].getText().encode('utf-8').split(',', 1)
csvfile.writerow(alldata)
The variable items contains the data I extract from an URL. The output in the CSV file is put in one column. I want it to be on two columns. An example of the data (alldata) I get in the CSV file, looks like this:
['\n\n\n1958\n\n\nGeorge Lees\n']
Using this data as an example, I need the year 1958 to be on one column, and the name George Lees to be on another column instead of the new lines.
EDIT
Forgot to mention what I meant with the commas. The reason why I mentioned the commas is that I also tried splitting on whitespaces. When I did that I got the data:
['1958', 'George', 'Lees']
So what I tried to achieve was to split the data on the first comma occurrence. That's why I did split(',', 1) forgetting that I also need to split on whitespaces. So my problem is that I don't know how I split on both the first commas occurence, so that the year is on oe column and the whole name is on another column. I got
['\n\n\n1958\n\n\nGeorge Lees\n']
When I tried to split with split(',', 1)
You can use strip to remove all spaces in the start & end and then use split by "\n" to get the required output. I have also used the filter method to remove any empty string or values.
Ex:
A = ['\n\n\n1958\n\n\nGeorge Lees\n']
print filter(None, A[0].strip().split("\n"))
Output:
['1958', 'George Lees']

How can I collect certain characters from a string seperated by new lines?

I have a list of time strings followed by phone numbers:-
00:12:23, 0712313412352
01:14:52, 0712312341256
What's the easiest way to get the time duration only?
duration = S[0:8] # duration is first 8 characters
If you know that all three parts of the time will always be formatted as two digits, meaning the entire time will always be exactly 8 characters long, then I think your way is easiest: duration = S[:8].
Otherwise, if you know that your time will always be followed by a comma, you could split on the comma and take the first element: duration = S.split(',')[0].
Otherwise you could use a regex if you don't know that your time will always be 8 characters long and you don't know that the time will be followed by a comma: r'(\d\d?:\d\d?\d\d?)'
Edit:
In your comment it says you want to read through all lines. If you have a string containing all the lines separated by new lines, first you'll want to split the string into individual lines, by splitting on new line. then you'll want to iterate through and get each time:
# Assume the text is stored in text_string
lines = text_string.split('\n')
times = [] # make an empty list to hold the times
for line in lines:
time = line[:8]
times.append(time) # Add the time to our list
print times # This will print our list of times
Assuming lines.txt contains your lines:
>>> [ x[:8] for x in open('lines.txt').readlines() ]
['00:12:23', '00:12:23', '00:12:23']
Or this, if the first field is variable length:
>>> [ x.split(',')[0] for x in open('lines.txt').readlines() ]
['00:12:23', '00:12:23', '00:12:23']
one of the best ways is to use regex and create a useful pattern to find the needed string part
import re
string = "00:12:23, 0712313412352"
request = re.match(r"(^\d*....\d*)", string)
print request.group()
>>>00:12:23
you can try different regex pattern here on https://regex101.com/, you can also at python as interpreter

Categories

Resources