Python Conditional Split - python

Given this string:
s = '01/03/1988 U/9 Mi\n08/19/1966 ABC\nDEF\n12/31/1999 YTD ABC'
I want to split it on each new record (which starts with a date) like this:
['01/03/1988 U/9 Mi', '08/19/1966 ABC\nDEF', '12/31/1999 YTD ABC']
Notice the extra new line delimiter between ABC and DEF? That's the challenge I'm having. I want to preserve it without a split there.
I'm thinking I need to conditionally split on any delimiter of these:
['01/', '02/','03/', '04/', '05/', '06/', '07/', '08/', '09/', '10/', '11/', '12/']
Is there an easy way to use re.findall this way or is there a better approach?
Thanks in advance!

You could split on the new line that is followed by a date with a lookahead. Something like:
import re
s = '01/03/1988 U/9 Mi\n08/19/1966 ABC\nDEF\n12/31/1999 YTD ABC'
re.split(r'\n(?=\d{2}/\d{2}/\d{4})', s)
# ['01/03/1988 U/9 Mi', '08/19/1966 ABC\nDEF', '12/31/1999 YTD ABC']
You may be able to simplify to just a newline followed by 2 digits depending on your data: r'\n(?=\d{2})'

Use regex instead.
code
import re
s = '01/03/1988 U/9 Mi\n08/19/1966 ABC\nDEF\n12/31/1999 YTD ABC'
chunks = re.compile(r'[\n](?=\d\d/\d\d/\d\d\d\d)').split(s)
print(chunks)
output
['01/03/1988 U/9 Mi', '08/19/1966 ABC\nDEF', '12/31/1999 YTD ABC']

You can also match a more specific date like format without lookarounds.
^(?:0[1-9]|1[012])/(?:0[1-9]|[12]\d|3[01])/(?:19|20)\d\d\b.*$
^ Start of string
(?:0[1-9]|1[012]) Match a month number from 01 - 12
/ Match literally
(?:0[1-9]|[12]\d|3[01]) Match a number 01 - 31
/ Match literally
(?:19|20)\d\d Match either 19 or 20 and 2 digits (Or just 4 digits \d{4})
\b.* A word boundary and match the rest of the line
$ End of string
Regex demo | Python demo
Example code
import re
s = '01/03/1988 U/9 Mi\n08/19/1966 ABC\nDEF\n12/31/1999 YTD ABC'
regex = r'^(?:0[1-9]|1[012])/(?:0[1-9]|[12]\d|3[01])/(?:19|20)\d\d\b.*$'
print(re.findall(regex, s, re.MULTILINE))
Output
['01/03/1988 U/9 Mi', '08/19/1966 ABC', '12/31/1999 YTD ABC']

Related

Find values using regex (includes brackets)

it's my first time with regex and I have some issues, which hopefully you will help me find answers. Let's give an example of data:
chartData.push({
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
});
var newDate = new Date();
newDate.setFullYear(
2007,
10,
1 );
Want I want to retrieve is to get the date which is the last bracket and the corresponding description. I have no idea how to do it with one regex, thus I decided to split it into two.
First part:
I retrieve the value after the description:. This was managed with the following code:[\n\r].*description:\s*([^\n\r]*) The output gives me the result with a quote "9710" but I can fairly say that it's alright and no changes are required.
Second part:
Here it gets tricky. I want to retrieve the values in brackets after the text newDate.setFullYear. Unfortunately, what I managed so far, is to only get values inside brackets. For that, I used the following code \(([^)]*)\) The result is that it picks all 3 brackets in the example:
"{
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
}",
"()",
"2007,
10,
1 "
What I am missing is an AND operator for REGEX with would allow me to construct a code allowing retrieval of data in brackets after the specific text.
I could, of course, pick every 3rd result but unfortunately, it doesn't work for the whole dataset.
Does anyone of you know the way how to resolve the second part issue?
Thanks in advance.
You can use the following expression:
res = re.search(r'description: "([^"]+)".*newDate.setFullYear\((.*)\);', text, re.DOTALL)
This will return a regex match object with two groups, that you can fetch using:
res.groups()
The result is then:
('9710', '\n2007,\n10,\n1 ')
You can of course parse these groups in any way you want. For example:
date = res.groups()[1]
[s.strip() for s in date.split(",")]
==>
['2007', '10', '1']
import re
test = r"""
chartData.push({
date: 'newDate',
visits: 9710,
color: "#016b92",
description: "9710"
})
var newDate = new Date()
newDate.setFullYear(
2007,
10,
1);"""
m = re.search(r".*newDate\.setFullYear(\(\n.*\n.*\n.*\));", test, re.DOTALL)
print(m.group(1).rstrip("\n").replace("\n", "").replace(" ", ""))
The result:
(2007,10,1)
The AND part that you are referring to is not really an operator. The pattern matches characters from left to right, so after capturing the values in group 1 you cold match all that comes before you want to capture your values in group 2.
What you could do, is repeat matching all following lines that do not start with newDate.setFullYear(
Then when you do encounter that value, match it and capture in group 2 matching all chars except parenthesis.
\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);
Regex demo | Python demo
Example code
import re
regex = r"\r?\ndescription: \"([^\"]+)\"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);"
test_str = ("chartData.push({\n"
"date: newDate,\n"
"visits: 9710,\n"
"color: \"#016b92\",\n"
"description: \"9710\"\n"
"});\n"
"var newDate = new Date();\n"
"newDate.setFullYear(\n"
"2007,\n"
"10,\n"
"1 );")
print (re.findall(regex, test_str))
Output
[('9710', '\n2007,\n10,\n1 ')]
There is another option to get group 1 and the separate digits in group 2 using the Python regex PyPi module
(?:\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(|\G)\r?\n(\d+),?(?=[^()]*\);)
Regex demo

Match words only if preceded by specific pattern

I have a string from a NWS bulletin:
LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
KHNX 141001 RECHNX Weather Service San Joaquin Valley
My aim is to extract a couple fields with regular expressions. In the first string I want "AAD" and from the second string I want "RECHNX". I have tried:
( )\w{3} #for the first string
and
\w{6} #for the 2nd string
But these find all 3 and 6 character strings leading up to the string I want.
Assuming the fields you want to extract are always in capital letters and preceded by 6 digits and a space, this regular expression would do the trick:
(?<=\d{6}\s)[A-Z]+
Demo: https://regex101.com/r/dsDHTs/1
Edit: if you want to match up to two alpha-numeric uppercase words preceded by 6 digits, you can use:
(?<=\d{6}\s)([A-Z0-9]+\b)\s(?:([A-Z0-9]+\b))*
Demo: https://regex101.com/r/dsDHTs/5
If you have a specific list of valid fields, you could also simply use:
(AAD|TMLB|RECHNX|RR4HNX)
https://regex101.com/r/dsDHTs/3
Since the substring you want to extract is a word that follows a number, separated by a space, you can use re.search with the following regex (given your input stored in s):
re.search(r'\b\d+ (\w+)', s).group(1)
To read first groups of word chars from each line, you can use a pattern like
(\w+) (\w+) (\w+) (\w+).
Then, from the first line read group No 4 and from the second line read group No 3.
Look at the following program. It prints four groups from each source line:
import re
txt = """LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
KHNX 141001 RECHNX Weather Service San Joaquin Valley"""
n = 0
pat = re.compile(r'(\w+) (\w+) (\w+) (\w+)')
for line in txt.splitlines():
n += 1
print(f'{n:2}: {line}')
mtch = pat.search(line)
if mtch:
gr = [ mtch.group(i) for i in range(1, 5) ]
print(f' {gr}')
The result is:
1: LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
['LTUS41', 'KCAR', '141558', 'AAD']
2: KHNX 141001 RECHNX Weather Service San Joaquin Valley
['KHNX', '141001', 'RECHNX', 'Weather']

regex the text string in python and split into arrays

I need to split a text like:
//string
s = CS -135IntrotoComputingCS -154IntroToWonderLand...
in array like
inputarray[0]= CS -135 Intro to computing
inputarray[1]= CS -154 Intro to WonderLand
.
.
.
and so on;
I am trying something like this:
re.compile("[CS]+\s").split(s)
But it's just not ready to even break, even if I try something like
re.compile("[CS]").split(s)
If anyone can throw some light on this?
You may use findall with a lookahead regex as this:
>>> s = 'CS -135IntrotoComputingCS -154IntroToWonderLand'
>>> print re.findall(r'.+?(?=CS|$)', s)
['CS -135IntrotoComputing', 'CS -154IntroToWonderLand']
Regex: .+?(?=CS|$) matches 1+ any characters that has CS at next position or end of line.
Although findall is more straightforward but finditer can also be used here
s = 'CS -135IntrotoComputingCS -154IntroToWonderLand'
x=[i.start() for i in re.finditer('CS ',s)] # to get the starting positions of 'CS'
count=0
l=[]
while count+1<len(x):
l.append(s[x[count]:x[count+1]])
count+=1
l.append(s[x[count]:])
print(l) # ['CS -135IntrotoComputing', 'CS -154IntroToWonderLand']

Using Regular expressions to match a portion of the string?(python)

What regular expression can i use to match genes(in bold) in the gene list string:
GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8
I tried : GENE_List:((( \w+).(\w+));)+* but it only captures the last gene
Given:
>>> s="GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"
You can use Python string methods to do:
>>> s.split(': ')[1].split('; ')
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
For a regex:
(?<=[:;]\s)([^\s;]+)
Demo
Or, in Python:
>>> re.findall(r'(?<=[:;]\s)([^\s;]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
You can use the following:
\s([^;\s]+)
Demo
The captured group, ([^;\s]+), will contain the desired substrings followed by whitespace (\s)
>>> s = 'GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8'
>>> re.findall(r'\s([^;\s]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
UPDATE
It's in fact much simpler:
[^\s;]+
however, first use substring to take only the part you need (the genes, without GENELIST )
demo: regex demo
string = "GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"
re.findall(r"([^;\s]+)(?:;|$)", string)
The output is:
['F59A7.7',
'T25D3.3',
'F13B12.4',
'cysl-1',
'cysl-2',
'cysl-3',
'cysl-4',
'F01D4.8']

Regex not grabbing all groups, not working in multiline

I have the following regex pattern:
pattern = r'''
(?P<name>.+?)\n
SKU\s#\s+(?P<sku_hidden>\d+)\n
Quantity:\s+(?P<quantity>\d+)\n
Gift\sWrap:\s+(?P<gift_wrap>.+?)\n
Shipping\sMethod:.+?\n
Price:.+?\n
Total:\s+(?P<total_price>\$[\d.]+)
'''
I retrieve them using:
re.finditer(pattern, plain, re.M | re.X)
Yet using re.findall yields the same result.
It should match texts like this:
Red Retro Citrus Juicer
SKU # 403109
Quantity: 1
Gift Wrap: No
Shipping Method:Standard
Price: $24.99
Total: $24.99
The first thing that is happening is that using re.M and re.X it doesn't work, but if I put it all in one line it does. The other thing is that when it does work only the first group is caught and the rest ignored. Any thoughts?
ADDITIONAL INFORMATION:
If I change my pattern to be just:
pattern = r'''
(?P<name>.+?)\n
SKU\s#\s+(?P<sku_hidden>\d+)\n
'''
My output comes out like this: [u'Red Retro Citrus Juicer'] it matches yet the SKU does not appear. If I put everything on the same line, like so:
pattern = r'(?P<name>.+?)\nSKU\s#\s+(?P<sku_hidden>\d+)\n'
It does match and grab everything.
When using the X flag, you need to escape the #, which start the comments.
Right now your two-line regex is equivalent to
(?P<name>.+?)\n
SKU\s
What you want is
pattern = r'''
(?P<name>.+?)\n
SKU\s\#\s+(?P<sku_hidden>\d+)\n
Quantity:\s+(?P<quantity>\d+)\n
Gift\sWrap:\s+(?P<gift_wrap>.+?)\n
Shipping\sMethod:.+?\n
Price:.+?\n
Total:\s+(?P<total_price>\$[\d.]+)
'''
Notice the \#...

Categories

Resources