How to split a multi-line string using regular expressions?

How to split a multi-line string using regular expressions? - python

I have been banging my beginner head for most of the day trying various things.
Here is the string
1 default active Eth2/45, Eth2/46, Eth2/47
Eth3/41, Eth3/42, Eth3/43
Eth4/41, Eth4/42, Eth4/43
47 Production active Po1, Po21, Po23, Po25, Po101
Po102, Eth2/1, Eth2/2, Eth2/3
Eth2/4, Eth3/29, Eth3/30
Eth3/31, Eth3/32, Eth3/33
Eth3/34, Eth3/35, Eth3/36
Eth3/37, Eth3/38, Eth3/39
Eth3/40, Eth3/44, Eth4/29
Eth4/30, Eth4/31, Eth4/32
Eth4/33, Eth4/34, Eth4/35
Eth4/36, Eth4/37, Eth4/38
Eth4/39, Eth4/40, Eth4/44
128 Test active Po1, Eth1/13, Eth2/1, Eth2/2
Eth2/3, Eth2/4
129 Backup active Po1, Eth1/14, Eth2/1, Eth2/2
Eth2/3, Eth2/4
What I need is to split like below. I have tried to use regex101.com to simulate various regex but I did not have much luck. I managed to isolate the delimiters with (\n\d+) and then I wanted to use lookbehind but I got an error saying that I need fixed string length.
Here is a link to the regex101 section:
1 default active Eth2/45, Eth2/46, Eth2/47
Eth3/41, Eth3/42, Eth3/43
Eth4/41, Eth4/42, Eth4/43
47 VLAN047 active Po1, Po21, Po23, Po25, Po101
Po102, Eth2/1, Eth2/2, Eth2/3
Eth2/4, Eth3/29, Eth3/30
Eth3/31, Eth3/32, Eth3/33
Eth3/34, Eth3/35, Eth3/36
Eth3/37, Eth3/38, Eth3/39
Eth3/40, Eth3/44, Eth4/29
Eth4/30, Eth4/31, Eth4/32
Eth4/33, Eth4/34, Eth4/35
Eth4/36, Eth4/37, Eth4/38
Eth4/39, Eth4/40, Eth4/44
128 Rogers-Refresh-MGT active Po1, Eth1/13, Eth2/1, Eth2/2
Eth2/3, Eth2/4
129 ManagementSegtNorthW active Po1, Eth1/14, Eth2/1, Eth2/2
Eth2/3, Eth2/4
Update: I update the regex101 example but it is not selecting what I want. The python code works. I wonder what is the problem with regex101

That's pretty simple - use lookahead instead of lookbehind:
parsed = re.split(r'\n(?=\d)', data)

In python there is always more than one way to skin a cat. Multiline regexes are usually very hard. The following is a lot simpler, and more importantly readable
for line in data.split("\n"):
if line[0].isdigit():
if section:
sections.append("\n".join(section))
section=[]
section.append(line)
sections.append("\n".join(section)) # grab the last one
print(sections)
Performance wise, I think this would probably be better, because we are not looking for a pattern in the entire string. we are only looking at the first character in a line.

Related

Regex that matches all German and Austrian mobile phone numbers

I need a Python regex which matches to mobile phone numbers from Germany and Austria.
In order to do so, we first have to understand the structure of a phone number:
a mobile number can be written with a country calling code in the beginning. However, this code is optional!
if we use the country calling code the trunk prefix is redundant!
The prefix is composed out of the trunk prefix and the company code
The prefix is followed by an individual and unique number with 7 or 8 digits, respectivley.
List of German prefixes:
0151, 0160, 0170, 0171, 0175, 0152, 0162, 0172, 0173, 0174, 0155, 0157, 0159, 0163, 0176, 0177, 0178, 0179, 0164, 0168, 0169
List of Austrian prefixes:
0664, 0680, 0688, 0681, 0699, 0664, 0667, 0650, 0678, 0650, 0677, 0676, 0660, 0699, 0690, 0665, 0686, 0670
Now that we know all rules to build a regex, we have to consider, that humans sometimes write numbers in a very strange ways with multiple whitespaces, / or (). For example:
0176 98 600 18 9
+49 17698600189
+(49) 17698600189
0176/98600189
0176 / 98600189
many more ways to write the same number
I am looking for a Python regex which can match all Austian and German mobile numbers.
What I have so far is this:
^(?:\+4[39]|004[39]|0|\+\(49\)|\(\+49\))\s?(?=(?:[^\d\n]*\d){10,11}(?!\d))(\()?[19][1567]\d{1,2}(?(1)\))\s?\d(?:[ /-]?\d)+

You can use
(?x)^ # Free spacing mode on and start of string
(?: # A container group:
(\+49|0049|\+\(49\)|\(\+49\))? [ ()\/-]* # German: country code
(?(1)|0)1(?:5[12579]|6[023489]|7[0-9]) # trunk prefix and company code
| # or
(\+43|0043|\+\(43\)|\(\+43\))? [ ()\/-]* # Austrian: country code
(?(2)|0)6(?:64|(?:50|6[0457]|7[0678]|8[0168]|9[09])) # trunk prefix and company code
)
[ ()\/-]* # zero or more spaces, parens, / and -
\d(?:[ \/-]*\d){6,7} # a digit and then six or seven occurrences of space, / or - and a digit
\s* # zero or more whites
$ # end of string
See the regex demo.
A one-line version of the pattern is
^(?:(\+49|0049|\+\(49\)|\(\+49\))?[ ()\/-]*(?(1)|0)1(?:5[12579]|6[023489]|7[0-9])|(\+43|0043|\+\(43\)|\(\+43\))?[ ()\/-]*(?(2)|0)6(?:64|(?:50|6[0457]|7[0678]|8[0168]|9[09])))[ ()\/-]*\d(?:[ \/-]*\d){6,7}\s*$
See this demo.
How to create company code regex
Go to the Optimize long lists of fixed string alternatives in regex
Click the Run code snippet button at the bottom of the answer to run the last code snippet
Re-size the input box if you wish
Get the list of your supported numbers, either comma or linebreak separated and paste it into the field
Click Generate button, and grab the pattern that will appear below.

Python, pandas replace entire column with regex match of string

I'm using pandas to analyze data from 3 different sources, which are imported into dataframes and require modification to account for human error, as this data was all entered by humans and contains errors.
Specifically, I'm working with street names. Until now, I have been using .str.replace() to remove street types (st., street, blvd., ave., etc.), as shown below. This isn't working well enough, and I decided I would like to use regex to match a pattern, and transform that entire column from the original street name, to the pattern matched by regex.
df['street'] = df['street'].str.replace(r' avenue+', '', regex=True)
I've decided I would like to use regex to identify (and remove all other characters from the address column's fields): any number of integers, followed by a space, and then the first 3 number of alphabetic characters.
For example, "3762 pearl street" might become "3762 pea" if x is 3 with the following regex:
(\d+ )+\w{0,3}
How can I use panda's .str.replace to do this? I don't want to specify WHAT I want to replace with the second argument. I want to replace the original string with the pattern matched from regex.
Something that, in my mind, might work like this:
df['street'] = df['street'].str.replace(ORIGINAL STRING, r' (\d+ )+\w{0,3}, regex=True)
which might make 43 milford st. into "43 mil".
Thank you, please let me know if I'm being unclear.

you could use the extract method to overwrite the column with its own content
pat = r'(\d+\s[a-zA-Z]{3})'
df['street'] = df['street'].str.extract(pat)
Just an observation: The regex you shared (\d+ )+\w{0,3} matches the following patterns and returns some funky stuff as well
1131 1313 street
121 avenue
1 1 1 1 1 1 avenue
42
I've changed it up a bit based on what you described, but i'm not sure if that works for all your datapoints.

How do I print everything between two sentences?

Im very new to coding and only know the very basics. I am using python and trying to print everything between two sentences in a text. I only want the content between, not before or after. It`s probably very easy, but i couldnt figure it out.
Ev 39 Fursetfjellet (Oppdøl - Batnfjordsøra) No reports. Ev 134 Haukelifjell (Liamyrane bom - Fjellstad bom) Ev 134 Haukelifjell Hordaland / Telemark — Icy. 10 o'clock 1 degree. Valid from: 05.01.2020 13:53 Rv 3 Kvikne (Tynset (Motrøa) - Ulsberg)
I want to collect the bold text to use in website later. Everything except the italic text(the sentence before and after) is dynamic if that has anything to say.

You can use split to cut the string and access the parts that you are interested in.
If you know how to get the full text already, it's easy to get the bold sentence by removing the two constant sentences before and after.
full_text = "Ev 39 Fursetfjellet (Oppdøl - Batnfjordsøra) No reports. Ev 134 Haukelifjell (Liamyrane bom - Fjellstad bom) Ev 134 Haukelifjell Hordaland / Telemark — Icy. 10 o'clock 1 degree. Valid from: 05.01.2020 13:53 Rv 3 Kvikne (Tynset (Motrøa) - Ulsberg)"
s1 = "Ev 39 Fursetfjellet (Oppdøl - Batnfjordsøra) No reports. Ev 134 Haukelifjell (Liamyrane bom - Fjellstad bom)"
s2 = "Rv 3 Kvikne (Tynset (Motrøa) - Ulsberg)"
bold_text = full_text.split(s1)[1] # Remove the left part.
bold_text = bold_text.split(s2)[0] # Remove the right part.
bold_text = bold_text.strip() # Clean up spaces on each side if needed.
print(bold_text)

It looks like a job for regular expressions, there is the re module in Python.
You should:
Open the file
Read its content in a variable
Use search or match function in the re module
In particular, in the last step you should use your "surrounding" strings as "delimiters" and capture everything between them. You can achieve this using a regex pattern like str1 + "(.*)" + str2.
You can give a look at regex documentation, but just to give you an idea:
".*" captures everything
"()" allows you actually capture the content inside them and access it later with an index (e.g. re.search(pattern, original_string).group(1))

regex to find image sequences from list of filenames

I need some help with a regex string to pull any filename that looks like it might be part of a frame sequence out of a previously generated list of filenames.
Frames in a sequence will generally have a minimum padding of 3 and will be preceeded by either a '.' or a '_' An exception is: if the filename is only made up of a number and the .jpg extension (e.g 0001.jpg, 0002.jpg, etc.). I'd like to capture all these in one line of regex, if possible.
Here's what I have so far:
(.*?)(.|_)(\d{3,})(.*)\.jpg
Now I know this doesn't do the "preceeded by . or _" bit and instead just finds a . or _ anywhere in the string to return a positive. I've tried a bit of negative lookbehind testing, but can't get the syntax to work.
A sample of data is:
test_canon_shot02.jpg
test_shot01-04.jpg
test_shot02-03.jpg
test_shot02-02.jpg
test_shot01-03.jpg
test_canon_shot03.jpg
test_shot01-02.jpg
test_shot02.jpg
test_canon_shot02.jpg
test_shot01.jpg
GameAssets_.00024.jpg
GameAssets_.00023.jpg
GameAssets_.00022.jpg
GameAssets_.00021.jpg
GameAssets_.00020.jpg
GameAssets_.00019.jpg
GameAssets_.00018.jpg
GameAssets_.00017.jpg
GameAssets_.00016.jpg
GameAssets_.00015.jpg
GameAssets_.00014.jpg
GameAssets_.00013.jpg
GameAssets_.00012.jpg
GameAssets_.00011.jpg
GameAssets_.00010.jpg
GameAssets_.00009.jpg
GameAssets_.00008.jpg
GameAssets_.00007.jpg
GameAssets_.00006.jpg
GameAssets_.00005.jpg
GameAssets_.00004.jpg
GameAssets_.00003.jpg
GameAssets_.00002.jpg
GameAssets_.00001.jpg
GameAssets_.00000.jpg
OrangeXmas2015_Print_A ct2.jpg
sh120_HF_V01-01.jpg
sh120_HF_V01-02.jpg
sh200_DMP_v04.jpg
sh120_HF_V04.jpg
sh120_HF_V03.jpg
sh120_HF_V02.jpg
blah_v02.jpg
blah_v01.jpg
blah_Capture0 4.jpg
blah_Capture03 .jpg
blah_Capture01. jpg
blah_Capture02.jpg
Wall_GraniteBlock_G rey_TC041813.jpg
Renders10_wire.jpg
Renders10.jpg
Renders09_wire.jpg
Renders09.jpg
Renders08_wire.jpg
Renders08.jpg
Renders07_wire.jpg
Renders07.jpg
Renders06_wire.jpg
Renders06.jpg
Renders05_wire.jpg
Renders05.jpg
Renders04_wire.jpg
Renders04.jpg
Renders03_wire.jpg
Renders03.jpg
Renders02_wire.jpg
Renders02.jpg
Renders01_wire.jpg
Renders01.jpg
archmodels58_057_carpinusbetulus_leaf_diffuse.jpg
archmodels58_042_bark_bump.jpg
archmodels58_023_leaf_diffuse.jpg
WINDY TECHNICZNE-reflect00.jpg
archmodels58_057_leaf_opacity.jpg
archmodels58_057_bark_reflect.jpg
archmodels58_057_bark_bump.jpg
blahC-00-oknaka.jpg
bed
debt
cab
00018.jpg
00017.jpg
00016.jpg
00015.jpg
00014.jpg
00013.jpg
00012.jpg
00011.jpg
00010.jpg
00009.jpg
00008.jpg
00007.jpg
00006.jpg
00005.jpg
00004.jpg
00003.jpg
00002.jpg
00001.jpg
00000.jpg
The result I'm after is 2 sequences identified:
GameAssets_.00000.jpg to GameAssets_.00024.jpg
00000.jpg to 00018.jpg

Based on the rules you specified in your question, this pattern should accomplish what you need:
(^|\r?\n|.*_|.*\.)\d{3,}.*\.jpg

for item in re.findall(r'.*?[._]?0{3,}.*',data):
print(item)
GameAssets_.00024.jpg
GameAssets_.00023.jpg
GameAssets_.00022.jpg
GameAssets_.00021.jpg
GameAssets_.00020.jpg
GameAssets_.00019.jpg
GameAssets_.00018.jpg
GameAssets_.00017.jpg
GameAssets_.00016.jpg
GameAssets_.00015.jpg
GameAssets_.00014.jpg
GameAssets_.00013.jpg
GameAssets_.00012.jpg
GameAssets_.00011.jpg
GameAssets_.00010.jpg
GameAssets_.00009.jpg
GameAssets_.00008.jpg
GameAssets_.00007.jpg
GameAssets_.00006.jpg
GameAssets_.00005.jpg
GameAssets_.00004.jpg
GameAssets_.00003.jpg
GameAssets_.00002.jpg
GameAssets_.00001.jpg
GameAssets_.00000.jpg
00018.jpg
00017.jpg
00016.jpg
00015.jpg
00014.jpg
00013.jpg
00012.jpg
00011.jpg
00010.jpg
00009.jpg
00008.jpg
00007.jpg
00006.jpg
00005.jpg
00004.jpg
00003.jpg
00002.jpg
00001.jpg
00000.jpg

Try
(.*?)(\.|_?)(000\d{0,})(.*)\.jpg
Notice that I had to escape the '.' in the second group. Also, I had to make the search for '.' and '_' optional in the second group. Finally, I had to add the minimum padding to the third group.
I used regex101.com to test and refine the regex: regex101

US-format phone numbers to links in Python

I'm working a piece of code to turn phone numbers into links for mobile phone - I've got it but it feels really dirty.
import re
from string import digits
PHONE_RE = re.compile('([(]{0,1}[2-9]\d{2}[)]{0,1}[-_. ]{0,1}[2-9]\d{2}[-_. ]{0,1}\d{4})')
def numbers2links(s):
result = ""
last_match_index = 0
for match in PHONE_RE.finditer(s):
raw_number = match.group()
number = ''.join(d for d in raw_number if d in digits)
call = '%s' % (number, raw_number)
result += s[last_match_index:match.start()] + call
last_match_index = match.end()
result += s[last_match_index:]
return result
>>> numbers2links("Ghost Busters at (555) 423-2368! How about this one: 555 456 7890! 555-456-7893 is where its at.")
'Ghost Busters at (555) 423-2368! How about this one: 555 456 7890! 555-456-7893 is where its at.'
Is there anyway I could restructure the regex or the the regex method I'm using to make this cleaner?
Update
To clarify, my question is not about the correctness of my regex - I realize that it's limited. Instead I'm wondering if anyone had any comments on the method of substiting in links for the phone numbers - is there anyway I could use re.replace or something like that instead of the string hackery that I have?

Nice first take :) I think this version is a bit more readable (and probably a teensy bit faster). The key thing to note here is the use of re.sub. Keeps us away from the nasty match indexes...
import re
PHONE_RE = re.compile('([(]{0,1}[2-9]\d{2}[)]{0,1}[-_. ]{0,1}[2-9]\d{2}[-_. ]{0,1}\d{4})')
NON_NUMERIC = re.compile('\D')
def numbers2links(s):
def makelink(mo):
raw_number = mo.group()
number = NON_NUMERIC.sub("", raw_number)
return '%s' % (number, raw_number)
return PHONE_RE.sub(makelink, s)
print numbers2links("Ghost Busters at (555) 423-2368! How about this one: 555 456 7890! 555-456-7893 is where its at.")
A note: In my practice, I've not noticed much of a speedup pre-compiling simple regular expressions like the two I'm using, even if you're using them thousands of times. The re module may have some sort of internal caching - didn't bother to read the source and check.
Also, I replaced your method of checking each character to see if it's in string.digits with another re.sub() because I think my version is more readable, not because I'm certain it performs better (although it might).

Your regexp only parses a specific format, which is not the international standard. If you limit yourself to one country, it may work.
Otherwise, the international standard is ITU E.123 : "Notation for national and international telephone numbers,
e-mail addresses and Web addresses"

Why not re-use the work of others - for example, from RegExpLib.com?
My second suggestion is to remember there are other countries besides the USA, and quite a few of them have telephones ;-) Please don't forget us during your software development.
Also, there is a standard for the formatting of telephone numbers; the ITU's E.123. My recollection of the standard was that what it describes doesn't match well with popular usage.
Edit: I mixed up G.123 and E.123. Oops. Props Bortzmeyer

First off, reliably capturing phone numbers with a single regular expression is notoriously difficult with a strong tendency to being impossible. Not every country has a definition of a "phone number" that is as narrow as it is in the U.S. Even in the U.S., things are more complicated than they seem (from the Wikipedia article on the North American Numbering Plan):
A) Country code: optional prefix ("1" or "+1" or "001")
((00|\+)?1)?
B) Numbering Plan Area code (NPA): cannot begin with 1, digit 2 cannot be 9
[2-9][0-8][0-9]
C) Exchange code (NXX): cannot begin with 1, cannot end with "11", optional parentheses
\(?[2-9](00|[2-9]{2})\)?
D) Station Code: four digits, cannot all be 0 (I suppose)
(?!0{4})\d{4}
E) an optional extension may follow
([x#-]\d+)?
S) parts of the number are separated by spaces, dashes, dots (or not)
[. -]?
So, the basic regex for the U.S. would be:
((00|\+)?1[. -]?)?\(?[2-9][0-8][0-9]\)?[. -]?[2-9](00|[2-9]{2})[. -]?(?!0{4})\d{4}([. -]?[x#-]\d+)?
| A |S | | B | S | C | S | D | S | E |
And that's just for the relatively trivial numbering plan of the U.S., and even there it certainly is not covering all subtleties. If you want to make it reliable you have to develop a similar beast for all expected input languages.

A few things that will clean up your existing regex without really changing the functionality:
Replace {0,1} with ?, [(] with (, [)] with ). You also might as well just make your [2-9] b e a \d as well, so you can make those patterns be \d{3} and \d{4} for the last part. I doubt it will really increase the rate of false positives.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split a multi-line string using regular expressions? - python

That's pretty simple - use lookahead instead of lookbehind: parsed = re.split(r'\n(?=\d)', data)

Related

Regex that matches all German and Austrian mobile phone numbers

Python, pandas replace entire column with regex match of string

How do I print everything between two sentences?

regex to find image sequences from list of filenames

US-format phone numbers to links in Python

Categories

Resources