using regex to identify characters and digits in python

using regex to identify characters and digits in python - python

I have phone numbers that might look like:
927-6847
611-6701p3715ou264-5435
869-6289fillemichelinemoisan
613-5000p4238soirou570-9639cel
and so on...
I want to identify and break them into:
9276847
6116701
2645435
8696289
6135000
5709639
String to store somewhere else:
611-6701p3715ou264-5435
869-6289fillemichelinemoisan
613-5000p4238soirou570-9639cel
When there is a p between digits, The number after p is an extension- get the number before p and save the whole string somewhere else
When there is ou, another number starts after that
When there is cel or any random string, get the number part and save the whole string somewhere else
Edit: This is what I have tried:
phNumber='928-4612cel'
if not re.match('^[\d]*$', phNumber):
res = re.match("(.*?)[a-z]",re.sub('[^\d\w]', '', phNumber)).group(1)
I am looking to handle cases and identify which of the strings had more characters before they were chopped off through regex

First let me confirm again your request:
find out the number with pattern "xxx-xxxx" where x is any number from 0-9, and then save the numbers with the pattern "xxxxxxx".
if there is any random string in the text, save the whole string.
import re
# make a list to input all the string want to test,
EXAMPLE = [
"927-6847",
"9276847"
"927.6847"
"611-6701p3715ou264-5435",
"6116701p3715ou264-5435",
"869-6289fillemichelinemoisan",
"869.6289fillemichelinemoisan",
"8696289fillemichelinemoisan",
"613-5000p4238soirou570-9639cel",
]
def save_phone_number(test_string,output_file_name):
number_to_save = []
# regex pattern of "xxx-xxxx" where x is digits
regex_pattern = r"[0-9]{3}-[0-9]{4}"
phone_numbers = re.findall(regex_pattern,test_string)
# remove the "-"
for item in phone_numbers:
number_to_save.append(item.replace("-",""))
# save to file
with open(output_file_name,"a") as file_object:
for item in number_to_save:
file_object.write(item+"\n")
def save_somewhere_else(test_string,output_file_name):
string_to_save = []
# regex pattern if there is any alphabet in the string
# (.*) mean any character with any length
# [a-zA-Z] mean if there is a character that is lower or upper alphabet
regex_pattern = r"(.*)[a-zA-Z](.*)"
if re.match(regex_pattern,test_string) is not None:
with open(output_file_name,"a") as file_object:
file_object.write(test_string+"\n")
if __name__ == "__main__":
phone_number_file = "phone_number.txt"
somewhere_file = "somewhere.txt"
for each_string in EXAMPLE:
save_phone_number(each_string,phone_number_file)
save_somewhere_else(each_string,somewhere_file)

Related

I have a problem with the task of reversing words and removing parentheses

Task
Write a program that will decode the secret message by reversing text
between square brackets. The message may contain nested brackets (that
is, brackets within brackets, such as One[owT[Three[ruoF]]]). In
this case, innermost brackets take precedence, similar to parentheses
in mathematical expressions, e.g. you could decode the aforementioned
example like this:
One[owT[Three[ruoF]]]
One[owT[ThreeFour]]
One[owTruoFeerhT]
OneThreeFourTwo
In order to make your own task slightly easier and less tricky, you
have already replaced all whitespaces in the original text with
underscores (“_”) while copying it from the paper version.
Input description
The first and only line of the standard input
consists of a non-empty string of up to 2 · 106 characters which may
be letters, digits, basic punctuation (“,.?!’-;:”), underscores (“_”)
and square brackets (“[]”). You can safely assume that all square
brackets are paired correctly, i.e. every opening bracket has exactly
one closing bracket matching it and vice versa.
Output description
The standard output should contain one line – the
decoded secret message without any square brackets.
Example
For sample input:
A[W_[y,[]]oh]o[dlr][!]
the correct output is:
Ahoy,_World!
Explanation
This example contains empty brackets. Of course, an empty string, when
reversed, remains empty, so we can simply ignore them. Then, as
previously, we can decode this example in stages, first reversing the
innermost brackets to obtain A[W_,yoh]o[dlr][!]. Afterwards, there
are no longer any nested brackets, so the remainder of the task is
trivial.
Below is my program that doesn't quite work
word = input("print something: ")
word_reverse = word[::-1]
while("[" in word and "]" in word):
open_brackets_index = word.index("[")
close_brackets_index = word_reverse.index("]")*(-1)-1
# print(word)
# print(open_brackets_index)
# print(close_brackets_index)
reverse_word_into_quotes = word[open_brackets_index+1:close_brackets_index:][::-1]
word = word[:close_brackets_index]
word = word[:open_brackets_index]
word = word+reverse_word_into_quotes
word = word.replace("[","]").replace("]","[")
print(word)
print(word)
Unfortunately my code only works with one pair of parentheses and I don't know how to fix it.
Thank you in advance for your help

Assuming the re module can be used, this code does the job:
import re
text = 'A[W_[y,[]]oh]o[dlr][!]'
# This scary regular expresion does all the work:
# It says find a sequence that starts with [ and ends with ] and
# contains anything BUT [ and ]
pattern = re.compile('\[([^\[\]]*)\]')
while True:
m = re.search(pattern, text)
if m:
# Here a single pattern like [String], if any, is replaced with gnirtS
text = re.sub(pattern, m[1][::-1], text, count=1)
else:
break
print(text)
Which prints this line:
Ahoy,_World!

I realize the my previous answer has been accepted but, for completeness, I'm submitting a second solution that does NOT use the re module:
text = 'A[W_[y,[]]oh]o[dlr][!]'
def find_pattern(text):
# Find [...] and return the locations of [ (start) ] (end)
# and the in-between str (content)
content = ''
for i,c in enumerate(text):
if c == '[':
content = ''
start = i
elif c == ']':
end = i
return start, end, content
else:
content += c
return None, None, None
while True:
start, end, content = find_pattern(text)
if start is None:
break
# Replace the content between [] with its reverse
text = "".join((text[:start], content[::-1], text[end+1:]))
print(text)

Python - Regex - combination of letters and numbers (undefined length)

I am trying to get a File-ID from a text file. In the above example the filename is d735023ds1.htm which I want to get in order to build another url. Those filenames differ however in their length and I would need a universal regex expression to cover all possibilities.
Example filenames
d804478ds1a.htm.
d618448ds1a.htm.
d618448.htm
My code
for cik in leftover_cik_list:
r = requests.get(filing.url)
content = str(r.content)
fileID = None
for line in content.split("\n"):
if fileID == None:
fileIDIndex = line.find("<FILENAME>")
if fileIDIndex != -1:
trimmedText = line[fileIDIndex:]
result = RegEx.search(r"^[\w\d.htm]*$", trimmedText)
if result:
fileID = result.group()
print ("fileID",fileID)
document_link = "https://www.sec.gov/Archives/edgar/data/{0}/{1}/{2}.htm".format(cik, accession_number, fileID)
print ("Document Link to S-1:", document_link)

import re
...
result = re.search('^d\d{1,6}.+\.htm$', trimmedText)
if result:
fileID = result.group()
^d = Start with a d
\d{1,6} = Look for 1-6 digits, if there could be an unlimited amount of digits replace with \d{1,}
.+ = Wild card
\.htm$ = End in .htm

You should try re.match() which searches for a pattern at the beginning of the input string. Also, your regex is not good, you have to add an anti-shash before ., as point means "any character" in regex.
import re
result = re.match('[\w]+\.htm', trimmedText)

Try this regex:
import re
files = [
"d804478ds1a.htm",
"d618448ds1a.htm",
"d618448.htm"
]
for f in files:
match = re.search(r"d\w+\.htm", f)
print(match.group())
d804478ds1a.htm
d618448ds1a.htm
d618448.htm
The assumptions in the above are that the file name starts with a d, ends with .htm and contains only letters, digits and underscores.

Python Regex: Remove the parts of the string that does not match regex pattern

I want to remove parts of the string that does not match the format that I want. Example:
import re
string = 'remove2017abcdremove'
pattern = re.compile("((20[0-9]{2})([a-zA-Z]{4}))")
result = pattern.search(string)
if result:
print('1')
else:
print('0')
It returns "1" so I can find the matching format inside the string however I also want to remove the parts that says "remove" on it.
I want it to return:
desired_output = '2017abcd'

You need to identify group from search result, which is done through calling a group():
import re
string = 'remove2017abcdremove'
pattern = re.compile("(20[0-9]{2}[a-zA-Z]{4})")
string = pattern.search(string).group()
# 2017abcd

How would I match a string that may or may not span multiple lines?

I have a document that when converted to text splits the phone number onto multiple lines like this:
(xxx)-xxx-
xxxx
For a variety of reasons related to my project I can't simply join the lines.
If I know the phonenumber="(555)-555-5555" how can I compile a regex so that if I run it over
(555)-555-
5555
it will match?
**EDIT
To help clarify my question here it is in a more abstract form.
test_string = "xxxx xx x xxxx"
text = """xxxx xx
x
xxxx"""
I need the test string to be found in the text. Newlines can be anywhere in the text and characters that need to be escaped should be taken into consideration.

A simple workaround would be to replace all the \n characters in the document text before you search it:
pat = re.compile(r'\(\d{3}\)-\d{3}\d{4}')
numbers = pat.findall(text.replace('\n',''))
# ['(555)-555-5555']
If this cannot be done for any reasons, the obvious answer, though unsightly, would be to handle a newline character between each search character:
pat = re.compile(r'\(\n*5\n*5\n*5\n*\)\n*-\n*5\n*5\n*5\n*-\n*5\n*5\n*5\n*5')
If you needed to handle any format, you can pad the format like so:
phonenumber = '(555)-555-5555'
pat = re.compile('\n*'.join(['\\'+i if not i.isalnum() else i for i in phonenumber]))
# pat
# re.compile(r'\(\n*5\n*5\n*5\n*\)\n*\-\n*5\n*5\n*5\n*\-\n*5\n*5\n*5\n*5', re.UNICODE)
Test case:
import random
def rndinsert(s):
i = random.randrange(len(s)-1)
return s[:i] + '\n' + s[i:]
for i in range(10):
print(pat.findall(rndinsert('abc (555)-555-5555 def')))
# ['(555)-555-5555']
# ['(555)-5\n55-5555']
# ['(555)-5\n55-5555']
# ['(555)-555-5555']
# ['(555\n)-555-5555']
# ['(5\n55)-555-5555']
# ['(555)\n-555-5555']
# ['(555)-\n555-5555']
# ['(\n555)-555-5555']
# ['(555)-555-555\n5']

You can search for a possible \n existing in the string:
import re
nums = ["(555)-555-\n5555", "(555)-555-5555"]
new_nums = [i for i in nums if re.findall('\([\d\n]+\)[\n-][\d\n]+-[\d\n]+', i)]
Output:
['(555)-555-\n5555', '(555)-555-5555']

data = ["(555)-555-\n5555", "(55\n5)-555-\n55\n55", "(555\n)-555-\n5555", "(555)-555-5555"]
input = '(555)-555-5555'
#add new lines to input string
input = re.sub(r'(?!^|$)', r'\\n*', input)
#escape brackets ()
input = re.sub(r'(?=[()])', r'\\',input)
r = re.compile(input)
match = filter(r.match, data)
Code demo

Python test if string matches a template value

I am trying to iterate through a list of strings, keeping only those that match a naming template I have specified. I want to accept any list entry that matches the template exactly, other than having an integer in a variable <SCENARIO> field.
The check needs to be general. Specifically, the string structure could change such that there is no guarantee <SCENARIO> always shows up at character X (to use list comprehensions, for example).
The code below shows an approach that works using split, but there must be a better way to make this string comparison. Could I use regular expressions here?
template = 'name_is_here_<SCENARIO>_20131204.txt'
testList = ['name_is_here_100_20131204.txt', # should accept
'name_is_here_100_20131204.txt.NEW', # should reject
'other_name.txt'] # should reject
acceptList = []
for name in testList:
print name
acceptFlag = True
splitTemplate = template.split('_')
splitName = name.split('_')
# if lengths do not match, name cannot possibly match template
if len(splitTemplate) == len(splitName):
print zip(splitTemplate, splitName)
# compare records in the split
for t, n in zip(splitTemplate, splitName):
if t!=n and not t=='<SCENARIO>':
#reject if any of the "other" fields are not identical
#(would also check that '<SCENARIO>' field is numeric - not shown here)
print 'reject: ' + name
acceptFlag = False
else:
acceptFlag = False
# keep name if it passed checks
if acceptFlag == True:
acceptList.append(name)
print acceptList
# correctly prints --> ['name_is_here_100_20131204.txt']

Try with the re module for regular expressions in Python:
import re
template = re.compile(r'^name_is_here_(\d+)_20131204.txt$')
testList = ['name_is_here_100_20131204.txt', #accepted
'name_is_here_100_20131204.txt.NEW', #rejected!
'name_is_here_aabs2352_20131204.txt', #rejected!
'other_name.txt'] #rejected!
acceptList = [item for item in testList if template.match(item)]

This should do, I understand that name_is_here is just a placeholder for alphanumeric characters?
import re
testList = ['name_is_here_100_20131204.txt', # should accept
'name_is_here_100_20131204.txt.NEW', # should reject
'other_name.txt',
'name_is_44ere_100_20131204.txt',
'name_is_here_100_2013120499.txt',
'name_is_here_100_something_2013120499.txt',
'name_is_here_100_something_20131204.txt']
def find(scenario):
begin = '[a-z_]+100_' # any combinations of chars and underscores followd by 100
end = '_[0-9]{8}.txt$' #exactly eight digits followed by .txt at the end
pattern = re.compile("".join([begin,scenario,end]))
result = []
for word in testList:
if pattern.match(word):
result.append(word)
return result
find('something') # returns ['name_is_here_100_something_20131204.txt']
EDIT: scenario in separate variable, regex now only matches characters followed by 100, then scenarion, then eight digits followed by .txt.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

using regex to identify characters and digits in python - python

Related

I have a problem with the task of reversing words and removing parentheses

Python - Regex - combination of letters and numbers (undefined length)

Python Regex: Remove the parts of the string that does not match regex pattern

How would I match a string that may or may not span multiple lines?

Python test if string matches a template value

Categories

Resources