How to print matching strings in python with regex? - python

I am working on a Python script that would go through a directory with a bunch of files and extract the strings that match a certain pattern.
More specifically, I'm trying to extract the values of serial number and a max-limit, and the lines look something like this:
#serial number = 642E0523D775
max-limit=50M/50M
I've got the script to go through the files, but I'm having an issue with it actually printing the values that I want it to. Instead of it printing the values, I just get the 'Nothing fount' output.
I'm thinking that it probably has something to do with the regex I'm using, but I can't for the life of me figure out how formulate this.
The script I've come up with so far:
import os
import re
#Where I'm searching
user_input = "/path/to/files/"
directory = os.listdir(user_input)
#What I'm looking for
searchstring = ['serial number', 'max-limit']
re_first = re.compile ('serial.\w.*')
re_second = re.compile ('max-limit=\w*.\w*')
#Regex combine
regex_list = [re_first, re_second]
#Looking
for fname in directory:
if os.path.isfile(user_input + os.sep + fname):
# Full path
f = open(user_input + os.sep + fname, 'r')
f_contents = f.read()
content = fname + f_contents
files = os.listdir(user_input)
lines_seen = set()
for f in files:
print(f)
if f not in lines_seen: # not a duplicate
for regex in regex_list:
matches = re.findall(regex, content)
if matches != None:
for match in matches:
print(match)
else:
print('Nema')
f.close()

Per the documentation, the regex module's match() searches for "characters at the beginning of a string [that] match the regular expression pattern". Since you are prepending your file contents with the file name in the line:
content=fname + f_contents
and then matching your pattern against the content in the line:
result=re.match(regex, content)
there will never be a match.
Since you want to locate a match anywhere in string, use search() instead.
See also: search() vs match()
Edit
The pattern ^[\w&.\-]+$ provided would match neither serial number = 642E0523D775 as it contains a space (" "), nor max-limit=50M/50M as it contains a forward slash ("/"). Both also contain an equals sign ("=") which cannot be matched by your pattern.
Additionally, the character class in this pattern matches the backslash (""), so you may want to remove it (the dash ("-") should not be escaped when it is at the end of the character class).
A pattern to match both these strings as well could be:
^[\w&. \/=\-]+$
Try it out here

Related

Search for sentences containing characters using Python regular expressions

I am searching for sentences containing characters using Python regular expressions.
But I can't find the sentence I want.
Please help me
regex.py
opfile = open(file.txt, 'r')
contents = opfile.read()
opfile.close()
index = re.findall(r'\[start file\](?:.|\n)*\[end file\]', contents)
item = re.search(r'age.*', str(index))
file.txt(example)
[start file]
name: steve
age: 23
[end file]
result
<re.Match object; span=(94, 738), match='age: >
The age is not printed
There are several issues here:
The str(index) returns the string literal representation of the string list, and it makes it difficult to further process the result
(?:.|\n)* is a very resource consuming construct, use a mere . with the re.S or re.DOTALL option
If you plan to find a single match, use re.search, not re.findall.
Here is a possible solution:
match = re.search(r'\[start file].*\[end file]', contents, re.S)
if match:
match2 = re.search(r"\bage:\s*(\d+)", match.group())
if match2:
print(match2.group(1))
Output:
23
If you want to get age in the output, use match2.group().
If you want to match the age only once between the start and end file markers, you could use a single pattern with a capture group and in between match all lines that do not start with age: or the start or end marker.
^\[start file](?:\n(?!age:|\[(?:start|end) file]).*)*\nage: (\d+)(?:\n(?!\[(?:start|end) file]).*)*\n\[end file]
Regex demo
Example
import re
regex = r"^\[start file](?:\n(?!age:|\[(?:start|end) file]).*)*\nage: (\d+)(?:\n(?!\[(?:start|end) file]).*)*\n\[end file]"
s = ("[start file]\n" "name: steve \n" "age: 23\n" "[end file]")
m = re.search(regex, s)
if m:
print(m.group(1))
Output
23
The example input looks like a list of key, value pairs enclosed between some start/end markers. For this use-case, it might be more efficient and readable to write the parsing stage as:
re.search to locate the document
splitlines() to isolate individual records
split() to extract the key and value of each record
Then, in a second step, access the extracted records.
Doing this allows to separate the parsing and exploitation parts and makes the code easier to maintain.
Additionally, a good practice is to wrap access to a file in a "context manager" (the with statement) to guarantee all resources are correctly cleaned on error.
Here is a full standalone example:
import re
# 1: Load the raw data from disk, in a context manager
with open('/tmp/file.txt') as f:
contents = f.read()
# 2: Parse the raw data
fields = {}
if match := re.search(r'\[start file\]\n(.*)\[end file\]', contents, re.S):
for line in match.group(1).splitlines():
k, v = line.split(':', 1)
fields[k.strip()] = v.strip()
# 3: Actual data exploitation
print(fields['age'])

regex: pattern fails to match what I am looking for

I have the following code that tries to retrieve the name of a file from a directory based on a double \ character:
import re
string = 'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
pattern = r'(?<=*\\\\)*'
re.findall(pattern,string)
The reasoning behind is that the name of the file is always after a double \ , so I try to look any string which is preceeded by any text that finishes with \ .
Neverthless, when I apply this code I get the following error:
error: nothing to repeat at position 4
What am I doing wrong?
Edit: The concrete output I am looking for is getting the string 'abo_st_gas_dtd_csv' as a match.
There's a couple of things going on:
You need to declare your string definition using the same r'string' notation as for the pattern; right now your string only has a single backslash, since the first one of the two is escaped.
I'm not sure you're using * correctly. It means "repeat immediately preceding group", and not just "any string" (as, e.g., in the usual shell patterns). The first * in parentheses does not have anything preceding it, meaning that the regex is invalid. Hence the error you see. I think, what you want is .*, i.e., repeating any character 0 or more times. Furthermore, it is not needed in the parentheses. A more correct regexp would be r'(?<=\\\\).*':
import re
string = r'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
pattern = r'(?<=\\\\).*'
re.findall(pattern,string)
Your pattern is just a lookabehind, which, by itself, can't match anything. I would use this re.findall approach:
string = 'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
filename = re.findall(r'\\([^.]+\.\w+)$', string)[0]
print(filename) # abo_st_gas_dtd.csv
files = 'I:E\\trm.csvest/PZMALIo4\ETRM841_FX_.csvDeals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
counter = -1
my_files = []
for f in files:
counter += 1
if ord(f) == 92:#'\'
temp = files[counter+1:len(files)]
temp_file = ""
for f1 in temp:
temp_file += f1
# [0-len(temp_file)] => if [char after . to num index of type file]== csv
if f1 == '.' and temp[len(temp_file):len(temp_file)+3] == "csv":
my_files.append(temp_file + "csv")
break
print(my_files)#['trm.csv', 'ETRM841_FX_.csv', 'abo_st_gas_dtd.csv']

How can I find some words in files with regex?

I have many files and need to categorize them into the words that come up there.
ex) [..murder..murderAttempted..] or [murder, murderAttempted] etc..
I tried this code. but not all came out. so I want "murder" and "murderAttmpted" in files surrounded by "[ ]".
def func(root_dir):
for files in os.listdir(root_dir):
pattern = r'\[.+murder.+murderAttempted.+'
if "txt" in files:
f = open(root_dir + files, 'rt', encoding='UTF8')
for i, line in enumerate(f):
for match in re.finditer(pattern, line):
print(match.group())
This appears to work for me: pattern = r'\[.*murder.*murderAttempted.*\]' instead of pattern = r'\[.+murder.+murderAttempted.+'. I believe it returns all occurrences of "murder" and "murderAttempted" in files surrounded by "[]". The + requires 1 or more occurrence whereas * could have 0. Also note the addition of the end \]. This ensures you only capture strings that are enclosed in brackets.

Python - Regex - combination of letters and numbers (undefined length)

I am trying to get a File-ID from a text file. In the above example the filename is d735023ds1.htm which I want to get in order to build another url. Those filenames differ however in their length and I would need a universal regex expression to cover all possibilities.
Example filenames
d804478ds1a.htm.
d618448ds1a.htm.
d618448.htm
My code
for cik in leftover_cik_list:
r = requests.get(filing.url)
content = str(r.content)
fileID = None
for line in content.split("\n"):
if fileID == None:
fileIDIndex = line.find("<FILENAME>")
if fileIDIndex != -1:
trimmedText = line[fileIDIndex:]
result = RegEx.search(r"^[\w\d.htm]*$", trimmedText)
if result:
fileID = result.group()
print ("fileID",fileID)
document_link = "https://www.sec.gov/Archives/edgar/data/{0}/{1}/{2}.htm".format(cik, accession_number, fileID)
print ("Document Link to S-1:", document_link)
import re
...
result = re.search('^d\d{1,6}.+\.htm$', trimmedText)
if result:
fileID = result.group()
^d = Start with a d
\d{1,6} = Look for 1-6 digits, if there could be an unlimited amount of digits replace with \d{1,}
.+ = Wild card
\.htm$ = End in .htm
You should try re.match() which searches for a pattern at the beginning of the input string. Also, your regex is not good, you have to add an anti-shash before ., as point means "any character" in regex.
import re
result = re.match('[\w]+\.htm', trimmedText)
Try this regex:
import re
files = [
"d804478ds1a.htm",
"d618448ds1a.htm",
"d618448.htm"
]
for f in files:
match = re.search(r"d\w+\.htm", f)
print(match.group())
d804478ds1a.htm
d618448ds1a.htm
d618448.htm
The assumptions in the above are that the file name starts with a d, ends with .htm and contains only letters, digits and underscores.

Stripping Hex code from a plain text file in Python [duplicate]

I have a string. How do I remove all text after a certain character? (In this case ...)
The text after will ... change so I that's why I want to remove all characters after a certain one.
Split on your separator at most once, and take the first piece:
sep = '...'
stripped = text.split(sep, 1)[0]
You didn't say what should happen if the separator isn't present. Both this and Alex's solution will return the entire string in that case.
Assuming your separator is '...', but it can be any string.
text = 'some string... this part will be removed.'
head, sep, tail = text.partition('...')
>>> print head
some string
If the separator is not found, head will contain all of the original string.
The partition function was added in Python 2.5.
S.partition(sep) -> (head, sep, tail)
Searches for the separator sep in S, and returns the part before it,
the separator itself, and the part after it. If the separator is not
found, returns S and two empty strings.
If you want to remove everything after the last occurrence of separator in a string I find this works well:
<separator>.join(string_to_split.split(<separator>)[:-1])
For example, if string_to_split is a path like root/location/child/too_far.exe and you only want the folder path, you can split by "/".join(string_to_split.split("/")[:-1]) and you'll get
root/location/child
Without a regular expression (which I assume is what you want):
def remafterellipsis(text):
where_ellipsis = text.find('...')
if where_ellipsis == -1:
return text
return text[:where_ellipsis + 3]
or, with a regular expression:
import re
def remwithre(text, there=re.compile(re.escape('...')+'.*')):
return there.sub('', text)
import re
test = "This is a test...we should not be able to see this"
res = re.sub(r'\.\.\..*',"",test)
print(res)
Output: "This is a test"
The method find will return the character position in a string. Then, if you want remove every thing from the character, do this:
mystring = "123⋯567"
mystring[ 0 : mystring.index("⋯")]
>> '123'
If you want to keep the character, add 1 to the character position.
From a file:
import re
sep = '...'
with open("requirements.txt") as file_in:
lines = []
for line in file_in:
res = line.split(sep, 1)[0]
print(res)
This is in python 3.7 working to me
In my case I need to remove after dot in my string variable fees
fees = 45.05
split_string = fees.split(".", 1)
substring = split_string[0]
print(substring)
Yet another way to remove all characters after the last occurrence of a character in a string (assume that you want to remove all characters after the final '/').
path = 'I/only/want/the/containing/directory/not/the/file.txt'
while path[-1] != '/':
path = path[:-1]
another easy way using re will be
import re, clr
text = 'some string... this part will be removed.'
text= re.search(r'(\A.*)\.\.\..+',url,re.DOTALL|re.IGNORECASE).group(1)
// text = some string

Categories

Resources