I am trying to get a File-ID from a text file. In the above example the filename is d735023ds1.htm which I want to get in order to build another url. Those filenames differ however in their length and I would need a universal regex expression to cover all possibilities.
Example filenames
d804478ds1a.htm.
d618448ds1a.htm.
d618448.htm
My code
for cik in leftover_cik_list:
r = requests.get(filing.url)
content = str(r.content)
fileID = None
for line in content.split("\n"):
if fileID == None:
fileIDIndex = line.find("<FILENAME>")
if fileIDIndex != -1:
trimmedText = line[fileIDIndex:]
result = RegEx.search(r"^[\w\d.htm]*$", trimmedText)
if result:
fileID = result.group()
print ("fileID",fileID)
document_link = "https://www.sec.gov/Archives/edgar/data/{0}/{1}/{2}.htm".format(cik, accession_number, fileID)
print ("Document Link to S-1:", document_link)
import re
...
result = re.search('^d\d{1,6}.+\.htm$', trimmedText)
if result:
fileID = result.group()
^d = Start with a d
\d{1,6} = Look for 1-6 digits, if there could be an unlimited amount of digits replace with \d{1,}
.+ = Wild card
\.htm$ = End in .htm
You should try re.match() which searches for a pattern at the beginning of the input string. Also, your regex is not good, you have to add an anti-shash before ., as point means "any character" in regex.
import re
result = re.match('[\w]+\.htm', trimmedText)
Try this regex:
import re
files = [
"d804478ds1a.htm",
"d618448ds1a.htm",
"d618448.htm"
]
for f in files:
match = re.search(r"d\w+\.htm", f)
print(match.group())
d804478ds1a.htm
d618448ds1a.htm
d618448.htm
The assumptions in the above are that the file name starts with a d, ends with .htm and contains only letters, digits and underscores.
Related
I have the following code that tries to retrieve the name of a file from a directory based on a double \ character:
import re
string = 'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
pattern = r'(?<=*\\\\)*'
re.findall(pattern,string)
The reasoning behind is that the name of the file is always after a double \ , so I try to look any string which is preceeded by any text that finishes with \ .
Neverthless, when I apply this code I get the following error:
error: nothing to repeat at position 4
What am I doing wrong?
Edit: The concrete output I am looking for is getting the string 'abo_st_gas_dtd_csv' as a match.
There's a couple of things going on:
You need to declare your string definition using the same r'string' notation as for the pattern; right now your string only has a single backslash, since the first one of the two is escaped.
I'm not sure you're using * correctly. It means "repeat immediately preceding group", and not just "any string" (as, e.g., in the usual shell patterns). The first * in parentheses does not have anything preceding it, meaning that the regex is invalid. Hence the error you see. I think, what you want is .*, i.e., repeating any character 0 or more times. Furthermore, it is not needed in the parentheses. A more correct regexp would be r'(?<=\\\\).*':
import re
string = r'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
pattern = r'(?<=\\\\).*'
re.findall(pattern,string)
Your pattern is just a lookabehind, which, by itself, can't match anything. I would use this re.findall approach:
string = 'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
filename = re.findall(r'\\([^.]+\.\w+)$', string)[0]
print(filename) # abo_st_gas_dtd.csv
files = 'I:E\\trm.csvest/PZMALIo4\ETRM841_FX_.csvDeals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
counter = -1
my_files = []
for f in files:
counter += 1
if ord(f) == 92:#'\'
temp = files[counter+1:len(files)]
temp_file = ""
for f1 in temp:
temp_file += f1
# [0-len(temp_file)] => if [char after . to num index of type file]== csv
if f1 == '.' and temp[len(temp_file):len(temp_file)+3] == "csv":
my_files.append(temp_file + "csv")
break
print(my_files)#['trm.csv', 'ETRM841_FX_.csv', 'abo_st_gas_dtd.csv']
I am working on a Python script that would go through a directory with a bunch of files and extract the strings that match a certain pattern.
More specifically, I'm trying to extract the values of serial number and a max-limit, and the lines look something like this:
#serial number = 642E0523D775
max-limit=50M/50M
I've got the script to go through the files, but I'm having an issue with it actually printing the values that I want it to. Instead of it printing the values, I just get the 'Nothing fount' output.
I'm thinking that it probably has something to do with the regex I'm using, but I can't for the life of me figure out how formulate this.
The script I've come up with so far:
import os
import re
#Where I'm searching
user_input = "/path/to/files/"
directory = os.listdir(user_input)
#What I'm looking for
searchstring = ['serial number', 'max-limit']
re_first = re.compile ('serial.\w.*')
re_second = re.compile ('max-limit=\w*.\w*')
#Regex combine
regex_list = [re_first, re_second]
#Looking
for fname in directory:
if os.path.isfile(user_input + os.sep + fname):
# Full path
f = open(user_input + os.sep + fname, 'r')
f_contents = f.read()
content = fname + f_contents
files = os.listdir(user_input)
lines_seen = set()
for f in files:
print(f)
if f not in lines_seen: # not a duplicate
for regex in regex_list:
matches = re.findall(regex, content)
if matches != None:
for match in matches:
print(match)
else:
print('Nema')
f.close()
Per the documentation, the regex module's match() searches for "characters at the beginning of a string [that] match the regular expression pattern". Since you are prepending your file contents with the file name in the line:
content=fname + f_contents
and then matching your pattern against the content in the line:
result=re.match(regex, content)
there will never be a match.
Since you want to locate a match anywhere in string, use search() instead.
See also: search() vs match()
Edit
The pattern ^[\w&.\-]+$ provided would match neither serial number = 642E0523D775 as it contains a space (" "), nor max-limit=50M/50M as it contains a forward slash ("/"). Both also contain an equals sign ("=") which cannot be matched by your pattern.
Additionally, the character class in this pattern matches the backslash (""), so you may want to remove it (the dash ("-") should not be escaped when it is at the end of the character class).
A pattern to match both these strings as well could be:
^[\w&. \/=\-]+$
Try it out here
I have phone numbers that might look like:
927-6847
611-6701p3715ou264-5435
869-6289fillemichelinemoisan
613-5000p4238soirou570-9639cel
and so on...
I want to identify and break them into:
9276847
6116701
2645435
8696289
6135000
5709639
String to store somewhere else:
611-6701p3715ou264-5435
869-6289fillemichelinemoisan
613-5000p4238soirou570-9639cel
When there is a p between digits, The number after p is an extension- get the number before p and save the whole string somewhere else
When there is ou, another number starts after that
When there is cel or any random string, get the number part and save the whole string somewhere else
Edit: This is what I have tried:
phNumber='928-4612cel'
if not re.match('^[\d]*$', phNumber):
res = re.match("(.*?)[a-z]",re.sub('[^\d\w]', '', phNumber)).group(1)
I am looking to handle cases and identify which of the strings had more characters before they were chopped off through regex
First let me confirm again your request:
find out the number with pattern "xxx-xxxx" where x is any number from 0-9, and then save the numbers with the pattern "xxxxxxx".
if there is any random string in the text, save the whole string.
import re
# make a list to input all the string want to test,
EXAMPLE = [
"927-6847",
"9276847"
"927.6847"
"611-6701p3715ou264-5435",
"6116701p3715ou264-5435",
"869-6289fillemichelinemoisan",
"869.6289fillemichelinemoisan",
"8696289fillemichelinemoisan",
"613-5000p4238soirou570-9639cel",
]
def save_phone_number(test_string,output_file_name):
number_to_save = []
# regex pattern of "xxx-xxxx" where x is digits
regex_pattern = r"[0-9]{3}-[0-9]{4}"
phone_numbers = re.findall(regex_pattern,test_string)
# remove the "-"
for item in phone_numbers:
number_to_save.append(item.replace("-",""))
# save to file
with open(output_file_name,"a") as file_object:
for item in number_to_save:
file_object.write(item+"\n")
def save_somewhere_else(test_string,output_file_name):
string_to_save = []
# regex pattern if there is any alphabet in the string
# (.*) mean any character with any length
# [a-zA-Z] mean if there is a character that is lower or upper alphabet
regex_pattern = r"(.*)[a-zA-Z](.*)"
if re.match(regex_pattern,test_string) is not None:
with open(output_file_name,"a") as file_object:
file_object.write(test_string+"\n")
if __name__ == "__main__":
phone_number_file = "phone_number.txt"
somewhere_file = "somewhere.txt"
for each_string in EXAMPLE:
save_phone_number(each_string,phone_number_file)
save_somewhere_else(each_string,somewhere_file)
for example is the string is "abbacdeffel" and the pattern being "xyyx" replaced with "1234"
so it would result from "abbacdeffel" to "1234cd1234l"
I have tried to think this out but I couldnt come up with anything. At first I thought maybe dictionary could help but still nothing came to mind.
What you're looking to do can be accomplished by using regex, or more commonly known as, Regular Expressions. Regular Expressions in programming enables you to extract what you want and just what you want from a string.
In your case, you want to match the string with the pattern abba so using the following regex:
(\w+)(\w+)\2\1
https://regex101.com/r/hP8lA3/1
You can match two word groups and use backreferences to make sure that the second group comes first, then the first group.
So implementing this in python code looks like this:
First, import the regex module in python
import re
Then, declare your variable
text = "abbacdeffel"
The re.finditer returns an iterable so you can iterate through all the groups
matches = re.finditer(r"(\w)(\w)\2\1", text)
Go through all the matches that the regexp found and replace the pattern with "1234"
for match in matches:
text = text.replace(match.group(0), "1234")
For debugging:
print(text)
Complete Code:
import re
text = "abbacdeffel"
matches = re.finditer(r"(\w)(\w)\2\1", text)
for match in matches:
text = text.replace(match.group(0), "1234")
print(text)
You can learn more about Regular Expressions here: https://regexone.com/references/python
New version of code (there was a bug):
def replace_with_pattern(pattern, line, replace):
from collections import OrderedDict
set_of_chars_in_pattern = set(pattern)
indice_start_pattern = 0
output_line = ""
while indice_start_pattern < len(line):
potential_end_pattern = indice_start_pattern + len(pattern)
subline = line[indice_start_pattern:potential_end_pattern]
print(subline)
set_of_chars_in_subline = set(subline)
if len(set_of_chars_in_subline)!= len(set_of_chars_in_pattern):
output_line += line[indice_start_pattern]
indice_start_pattern +=1
continue
map_of_chars = OrderedDict()
liste_of_chars_in_pattern = []
for char in pattern:
if char not in liste_of_chars_in_pattern:
liste_of_chars_in_pattern.append(char)
print(liste_of_chars_in_pattern)
for subline_char in subline:
if subline_char not in map_of_chars.values():
map_of_chars[liste_of_chars_in_pattern.pop(0)] =subline_char
print(map_of_chars)
wanted_subline = ""
for char_of_pattern in pattern:
wanted_subline += map_of_chars[char_of_pattern]
print("wanted_subline =" + wanted_subline)
if subline == wanted_subline:
output_line += replace
indice_start_pattern += len(pattern)
else:
output_line += line[indice_start_pattern]
indice_start_pattern += 1
return output_line
some test :
test1 = replace_with_pattern("xyyx", "abbacdeffel", "1234")
test2 = replace_with_pattern("abbacdeffel", "abbacdeffel", "1234")
print(test1, test2)
=> 1234cd1234l 1234
Here goes my attempt:
([a-zA-Z])(?!\1)([a-zA-Z])\2\1
Assuming you want to match letters only (if other ranges, change both [a-zA-Z] as appropriate, we have:
([a-zA-Z])
Find the first character, and note it so we can later refer to it with \1.
(?!\1)
Check to see if the next character is not the same as the first, but without advancing the search pointer. This is to prevent aaaa being accepted. If aaaa is OK, just remove this subexpression.
([a-zA-Z])
Find the second character, and note it so we can later refer to it with \2.
\2\1
Now find the second again, then the first again, so we match the full abba pattern.
And finally, to do a replace operation, the full command would be:
import re
re.sub(r'([a-zA-Z])(?!\1)([a-zA-Z])\2\1',
'1234',
'abbacdeffelzzzz')
The r at the start of the regex pattern is to prevent Python processing the backslashes. Without it, you would need to do:
import re
re.sub('([a-zA-Z])(?!\\1)([a-zA-Z])\\2\\1',
'1234',
'abbacdeffelzzzz')
Now, I see the spec has expanded to a user-defined pattern; here is some code that will build that pattern:
import re
def make_re(pattern, charset):
result = ''
seen = []
for c in pattern:
# Is this a letter we've seen before?
if c in seen:
# Yes, so we want to match the captured pattern
result += '\\' + str(seen.index(c)+1)
else:
# No, so match a new character from the charset,
# but first exclude already matched characters
for i in xrange(len(seen)):
result += '(?!\\' + str(i + 1) + ')'
result += '(' + charset + ')'
# Note we have seen this letter
seen.append(c)
return result
print re.sub(make_re('xzzx', '\\d'), 'abba', 'abba1221b99999889')
print re.sub(make_re('xyzxyz', '[a-z]'), '123123', 'abcabc zyxzyyx zyzzyz')
Outputs:
abbaabbab9999abba
123123 zyxzyyx zyzzyz
I have a file with lines of this form:
ClientsName(0) = "SUPERBRAND": ClientsName(1) = "GREATSTUFF": cClientsNames.Add Key:="SUPER", Item:=ClientsName
and I would like to capture the names in quotes "" after ClientsName(0) = and ClientsName(1) =.
So far, I came up with this code
import re
f = open('corrected_clients_data.txt', 'r')
result = ''
re_name = "ClientsName\(0\) = (.*)"
for line in f:
name = re.search(line, re_name)
print (name)
which is returning None at each line...
Two sources of error can be: the backslashes and the capture sequence (.*)...
You can do that more easily using re.findall and using \d instead of 0 to make it more general:
import re
s = '''ClientsName(0) = "SUPERBRAND": ClientsName(1) = "GREATSTUFF": cClientsNames.Add Key:="SUPER", Item:=ClientsName'''
>>> print re.findall(r'ClientsName\(\d\) = "([^"]*)"', s)
['SUPERBRAND', 'GREATSTUFF']
Another thing you must note is that your order of arguments to search() or findall() is wrong. It should be as follows: re.search(pattern, string)
You can use re.findall and just take the first two matches:
>>> s = '''ClientsName(0) = "SUPERBRAND": ClientsName(1) = "GREATSTUFF": cClientsNames.Add Key:="SUPER", Item:=ClientsName'''
>>> re.findall(r'\"([^"]+)\"' , s)[:2]
['SUPERBRAND', 'GREATSTUFF']
try this
import re
text_file = open("corrected_clients_data.txt", "r")
text = text_file.read()
matches=re.findall(r'\"(.+?)\"',text)
text_file.close()
if you notice the question mark(?) indicates that we have to stop reading the string
at the first ending double quotes encountered.
hope this is helpful.
Use a lookbehind to get the value of ClientsName(0) and ClientsName(1) through re.findall function,
>>> import re
>>> str = '''ClientsName(0) = "SUPERBRAND": ClientsName(1) = "GREATSTUFF": cClientsNames.Add Key:="SUPER", Item:=ClientsName'''
>>> m = re.findall(r'(?<=ClientsName\(0\) = \")[^"]*|(?<=ClientsName\(1\) = \")[^"]*', str)
>>> m
['SUPERBRAND', 'GREATSTUFF']
Explanation:
(?<=ClientsName\(0\) = \") Positive lookbehind is used to set the matching marker just after to the string ClientsName(0) = "
[^"]* Then it matches any character not of " zero or more times. So it match the first value ie, SUPERBRAND
| Logical OR operator used to combine two regexes.
(?<=ClientsName\(1\) = \")[^"]* Matches any character just after to the string ClientsName(1) = " upto the next ". Now it matches the second value ie, GREATSTUFF