Regex Python capture string in quotes - python

I have a file with lines of this form:
ClientsName(0) = "SUPERBRAND": ClientsName(1) = "GREATSTUFF": cClientsNames.Add Key:="SUPER", Item:=ClientsName
and I would like to capture the names in quotes "" after ClientsName(0) = and ClientsName(1) =.
So far, I came up with this code
import re
f = open('corrected_clients_data.txt', 'r')
result = ''
re_name = "ClientsName\(0\) = (.*)"
for line in f:
name = re.search(line, re_name)
print (name)
which is returning None at each line...
Two sources of error can be: the backslashes and the capture sequence (.*)...

You can do that more easily using re.findall and using \d instead of 0 to make it more general:
import re
s = '''ClientsName(0) = "SUPERBRAND": ClientsName(1) = "GREATSTUFF": cClientsNames.Add Key:="SUPER", Item:=ClientsName'''
>>> print re.findall(r'ClientsName\(\d\) = "([^"]*)"', s)
['SUPERBRAND', 'GREATSTUFF']
Another thing you must note is that your order of arguments to search() or findall() is wrong. It should be as follows: re.search(pattern, string)

You can use re.findall and just take the first two matches:
>>> s = '''ClientsName(0) = "SUPERBRAND": ClientsName(1) = "GREATSTUFF": cClientsNames.Add Key:="SUPER", Item:=ClientsName'''
>>> re.findall(r'\"([^"]+)\"' , s)[:2]
['SUPERBRAND', 'GREATSTUFF']

try this
import re
text_file = open("corrected_clients_data.txt", "r")
text = text_file.read()
matches=re.findall(r'\"(.+?)\"',text)
text_file.close()
if you notice the question mark(?) indicates that we have to stop reading the string
at the first ending double quotes encountered.
hope this is helpful.

Use a lookbehind to get the value of ClientsName(0) and ClientsName(1) through re.findall function,
>>> import re
>>> str = '''ClientsName(0) = "SUPERBRAND": ClientsName(1) = "GREATSTUFF": cClientsNames.Add Key:="SUPER", Item:=ClientsName'''
>>> m = re.findall(r'(?<=ClientsName\(0\) = \")[^"]*|(?<=ClientsName\(1\) = \")[^"]*', str)
>>> m
['SUPERBRAND', 'GREATSTUFF']
Explanation:
(?<=ClientsName\(0\) = \") Positive lookbehind is used to set the matching marker just after to the string ClientsName(0) = "
[^"]* Then it matches any character not of " zero or more times. So it match the first value ie, SUPERBRAND
| Logical OR operator used to combine two regexes.
(?<=ClientsName\(1\) = \")[^"]* Matches any character just after to the string ClientsName(1) = " upto the next ". Now it matches the second value ie, GREATSTUFF

Related

How to output only the match string using Python?

I want to match a string then print the string that matched.
I need to match a string mapping=C111 from all those lists.
Here what I tried. I can find the matched string but I can not print only the matched string.
import re
AllString = ["123A","B456","AGHF\C111\B321","3FEW/D654"]
print(type(AllString))
for str in AllString:
mapping = "C111"
findid = [re.match(mapping, str)]
for f in findid:
if f is not None:
print(f)
The output is like this:
<re.Match object; span=(0, 4), match='C111'>
My expectation result is "AGHF\C111\B321" the whole string.
Anyone can help, please. Thank you so much
import re
AllString = ["123A","B456","C111\B321","3FEW/D654"]
print(type(AllString))
for str in AllString:
mapping = "C111"
findid = [re.match(mapping, str)]
for f in findid:
if f is not None:
print(f.string) # output: C111\B321 and It makes sense
OR:
import re
AllString = ["123A","B456","C111\B321","3FEW/D654"]
print(type(AllString))
for str in AllString:
mapping = "C111"
findid = [re.match(mapping, str)]
for f in findid:
if f is not None:
print(mapping) # It meets your requirement but looks weird
NEW UPDATED:
import re
AllString = ["123A","B456","AGHF\C111\B321","3FEW/D654"]
print(type(AllString))
for str in AllString:
mapping = r".+C111.+" # method 'match' should be used with regex
findid = [re.match(mapping, str)]
for f in findid:
if f is not None:
print(f.string)
One problem with the code is re.match must match at the start of the string. You could use re.search instead, but there is no need for a regular expression in this case. Use in:
strings = ['123A','B456','AGHF\C111\B321','3FEW/D654']
for s in strings:
if 'C111' in s:
print(s)
AGHF\C111\B321
If you need to match the exact alphanumeric sequence with no extra letters/numbers around it then use re.search with \b (word breaks):
import re
strings = ["123A","B456","C111\B321","3FEW/ABC111DEF/D654","ABC\C111/DEF"]
for s in strings:
if re.search(r'\bC111\b',s):
print(s)
C111\B321
ABC\C111/DEF

Python - Regex - combination of letters and numbers (undefined length)

I am trying to get a File-ID from a text file. In the above example the filename is d735023ds1.htm which I want to get in order to build another url. Those filenames differ however in their length and I would need a universal regex expression to cover all possibilities.
Example filenames
d804478ds1a.htm.
d618448ds1a.htm.
d618448.htm
My code
for cik in leftover_cik_list:
r = requests.get(filing.url)
content = str(r.content)
fileID = None
for line in content.split("\n"):
if fileID == None:
fileIDIndex = line.find("<FILENAME>")
if fileIDIndex != -1:
trimmedText = line[fileIDIndex:]
result = RegEx.search(r"^[\w\d.htm]*$", trimmedText)
if result:
fileID = result.group()
print ("fileID",fileID)
document_link = "https://www.sec.gov/Archives/edgar/data/{0}/{1}/{2}.htm".format(cik, accession_number, fileID)
print ("Document Link to S-1:", document_link)
import re
...
result = re.search('^d\d{1,6}.+\.htm$', trimmedText)
if result:
fileID = result.group()
^d = Start with a d
\d{1,6} = Look for 1-6 digits, if there could be an unlimited amount of digits replace with \d{1,}
.+ = Wild card
\.htm$ = End in .htm
You should try re.match() which searches for a pattern at the beginning of the input string. Also, your regex is not good, you have to add an anti-shash before ., as point means "any character" in regex.
import re
result = re.match('[\w]+\.htm', trimmedText)
Try this regex:
import re
files = [
"d804478ds1a.htm",
"d618448ds1a.htm",
"d618448.htm"
]
for f in files:
match = re.search(r"d\w+\.htm", f)
print(match.group())
d804478ds1a.htm
d618448ds1a.htm
d618448.htm
The assumptions in the above are that the file name starts with a d, ends with .htm and contains only letters, digits and underscores.

How to remove characters from a str in python?

I have the following str I want to delete characters.
For example:
from str1 = "A.B.1912/2013(H-0)02322"
to 1912/2013
from srt2 = "I.M.1591/2017(I-299)17529"
to 1591/2017
from str3 = "I.M.C.15/2017(I-112)17529"
to 15/2017
I'm trying this way, but I need to remove the rest from ( to the right
newStr = str1.strip('A.B.')
'1912/2013(H-0)02322'
For the moment I'm doing it with slice notation
str1 = "A.B.1912/2013(H-0)02322"
str1 = str1[4:13]
'1912/2013'
But not all have the same length.
Any ideas or suggestions?
With some (modest) assumptions about the format of the strings, here's a solution without using regex:
First split the string on the ( character, keeping the substring on the left:
left = str1.split( '(' )[0] # "A.B.1912/2013"
Then, split the result on the last . (i.e. split from the right just once), keeping the second component:
cut = left.rsplit('.', 1)[1] # "1912/2013"
or combining the two steps into a function:
def extract(s):
return s.split('(')[0].rsplit('.', 1)[1]
Use a regex instead:
import re
regex = re.compile(r'\d+/\d+')
print(regex.search(str1).group())
print(regex.search(str2).group())
print(regex.search(str3).group())
Output:
1912/2013
1591/2017
15/2017
We can try using re.sub here with a capture group:
str1 = "A.B.1912/2013(H-0)02322"
output = re.sub(r'.*\b(\d+/\d+)\b.*', '\\1', str1)
print(output)
1912/2013
You have to use a regular expression to solve this problem.
import re
pattern = r'\d+/\d+'
str1 = "A.B.1912/2013(H-0)02322"
srt2 = "I.M.1591/2017(I-299)17529"
str3 = "I.M.C.15/2017(I-112)17529"
print(*re.findall(pattern, str1))
print(*re.findall(pattern, str2))
print(*re.findall(pattern, str3))
Output:
1912/2013
1591/2017
15/2017

How can I replace part of a string with a pattern

for example is the string is "abbacdeffel" and the pattern being "xyyx" replaced with "1234"
so it would result from "abbacdeffel" to "1234cd1234l"
I have tried to think this out but I couldnt come up with anything. At first I thought maybe dictionary could help but still nothing came to mind.
What you're looking to do can be accomplished by using regex, or more commonly known as, Regular Expressions. Regular Expressions in programming enables you to extract what you want and just what you want from a string.
In your case, you want to match the string with the pattern abba so using the following regex:
(\w+)(\w+)\2\1
https://regex101.com/r/hP8lA3/1
You can match two word groups and use backreferences to make sure that the second group comes first, then the first group.
So implementing this in python code looks like this:
First, import the regex module in python
import re
Then, declare your variable
text = "abbacdeffel"
The re.finditer returns an iterable so you can iterate through all the groups
matches = re.finditer(r"(\w)(\w)\2\1", text)
Go through all the matches that the regexp found and replace the pattern with "1234"
for match in matches:
text = text.replace(match.group(0), "1234")
For debugging:
print(text)
Complete Code:
import re
text = "abbacdeffel"
matches = re.finditer(r"(\w)(\w)\2\1", text)
for match in matches:
text = text.replace(match.group(0), "1234")
print(text)
You can learn more about Regular Expressions here: https://regexone.com/references/python
New version of code (there was a bug):
def replace_with_pattern(pattern, line, replace):
from collections import OrderedDict
set_of_chars_in_pattern = set(pattern)
indice_start_pattern = 0
output_line = ""
while indice_start_pattern < len(line):
potential_end_pattern = indice_start_pattern + len(pattern)
subline = line[indice_start_pattern:potential_end_pattern]
print(subline)
set_of_chars_in_subline = set(subline)
if len(set_of_chars_in_subline)!= len(set_of_chars_in_pattern):
output_line += line[indice_start_pattern]
indice_start_pattern +=1
continue
map_of_chars = OrderedDict()
liste_of_chars_in_pattern = []
for char in pattern:
if char not in liste_of_chars_in_pattern:
liste_of_chars_in_pattern.append(char)
print(liste_of_chars_in_pattern)
for subline_char in subline:
if subline_char not in map_of_chars.values():
map_of_chars[liste_of_chars_in_pattern.pop(0)] =subline_char
print(map_of_chars)
wanted_subline = ""
for char_of_pattern in pattern:
wanted_subline += map_of_chars[char_of_pattern]
print("wanted_subline =" + wanted_subline)
if subline == wanted_subline:
output_line += replace
indice_start_pattern += len(pattern)
else:
output_line += line[indice_start_pattern]
indice_start_pattern += 1
return output_line
some test :
test1 = replace_with_pattern("xyyx", "abbacdeffel", "1234")
test2 = replace_with_pattern("abbacdeffel", "abbacdeffel", "1234")
print(test1, test2)
=> 1234cd1234l 1234
Here goes my attempt:
([a-zA-Z])(?!\1)([a-zA-Z])\2\1
Assuming you want to match letters only (if other ranges, change both [a-zA-Z] as appropriate, we have:
([a-zA-Z])
Find the first character, and note it so we can later refer to it with \1.
(?!\1)
Check to see if the next character is not the same as the first, but without advancing the search pointer. This is to prevent aaaa being accepted. If aaaa is OK, just remove this subexpression.
([a-zA-Z])
Find the second character, and note it so we can later refer to it with \2.
\2\1
Now find the second again, then the first again, so we match the full abba pattern.
And finally, to do a replace operation, the full command would be:
import re
re.sub(r'([a-zA-Z])(?!\1)([a-zA-Z])\2\1',
'1234',
'abbacdeffelzzzz')
The r at the start of the regex pattern is to prevent Python processing the backslashes. Without it, you would need to do:
import re
re.sub('([a-zA-Z])(?!\\1)([a-zA-Z])\\2\\1',
'1234',
'abbacdeffelzzzz')
Now, I see the spec has expanded to a user-defined pattern; here is some code that will build that pattern:
import re
def make_re(pattern, charset):
result = ''
seen = []
for c in pattern:
# Is this a letter we've seen before?
if c in seen:
# Yes, so we want to match the captured pattern
result += '\\' + str(seen.index(c)+1)
else:
# No, so match a new character from the charset,
# but first exclude already matched characters
for i in xrange(len(seen)):
result += '(?!\\' + str(i + 1) + ')'
result += '(' + charset + ')'
# Note we have seen this letter
seen.append(c)
return result
print re.sub(make_re('xzzx', '\\d'), 'abba', 'abba1221b99999889')
print re.sub(make_re('xyzxyz', '[a-z]'), '123123', 'abcabc zyxzyyx zyzzyz')
Outputs:
abbaabbab9999abba
123123 zyxzyyx zyzzyz

How to parse values appear after the same string in python?

I have a input text like this (actual text file contains tons of garbage characters surrounding these 2 string too.)
(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)
I am trying to parse the text to store something like this:
value1="xxx" and value2="yyy".
I wrote python code as follows:
value1_start = content.find('value')
value1_end = content.find(';', value1_start)
value2_start = content.find('value')
value2_end = content.find(';', value2_start)
print "%s" %(content[value1_start:value1_end])
print "%s" %(content[value2_start:value2_end])
But it always returns:
value=xxx
value=xxx
Could anyone tell me how can I parse the text so that the output is:
value=xxx
value=yyy
Use a regex approach:
re.findall(r'\bvalue=[^;]*', s)
Or - if value can be any 1+ word (letter/digit/underscore) chars:
re.findall(r'\b\w+=[^;]*', s)
See the regex demo
Details:
\b - word boundary
value= - a literal char sequence value=
[^;]* - zero or more chars other than ;.
See the Python demo:
import re
rx = re.compile(r"\bvalue=[^;]*")
s = "$%$%&^(&value=xxx;$%^$%^$&^%^*value=yyy;%$#^%"
res = rx.findall(s)
print(res)
Use regex to filter the data you want from the "junk characters":
>>> import re
>>> _input = '#4#5%value=xxx38u952035983049;3^&^*(^%$3value=yyy#%$#^&*^%;$#%$#^'
>>> matches = re.findall(r'[a-zA-Z0-9]+=[a-zA-Z0-9]+', _input)
>>> matches
['value=xxx', 'value=yyy']
>>> for match in matches:
print(match)
value=xxx
value=yyy
>>>
Summary or the regular expression:
[a-zA-Z0-9]+: One or more alphanumeric characters
=: literal equal sign
[a-zA-Z0-9]+: One or more alphanumeric characters
For this input:
content = '(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)'
use a simple regex and manually strip off the first and last two characters:
import re
values = [x[2:-2] for x in re.findall(r'\*\*value=.*?\*\*', content)]
for value in values:
print(value)
Output:
value=xxx
value=yyy
Here the assumption is that there are always two leading and two trailing * as in **value=xxx**.
You already have good answers based on the re module. That would certainly be the simplest way.
If for any reason (perfs?) you prefere to use str methods, it is indeed possible. But you must search the second string past the end of the first one :
value2_start = content.find('value', value1_end)
value2_end = content.find(';', value2_start)

Categories

Resources