I'm trying to extract the strings from a file that start with ${ and ends with } using Python. I am using the code below to do so, but I don't get the expected result.
My input file looks like this:
Click ${SWIFT_TAB}
Click ${SEARCH_SWIFT_CODE}
and I want to get a list as below:
${SWIFT_TAB}
${SEARCH_SWIFT_CODE}
My current code looks like this:
def findStringFromFile(file):
import os,re
with open(file) as f:
ans = []
for line in f:
matches = re.findall(r'\b\${\S+}\b', line)
ans.extend(matches)
print (ans)
I am expecting a list of strings that start with ${ and end with }, but all I currently get is an empty list.
The problem is that your regexp is buggy, and doesn't match the strings you want to extract. Specifically, you have two issues:
{ and } are regexp metacharacters, just like $, and also need to be escaped if you want to match them literally.
\b matches a word boundary, i.e. a position between a "word character" (a letter, a number or an underscore) and a "non-word character" (anything else) or the beginning/end end of string. It does not match between, say, a space and $.
To fix these issues, change your line:
matches = re.findall(r'\b\${\S+}\b', line)
to:
matches = re.findall(r'\$\{\S+\}', line)
and it should work.
See the Python regular expressions documentation for more details.
Related
I want to substitute from 'datasets/4/image-3.jpg' to 'datasets/4/image-1.jpg'. Are there any ways to do it by using re.sub? Or should I try something else like .split("/")[-1]? I had tried below but end up getting 'datasets/1/image-1.jpg', but I want to keep the /4/ instead of /1/.
My Code
import re
employee_name = 'datasets/4/image-3.jpg'
label = re.sub('[0-9]', "1", employee_name)
print(label)
Output
datasets/1/image-1.jpg
Expected Input
datasets/4/image-3.jpg
Expected Output
datasets/4/image-1.jpg
You can use
re.sub(r'-\d+\.jpg', '-1.jpg', text)
Note: If the match is always at the end of string, append the $ anchor at the end of the regex. If there can be other text after and you need to make sure there are no word chars after jpg, add a word boundary, \b. If you want to match other extensions, use a group, e.g. (?:jpe?g|png).
Regex details
-\d+ - one or more digits
\.jpg - .jpg string.
See the regex demo (code generator link).
I am looking to remove the last statement in a rule used for parsing. The statements are encapsulated with # characters, and the rule itself is encapsulated with pattern tags.
What I want to do is just remove the last rule statement.
My current idea to achieve this goes like this:
Opens the rules file, saves each line as an element into a list.
Selects the line that contains the correct rule-id and then saves the rule pattern as a new string.
Reverses the saved rule pattern.
Removes the last rule statement.
Re-reverses the rule pattern.
Adds in the trailing pattern tag.
So the input will look like:
<pattern>#this is a statement# #this is also a statement#</pattern>
Output will look like:
<pattern>#this is a statement# </pattern>
My current attempt goes like this:
with open(rules) as f:
lines = f.readlines()
string = ""
for line in lines:
if ruleid in line:
position = lines.index(line)
string = lines[position + 2] # the rule pattern will be two lines down
# from where the rule-id is located, hence
# the position + 2
def reversed_string(a_string): #reverses the string
return a_string[::-1]
def remove_at(x): #removes everything until the # character
return re.sub('^.*?#','',x)
print(reversed_string(remove_at(remove_at(reversed_string(string)))))
This will reverse the string but not remove the last rule statement once it has been reversed.
Running just the reversed_string() function will successfully reverse the string, but trying to run that same string through the remove_at() function will not work at all.
But, if you manually create the input string (to the same rule pattern), and forgo opening and grabbing the rule pattern, it will successfully remove the trailing rule statement.
The successful code looks like this:
string = '<pattern>#this is a statement# #this is also a statement#</pattern>'
def reversed_string(a_string): #reverses the string
return a_string[::-1]
def remove_at(x): #removes everything until the # character
return re.sub('^.*?#','',x)
print(reversed_string(remove_at(remove_at(reversed_string(string)))))
As well, how would I add in the pattern tag after the removal is complete?
The lines you are reading probably have a \n at the end and that's why your replacement is not working. This question can guide you about reading the file without new lines.
Among the options, one could be removing the \n using rstrip() like this:
string = lines[position + 2].rstrip("\n")
Now, about the replacement, I think you could simplify it by using this regular expression:
#[^#]+#(?!.*#)
It consists of the following parts:
#[^#]+# matches one # followed by one or more characters that are not an # and then another #.
(?!.*#) is a negative lookahead to check that no # is found ahead, preceded by zero or more occurrences of any other character.
Here you can see a demo of this regular expression.
This expression should match the last statement and you would not need to reverse the string:
re.sub("#[^#]+#(?!.*#)", "", string)
I am trying to find a way to parse a string of a transcript into speaker segments (as a list). Speaker labels are denoted by the upper-casing of the speaker's name followed by a colon. The problem I am having is some names have a number of non upper-case characters. Examples might include the following:
OBAMA: said something
O'MALLEY: said something else
GOV. HICKENLOOPER: said something else entirely'
I have written the following regex, but I am struggling to get it to work:
mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"
parse_turns = re.split(r'\n(?=[A-Z]+(\ |\.|\'|\d)*[A-Z]*:)', mystring)
What I think I have written (and ideally what I want to do) is a command to split the string based on:
1. Find a newline
2. Use positive look-ahead for one or more uppercase characters
3. If upper-case characters are found look for optional characters from the list of periods, apostrophes, single spaces, and digits
4. If these optional characters are found, look for additional uppercase characters.
5. Crucially, find a colon symbol at the end of this sequence.
EDIT: In many cases, the content of the speech will have newline characters contained within it, and possibly colon symbols. As such, the only thing separating the speaker label from the content of speech is the sequence mentioned above.
just change (\ |.|\'|\d) to [\ .\'\d] or (?:\ |.|\'|\d)
import re
mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"
parse_turns = re.split(r'\n(?=[A-Z]+[\ \.\'\d]*[A-Z]*:)', mystring)
print(parse_turns)
If it's true that the speaker's name and what they said are separated by a colon, then it might be simpler to move away from regex to do your splitting.
list_of_things = []
mystring = "OBAMA: Hi\nO'MALLEY: True Dat\nHUCK FINN: Sure thing\n"
lines = mystring.split("\n")# 1st split the string into lines based on the \n character
for line in lines:
colon_pos = line.find(":",0) # Finds the position of the first colon in the line
speaker, utterance = line[0:colon_pos].strip(), line[colon_pos+1:].strip()
list_of_things.append((speaker, utterance))
At the end, you should have a neat list of tuples containing speakers, and the things they said.
I have some config file from which I need to extract only some values. For example, I have this:
PART
{
title = Some Title
description = Some description here. // this 2 params are needed
tags = qwe rty // don't need this param
...
}
I need to extract value of certain param, for example description's value. How do I do this in Python3 with regex?
Here is the regex, assuming that the file text is in txt:
import re
m = re.search(r'^\s*description\s*=\s*(.*?)(?=(//)|$)', txt, re.M)
print(m.group(1))
Let me explain.
^ matches at beginning of line.
Then \s* means zero or more spaces (or tabs)
description is your anchor for finding the value part.
After that we expect = sign with optional spaces before or after by denoting \s*=\s*.
Then we capture everything after the = and optional spaces, by denoting (.*?). This expression is captured by parenthesis. Inside the parenthesis we say match anything (the dot) as many times as you can find (the asterisk) in a non greedy manner (the question mark), that is, stop as soon as the following expression is matched.
The following expression is a lookahead expression, starting with (?= which matches the thing right after the (?=.
And that thing is actually two options, separated by the vertical bar |.
The first option, to the left of the bar says // (in parenthesis to make it atomic unit for the vertical bar choice operation), that is, the start of the comment, which, I suppose, you don't want to capture.
The second option is $, meaning the end of the line, which will be reached if there is no comment // on the line.
So we look for everything we can after the first = sign, until either we meet a // pattern, or we meet the end of the line. This is the essence of the (?=(//)|$) part.
We also need the re.M flag, to tell the regex engine that we want ^ and $ match the start and end of lines, respectively. Without the flag they match the start and end of the entire string, which isn't what we want in this case.
The better approach would be to use an established configuration file system. Python has built-in support for INI-like files in the configparser module.
However, if you just desperately need to get the string of text in that file after the description, you could do this:
def get_value_for_key(key, file):
with open(file) as f:
lines = f.readlines()
for line in lines:
line = line.lstrip()
if line.startswith(key + " ="):
return line.split("=", 1)[1].lstrip()
You can use it with a call like: get_value_for_key("description", "myfile.txt"). The method will return None if nothing is found. It is assumed that your file will be formatted where there is a space and the equals sign after the key name, e.g. key = value.
This avoids regular expressions altogether and preserves any whitespace on the right side of the value. (If that's not important to you, you can use strip instead of lstrip.)
Why avoid regular expressions? They're expensive and really not ideal for this scenario. Use simple string matching. This avoids importing a module and simplifies your code. But really I'd say to convert to a supported configuration file format.
This is a pretty simple regex, you just need a positive lookbehind, and optionally something to remove the comments. (do this by appending ?(//)? to the regex)
r"(?<=description = ).*"
Regex101 demo
I have an auto generated bibliography file which stores my references. The citekey in the generated file is of the form xxxxx:2009tb. Is there a way to make the program to detect such a pattern and change the citekey form to xxxxx:2009?
It's not quite clear to me which expression you want to match, but you can build everything with regex, using import re and re.sub as shown. [0-9]*4 matches exactly 4 numbers.
(Edit, to incorporate suggestions)
import re
inf = 'temp.txt'
outf = 'out.txt'
with open(inf) as f,open(outf,'w') as o:
all = f.read()
all = re.sub("xxxxx:[0-9]*4tb","xxxxx:tb",all) # match your regex here
o.write(all)
o.close()
You actually just want to remove the two letters after the year in a reference. Supposing we could uniquely identify a reference as a colon followed by four numbers and two letters, than the following regular expression would work (at least it is working in this example code):
import re
s = """
according to some works (newton:2009cb), gravity is not the same that
severity (darwin:1873dc; hampton:1956tr).
"""
new_s = re.sub('(:[0-9]{4})\w{2}', r'\1', s)
print new_s
Explanation: "match a colon : followed by four numbers [0-9]{4} followed by any two "word" characters \w{2}. The parentheses catch just the part you want to keep, and r'\1' means you are replacing each whole match by a smaller part of it which is in the first (and only) group of parentheses. The r before the string is there because it is necessary to interpret \1 as a raw string, and not as an escape sequence.
Hope this helps!