how to match a pattern and add a character to it - python

I have something like:
GCF_002904975:2.6672e-05):2.6672e-05.
and I would like to add the word '_S' right after any GCF(any number) entry before the next colon.
In other words I would like my text becoming like:
GCF_002904975_S:2.6672e-05):2.6672e-05.
I have repeated pattern like that all along my text.

This can be easily done with re.sub function. A working example would look like this:
import re
inp_string='(((GCF_001297375:2.6671e-05,GCF_002904975:2.6672e-05)0.924:0.060046136,(GCF_000144955:0.036474926,((GCF_001681075:0.017937143,...'
if __name__ == "__main__":
outp_string = re.sub(r'GCF_(?P<gfc_number>\d+)\:', r'GCF_\g<gfc_number>_S:', inp_string)
print(outp_string)
This code gives the following result, which is hopefully what you need:
(((GCF_001297375_S:2.6671e-05,GCF_002904975_S:2.6672e-05)0.924:0.060046136,(GCF_000144955_S:0.036474926,((GCF_001681075_S:0.017937143,...
For more info take a look at the docs:
https://docs.python.org/3/library/re.html

You can use regular expressions with a function substitution. The solution below depends on the numbers always being 9 digits, but could be modified to work with other cases.
test_str = '(((GCF_001297375:2.6671e-05,GCF_002904975:2.6672e-05)0.924:0.060046136,GCF_000144955:0.036474926,((GCF_001681075:0.017937143,...'
new_str = re.sub(r"GCF_\d{9}", lambda x: x.group(0) + "_S", test_str)
print(new_str)
#(((GCF_001297375_S:2.6671e-05,GCF_002904975_S:2.6672e-05)0.924:0.060046136,GCF_000144955_S:0.036474926,((GCF_001681075_S:0.017937143,...

Why not just do a replace? Shortening your example string to make it easier to read:
"(((GCF_001297375:2.6671e-05,GCF_002904975:2.6672e-05)...".replace(":","_S:")

Related

Updating a string using regular expressions in Python

I'm pretty sure that my question is very straightforward but I cannot find the answer to it. Let's say we have an input string like:
input = "This is an example"
Now, I want to simply replace every word --generally speaking, every substring using a regular expression, "word" here is just an example-- in the input with another string which includes the original string too. For instance, I want to add an # to the left and right of every word in input. And, the output would be:
output = "#This# #is# #an# #example#"
What is the solution? I know how to use re.sub or replace, but I do not know how I can use them in a way that I can update the original matched strings and not completely replace them with something else.
You can use capture groups for that.
import re
input = "This is an example"
output = re.sub("(\w+)", "#\\1#", input)
A capture group is something that you can later reference, for example in the substitution string. In this case, I'm matching a word, putting it into a capture group and then replacing it with the same word, but with # added as a prefix and a suffix.
You can read about regexps in python more in the docs.
Here is an option using re.sub with lookarounds:
input = "This is an example"
output = re.sub(r'(?<!\w)(?=\w)|(?<=\w)(?!\w)', '#', input)
print(output)
#This# #is# #an# #example#
This is without re library
a = "This is an example"
l=[]
for i in a.split(" "):
l.append('#'+i+'#')
print(" ".join(l))
You can match only word boundaries with \b:
import re
input = "This is an example"
output = re.sub(r'\b', '#', input)
print(output)
#This# #is# #an# #example#

Regex Get String Subset

How can we get the substring based on fullstops using regex? We only wish to get the data after the full stop
Str = “i like cows. I also like camels”
// Regex Code here
Output : “I also like camels”
No need to use regex for that. Use split() method.
splitted = Str.split('.')
# splitted[0] will be 'i like cows'
# splitted[1] will be 'I also like camels'
You can use this approach:
str1 = 'i like cows. I also like camels'
print(str1.split('.')[1:][0].strip())
output:
I also like camels
Try this split
String dataIWant = mydata.split(".")[1];
Result : I also like camels
Using split('.') and selecting the last element is generally better but for fun this is a RegEx solution:
import re
Str = "i like .cows. I also like camels"
pattern = r"([^\.]*$)"
results = re.search(pattern, Str)
print(results.group(1).strip())
This (?:[.]\s([A-Z].+)) picks "I also like camels"

Python Regular Express Lookahead multiple conditions

My string looks like this:
string = "*[EQ](#[Type],'A,B,C',#[Type],*[EQ](#[Type],D,E,F))"
The ideal output list is:
['#[Type]', 'A,B,C', '#[Type]', '*[EQ](#[Type],D,E,F)']
So I can parse the string as:
if #[Type] in ('A,B,C') then #[Type] else *[EQ](#[Type],D,E,F)
The challenge is to find all the commas followed by #, ' or *. I've tried the following code but it doesn't work:
interM = re.search(r"\*\[EQ\]\((.+)(?=,#|,\*|,\')+,(.+)\)", string)
print(interM.groups())
Edit:
The ultimate goal is to parse out the 4 components of the input string:
*[EQ](Value, Target, ifTrue, ifFalse)
>>> import re
>>> string = "*[EQ](#[Type],'A,B,C',#[Type],*[EQ](#[Type],D,E,F))"
>>> re.split(r"^\*\[EQ\]\(|\)$|,(?=[#'*])", string)[1:-1]
['#[Type]', "'A,B,C'", '#[Type]', '*[EQ](#[Type],D,E,F)']
Although, if you are looking for a more robust solution I'd highly recommend a Lexical Analyzer such as flex.
x="*[EQ](#[Type],'A,B,C',#[Type],*[EQ](#[Type],D,E,F))"
print re.findall(r"#[^,]+|'[^']+'|\*.*?\([^\)]*\)",re.findall(r"\*\[EQ\]\((.*?)\)$",x)[0])
Output:
['#[Type]', "'A,B,C'", '#[Type]', '*[EQ](#[Type],D,E,F)']
You can try something of this sort.You have not mentioned the logic or anything so not sure if this can be scaled.

In Python how to strip dollar signs and commas from dollar related fields only

I'm reading in a large text file with lots of columns, dollar related and not, and I'm trying to figure out how to strip the dollar fields ONLY of $ and , characters.
so say I have:
a|b|c
$1,000|hi,you|$45.43
$300.03|$MS2|$55,000
where a and c are dollar-fields and b is not.
The output needs to be:
a|b|c
1000|hi,you|45.43
300.03|$MS2|55000
I was thinking that regex would be the way to go, but I can't figure out how to express the replacement:
f=open('sample1_fixed.txt','wb')
for line in open('sample1.txt', 'rb'):
new_line = re.sub(r'(\$\d+([,\.]\d+)?k?)',????, line)
f.write(new_line)
f.close()
Anyone have an idea?
Thanks in advance.
Unless you are really tied to the idea of using a regex, I would suggest doing something simple, straight-forward, and generally easy to read:
def convert_money(inval):
if inval[0] == '$':
test_val = inval[1:].replace(",", "")
try:
_ = float(test_val)
except:
pass
else:
inval = test_val
return inval
def convert_string(s):
return "|".join(map(convert_money, s.split("|")))
a = '$1,000|hi,you|$45.43'
b = '$300.03|$MS2|$55,000'
print convert_string(a)
print convert_string(b)
OUTPUT
1000|hi,you|45.43
300.03|$MS2|55000
A simple approach:
>>> import re
>>> exp = '\$\d+(,|\.)?\d+'
>>> s = '$1,000|hi,you|$45.43'
>>> '|'.join(i.translate(None, '$,') if re.match(exp, i) else i for i in s.split('|'))
'1000|hi,you|45.43'
It sounds like you are addressing the entire line of text at once. I think your first task would be to break up your string by columns into an array or some other variables. Once you've don that, your solution for converting strings of currency into numbers doesn't have to worry about the other fields.
Once you've done that, I think there is probably an easier way to do this task than with regular expressions. You could start with this SO question.
If you really want to use regex though, then this pattern should work for you:
\[$,]\g
Demo on regex101
Replace matches with empty strings. The pattern gets a little more complicated if you have other kinds of currency present.
I Try this regex take if necessary.
\$(\d+)[\,]*([\.]*\d*)
SEE DEMO : http://regex101.com/r/wM0zB6/2
Use the regexx
((?<=\d),(?=\d))|(\$(?=\d))
eg
import re
>>> x="$1,000|hi,you|$45.43"
re.sub( r'((?<=\d),(?=\d))|(\$(?=\d))', r'', x)
'1000|hi,you|45.43'
Try the below regex and then replace the matched strings with \1\2\3
\$(\d+(?:\.\d+)?)(?:(?:,(\d{2}))*(?:,(\d{3})))?
DEMO
Defining a black list and checking if the characters are in it, is an easy way to do this:
blacklist = ("$", ",") # define characters to remove
with open('sample1_fixed.txt','wb') as f:
for line in open('sample1.txt', 'rb'):
clean_line = "".join(c for c in line if c not in blacklist)
f.write(clean_line)
\$(?=(?:[^|]+,)|(?:[^|]+\.))
Try this.Replace with empty string.Use re.M option.See demo.
http://regex101.com/r/gT6kI4/6

Regular expression split

I have inputs similar to the following:
TV-12VX
TV-14JW
TV-2JIS
VC-224X
I need to remove everything after the numbers after the dash. The result would be:
TV-12
TV-14
TV-2
TV-224
How would I do this split via regular expressions?
The following code shows how to match strings of the form "TV-" + (some number):
>>> re.match('TV-[0-9]+','TV-12VX').group(0)
'TV-12'
(Note that, because I'm using match, this only works if the string starts with the bit you want to extract.)
I think this regex is appropriate for you: (.+?-\d+?)[a-zA-Z]. You can use it with re.findall, or re.match.
import re
p = re.match('([\w]{2}-\d+)', 'TV-12VX')
print(p.group(0))
Outputs
TV-12
You can remove everything after the digits with this:
re.sub(r"^(\w+-\d+).*", r"\1", input)

Categories

Resources