Regex Get String Subset - python

How can we get the substring based on fullstops using regex? We only wish to get the data after the full stop
Str = “i like cows. I also like camels”
// Regex Code here
Output : “I also like camels”

No need to use regex for that. Use split() method.
splitted = Str.split('.')
# splitted[0] will be 'i like cows'
# splitted[1] will be 'I also like camels'

You can use this approach:
str1 = 'i like cows. I also like camels'
print(str1.split('.')[1:][0].strip())
output:
I also like camels

Try this split
String dataIWant = mydata.split(".")[1];
Result : I also like camels

Using split('.') and selecting the last element is generally better but for fun this is a RegEx solution:
import re
Str = "i like .cows. I also like camels"
pattern = r"([^\.]*$)"
results = re.search(pattern, Str)
print(results.group(1).strip())

This (?:[.]\s([A-Z].+)) picks "I also like camels"

Related

how to match a pattern and add a character to it

I have something like:
GCF_002904975:2.6672e-05):2.6672e-05.
and I would like to add the word '_S' right after any GCF(any number) entry before the next colon.
In other words I would like my text becoming like:
GCF_002904975_S:2.6672e-05):2.6672e-05.
I have repeated pattern like that all along my text.
This can be easily done with re.sub function. A working example would look like this:
import re
inp_string='(((GCF_001297375:2.6671e-05,GCF_002904975:2.6672e-05)0.924:0.060046136,(GCF_000144955:0.036474926,((GCF_001681075:0.017937143,...'
if __name__ == "__main__":
outp_string = re.sub(r'GCF_(?P<gfc_number>\d+)\:', r'GCF_\g<gfc_number>_S:', inp_string)
print(outp_string)
This code gives the following result, which is hopefully what you need:
(((GCF_001297375_S:2.6671e-05,GCF_002904975_S:2.6672e-05)0.924:0.060046136,(GCF_000144955_S:0.036474926,((GCF_001681075_S:0.017937143,...
For more info take a look at the docs:
https://docs.python.org/3/library/re.html
You can use regular expressions with a function substitution. The solution below depends on the numbers always being 9 digits, but could be modified to work with other cases.
test_str = '(((GCF_001297375:2.6671e-05,GCF_002904975:2.6672e-05)0.924:0.060046136,GCF_000144955:0.036474926,((GCF_001681075:0.017937143,...'
new_str = re.sub(r"GCF_\d{9}", lambda x: x.group(0) + "_S", test_str)
print(new_str)
#(((GCF_001297375_S:2.6671e-05,GCF_002904975_S:2.6672e-05)0.924:0.060046136,GCF_000144955_S:0.036474926,((GCF_001681075_S:0.017937143,...
Why not just do a replace? Shortening your example string to make it easier to read:
"(((GCF_001297375:2.6671e-05,GCF_002904975:2.6672e-05)...".replace(":","_S:")

How to lowercase a portion of text using python

I have the following text:
ABC=ABC.2016.001.02.Yomama.01234
How to lowercase just the Yomama part. I'd like it to look like this:
ABA.2016.001.02.yomama.01234
How can I accomplish this with python?
Any help would be appreciated. Thanks.
Assuming that you want a generic solution (otherwise you could just use str.replace() with a hard coded string) you can split the string on the ., lowercase the string in the appropriate field, and then stitch it back together with str.join():
s = 'ABC=ABC.2016.001.02.Yomama.01234'
fields = s.split('.')
fields[4] = fields[4].lower()
print('.'.join(fields))
Alternative solution, provided text ABC don't have repeating text
tmp = ABC.split('.')[-2]
ABC = ABC.replace(tmp, tmp.lower())

Complex regex in Python

I am trying to write a generic pattern using regex so that it fetches only particular things from the string. Let's say we have strings like GigabitEthernet0/0/0/0 or FastEthernet0/4 or Ethernet0/0.222. The regex should fetch the first 2 characters and all the numerals. Therefore, the fetched result should be something like Gi0000 or Fa04 or Et00222 depending on the above cases.
x = 'GigabitEthernet0/0/0/2
m = re.search('([\w+]{2}?)[\\\.(\d+)]{0,}',x)
I am not able to understand how shall I write the regular expression. The values can be fetched in the form of a list also. I write few more patterns but it isn't helping.
In regex, you may use re.findall function.
>>> import re
>>> s = 'GigabitEthernet0/0/0/0 '
>>> s[:2]+''.join(re.findall(r'\d', s))
'Gi0000'
OR
>>> ''.join(re.findall(r'^..|\d', s))
'Gi0000'
>>> ''.join(re.findall(r'^..|\d', 'Ethernet0/0.222'))
'Et00222'
OR
>>> s = 'GigabitEthernet0/0/0/0 '
>>> s[:2]+''.join([i for i in s if i.isdigit()])
'Gi0000'
z="Ethernet0/0.222."
print z[:2]+"".join(re.findall(r"(\d+)(?=[\d\W]*$)",z))
You can try this.This will make sure only digits from end come into play .
Here is another option:
s = 'Ethernet0/0.222'
"".join(re.findall('^\w{2}|[\d]+', s))

Python Regular Express Lookahead multiple conditions

My string looks like this:
string = "*[EQ](#[Type],'A,B,C',#[Type],*[EQ](#[Type],D,E,F))"
The ideal output list is:
['#[Type]', 'A,B,C', '#[Type]', '*[EQ](#[Type],D,E,F)']
So I can parse the string as:
if #[Type] in ('A,B,C') then #[Type] else *[EQ](#[Type],D,E,F)
The challenge is to find all the commas followed by #, ' or *. I've tried the following code but it doesn't work:
interM = re.search(r"\*\[EQ\]\((.+)(?=,#|,\*|,\')+,(.+)\)", string)
print(interM.groups())
Edit:
The ultimate goal is to parse out the 4 components of the input string:
*[EQ](Value, Target, ifTrue, ifFalse)
>>> import re
>>> string = "*[EQ](#[Type],'A,B,C',#[Type],*[EQ](#[Type],D,E,F))"
>>> re.split(r"^\*\[EQ\]\(|\)$|,(?=[#'*])", string)[1:-1]
['#[Type]', "'A,B,C'", '#[Type]', '*[EQ](#[Type],D,E,F)']
Although, if you are looking for a more robust solution I'd highly recommend a Lexical Analyzer such as flex.
x="*[EQ](#[Type],'A,B,C',#[Type],*[EQ](#[Type],D,E,F))"
print re.findall(r"#[^,]+|'[^']+'|\*.*?\([^\)]*\)",re.findall(r"\*\[EQ\]\((.*?)\)$",x)[0])
Output:
['#[Type]', "'A,B,C'", '#[Type]', '*[EQ](#[Type],D,E,F)']
You can try something of this sort.You have not mentioned the logic or anything so not sure if this can be scaled.

In Python how to strip dollar signs and commas from dollar related fields only

I'm reading in a large text file with lots of columns, dollar related and not, and I'm trying to figure out how to strip the dollar fields ONLY of $ and , characters.
so say I have:
a|b|c
$1,000|hi,you|$45.43
$300.03|$MS2|$55,000
where a and c are dollar-fields and b is not.
The output needs to be:
a|b|c
1000|hi,you|45.43
300.03|$MS2|55000
I was thinking that regex would be the way to go, but I can't figure out how to express the replacement:
f=open('sample1_fixed.txt','wb')
for line in open('sample1.txt', 'rb'):
new_line = re.sub(r'(\$\d+([,\.]\d+)?k?)',????, line)
f.write(new_line)
f.close()
Anyone have an idea?
Thanks in advance.
Unless you are really tied to the idea of using a regex, I would suggest doing something simple, straight-forward, and generally easy to read:
def convert_money(inval):
if inval[0] == '$':
test_val = inval[1:].replace(",", "")
try:
_ = float(test_val)
except:
pass
else:
inval = test_val
return inval
def convert_string(s):
return "|".join(map(convert_money, s.split("|")))
a = '$1,000|hi,you|$45.43'
b = '$300.03|$MS2|$55,000'
print convert_string(a)
print convert_string(b)
OUTPUT
1000|hi,you|45.43
300.03|$MS2|55000
A simple approach:
>>> import re
>>> exp = '\$\d+(,|\.)?\d+'
>>> s = '$1,000|hi,you|$45.43'
>>> '|'.join(i.translate(None, '$,') if re.match(exp, i) else i for i in s.split('|'))
'1000|hi,you|45.43'
It sounds like you are addressing the entire line of text at once. I think your first task would be to break up your string by columns into an array or some other variables. Once you've don that, your solution for converting strings of currency into numbers doesn't have to worry about the other fields.
Once you've done that, I think there is probably an easier way to do this task than with regular expressions. You could start with this SO question.
If you really want to use regex though, then this pattern should work for you:
\[$,]\g
Demo on regex101
Replace matches with empty strings. The pattern gets a little more complicated if you have other kinds of currency present.
I Try this regex take if necessary.
\$(\d+)[\,]*([\.]*\d*)
SEE DEMO : http://regex101.com/r/wM0zB6/2
Use the regexx
((?<=\d),(?=\d))|(\$(?=\d))
eg
import re
>>> x="$1,000|hi,you|$45.43"
re.sub( r'((?<=\d),(?=\d))|(\$(?=\d))', r'', x)
'1000|hi,you|45.43'
Try the below regex and then replace the matched strings with \1\2\3
\$(\d+(?:\.\d+)?)(?:(?:,(\d{2}))*(?:,(\d{3})))?
DEMO
Defining a black list and checking if the characters are in it, is an easy way to do this:
blacklist = ("$", ",") # define characters to remove
with open('sample1_fixed.txt','wb') as f:
for line in open('sample1.txt', 'rb'):
clean_line = "".join(c for c in line if c not in blacklist)
f.write(clean_line)
\$(?=(?:[^|]+,)|(?:[^|]+\.))
Try this.Replace with empty string.Use re.M option.See demo.
http://regex101.com/r/gT6kI4/6

Categories

Resources