Cutting string in python - python

I did some search but couldn't find any useful information.
s = ['33PM']
My aim is to cut 'PM' from s[0] and append it as s[1].

You can use re.findall to extract continuous range of numbers and characters. \d+ would extract all numbers and \w+ would extract all character ranges
>>> import re
>>> s = re.findall(r'\d+|\w+', s[0])
>>> s
['33', 'PM']

Here is a method that uses simple Python code, avoiding the complications of regular expressions. This is designed for when you know that 'PM' is in the string, and if there is any text in the string after that it will be moved to the second list item together with the 'PM. This code also assumes that you care only about the first item in the list--any later items will be dropped.
s = ['33PM']
string0 = s[0]
loc = string0.find('PM')
s = [string0[:loc], string0[loc:]]
If you now print s the result is
['33', 'PM']

Related

How to split string in Python to take only middle characters?

I have string
['tick_calculated_2_2020-05-27T11-59-06.json.gz']
I want to get only 59-06
>>> f.split('_')
['tick', 'calculated', '2', '2020-05-27T11-59-06.json.gz']
>>> f.split('_')[3]
'2020-05-27T11-59-06.json.gz'
>>> f.split('_')[3].split('.')[0]
'2020-05-27T11-59-06'
What should be the next step?
You are going in the right direction.
Contrary to other answers, I feel regex is a bit of an overkill, apart from being slower and harder to understand and maintain.
Once you have the string x = '2020-05-27T11-59-06', you can do x.split('-') to get a list lst = ['2020', '05', '27T11', '59', '06'].
You can then access the last 2 elements of this list to get what you want easily: lst[-1], lst[-2].
You could try using re (regex).
import re
f = "tick_calculated_2_2020-05-27T11-59-06.json.gz"
res = re.search(r"T\d+\-([\d\-]+)\.json\.gz", f)
print(res.groups()[0])
output:
59-06
Assuming you don't know about using regular expressions, try to Google python string slicing. You had the right idea to split by '_', continue it to split by '.' then slice the string thus acquired for last 5 chars
f = 'tick_calculated_2_2020-05-27T11-59-06.json.gz'
splitted = f.split('_')
print(splitted)
date = splitted[3].split('.')[0]
specialNum = date[-5:]
print(specialNum)
You can use str.rfind this way:
index = s.rfind('-')
s[index - 2:index + 3]
Or use a regexp this way:
import re
re.search(r'.{5}(?=\.json)', s).group()
This uses Positive lookahead and Positive lookbehind to assert that match occurs accurately.
import re
string = 'tick_calculated_2_2020-05-27T11-59-06.json.gz'
re.search(r'(?<=T\d{2}-)\d{2}-\d{2}(?=\.json)', string).group()
Output:
59-06

Regular expression to retrieve string parts within parentheses separated by commas

I have a String from which I want to take the values within the parenthesis. Then, get the values that are separated from a comma.
Example: x(142,1,23ERWA31)
I would like to get:
142
1
23ERWA31
Is it possible to get everything with one regex?
I have found a method to do so, but it is ugly.
This is how I did it in python:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
secondResult = re.search("(?<=\()(.*?)(?=\))", firstResult.group(0))
finalResult = [x.strip() for x in secondResult.group(0).split(',')]
for i in finalResult:
print(i)
142
1
23ERWA31
This works for your example string:
import re
string = "x(142,1,23ERWA31)"
l = re.findall (r'([^(,)]+)(?!.*\()', string)
print (l)
Result: a plain list
['142', '1', '23ERWA31']
The expression matches a sequence of characters not in (,,,) and – to prevent the first x being picked up – may not be followed by a ( anywhere further in the string. This makes it also work if your preamble x consists of more than a single character.
findall rather than search makes sure all items are found, and as a bonus it returns a plain list of the results.
You can make this a lot simpler. You are running your first Regex but then not taking the result. You want .group(1) (inside the brackets), not .group(0) (the whole match). Once you have that you can just split it on ,:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
for e in firstResult.group(1).split(','):
print(e)
A little wonky looking, and also assuming there's always going to be a grouping of 3 values in the parenthesis - but try this regex
\((.*?),(.*?),(.*?)\)
To extract all the group matches to a single object - your code would then look like
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?),(.*?),(.*?)\)", string).groups()
You can then call the firstResult object like a list
>> print(firstResult[2])
23ERWA31

Python - How to use regex to find multiple words and extract them at the same time

Using Regular Expression, I want to find all the match words in a sentence and extract the wanted part in the matches words at the same time.
I use the API "findall" from "re" module to find the match words and plus the brackets to extract the parts I want.
For example I have a string "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C".
I only want the remaining two words after "0xQQ" or "0xWW", which will result in a list ["1A", "2B, "4C"].
Here is my code:
import re
MyString = "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C"
MySearch = re.compile("0xQQ(\w{2})|0xWW(\w{2})")
MyList = MySearch.findall(MyString)
print MyList
So my expected result is ["1A", "2B, "4C"].
But the actual result is [('1A', ''), ('', '2B'), ('4C', '')]
I think I might have used the combination of "()" and "|" in the wrong way.
Thx for the help!
Two different capturing groups will result in two items in the output (whatever matched each).
Instead, use a single capturing group and put your | (OR) earlier:
re.compile("0x(?:QQ|WW)(\w{2})")
((?:...) is a non-capturing group that matches ... - used to limit the effects of the | to only the QQ/WW split, without adding another capture to the output.)
You can try this:
import re
string = "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C"
pattern = re.compile(r"(0xQQ|0xWW)(\w{2})")
result = [match[2] for match in pattern.finditer(string)]
result will be:
['1A', '2B', '4C']

Cut within a pattern using Python regex

Objective: I am trying to perform a cut in Python RegEx where split doesn't quite do what I want. I need to cut within a pattern, but between characters.
What I am looking for:
I need to recognize the pattern below in a string, and split the string at the location of the pipe. The pipe isn't actually in the string, it just shows where I want to split.
Pattern: CDE|FG
String: ABCDEFGHIJKLMNOCDEFGZYPE
Results: ['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
What I have tried:
I seems like using split with parenthesis is close, but it doesn't keep the search pattern attached to the results like I need it to.
re.split('CDE()FG', 'ABCDEFGHIJKLMNOCDEFGZYPE')
Gives,
['AB', 'HIJKLMNO', 'ZYPE']
When I actually need,
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
Motivation:
Practicing with RegEx, and wanted to see if I could use RegEx to make a script that would predict the fragments of a protein digestion using specific proteases.
A non regex way would be to replace the pattern with the piped value and then split.
>>> pattern = 'CDE|FG'
>>> s = 'ABCDEFGHIJKLMNOCDEFGZYPE'
>>> s.replace('CDEFG',pattern).split('|')
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
You can solve it with re.split() and positive "look arounds":
>>> re.split(r"(?<=CDE)(\w+)(?=FG)", s)
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
Note that if one of the cut sequences is an empty string, you would get an empty string inside the resulting list. You can handle that "manually", sample (I admit, it is not that pretty):
import re
s = "ABCDEFGHIJKLMNOCDEFGZYPE"
cut_sequences = [
["CDE", "FG"],
["FGHI", ""],
["", "FGHI"]
]
for left, right in cut_sequences:
items = re.split(r"(?<={left})(\w+)(?={right})".format(left=left, right=right), s)
if not left:
items = items[1:]
if not right:
items = items[:-1]
print(items)
Prints:
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
['ABCDEFGHI', 'JKLMNOCDEFGZYPE']
['ABCDE', 'FGHIJKLMNOCDEFGZYPE']
To keep the splitting pattern when you split with re.split, or parts of it, enclose them in parentheses.
>>> data
'ABCDEFGHIJKLMNOCDEFGZYPE'
>>> pieces = re.split(r"(CDE)(FG)", data)
>>> pieces
['AB', 'CDE', 'FG', 'HIJKLMNO', 'CDE', 'FG', 'ZYPE']
Easy enough. All the parts are there, but as you can see they have been separated. So we need to reassemble them. That's the trickier part. Look carefully and you'll see you need to join the first two pieces, the last two pieces, and the rest in triples. I simplify the code by padding the list, but you could do it with the original list (and a bit of extra code) if performance is a problem.
>>> pieces = [""] + pieces
>>> [ "".join(pieces[i:i+3]) for i in range(0,len(pieces), 3) ]
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
re.split() guarantees a piece for every capturing (parenthesized) group, plus a piece for what's between. With more complex regular expressions that need their own grouping, use non-capturing groups to keep the format of the returned data the same. (Otherwise you'll need to adapt the reassembly step.)
PS. I also like Bhargav Rao's suggestion to insert a separator character in the string. If performance is not an issue, I guess it's a matter of taste.
Edit: Here's a (less transparent) way to do it without adding an empty string to the list:
pieces = re.split(r"(CDE)(FG)", data)
result = [ "".join(pieces[max(i-3,0):i]) for i in range(2,len(pieces)+2, 3) ]
A safer non-regex solution could be this:
import re
def split(string, pattern):
"""Split the given string in the place indicated by a pipe (|) in the pattern"""
safe_splitter = "####SPLIT_HERE####"
safe_pattern = pattern.replace("|", safe_splitter)
string = string.replace(pattern.replace("|", ""), safe_pattern)
return string.split(safe_splitter)
s = "ABCDEFGHIJKLMNOCDEFGZYPE"
print(split(s, "CDE|FG"))
print(split(s, "|FG"))
print(split(s, "FGH|"))
https://repl.it/C448

Python - Parse strings with variable repeating substring

I am trying to do something which I thought would be simple (and probably is), however I am hitting a wall. I have a string that contains document numbers. In most cases the format is ######-#-### however in some cases, where the single digit should be, there are multiple single digits separated by a comma (i.e. ######-#,#,#-###). The number of single digits separated by a comma is variable. Below is an example:
For the string below:
('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
I need to return:
['030421-1-001', '030421-2-001' '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002' '030421-1-003']
I have only gotten as far as returning the strings that match the ######-#-### pattern:
import re
p = re.compile('\d{6}-\d{1}-\d{3}')
m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
print m
Thanks in advance for any help!
Matt
Perhaps something like this:
>>> import re
>>> s = '030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003'
>>> it = re.finditer(r'(\b\d{6}-)(\d(?:,\d)*)(-\d{3})\b', s)
>>> for m in it:
a, b, c = m.groups()
for x in b.split(','):
print a + x + c
...
030421-1-001
030421-2-001
030421-1-002
030421-1-002
030421-2-002
030421-3-002
030421-1-003
Or using a list comprehension
>>> [a+x+c for a, b, c in (m.groups() for m in it) for x in b.split(',')]
['030421-1-001', '030421-2-001', '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002', '030421-1-003']
Use '\d{6}-\d(,\d)*-\d{3}'.
* means "as many as you want (0 included)".
It is applied to the previous element, here '(,\d)'.
I wouldn't use a single regular expression to try and parse this. Since it is essentially a list of strings, you might find it easier to replace the "&" with a comma globally in the string and then use split() to put the elements into a list.
Doing a loop of the list will allow you to write a single function to parse and fix the string and then you can push it onto a new list and the display your string.
replace(string, '&', ',')
initialList = string.split(',')
for item in initialList:
newItem = myfunction(item)
newList.append(newItem)
newstring = newlist(join(','))
(\d{6}-)((?:\d,?)+)(-\d{3})
We take 3 capturing groups. We match the first part and last part the easy way. The center part is optionally repeated and optionally contains a ','. Regex will however only match the last one, so ?: won't store it at all. What where left with is the following result:
>>> p = re.compile('(\d{6}-)((?:\d,?)+)(-\d{3})')
>>> m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
>>> m
[('030421-', '1,2', '-001'), ('030421-', '1', '-002'), ('030421-', '1,2,3', '-002'), ('030421-', '1', '-003')]
You'll have to manually process the 2nd term to split them up and join them, but a list comprehension should be able to do that.

Categories

Resources