Split string multiple times in Python

Split string multiple times in Python - python

This should be a simple thing to do but I can't get it to work.
Say I have this string.
I want this string to be splitted into smaller strings.
And, well, I want to split it into smaller strings, but only take what is between a T and a S.
So, the result should yield
this, to be s, to s, trings
So far I've tried splitting on every S, and then up to every T (backwards). However, it will only get the first "this", and stop. How can I make it continue and get all the things that are between T's and S's?
(In this program I export the results to another text file)
matches = open('string.txt', 'r')
with open ('test.txt', 'a') as file:
for line in matches:
test = line.split("S")
file.write(test[0].split("T")[-1] + "\n")
matches.close()
Maybe using Regular Expressions would be better, though I don't know how to work with them too well?

You want a re.findall() call instead:
re.findall(r't[^s]*s', line, flags=re.I)
Demo:
>>> import re
>>> sample = 'I want this string to be splitted into smaller strings.'
>>> re.findall(r't[^s]*s', sample, flags=re.I)
['t this', 'tring to be s', 'tted into s', 'trings']
Note that this matches 't this' and 'tted into s'; your rules need clarification as to why those first t characters shouldn't match when 'trings' does.
It sounds as if you want to match only text between t and s without including any other t:
>>> re.findall(r't[^ts]*s', sample, flags=re.I)
['this', 'to be s', 'to s', 'trings']
where tring in the second result and tted in the 3rd are not included because there is a later t in those results.

Related

Replace Items with regex (kind of)

I am handed a bunch of data and trying to get rid of certain characters. The data contains multiple instances of "^{number}" → "^0", "^1", "^2", etc.
I am trying to set all of these instances to an empty string, "", is there a better way to do this than
string.replace("^0", "").replace("^1", "").replace("^2", "")
I understand you can use a dictionary, but it seems a little overkill considering each item will be replaced with "".

I understand that the digits are always at the end of the string, have a look at the solutions below.
with regex:
import re
text = 'xyz125'
s = re.sub("\d+$",'', text)
print(s)
it should print:
'xyz'
without regex, keep in mind that this solution removes all digits and not only the ones at the end of a string:
text = 'xyz125'
result = ''.join(i for i in text if not i.isdigit())
print(result)
it should print:
'xyz'

Split sentences based on different patterns in Python 3

I need to split strings based on a sequence of Regex patterns. I am able to apply individually the split, but the issue is recursively split the different sentences.
For example I have this sentence:
"I want to be splitted using different patterns. It is a complex task, and not easy to solve; so, I would need help."
I would need to split the sentence based on ",", ";" and ".".
The resulst should be 5 sentences like:
"I want to be splitted using different patterns."
"It is a complex task,"
"and not easy to solve;"
"so,"
"I would need help."
My code so far:
import re
sample_sentence = "I want to be splitted using different patterns. It is a complex task, and not easy to solve; so, I would need help."
patterns = [re.compile('(?<=\.) '),
re.compile('(?<=,) '),
re.compile('(?<=;) ')]
for pattern in patterns:
splitted_sentences = pattern.split(sample_sentence)
print(f'Pattern used: {pattern}')
How can I apply the different patterns without losing the results and get the expected result?
Edit: I need to run each pattern one by one, as I need to do some checks in the result of every pattern, so running it in some sort of tree algorithm. Sorry for not explaining entirely, in my head it was clear, but I did not think it would have side effects.

You can join each pattern with |:
import re
s = "I want to be splitted using different patterns. It is a complex task, and not easy to solve; so, I would need help."
result = re.split('(?<=\.)\s|,\s*|;\s*', s)
Output:
['I want to be splitted using different patterns.', 'It is a complex task', 'and not easy to solve', 'so', 'I would need help.']

Python has this in re
Try
re.split('; | , | . ',ourString)

I can't think of a single regex to do this. So, what you can do it replace all the different type of delimiters with a custom-defined delimiter, say $DELIMITER$ and then split your sentence based on this delimiter.
new_sent = re.sub('[.,;]', '$DELIMITER$', sent)
new_sent.split('$DELIMITER$')
This will result in the following:
['I want to be splitted using different patterns',
' It is a complex task',
' and not easy to solve',
' so',
' I would need help',
'']
NOTE: The above output has an additional empty string. This is because there is a period at the end of the sentence. To avoid this, you can either remove that empty element from the list or you can substitute the custom defined delimiter if it occurs at the end of the sentence.
new_sent = re.sub('[.,;]', '$DELIMITER$', sent)
new_sent = re.sub('\$DELIMITER\$$', '', new_sent)
new_sent.split('$DELIMITER$')
In case you have a list of delimiters, you can make you regex pattern using the following code:
delimiter_list = [',', '.', ':', ';']
pattern = '[' + ''.join(delimiter_list) + ']' #will result in [,.:;]
new_sent = re.sub(pattern, '$DELIMITER$', sent)
new_sent = re.sub('\$DELIMITER\$$', '', new_sent)
new_sent.split('$DELIMITER$')
I hope this helps!!!

Use a lookbehind with a character class:
import re
s = "I want to be splitted using different patterns. It is a complex task, and not easy to solve; so, I would need help."
result = re.split('(?<=[.,;])\s', s)
print(result)
Output:
['I want to be splitted using different patterns.',
'It is a complex task,',
'and not easy to solve;',
'so,',
'I would need help.']

How to capture specific character arrangements or a word form a line using python

I need to read following sample line and grab a specific word from that line.
sample line
#apple (orange3ball/345-35:;bat9cap/253-43) school=(book,pen,bottle)
Let say I want to grab word 'orange3ball' (in between ('(' and '/') and 'bat9cap' and 'bottle' . what is the best way to do it.
I tried with split() function but I couldn't do it properly.
If it is too difficult to do can I search a specif arrangements of characters in a line.
As an example can I find the 'bat9cap' character arrangement from the above line.

This is a job for the interactive shell! Make a variable containing the line in question and experiment away. Here I did it for you to show you one slightly convoluted way to "grab" the word between ( and /.
>>> line = "#apple (orange3ball/345-35:;bat9cap/253-43) school=(book,pen,bottle)"
>>> line.split()
['#apple', '(orange3ball/345-35:;bat9cap/253-43)', 'school=(book,pen,bottle)']
>>> line.split()[1]
'(orange3ball/345-35:;bat9cap/253-43)'
>>> line.split()[1].split("/")
['(orange3ball', '345-35:;bat9cap', '253-43)']
>>> line.split()[1].split("/")[0]
'(orange3ball'
>>> line.split()[1].split("/")[0].strip("(")
'orange3ball'
Notice that I just pressed uparrow to get the code I used last and appended some stuff to it. The last line is rather unreadable though, so after finding something that works you may want to break it into several lines and use some nicely named variables to store the intermediate results.
The ideal way to do it depends on which aspects of the line you can depend on always being like they are here. (E.g. if the #apple part is optional so that it may not be there at all.) You may need to split on different characters or index into the resulting lists from the end of the list using negative indices (e.g. mylist[-1] to get the last item).

Use in to test membership:
>>> s='#apple (orange3ball/345-35:;bat9cap/253-43) school=(book,pen,bottle)'
>>> 'orange3ball' in s
True
>>> 'orange4ball' in s
False
>>> 'bat9cap' in s
True
>>> 'bat9ball' in s
False
You can also use a regex to break apart on word boundaries:
>>> import re
>>> re.findall(r'(?:\W*(\w+))', s)
['apple', 'orange3ball', '345', '35', 'bat9cap', '253', '43', 'school', 'book', 'pen', 'bottle']
The advantage of the second method is that only entire matches are a match in the resulting list:
>>> 'or' in s
True
>>> 'or' in re.findall(r'(?:\W*(\w+))', s)
False
Or just use a single regex to test for the whole word:
>>> re.search(r'\borange3ball\b', s)
<_sre.SRE_Match object; span=(8, 19), match='orange3ball'>
>>> re.search(r'\borange\b', s)
>>>
(The return of a match object is a positive match...)

Difference between re.search() and re.findall()

The following code is very strange:
>>> words = "4324324 blahblah"
>>> print re.findall(r'(\s)\w+', words)
[' ']
>>> print re.search(r'(\s)\w+', words).group()
blahblah
The () operator seems to behave poorly with findall. Why is this? I need it for a csv file.
Edit for clarity: I want to display blahblah using findall.
I discovered that re.findall(r'\s(\w+)', words) does what I want, but have no idea why findall treats groups in this way.

One character off:
>>> print re.search(r'(\s)\w+', words).groups()
(' ',)
>>> print re.search(r'(\s)\w+', words).group(1)
' '
findall returns a list of all groups captured. You're getting a space back because that's what you capture. Stop capturing, and it works fine:
>>> print re.findall(r'\s\w+', words)
[' blahblah']
Use the csv module

If you prefer to keep the capturing groups in your regex, but you still want to find the entire contents of each match instead of the groups, you can use the following:
[m.group() for m in re.finditer(r'(\s)\w+', words)]
For example:
>>> [m.group() for m in re.finditer(r'(\s)\w+', '4324324 blahblah')]
[' blahblah']

Parsing and reformatting CSV/text data using Python

sorry if this a bit of a beginner's question, but I haven't had much experience with python, and could really use some help in figuring this out. If there is a better programming language for tackling this, I'd be more than open to hearing it
I'm working on a small project, and I have two blocks of data, formatted differently from each other. They're all spreadsheets saved as CSV files, and I'd really like to make one group match the other without having to manually edit all the data.
What I need to do is go through a CSV, and format any data saved like this:
10W
20E
15-16N
17-18S
To a format like this (respective line to respective format):
10,W
20,E
,,15,16,N
,,17,18,S
So that they can just be copied over when opened as spreadsheets
I'm able to get the files into a string in python, but I'm unsure of how to properly write something to search for a number-hyphen-number-letter format.
I'd be immensely grateful for any help I can get. Thanks

This sounds like a good use-case for regular expressions. Once you've split the lines up into individual strings and stripped the whitespace (using s.strip()) these should work (I'm assuming those are cardinal directions; you'll need to change [NESW] to something else if that assumption is incorrect.):
>>> import re
>>> re.findall('\A(\d+)([NESW])', '16N')
[('16', 'N')]
>>> re.findall('\A(\d+)([NESW])', '15-16N')
[]
>>> re.findall('\A(\d+)-(\d+)([NESW])', '15-16N')
[('15', '16', 'N')]
>>> re.findall('\A(\d+)-(\d+)([NESW])', '16N')
[]
The first regex '\A(\d+)([NESW])' matches only a string that begins with a sequence of digits followed by a capital letter N, E, S, or W. The second matches only a string that begins with a sequence of digits followed by a hyphen, followed by another sequence of digits, followed by a capital letter N, E, S, or W. Forcing it to match at the beginning ensures that these regexes don't match a suffix of a longer string.
Then you can do something like this:
>>> vals = re.findall('\A(\d+)([NESW])', '16N')[0]
>>> ','.join(vals)
'16,N'
>>> vals = re.findall('(\d+)-(\d+)([NESW])', '15-16N')[0]
>>> ',,' + ','.join(vals)
',,15,16,N'

This is a whole solution that uses regexs. #senderle has beat me to the answer, so feel free to tick his response. This is just added here as I know how difficult it was to wrap my head around re in my code at first.
import re
dash = re.compile('(\d{2})-(\d{2})([WENS])')
no_dash = re.compile( '(\d{2})([WENS])' )
raw = '''10W
20E
15-16N
17-18S'''
lines = raw.split('\n')
data = []
for l in lines:
if '-' in l:
match = re.search(dash, l).groups()
data.append( ',,%s,%s,%s' % (match[0], match[1], match[2] ) )
else:
match = re.search(no_dash, l).groups()
data.append( '%s,%s' % (match[0], match[1] ) )
print '\n'.join(data)

In your case, I think the quick solution would involve regexps
You can either use the match method to extract your different tokens when they match a given regular expression, or the split method to split your string into tokens given a separator.
However, in your case, the separator would be a single character, so you can use the split method from the str class.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split string multiple times in Python - python

Related

Replace Items with regex (kind of)

Split sentences based on different patterns in Python 3

How to capture specific character arrangements or a word form a line using python

Difference between re.search() and re.findall()

Parsing and reformatting CSV/text data using Python

Categories

Resources