splitting txt based on ':' but excluding the timestamp in python - python

05-23 14:14:53.275 A:B:C
in the above case i am trying to split the txt based on : using line.split(':') and following o/p should come as
['05-23 14:14:53.275','A','B','C']
but instead The o/p came is
['05-23 14','14','53.275','A','B','C']
it is also splitting the timestamp.
how do i exclude that from splitting

You are also splitting on the last space. An easy solution is to split on the last space and then split the second group:
s = '05-23 14:14:53.275 A:B:C'
front, back = s.rsplit(maxsplit=1)
[front] + back.split(':')
# ['05-23 14:14:53.275', 'A', 'B', 'C']

Split the line on whitespaces once, starting from the right:
parts = line.rsplit(maxsplit=1)
Combine the first two parts and the last one split by the colons:
parts[:1] + parts[-1].rsplit(":")
['05-23 14:14:53.275', 'A', 'B', 'C']

Just for fun of using walrus:
>>> s = '05-23 14:14:53.275 A:B:C'
>>> [(temp := s.rsplit(maxsplit=1))[0], *temp[1].split(':')]
['05-23 14:14:53.275', 'A', 'B', 'C']

I would suggest you use regex to split this.
([-:\s\d.]*)\s([\w:]*)
Try it in some regex online to see how it is split. Once you get your regex right, you cna use the groups to select which part you want and work on that.
import re
str = '05-23 14:14:53.275 A:B:C'
regex = '([-:\s\d.]*)\s([\w:]*)'
groups = re.compile(regex).match(str).groups()
timestamp = groups[0]
restofthestring = groups[1]
# Now you can split the second part using split
splits = restofthestring.split(':')

Related

Regex multiple parenthesis and remove one with specific pattern

I have multiple parentheses and want to remove the parentheses that have at least one number in.
I have tried the following. However, since it is greedy, it removes the first open parenthesis to the last close parenthesis. I have also tried to destroy the greedy feature by excluding an open parenthesis but did not work.
names = ['d((123))', 'd(1a)(ab)', 'd(1a)(ab)(123)']
data = pd.DataFrame(names, columns = ['name'])
print(data.name.str.replace("\(.*?\d+.*?\)", ""))
# Output: ['d)', 'd(ab)', 'd']
print(data.name.str.replace("\((?!\().*[\d]+(?!\().*\)",""))
# Output: ['d(', 'd', 'd']
# desired output: ['d', 'd(ab)', 'd(ab)']
This regex seems valid: \([^)\d]*?\d+[^)]*?\)+
>>> pattern = '\([^)\d]*?\d+[^)]*?\)+'
>>> names = ['d((123))', 'd(1a)(ab)', 'd(1a)(ab)(123)']
>>> [re.sub(pattern, '', x) for x in names]
['d', 'd(ab)', 'd(ab)']
I don't know if there are more complex cases but for those that you've supplied and similar, it should do the trick.
Although Python does not support recursive regex, you can enable
it by installing regex module with:
pip install regex
Then you can say something like:
import regex
names = ['d((123))', 'd(1a)(ab)', 'd(1a)(ab)(123)']
pattern = r'\((?:[^()]*?\d[^()]*?|(?R))+\)'
print ([regex.sub(pattern, '', x) for x in names])
Output:
['d', 'd(ab)', 'd(ab)']

Splitting based on particular pattern and editing string

I am trying to split a string based on a particular pattern in an effort to rejoin it later after adding a few characters.
Here's a sample of my string: "123\babc\b:123" which I need to convert to "123\babc\\"b\":123". I need to do it several times in a long string. I have tried variations of the following:
regex = r"(\\b[a-zA-Z]+)\\b:"
test_str = "123\\babc\\b:123"
x = re.split(regex, test_str)
but it doesn't split at the right positions for me to join. Is there another way of doing this/another way of splitting and joining?
You're right, you can do it with re.split as suggested. You can split by \b and then rebuild your output with a specific separator (and keep the \b when you want too).
Here an example:
# Import module
import re
string = "123\\babc\\b:123"
# Split by "\n"
list_sliced = re.split(r'\\b', "123\\babc\\b:123")
print(list_sliced)
# ['123', 'abc', ':123']
# Define your custom separator
custom_sep = '\\\\"b\\"'
# Build your new output
output = list_sliced[0]
# Iterate over each word
for i, word in enumerate(list_sliced[1:]):
# Chose the separator according the parity (since we don't want to change the first "\b")
sep = "\\\\b"
if i % 2 == 1:
sep = custom_sep
# Update output
output += sep + word
print(output)
# 123\\babc\\"b\":123
Maybe, the following expression,
^([\\]*)([^\\]+)([\\]*)([^\\]+)([\\]*)([^:]+):(.*)$
and a replacement of,
\1\2\3\4\5\\"\6\\":\7
with a re.sub might return our desired output.
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.

Removing trailing white spaces after text in python

Let's say that I have a file with several SQL queries, all separated by ";". If put the contents of the file in a string and do:
with open(query_file) as f:
query = f.read()
f.close()
queries = query.split(';')
I'll get a list where each item is one of the queries. Which is my final objective. However, how do I include here that if the last item is only spaces, next lines, tabs or an empty string to remove it? Can it be done via the .split()? I want to avoid things like this:
>>> a = 'a;b;'
>>> a.split(';')
['a', 'b', '']
Or this (this is a bad example, but you get the idea):
>>> a = '''a;b;\n'''
>>> a.split(';')
['a', 'b', '\n']
Thanks!
EDIT: I'm open to other approaches as well, basically separating the string into all the individual queries.

Splitting String with Multiple Delimiters in a Particular Order

I am dealing with a type of ASCII file where there are effectively 4 columns of data and the each row is assigned to a line in the file. Below is an example of a row of data from this file
'STOP.F 11966.0000:STOP DEPTH'
The data is always structured so that the delimiter between the first and second column is a period, the delimiter between the second and third column is a space and the delimiter between the third and fourth column is a colon.
Ideally, I would like to find a way to return the following result from the string above
['STOP', 'F', '11966.0000', 'STOP DEPTH']
I tried using a regular expression with the period, space and colon as delimiters, but it breaks down (see example below) because I don't know how to specify the specific order in which to split the string, and I don't know if there is a way to specify the maximum number of splits per delimiter right in the regular expression itself. I want it to split the delimiters in the specific order and each delimiter a maximum of 1 time.
import re
line = 'STOP.F 11966.0000:STOP DEPTH'
re.split("[. :]", line)
>>> ['STOP', 'F', '11966', '0000', 'STOP', 'DEPTH']
Any suggestions on a tidy way to do this?
This may work. Credit to Juan
import re
pattern = re.compile(r'^(.+)\.(.+) (.+):(.+)$')
line = 'STOP.F 11966.0000:STOP DEPTH'
pattern.search(line).groups()
Out[6]: ('STOP', 'F', '11966.0000', 'STOP DEPTH')
re.split() solution with specific regex pattern:
import re
s = 'STOP.F 11966.0000:STOP DEPTH'
result = re.split(r'(?<=^[^.]+)\.|(?<=^[^ ]+) |:', s)
print(result)
The output:
['STOP', 'F', '11966.0000', 'STOP DEPTH']

How to split a string by tabs but only once per occurrence

I have a string structured like this:
"I\thave\ta\t\tstring"
And in order split by tabs I used this method:
text = [splits for splits in row.split("\t") if splits is not ""]
Now this method removes all tabs from the string but I want it to remove only the first occurrence of a tab after a word so it would end up like this:
"Ihavea\tstring"
Is there a way of doing this?
Using re.split on a negative look behind assertion should do:
import re
s = ''.join(re.split(r'(?<!\t)\t', row))
print(s)
# 'Ihavea\tstring'
The assertion (?<!\t) prevents a split on a \t which was preceded by another \t.
You can use re.sub if you do not actually need the items from the split:
s = re.sub(r'(?<!\t)\t', '', row)
print(s)
# 'Ihavea\tstring'
List comprehension is also a way to go if you want to avoid to import the re module:
row = "I\thave\ta\t\tstring"
text = [splits if splits else "\t" for splits in row.split("\t")]
"".join(text)
#'Ihavea\tstring'
An empty string is in a boolean context false and empty list elements will be generated for every consecutive split-char ("\t" in this case)
To keep it simple you can use re.split
from re import split
text = "I\thave\ta\t\tstring"
split_string = split(r'\t+', text) #Gives ['I', 'have', 'a', 'string']
The regular expression r'\t+' basically just groups all consecutive tabs together.

Categories

Resources