I am dealing with a type of ASCII file where there are effectively 4 columns of data and the each row is assigned to a line in the file. Below is an example of a row of data from this file
'STOP.F 11966.0000:STOP DEPTH'
The data is always structured so that the delimiter between the first and second column is a period, the delimiter between the second and third column is a space and the delimiter between the third and fourth column is a colon.
Ideally, I would like to find a way to return the following result from the string above
['STOP', 'F', '11966.0000', 'STOP DEPTH']
I tried using a regular expression with the period, space and colon as delimiters, but it breaks down (see example below) because I don't know how to specify the specific order in which to split the string, and I don't know if there is a way to specify the maximum number of splits per delimiter right in the regular expression itself. I want it to split the delimiters in the specific order and each delimiter a maximum of 1 time.
import re
line = 'STOP.F 11966.0000:STOP DEPTH'
re.split("[. :]", line)
>>> ['STOP', 'F', '11966', '0000', 'STOP', 'DEPTH']
Any suggestions on a tidy way to do this?
This may work. Credit to Juan
import re
pattern = re.compile(r'^(.+)\.(.+) (.+):(.+)$')
line = 'STOP.F 11966.0000:STOP DEPTH'
pattern.search(line).groups()
Out[6]: ('STOP', 'F', '11966.0000', 'STOP DEPTH')
re.split() solution with specific regex pattern:
import re
s = 'STOP.F 11966.0000:STOP DEPTH'
result = re.split(r'(?<=^[^.]+)\.|(?<=^[^ ]+) |:', s)
print(result)
The output:
['STOP', 'F', '11966.0000', 'STOP DEPTH']
Related
Lets assume I have a string as follows:
s = '23092020_indent.xlsx'
I want to extract only indent from the above string. Now there are many approaches:
#Via re.split() operation
s_f = re.split('_ |. ',s) <---This is returning 's' ONLY. Not the desired output
#Via re.findall() operation
s_f = re.findall(r'[^A-Za-z]',s,re.I)
s_f
['i','n','d','e','n','t','x','l','s','x']
s_f = ''.join(s_f) <----This is returning 'indentxlsx'. Not the desired output
Am I missing out anything? Or do I need to use regex at all?
P.S. In the whole part of s only '.'delimiter would be constant. Rests all delimiter can be changed.
Use os.path.splitext and then str.split:
import os
name, ext = os.path.splitext(s)
name.split("_")[1] # If the position is always fixed
Output:
"indent"
I LOVE regex's, so that's definitely the way I'd go.
The exactly right answer requires more information as to all possible input strings and what the right thing to extract is for each of them. Here's a solution that assumes:
one or more digits, then
a single underscore, then
a group of chars not containing a '.', then
a '.', then
anything besides a '.', but at least one char
The #3 part is captured.
import re
s = '23092020_indent.xlsx'
exp = re.compile(r"^\d+_(.*?)\.[^.]+$")
m = exp.match(s)
if m:
print(m.group(1))
Result:
indent
05-23 14:14:53.275 A:B:C
in the above case i am trying to split the txt based on : using line.split(':') and following o/p should come as
['05-23 14:14:53.275','A','B','C']
but instead The o/p came is
['05-23 14','14','53.275','A','B','C']
it is also splitting the timestamp.
how do i exclude that from splitting
You are also splitting on the last space. An easy solution is to split on the last space and then split the second group:
s = '05-23 14:14:53.275 A:B:C'
front, back = s.rsplit(maxsplit=1)
[front] + back.split(':')
# ['05-23 14:14:53.275', 'A', 'B', 'C']
Split the line on whitespaces once, starting from the right:
parts = line.rsplit(maxsplit=1)
Combine the first two parts and the last one split by the colons:
parts[:1] + parts[-1].rsplit(":")
['05-23 14:14:53.275', 'A', 'B', 'C']
Just for fun of using walrus:
>>> s = '05-23 14:14:53.275 A:B:C'
>>> [(temp := s.rsplit(maxsplit=1))[0], *temp[1].split(':')]
['05-23 14:14:53.275', 'A', 'B', 'C']
I would suggest you use regex to split this.
([-:\s\d.]*)\s([\w:]*)
Try it in some regex online to see how it is split. Once you get your regex right, you cna use the groups to select which part you want and work on that.
import re
str = '05-23 14:14:53.275 A:B:C'
regex = '([-:\s\d.]*)\s([\w:]*)'
groups = re.compile(regex).match(str).groups()
timestamp = groups[0]
restofthestring = groups[1]
# Now you can split the second part using split
splits = restofthestring.split(':')
I have a csv file with some text, among others. I want to tokenize (split into a list of words) this text and am having problems with how pd.read_csv interprets escape characters.
My csv file looks like this:
text, number
one line\nother line, 12
and the code is like follows:
df = pd.read_csv('test.csv')
word_tokenize(df.iloc[0,0])
output is:
['one', 'line\\nother', 'line']
while what I want is:
['one', 'line', 'other', 'line']
The problem is pd.read_csv() is not interpreting the \n as a newline character but as two characters (\ and n).
I've tried setting the escapechar argument to '\' and to '\\' but both just remove the slash from the string without doing any interpretation of a newline character, i.e. the string becomes on one linenon other line.
If I explicitly set df.iloc[0,0] = 'one line\nother line', word_tokenize works just fine, because \n is actually interpreted as a newline character this time.
Ideally I would do this simply changing the way pd.read_csv() interprets the file, but other solutions are also ok.
The question is a bit poorly worded. I guess pandas escaping the \ in the string is confusing nltk.word_tokenize. pandas.read_csv can only use one separator (or a regex, but I doubt you want that), so it will always read the text column as "one line\nother line", and escape the backslash to preserve it. If you want to further parse and format it, you could use converters. Here's an example:
import pandas as pd
import re
df = pd.read_csv(
"file.csv", converters={"text":lambda s: re.split("\\\\n| ", s)}
)
The above results to:
text number
0 [one, line, other, line] 12
Edit: In case you need to use nltk to do the splitting (say the splitting depends on the language model), you would need to unescape the string before passing on to word_tokenize; try something like this:
lambda s: word_tokenize(s.encode('utf-8').decode('unicode_escape')
Note: Matching lists in queries is incredibly tricky, so you might want to convert them to tuples by altering the lambda like this:
lambda s: tuple(re.split("\\\\n| ", s))
You can simply try this
import pandas as pd
df = pd.read_csv("test.csv", header=None)
df = df.apply(lambda x: x.str.replace('\\', " "))
print(df.iloc[1, 0])
# output: one line other line
In you case simply use:
data = pd.read_csv('test.csv', sep='\\,', names=['c1', 'c2', 'c3', 'c4'], engine='python')
I would like to create a single regular expression in Python that extracts two interleaved portions of text from a filename as named groups. An example filename is given below:
CM00626141_H12.d4_T0001F003L01A02Z03C02.tif
The part of the filename I'd like to extract is contained between the underscores, and consists of the following:
An uppercase letter: [A-H]
A zero-padded two-digit number: 01 to 12
A period
A lowercase letter: [a-d]
A single digit: 1 to 4
For the example above, I would like one group ('Row') to contain H.d, and the other group ('Column') to contain 12.4. However, I don't know how to do this this when the text is separated as it is here.
EDIT: A constraint which I omitted: it needs to be a single regex to handle the string. I've updated the text/title to reflect this point.
Regexp capturing groups (whether numbered or named) do not actually capture text - they capture starting/ending indices within the original text. Thus, it is impossible for them to capture non-contiguous text. Probably the best thing to do here is have four separate groups, and combine them into your two desired values manually.
You may do it in two steps using re.findall() as:
Step 1: Extract substring from the main string following your pattern as:
>>> import re
>>> my_file = 'CM00626141_H12.d4_T0001F003L01A02Z03C02.tif'
>>> my_content = re.findall(r'_([A-H])(0[0-9]|1[0-2])\.([a-d])([1-4])_', my_file)
# where content of my_content is: [('H', '12', 'd', '4')]
Step 2: Join tuples to get the value of row and column:
>>> row = ".".join(my_content[0][::2])
>>> row
'H.d'
>>> column = ".".join(my_content[0][1::2])
>>> column
'12.4'
I do not believe there is any way to capture everything you want in exactly two named capture groups and one regex call. The most straightforward way I see is to do the following:
>>> import re
>>> source = 'CM00626141_H12.d4_T0001F003L01A02Z03C02.tif'
>>> match = re.search(r'_([A-H])(0[0-9]|1[0-2])\.([a-d])([1-4])_', source)
>>> row, column = '.'.join(match.groups()[0::2]), '.'.join(match.groups()[1::2])
>>> row
'H.d'
>>> column
'12.4'
Alternatively, you might find it more appealing to handle the parsing almost completely in the regex:
>>> row, column = re.sub(
r'^.*_([A-H])(0[0-9]|1[0-2])\.([a-d])([1-4])_.*$',
r'\1.\3,\2.\4',
source).split(',')
>>> row, column
('H.d', '12.4')
I have a string structured like this:
"I\thave\ta\t\tstring"
And in order split by tabs I used this method:
text = [splits for splits in row.split("\t") if splits is not ""]
Now this method removes all tabs from the string but I want it to remove only the first occurrence of a tab after a word so it would end up like this:
"Ihavea\tstring"
Is there a way of doing this?
Using re.split on a negative look behind assertion should do:
import re
s = ''.join(re.split(r'(?<!\t)\t', row))
print(s)
# 'Ihavea\tstring'
The assertion (?<!\t) prevents a split on a \t which was preceded by another \t.
You can use re.sub if you do not actually need the items from the split:
s = re.sub(r'(?<!\t)\t', '', row)
print(s)
# 'Ihavea\tstring'
List comprehension is also a way to go if you want to avoid to import the re module:
row = "I\thave\ta\t\tstring"
text = [splits if splits else "\t" for splits in row.split("\t")]
"".join(text)
#'Ihavea\tstring'
An empty string is in a boolean context false and empty list elements will be generated for every consecutive split-char ("\t" in this case)
To keep it simple you can use re.split
from re import split
text = "I\thave\ta\t\tstring"
split_string = split(r'\t+', text) #Gives ['I', 'have', 'a', 'string']
The regular expression r'\t+' basically just groups all consecutive tabs together.