Python split string on quotes

Python split string on quotes - python

I'm a python learner. If I have a lines of text in a file that looks like this
"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Can I split the lines around the inverted commas? The only constant would be their position in the file relative to the data lines themselves. The data lines could range from 10 to 100+ characters (they'll be nested network folders). I cannot see how I can use any other way to do those markers to split on, but my lack of python knowledge is making this difficult.
I've tried
optfile=line.split("")
and other variations but keep getting valueerror: empty seperator. I can see why it's saying that, I just don't know how to change it. Any help is, as always very appreciated.
Many thanks

You must escape the ":
input.split("\"")
results in
['\n',
'Y:\\DATA\x0001\\SERVER\\DATA.TXT',
' ',
'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT',
'\n']
To drop the resulting empty lines:
[line for line in [line.strip() for line in input.split("\"")] if line]
results in
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']

I'll just add that if you were dealing with lines that look like they could be command line parameters, then you could possibly take advantage of the shlex module:
import shlex
with open('somefile') as fin:
for line in fin:
print shlex.split(line)
Would give:
['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']

No regex, no split, just use csv.reader
import csv
sample_line = '10.0.0.1 foo "24/Sep/2015:01:08:16 +0800" www.google.com "GET /" -'
def main():
for l in csv.reader([sample_line], delimiter=' ', quotechar='"'):
print l
The output is
['10.0.0.1', 'foo', '24/Sep/2015:01:08:16 +0800', 'www.google.com', 'GET /', '-']

shlex module can help you.
import shlex
my_string = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
shlex.split(my_string)
This will spit
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
Reference: https://docs.python.org/2/library/shlex.html

Finding all regular expression matches will do it:
input=r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
re.findall('".+?"', # or '"[^"]+"', input)
This will return the list of file names:
["Y:\DATA\00001\SERVER\DATA.TXT", "V:\DATA2\00002\SERVER2\DATA2.TXT"]
To get the file name without quotes use:
[f[1:-1] for f in re.findall('".+?"', input)]
or use re.finditer:
[f.group(1) for f in re.finditer('"(.+?)"', input)]

The following code splits the line at each occurrence of the inverted comma character (") and removes empty strings and those consisting only of whitespace.
[s for s in line.split('"') if s.strip() != '']
There is no need to use regular expressions, an escape character, some module or assume a certain number of whitespace characters between the paths.
Test:
line = r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
output = [s for s in line.split('"') if s.strip() != '']
print(output)
>>> ['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']

I think what you want is to extract the filepaths, which are separated by spaces. That is you want to split the line about items contained within quotations. I.e with a line
"FILE PATH" "FILE PATH 2"
You want
["FILE PATH","FILE PATH 2"]
In which case:
import re
with open('file.txt') as f:
for line in f:
print(re.split(r'(?<=")\s(?=")',line))
With file.txt:
"Y:\DATA\00001\SERVER\DATA MINER.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Outputs:
>>>
['"Y:\\DATA\\00001\\SERVER\\DATA MINER.TXT"', '"V:\\DATA2\\00002\\SERVER2\\DATA2.TXT"']

This was my solution. It parses most sane input exactly the same as if it was passed into the command line directly.
import re
def simpleParse(input_):
def reduce_(quotes):
return '' if quotes.group(0) == '"' else '"'
rex = r'("[^"]*"(?:\s|$)|[^\s]+)'
return [re.sub(r'"{1,2}',reduce_,z.strip()) for z in re.findall(rex,input_)]
Use case: Collecting a bunch of single shot scripts into a utility launcher without having to redo command input much.
Edit:
Got OCD about the stupid way that the command line handles crappy quoting and wrote the below:
import re
tokens = list()
reading = False
qc = 0
lq = 0
begin = 0
for z in range(len(trial)):
char = trial[z]
if re.match(r'[^\s]', char):
if not reading:
reading = True
begin = z
if re.match(r'"', char):
begin = z
qc = 1
else:
begin = z - 1
qc = 0
lc = begin
else:
if re.match(r'"', char):
qc = qc + 1
lq = z
elif reading and qc % 2 == 0:
reading = False
if lq == z - 1:
tokens.append(trial[begin + 1: z - 1])
else:
tokens.append(trial[begin + 1: z])
if reading:
tokens.append(trial[begin + 1: len(trial) ])
tokens = [re.sub(r'"{1,2}',lambda y:'' if y.group(0) == '"' else '"', z) for z in tokens]

I know this got answered a million year ago, but this works too:
input = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
input = input.replace('" "','"').split('"')[1:-1]
Should output it as a list containing:
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']

My question Python - Error Caused by Space in argv Arument was marked as a duplicate of this one. We have a number of Python books doing back to Python 2.3. The oldest referred to using a list for argv, but with no example, so I changed things to:-
repoCmd = ['Purchaser.py', 'task', repoTask, LastDataPath]
SWCore.main(repoCmd)
and in SWCore to:-
sys.argv = args
The shlex module worked but I prefer this.

Related

Python: how to avoid the space in dna calculated?

I am using python 2.7.
I want to find the DNA length. I have no idea where is the mistake.....The length of DNA supposed to be 283, but it comes up with 345.
The sequence in a single line is nothing wrong but just the length have some problem.....
I think the spaces are calculated too. May I know how to get the length of the DNA without including the spaces?
Thank you.
import re
singleSeq = ""
fh = open("seq.embl.txt")
lines = fh.readlines()
for line in lines:
lines = line.strip()
m = re.match(r"\s+(.[^\d]+)\s+\d+", line)
if m:
print(m.group(0))
seqline = m.group(1)
print(seqline)
singleSeq += seqline
print("\nSequence in a single line: ")
# print(line.strip(singleSeq))
print(singleSeq)
print("\nSequence length: ", len(singleSeq))
Output
Sequence in a single line:
cccatgtccc agcggcgtat tgctttgcat cgcgaacgca ctttcaatgt cccagcggcg tattgcttct attttataag taccagctaa attttttttt tttttttata agtaccagct aaaatttttt tttttttttt ttataagtac cagctaaaat tttttttttt tttttttata agtaccagct aaaatttttt ttttttttta taagttccag cggcgtattg ctttctgaaa tttaaaaaaa aaaaaaaatt tttttttaat aatatattat ata
Sequence length: 345

This should do the trick
# Python3 code to remove whitespace
def remove(string):
return string.replace(" ", "")
# Driver Program
string = ' t e s t '
print(remove(string))

it seems you are reinventing the wheel her. i strongly suggest you try BioPython for this
from Bio import SeqIO
record = SeqIO.read("seq.embl.txt", "embl")
print("\nSequence length: ", len(record))

Python - Error Caused by Space in argv Arument [duplicate]

I'm a python learner. If I have a lines of text in a file that looks like this
"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Can I split the lines around the inverted commas? The only constant would be their position in the file relative to the data lines themselves. The data lines could range from 10 to 100+ characters (they'll be nested network folders). I cannot see how I can use any other way to do those markers to split on, but my lack of python knowledge is making this difficult.
I've tried
optfile=line.split("")
and other variations but keep getting valueerror: empty seperator. I can see why it's saying that, I just don't know how to change it. Any help is, as always very appreciated.
Many thanks

You must escape the ":
input.split("\"")
results in
['\n',
'Y:\\DATA\x0001\\SERVER\\DATA.TXT',
' ',
'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT',
'\n']
To drop the resulting empty lines:
[line for line in [line.strip() for line in input.split("\"")] if line]
results in
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']

I'll just add that if you were dealing with lines that look like they could be command line parameters, then you could possibly take advantage of the shlex module:
import shlex
with open('somefile') as fin:
for line in fin:
print shlex.split(line)
Would give:
['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']

No regex, no split, just use csv.reader
import csv
sample_line = '10.0.0.1 foo "24/Sep/2015:01:08:16 +0800" www.google.com "GET /" -'
def main():
for l in csv.reader([sample_line], delimiter=' ', quotechar='"'):
print l
The output is
['10.0.0.1', 'foo', '24/Sep/2015:01:08:16 +0800', 'www.google.com', 'GET /', '-']

shlex module can help you.
import shlex
my_string = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
shlex.split(my_string)
This will spit
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
Reference: https://docs.python.org/2/library/shlex.html

Finding all regular expression matches will do it:
input=r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
re.findall('".+?"', # or '"[^"]+"', input)
This will return the list of file names:
["Y:\DATA\00001\SERVER\DATA.TXT", "V:\DATA2\00002\SERVER2\DATA2.TXT"]
To get the file name without quotes use:
[f[1:-1] for f in re.findall('".+?"', input)]
or use re.finditer:
[f.group(1) for f in re.finditer('"(.+?)"', input)]

The following code splits the line at each occurrence of the inverted comma character (") and removes empty strings and those consisting only of whitespace.
[s for s in line.split('"') if s.strip() != '']
There is no need to use regular expressions, an escape character, some module or assume a certain number of whitespace characters between the paths.
Test:
line = r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
output = [s for s in line.split('"') if s.strip() != '']
print(output)
>>> ['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']

I think what you want is to extract the filepaths, which are separated by spaces. That is you want to split the line about items contained within quotations. I.e with a line
"FILE PATH" "FILE PATH 2"
You want
["FILE PATH","FILE PATH 2"]
In which case:
import re
with open('file.txt') as f:
for line in f:
print(re.split(r'(?<=")\s(?=")',line))
With file.txt:
"Y:\DATA\00001\SERVER\DATA MINER.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Outputs:
>>>
['"Y:\\DATA\\00001\\SERVER\\DATA MINER.TXT"', '"V:\\DATA2\\00002\\SERVER2\\DATA2.TXT"']

This was my solution. It parses most sane input exactly the same as if it was passed into the command line directly.
import re
def simpleParse(input_):
def reduce_(quotes):
return '' if quotes.group(0) == '"' else '"'
rex = r'("[^"]*"(?:\s|$)|[^\s]+)'
return [re.sub(r'"{1,2}',reduce_,z.strip()) for z in re.findall(rex,input_)]
Use case: Collecting a bunch of single shot scripts into a utility launcher without having to redo command input much.
Edit:
Got OCD about the stupid way that the command line handles crappy quoting and wrote the below:
import re
tokens = list()
reading = False
qc = 0
lq = 0
begin = 0
for z in range(len(trial)):
char = trial[z]
if re.match(r'[^\s]', char):
if not reading:
reading = True
begin = z
if re.match(r'"', char):
begin = z
qc = 1
else:
begin = z - 1
qc = 0
lc = begin
else:
if re.match(r'"', char):
qc = qc + 1
lq = z
elif reading and qc % 2 == 0:
reading = False
if lq == z - 1:
tokens.append(trial[begin + 1: z - 1])
else:
tokens.append(trial[begin + 1: z])
if reading:
tokens.append(trial[begin + 1: len(trial) ])
tokens = [re.sub(r'"{1,2}',lambda y:'' if y.group(0) == '"' else '"', z) for z in tokens]

I know this got answered a million year ago, but this works too:
input = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
input = input.replace('" "','"').split('"')[1:-1]
Should output it as a list containing:
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']

My question Python - Error Caused by Space in argv Arument was marked as a duplicate of this one. We have a number of Python books doing back to Python 2.3. The oldest referred to using a list for argv, but with no example, so I changed things to:-
repoCmd = ['Purchaser.py', 'task', repoTask, LastDataPath]
SWCore.main(repoCmd)
and in SWCore to:-
sys.argv = args
The shlex module worked but I prefer this.

Python - regex parsing file

I have a file like this
module modulename(wire1, wire2, \wire3[0], \wire3[1], \wire3[2], wire4, wire5,wire6, wire7, \wire8[0], wire9); nonmodule modulename(wire1, wire2, \wire3[0], \wire3[1], \wire3[2], wire4, wire5,wire6, wire7, \wire8[0], wire9)
i want to change this string to
module modulename(wire1, wire2, wire3[0:2],wire4, wire5, wire6, wire7,wire8[0],wire9) ; nonmodule modulename(wire1, wire2, wire3[0], wire3[1], wire3[2], wire4, wire5,wire6, wire7, wire8[0], wire9)
so basically remove \ and delete individual copies of wires and change size to [start:stop] when the starting keyword is module and just removing slashes when starting keyword after ";" is not module
If i can parse it with regex i can do the rest, i am trying the code below but its not matching anything. the code is modified from -pattern to dictionary of lists Python
lines=f.read()
d = defaultdict(list)
module_pattern = r'(\w+)\s(\w+)\(([^;]+)'
mod_rex = re.compile(module_pattern)
wire_pattern = r'(\w+)\s[\\]?(\w+)['
wire_rex = re.compile(wire_pattern)
for match in mod_rex.finditer(lines):
#print '\n'.join(match.groups())
module, instance, wires = match.groups()
for match in wire_rex.finditer(wires):
wire, connection = match.groups()
#print '\t', wire, connection
d[wire].append((module, instance, connection))
for k, v in d.items():
print k, ':', v
Help is appreciated , havent been able to identify the tokens.

This should get you started. I'm not sure of what assumptions you can make about your file format, but it should be straightforward enough to modify this code to suit your needs.
Also, I assumed that the ordering of the ports was strict, so they have been left unmodified. This is also the reason I didn't use dicts.
This code will strip out all backslashes and collapse adjacent bits into vectors. This will also handle vectors that do not start at 0 (for example someport[3:8]). I also chose to make single bit vectors say [0:0] rather than [0].
import re
import sys
mod_re = re.compile(r'module\s*([^(]+)\(([^)]*)\);(.*)')
wire_re = re.compile(r'([^[]+)\[([0-9]+)\]')
def process(line):
# Get rid of all backslashes. You can make this more selective if you want
clean = line.replace('\\', '')
m = mod_re.search(clean)
if m:
ports = []
mod_name, wires, remaining = m.groups()
for wire in wires.split(','):
wire = wire.replace(' ', '')
m = wire_re.search(wire)
if m:
# Found a vector
n = int(m.group(2))
prev_wire, _ = ports[-1]
# If previous port was a vector, tack on next value
if prev_wire == m.group(1):
ports[-1][1][1] = n
else:
ports.append((m.group(1), [n, n]))
else:
# Found a scalar
ports.append((wire, None))
# Stringify ports
out = []
for port in ports:
name, val = port
if val is None:
out.append(name)
else:
start, end = val
out.append('%s[%s:%s]' % (name, start, end))
print 'module %s(%s); %s' % (mod_name, ', '.join(out), remaining)
f = open(sys.argv[1], 'r')
if f:
for l in f.readlines():
process(l)
f.close()
Output:
module modulename(wire1, wire2, wire3[0:2], wire4, wire5, wire6, wire7, wire8[0:0], wire9); nonmodule modulename(wire1, wire2, wire3[0], wire3[1], wire3[2], wire4, wire5,wire6, wire7, wire8[0], wire9)
PS: I don't know what exactly you are trying to do, but changing the module definition will also require changing the instantiation as well.
EDIT: Removed with keyword when opening file for Python2.5 support.

it seems like you're removing \ regardless if where it is, so replace them after performing this pattern
(\bmodule\b[^()]+\([^;]*?)(\\wire(\d+)\[(\d+)\][^;]*\wire\3\[(\d+)\])
and replace w/ \1wire\3[\4:\5]
Demo
per comment try new pattern
(\bmodule\b[^\\;]+)\\([^[]+)\[(\d+)\][^;]+\2\[(\d+)\]
Demo

Counting chars in a file | python 3x

I'm wondering, how can I count for example all "s" characters and print their number in a text file that I'm importing? Tried few times to do it by my own but I'm still doing something wrong. If someone could give me some tips I would really appreciate that :)

Open the file, the "r" means it is opened as readonly mode.
filetoread = open("./filename.txt", "r")
With this loop, you iterate over all the lines in the file and counts the number of times the character chartosearch appears. Finally, the value is printed.
total = 0
chartosearch = 's'
for line in filetoread:
total += line.count(chartosearch)
print("Number of " + chartosearch + ": " + total)

I am assuming you want to read a file, find the number of s s and then, store the result at the end of the file.
f = open('blah.txt','r+a')
data_to_read = f.read().strip()
total_s = sum(map(lambda x: x=='s', data_to_read ))
f.write(str(total_s))
f.close()
I did it functionally just to give you another perspective.

You open the file with an open("myscript.txt", "r") with the mode as "r" because you are reading. To remove whitespaces and \n's, we do a .read().split(). Then, using a for loop, we loop over each individual character and check if it is an 'S' or an 's', and each time we find one, we add one to the scount variable (scount is supposed to mean S-count).
filetoread = open("foo.txt").read().split()
scount = 0
for k in ''.join(filetoread):
if k.lower() == 's':
scount+=1
print ("There are %d 's' characters" %(scount))

Here's a version with a reasonable time performance (~500MB/s on my machine) for ascii letters:
#!/usr/bin/env python3
import sys
from functools import partial
byte = sys.argv[1].encode('ascii') # s
print(sum(chunk.count(byte)
for chunk in iter(partial(sys.stdin.buffer.read, 1<<14), b'')))
Example:
$ echo baobab | ./count-byte b
3
It could be easily changed to support arbitrary Unicode codepoints:
#!/usr/bin/env python3
import sys
from functools import partial
char = sys.argv[1]
print(sum(chunk.count(char)
for chunk in iter(partial(sys.stdin.read, 1<<14), '')))
Example:
$ echo ⛄⛇⛄⛇⛄ | ./count-char ⛄
3
To use it with a file, you could use a redirect:
$ ./count-char < input_file

matching and dispalying specific lines through python

I have 15 lines in a log file and i want to read the 4th and 10 th line for example through python and display them on output saying this string is found :
abc
def
aaa
aaa
aasd
dsfsfs
dssfsd
sdfsds
sfdsf
ssddfs
sdsf
f
dsf
s
d
please suggest through code how to achieve this in python .
just to elaborate more on this example the first (string or line is unique) and can be found easily in logfile the next String B comes within 40 lines of the first one but this one occurs at lots of places in the log file so i need to read this string withing the first 40 lines after reading string A and print the same that these strings were found.
Also I cant use with command of python as this gives me errors like 'with' will become a reserved keyword in Python 2.6. I am using Python 2.5

You can use this:
fp = open("file")
for i, line in enumerate(fp):
if i == 3:
print line
elif i == 9:
print line
break
fp.close()

def bar(start,end,search_term):
with open("foo.txt") as fil:
if search_term in fil.readlines()[start,end]:
print search_term + " has found"
>>>bar(4, 10, "dsfsfs")
"dsfsfs has found"

#list of random characters
from random import randint
a = list(chr(randint(0,100)) for x in xrange(100))
#look for this
lookfor = 'b'
for element in xrange(100):
if lookfor==a[element]:
print a[element],'on',element
#b on 33
#b on 34
is one easy to read and simple way to do it. Can you give part of your log file as an example? There are other ways that may work better :).
after edits by author:
The easiest thing you can do then is:
looking_for = 'findthis' i = 1 for line in open('filename.txt','r'):
if looking_for == line:
print i, line
i+=1
it's efficient and easy :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python split string on quotes - python

shlex module can help you. import shlex my_string = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"' shlex.split(my_string) This will spit ['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT'] Reference: https://docs.python.org/2/library/shlex.html

Related

Python: how to avoid the space in dna calculated?

Python - Error Caused by Space in argv Arument [duplicate]

Python - regex parsing file

Counting chars in a file | python 3x

matching and dispalying specific lines through python

Categories

Resources