My code
token = open('out.txt','r')
linestoken=token.readlines()
tokens_column_number = 1
r=[]
for x in linestoken:
r.append(x.split()[tokens_column_number])
token.close()
print (r)
Output
'"tick2/tick_calculated_2_2020-05-27T11-59-06.json.gz",'
Desired output
"tick2/tick_calculated_2_2020-05-27T11-59-06.json.gz"
How to get rid of '' and , ?
It would be nice to see your input data. I have created an input file which is similar than yours (I hope).
My test file content:
example1 "tick2/tick_calculated_2_2020-05-27T11-59-06.json.gz", aaaa
example2 "tick2/tick_calculated_2_2020-05-27T11-59-07.json.gz", bbbb
You should replace the ", ',' characters.
You can do it with replace (https://docs.python.org/3/library/stdtypes.html#str.replace):
r.append(x.split()[tokens_column_number].replace('"', "").replace(",", ""))
You can do it with strip (https://docs.python.org/2/library/string.html#string.strip):
r.append(x.split()[tokens_column_number].strip('",'))
You can do it with re.sub (https://docs.python.org/3/library/re.html#re.sub):
import re
...
...
for x in linestoken:
x = re.sub('[",]', "", x.split()[tokens_column_number])
r.append(x)
...
...
Output in both cases:
>>> python3 test.py
['tick2/tick_calculated_2_2020-05-27T11-59-06.json.gz', 'tick2/tick_calculated_2_2020-05-27T11-59-07.json.gz']
As you can see above the output (r) is a list type but if you want to get the result as a string, you should use the join (https://docs.python.org/3/library/stdtypes.html#str.join).
Output with print(",".join(r)):
>>> python3 test.py
tick2/tick_calculated_2_2020-05-27T11-59-06.json.gz,tick2/tick_calculated_2_2020-05-27T11-59-07.json.gz
Output with print("\n".join(r)):
>>> python3 test.py
tick2/tick_calculated_2_2020-05-27T11-59-06.json.gz
tick2/tick_calculated_2_2020-05-27T11-59-07.json.gz
Related
I have a file which contains something similar to the following lines:
[<data_type0>,<data_type1>] name(data)
"DATA_VALUE0"|"DATA_VALUE1" name(data)
I am looking to split each line into two. The first part being between either the '<' and '>' the '[' and ']' or " and ".
So the output from the desired split would be something like:
valueA[0] = [data_type0,data_type1]
valueA[1] = [name(data)]
valueB[0] = [DATA_VALUE0,DATA_VALUE1]
valueB[1] = [name(data)]
One snag is that the data values are of an unknown length, so some lines could read:
<date_type0> name(data)
and others could be:
<data_type0>,<data_type1>,<data_type2>...<data_type8> name(data)
Any ideas how?
What you are looking for is rsplit():
Code:
lines = (
'"[ < data_type0 >, < data_type1 >] name(data)',
'"DATA_VALUE0" | "DATA_VALUE1" name(data)',
)
for line in lines:
print(line.rsplit(' ', 1))
Results:
['"[ < data_type0 >, < data_type1 >]', 'name(data)']
['"DATA_VALUE0" | "DATA_VALUE1"', 'name(data)']
It looks like you could just split on a space .
>>> data = """[<data_type0>,<data_type1>] name(data)
... "DATA_VALUE0"|"DATA_VALUE1" name(data)"""
>>> for line in data.split("\n"):
... print(line.split())
...
['[<data_type0>,<data_type1>]', 'name(data)']
['"DATA_VALUE0"|"DATA_VALUE1"', 'name(data)']
There is also a general approach to finding stuff in strings and breaking them apart.
a = '<data_1>,<data_2> name(data)'
division = a.find('name(')
b = a[:division-1]
c = a[division:]
Results:
>>> b
'<data_1>,<data_2>'
>>> c
'name(data)'
I'm a python learner. If I have a lines of text in a file that looks like this
"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Can I split the lines around the inverted commas? The only constant would be their position in the file relative to the data lines themselves. The data lines could range from 10 to 100+ characters (they'll be nested network folders). I cannot see how I can use any other way to do those markers to split on, but my lack of python knowledge is making this difficult.
I've tried
optfile=line.split("")
and other variations but keep getting valueerror: empty seperator. I can see why it's saying that, I just don't know how to change it. Any help is, as always very appreciated.
Many thanks
You must escape the ":
input.split("\"")
results in
['\n',
'Y:\\DATA\x0001\\SERVER\\DATA.TXT',
' ',
'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT',
'\n']
To drop the resulting empty lines:
[line for line in [line.strip() for line in input.split("\"")] if line]
results in
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
I'll just add that if you were dealing with lines that look like they could be command line parameters, then you could possibly take advantage of the shlex module:
import shlex
with open('somefile') as fin:
for line in fin:
print shlex.split(line)
Would give:
['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
No regex, no split, just use csv.reader
import csv
sample_line = '10.0.0.1 foo "24/Sep/2015:01:08:16 +0800" www.google.com "GET /" -'
def main():
for l in csv.reader([sample_line], delimiter=' ', quotechar='"'):
print l
The output is
['10.0.0.1', 'foo', '24/Sep/2015:01:08:16 +0800', 'www.google.com', 'GET /', '-']
shlex module can help you.
import shlex
my_string = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
shlex.split(my_string)
This will spit
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
Reference: https://docs.python.org/2/library/shlex.html
Finding all regular expression matches will do it:
input=r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
re.findall('".+?"', # or '"[^"]+"', input)
This will return the list of file names:
["Y:\DATA\00001\SERVER\DATA.TXT", "V:\DATA2\00002\SERVER2\DATA2.TXT"]
To get the file name without quotes use:
[f[1:-1] for f in re.findall('".+?"', input)]
or use re.finditer:
[f.group(1) for f in re.finditer('"(.+?)"', input)]
The following code splits the line at each occurrence of the inverted comma character (") and removes empty strings and those consisting only of whitespace.
[s for s in line.split('"') if s.strip() != '']
There is no need to use regular expressions, an escape character, some module or assume a certain number of whitespace characters between the paths.
Test:
line = r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
output = [s for s in line.split('"') if s.strip() != '']
print(output)
>>> ['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
I think what you want is to extract the filepaths, which are separated by spaces. That is you want to split the line about items contained within quotations. I.e with a line
"FILE PATH" "FILE PATH 2"
You want
["FILE PATH","FILE PATH 2"]
In which case:
import re
with open('file.txt') as f:
for line in f:
print(re.split(r'(?<=")\s(?=")',line))
With file.txt:
"Y:\DATA\00001\SERVER\DATA MINER.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Outputs:
>>>
['"Y:\\DATA\\00001\\SERVER\\DATA MINER.TXT"', '"V:\\DATA2\\00002\\SERVER2\\DATA2.TXT"']
This was my solution. It parses most sane input exactly the same as if it was passed into the command line directly.
import re
def simpleParse(input_):
def reduce_(quotes):
return '' if quotes.group(0) == '"' else '"'
rex = r'("[^"]*"(?:\s|$)|[^\s]+)'
return [re.sub(r'"{1,2}',reduce_,z.strip()) for z in re.findall(rex,input_)]
Use case: Collecting a bunch of single shot scripts into a utility launcher without having to redo command input much.
Edit:
Got OCD about the stupid way that the command line handles crappy quoting and wrote the below:
import re
tokens = list()
reading = False
qc = 0
lq = 0
begin = 0
for z in range(len(trial)):
char = trial[z]
if re.match(r'[^\s]', char):
if not reading:
reading = True
begin = z
if re.match(r'"', char):
begin = z
qc = 1
else:
begin = z - 1
qc = 0
lc = begin
else:
if re.match(r'"', char):
qc = qc + 1
lq = z
elif reading and qc % 2 == 0:
reading = False
if lq == z - 1:
tokens.append(trial[begin + 1: z - 1])
else:
tokens.append(trial[begin + 1: z])
if reading:
tokens.append(trial[begin + 1: len(trial) ])
tokens = [re.sub(r'"{1,2}',lambda y:'' if y.group(0) == '"' else '"', z) for z in tokens]
I know this got answered a million year ago, but this works too:
input = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
input = input.replace('" "','"').split('"')[1:-1]
Should output it as a list containing:
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
My question Python - Error Caused by Space in argv Arument was marked as a duplicate of this one. We have a number of Python books doing back to Python 2.3. The oldest referred to using a list for argv, but with no example, so I changed things to:-
repoCmd = ['Purchaser.py', 'task', repoTask, LastDataPath]
SWCore.main(repoCmd)
and in SWCore to:-
sys.argv = args
The shlex module worked but I prefer this.
I have lots of lines in a text file. They looks like, for example:
562: DEBUG, CIC, Parameter(Auto_Gain_ROI_Size) = 4
711: DEBUG, VSrc, Parameter(Auto_Contrast) = 0
I want to exact the string inside the parantheses, for example, output in this case should
"Auto_Gain_ROI_Size" and "Auto_Contrast".
Notice that, string is always enclosed by "Parameter()". Thanks.
You can use regex:
>>> import re
>>> s = "562: DEBUG, CIC, Parameter(Auto_Gain_ROI_Size) = 4"
>>> t = "711: DEBUG, VSrc, Parameter(Auto_Contrast) = 0 "
>>> myreg = re.compile(r'Parameter\((.*?)\)')
>>> print myreg.search(s).group(1)
Auto_Gain_ROI_Size
>>> print myreg.search(t).group(1)
Auto_Contrast
Or, without regex (albeit a bit more messier):
>>> print s.split('Parameter(')[1].split(')')[0]
Auto_Gain_ROI_Size
>>> print t.split('Parameter(')[1].split(')')[0]
Auto_Contrast
I'm a python learner. If I have a lines of text in a file that looks like this
"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Can I split the lines around the inverted commas? The only constant would be their position in the file relative to the data lines themselves. The data lines could range from 10 to 100+ characters (they'll be nested network folders). I cannot see how I can use any other way to do those markers to split on, but my lack of python knowledge is making this difficult.
I've tried
optfile=line.split("")
and other variations but keep getting valueerror: empty seperator. I can see why it's saying that, I just don't know how to change it. Any help is, as always very appreciated.
Many thanks
You must escape the ":
input.split("\"")
results in
['\n',
'Y:\\DATA\x0001\\SERVER\\DATA.TXT',
' ',
'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT',
'\n']
To drop the resulting empty lines:
[line for line in [line.strip() for line in input.split("\"")] if line]
results in
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
I'll just add that if you were dealing with lines that look like they could be command line parameters, then you could possibly take advantage of the shlex module:
import shlex
with open('somefile') as fin:
for line in fin:
print shlex.split(line)
Would give:
['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
No regex, no split, just use csv.reader
import csv
sample_line = '10.0.0.1 foo "24/Sep/2015:01:08:16 +0800" www.google.com "GET /" -'
def main():
for l in csv.reader([sample_line], delimiter=' ', quotechar='"'):
print l
The output is
['10.0.0.1', 'foo', '24/Sep/2015:01:08:16 +0800', 'www.google.com', 'GET /', '-']
shlex module can help you.
import shlex
my_string = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
shlex.split(my_string)
This will spit
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
Reference: https://docs.python.org/2/library/shlex.html
Finding all regular expression matches will do it:
input=r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
re.findall('".+?"', # or '"[^"]+"', input)
This will return the list of file names:
["Y:\DATA\00001\SERVER\DATA.TXT", "V:\DATA2\00002\SERVER2\DATA2.TXT"]
To get the file name without quotes use:
[f[1:-1] for f in re.findall('".+?"', input)]
or use re.finditer:
[f.group(1) for f in re.finditer('"(.+?)"', input)]
The following code splits the line at each occurrence of the inverted comma character (") and removes empty strings and those consisting only of whitespace.
[s for s in line.split('"') if s.strip() != '']
There is no need to use regular expressions, an escape character, some module or assume a certain number of whitespace characters between the paths.
Test:
line = r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
output = [s for s in line.split('"') if s.strip() != '']
print(output)
>>> ['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
I think what you want is to extract the filepaths, which are separated by spaces. That is you want to split the line about items contained within quotations. I.e with a line
"FILE PATH" "FILE PATH 2"
You want
["FILE PATH","FILE PATH 2"]
In which case:
import re
with open('file.txt') as f:
for line in f:
print(re.split(r'(?<=")\s(?=")',line))
With file.txt:
"Y:\DATA\00001\SERVER\DATA MINER.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Outputs:
>>>
['"Y:\\DATA\\00001\\SERVER\\DATA MINER.TXT"', '"V:\\DATA2\\00002\\SERVER2\\DATA2.TXT"']
This was my solution. It parses most sane input exactly the same as if it was passed into the command line directly.
import re
def simpleParse(input_):
def reduce_(quotes):
return '' if quotes.group(0) == '"' else '"'
rex = r'("[^"]*"(?:\s|$)|[^\s]+)'
return [re.sub(r'"{1,2}',reduce_,z.strip()) for z in re.findall(rex,input_)]
Use case: Collecting a bunch of single shot scripts into a utility launcher without having to redo command input much.
Edit:
Got OCD about the stupid way that the command line handles crappy quoting and wrote the below:
import re
tokens = list()
reading = False
qc = 0
lq = 0
begin = 0
for z in range(len(trial)):
char = trial[z]
if re.match(r'[^\s]', char):
if not reading:
reading = True
begin = z
if re.match(r'"', char):
begin = z
qc = 1
else:
begin = z - 1
qc = 0
lc = begin
else:
if re.match(r'"', char):
qc = qc + 1
lq = z
elif reading and qc % 2 == 0:
reading = False
if lq == z - 1:
tokens.append(trial[begin + 1: z - 1])
else:
tokens.append(trial[begin + 1: z])
if reading:
tokens.append(trial[begin + 1: len(trial) ])
tokens = [re.sub(r'"{1,2}',lambda y:'' if y.group(0) == '"' else '"', z) for z in tokens]
I know this got answered a million year ago, but this works too:
input = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
input = input.replace('" "','"').split('"')[1:-1]
Should output it as a list containing:
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
My question Python - Error Caused by Space in argv Arument was marked as a duplicate of this one. We have a number of Python books doing back to Python 2.3. The oldest referred to using a list for argv, but with no example, so I changed things to:-
repoCmd = ['Purchaser.py', 'task', repoTask, LastDataPath]
SWCore.main(repoCmd)
and in SWCore to:-
sys.argv = args
The shlex module worked but I prefer this.
I'm working with a JSON file that has values for rows and columns, and I need to create a unique id based on those values. We decided to combine the values row and column, and make sure each is represented by four digits (add 1000 to each).
For example:
"Col_Row": "1 - 145" needs to be something like "Geo_ID": "10011145"
I thought maybe I could do this with Python and regex, because I have to search for "Col_Row".
Here's what I have:
output = re.sub(r'"Col_Row: "(.*?)",', r'\1', test);
output = output.split(' - ')
[1000+int(v) for v in output]
So I can get the values, but now I'm stumped on how to search/replace a very large JSON file with those values.
use this regex (?<=\D|^)(\d{5,}|1[1-9]\d\d|1\d[1-9]\d|1\d\d[1-9]|[2-9]\d{3})(?=\D|$)
(?<=\D|^) symbols before digit
(\d{5,}|1[1-9]\d\d|1\d[1-9]\d|1\d\d[1-9]|[2-9]\d{3}) symbols > 1000
(?=\D|$) symbols after digit
Found how to do calculations to regex references through a callback function, as displayed in this question: Python re.sub question
So here's the python that worked for me:
Example:
import re
test = '"Properties" : { "Col_Row": "1 - 145", ... "Col_Row": "130 - 240" ... }}'
def repl(m):
num = "%d%d" % (int(m.group(1))+1000,int(m.group(2))+1000)
string = '"Geo_ID": "%s", "Col_Row": "%s - %s",' % (num,m.group(1),m.group(2))
return string
output = re.sub(r'"Col_Row": "(.*?) - (.*?)",', repl, test)
outputs: '"Properties" : { "Geo_ID": "10011145", "Col_Row": "1 - 145", ... "Geo_ID": "11301240", "Col_Row": "130 - 240" ... }}'
And now the real thing (manipulating a file):
input = open('fishnet.json','r')
input_list = input.readlines()
output = open('updated.json','w')
for line in input_list:
updated = re.sub(r'"Col_Row": "(.*?) - (.*?)",', repl, line)
output.write(updated)
output.close()
input.close()