I'm working with a JSON file that has values for rows and columns, and I need to create a unique id based on those values. We decided to combine the values row and column, and make sure each is represented by four digits (add 1000 to each).
For example:
"Col_Row": "1 - 145" needs to be something like "Geo_ID": "10011145"
I thought maybe I could do this with Python and regex, because I have to search for "Col_Row".
Here's what I have:
output = re.sub(r'"Col_Row: "(.*?)",', r'\1', test);
output = output.split(' - ')
[1000+int(v) for v in output]
So I can get the values, but now I'm stumped on how to search/replace a very large JSON file with those values.
use this regex (?<=\D|^)(\d{5,}|1[1-9]\d\d|1\d[1-9]\d|1\d\d[1-9]|[2-9]\d{3})(?=\D|$)
(?<=\D|^) symbols before digit
(\d{5,}|1[1-9]\d\d|1\d[1-9]\d|1\d\d[1-9]|[2-9]\d{3}) symbols > 1000
(?=\D|$) symbols after digit
Found how to do calculations to regex references through a callback function, as displayed in this question: Python re.sub question
So here's the python that worked for me:
Example:
import re
test = '"Properties" : { "Col_Row": "1 - 145", ... "Col_Row": "130 - 240" ... }}'
def repl(m):
num = "%d%d" % (int(m.group(1))+1000,int(m.group(2))+1000)
string = '"Geo_ID": "%s", "Col_Row": "%s - %s",' % (num,m.group(1),m.group(2))
return string
output = re.sub(r'"Col_Row": "(.*?) - (.*?)",', repl, test)
outputs: '"Properties" : { "Geo_ID": "10011145", "Col_Row": "1 - 145", ... "Geo_ID": "11301240", "Col_Row": "130 - 240" ... }}'
And now the real thing (manipulating a file):
input = open('fishnet.json','r')
input_list = input.readlines()
output = open('updated.json','w')
for line in input_list:
updated = re.sub(r'"Col_Row": "(.*?) - (.*?)",', repl, line)
output.write(updated)
output.close()
input.close()
Related
I am using python to process pcap files and input the processed values to a text file. The text file has around 8000 rows and some times, the text file has string such as 7.70.582 . In my further processing of the text file i am splitting the file into lines and extracting each of the float values in every line. Then I get this error
ValueError: invalid literal for float(): 7.70.582
In such cases I am interested only in 7.70 and I need to avoid everything after the second decimal including it. Is there any trick to extract only the string till the first character after the first decimal point?
I was searching for an answer for this and it seems there has been no such situation asked before.
Or is there a method where I can skip those lines where this kind of errors are happening?
I'm not a huge fan of this approach, but the simplest might be something like:
strs = [
"7",
"7.70",
"7.70.582",
"7.70.582.123"
]
def parse(s):
s += ".."
return float(s[:s.index(".", s.index(".")+1)])
for s in strs:
print(s, parse(s))
It's a more legible approach might be to use something like:
def parse(s):
if s.count('.') <= 1: return float(s)
return float(s[:s.index(".", s.index(".")+1)])
Or, based off Ajax1234's answer:
def parse(s):
return float('.'.join(s.split('.')[:2]))
All versions output:
7 7.0
7.70 7.7
7.70.582 7.7
7.70.582.123 7.7
You can use a regular expression, like this one:
https://pythex.org/?regex=%5E(%5B0-9%5D%2B%5C.%5B0-9%5D%2B).*&test_string=7.70.582&ignorecase=0&multiline=0&dotall=0&verbose=0
If your line is like '7.70.582' this regex will extract the 7.70 into the first group:
^([0-9]+.[0-9]+).*
https://docs.python.org/2/library/re.html
import re
line = "7654 16.317 8.651 7.70.582 17.487"
val = line.split(" ")[3]
m = re.search('^([0-9]+\.[0-9]+).*', val)
m.group(1)
'7.70'
float(m.group(1))
7.70
You can use str.split() and '.'.join:
s = "7654 16.317 8.651 7.70.582 17.487"
final_data = map(float, ['.'.join(i.split('.')[:-1]) if len(i.split('.')) > 2 else i for i in s.split()])
Output:
[7654.0, 16.317, 8.651, 7.7, 17.487]
Regarding the single string:
s = ["7.70.582"]
final_data = map(float, ['.'.join(i.split('.')[:-1]) if len(i.split('.')) > 2 else i for i in s])
Output:
[7.7]
I'm a python learner. If I have a lines of text in a file that looks like this
"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Can I split the lines around the inverted commas? The only constant would be their position in the file relative to the data lines themselves. The data lines could range from 10 to 100+ characters (they'll be nested network folders). I cannot see how I can use any other way to do those markers to split on, but my lack of python knowledge is making this difficult.
I've tried
optfile=line.split("")
and other variations but keep getting valueerror: empty seperator. I can see why it's saying that, I just don't know how to change it. Any help is, as always very appreciated.
Many thanks
You must escape the ":
input.split("\"")
results in
['\n',
'Y:\\DATA\x0001\\SERVER\\DATA.TXT',
' ',
'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT',
'\n']
To drop the resulting empty lines:
[line for line in [line.strip() for line in input.split("\"")] if line]
results in
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
I'll just add that if you were dealing with lines that look like they could be command line parameters, then you could possibly take advantage of the shlex module:
import shlex
with open('somefile') as fin:
for line in fin:
print shlex.split(line)
Would give:
['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
No regex, no split, just use csv.reader
import csv
sample_line = '10.0.0.1 foo "24/Sep/2015:01:08:16 +0800" www.google.com "GET /" -'
def main():
for l in csv.reader([sample_line], delimiter=' ', quotechar='"'):
print l
The output is
['10.0.0.1', 'foo', '24/Sep/2015:01:08:16 +0800', 'www.google.com', 'GET /', '-']
shlex module can help you.
import shlex
my_string = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
shlex.split(my_string)
This will spit
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
Reference: https://docs.python.org/2/library/shlex.html
Finding all regular expression matches will do it:
input=r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
re.findall('".+?"', # or '"[^"]+"', input)
This will return the list of file names:
["Y:\DATA\00001\SERVER\DATA.TXT", "V:\DATA2\00002\SERVER2\DATA2.TXT"]
To get the file name without quotes use:
[f[1:-1] for f in re.findall('".+?"', input)]
or use re.finditer:
[f.group(1) for f in re.finditer('"(.+?)"', input)]
The following code splits the line at each occurrence of the inverted comma character (") and removes empty strings and those consisting only of whitespace.
[s for s in line.split('"') if s.strip() != '']
There is no need to use regular expressions, an escape character, some module or assume a certain number of whitespace characters between the paths.
Test:
line = r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
output = [s for s in line.split('"') if s.strip() != '']
print(output)
>>> ['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
I think what you want is to extract the filepaths, which are separated by spaces. That is you want to split the line about items contained within quotations. I.e with a line
"FILE PATH" "FILE PATH 2"
You want
["FILE PATH","FILE PATH 2"]
In which case:
import re
with open('file.txt') as f:
for line in f:
print(re.split(r'(?<=")\s(?=")',line))
With file.txt:
"Y:\DATA\00001\SERVER\DATA MINER.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Outputs:
>>>
['"Y:\\DATA\\00001\\SERVER\\DATA MINER.TXT"', '"V:\\DATA2\\00002\\SERVER2\\DATA2.TXT"']
This was my solution. It parses most sane input exactly the same as if it was passed into the command line directly.
import re
def simpleParse(input_):
def reduce_(quotes):
return '' if quotes.group(0) == '"' else '"'
rex = r'("[^"]*"(?:\s|$)|[^\s]+)'
return [re.sub(r'"{1,2}',reduce_,z.strip()) for z in re.findall(rex,input_)]
Use case: Collecting a bunch of single shot scripts into a utility launcher without having to redo command input much.
Edit:
Got OCD about the stupid way that the command line handles crappy quoting and wrote the below:
import re
tokens = list()
reading = False
qc = 0
lq = 0
begin = 0
for z in range(len(trial)):
char = trial[z]
if re.match(r'[^\s]', char):
if not reading:
reading = True
begin = z
if re.match(r'"', char):
begin = z
qc = 1
else:
begin = z - 1
qc = 0
lc = begin
else:
if re.match(r'"', char):
qc = qc + 1
lq = z
elif reading and qc % 2 == 0:
reading = False
if lq == z - 1:
tokens.append(trial[begin + 1: z - 1])
else:
tokens.append(trial[begin + 1: z])
if reading:
tokens.append(trial[begin + 1: len(trial) ])
tokens = [re.sub(r'"{1,2}',lambda y:'' if y.group(0) == '"' else '"', z) for z in tokens]
I know this got answered a million year ago, but this works too:
input = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
input = input.replace('" "','"').split('"')[1:-1]
Should output it as a list containing:
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
My question Python - Error Caused by Space in argv Arument was marked as a duplicate of this one. We have a number of Python books doing back to Python 2.3. The oldest referred to using a list for argv, but with no example, so I changed things to:-
repoCmd = ['Purchaser.py', 'task', repoTask, LastDataPath]
SWCore.main(repoCmd)
and in SWCore to:-
sys.argv = args
The shlex module worked but I prefer this.
I am fairly new to Python. I have a text file containing many blocks of data in following format along with other unnecessary blocks.
NOT REQUIRED :: 123
Connected Part-1:: A ~$
Connected Part-3:: B ~$
Connector Location:: 100 200 300 ~$
NOT REQUIRED :: 456
Connected Part-2:: C ~$
i wish to extract the info (A,B,C, 100 200 300) corresponding to each property ( connected part-1, Connector location) and store it as list to use it later. I have prepared following code which reads file, cleans the line and store it as list.
import fileinput
with open('C:/Users/file.txt') as f:
content = f.readlines()
for line in content:
if 'Connected Part-1' in line or 'Connected Part-3' in line:
if 'Connected Part-1' in line:
connected_part_1 = [s.strip(' \n ~ $ Connected Part -1 ::') for s in content]
print ('PART_1:',connected_part_1)
if 'Connected Part-3' in line:
connected_part_3 = [s.strip(' \n ~ $ Connected Part -3 ::') for s in content]
print ('PART_3:',connected_part_3)
if 'Connector Location' in line:
# removing unwanted characters and converting into the list
content_clean_1 = [s.strip('\n ~ $ Connector Location::') for s in content]
#converting a single string item in list to a string
s = " ".join(content_clean_1)
# splitting the string and converting into a list
weld_location= s.split(" ")
print ('POSITION',weld_location)
here is the output
PART_1: ['A', '\t\tConnector Location:: 100.00 200.00 300.00', '\t\tConnected Part-3:: C~\t']
POSITION ['d', 'Part-1::', 'A', '\t\tConnector', 'Location::', '100.00', '200.00', '300.00', '\t\tConnected', 'Part-3::', 'C~\t']
PART_3: ['1:: A', '\t\tConnector Location:: 100.00 200.00 300.00', '\t\tConnected Part-3:: C~\t']
From the output of this program, i may conclude that, since 'content' is the string consisting all the characters in the file, the program is not reading an individual line. Instead it is considering all text as single string. Could anyone please help in this case?
I am expecting following output:
PART_1: ['A']
PART_3: ['C']
POSITION: ['100.00', '200.00','300.00']
(Note) When i am using individual files containing single line of data, it works fine. Sorry for such a long question
I will try to make it clear, and show how I would do it without regex. First of all, the biggest issue with the code presented is that when using the string.strip function the entire content list is being read:
connected_part_1 = [s.strip(' \n ~ $ Connected Part -1 ::') for s in content]
Content is the entire file lines, I think you want simply something like:
connected_part_1 = [line.strip(' \n ~ $ Connected Part -1 ::')]
How to parse the file is a bit subjective, but given the file format posted as input, I would do it like this:
templatestr = "{}: {}"
with open('inputreadlines.txt') as f:
content = f.readlines()
for line in content:
label, value = line.split('::')
ltokens = label.split()
if ltokens[0] == 'Connected':
print(templatestr.format(
ltokens[-1], #The last word on the label
value.split()[:-1])) #the split value without the last word '~$'
elif ltokens[0] == 'Connector':
print(value.split()[:-1]) #the split value without the last word '~$'
else: #NOT REQUIRED
pass
You can use the string.strip function to remove the funny characters '~$' instead of removing the last token as in the example.
I have a folder with about 50 .txt files containing data in the following format.
=== Predictions on test data ===
inst# actual predicted error distribution (OFTd1_OF_Latency)
1 1:S 2:R + 0.125,*0.875 (73.84)
I need to write a program that combines the following: my index number (i), the letter of the true class (R or S), the letter of the predicted class, and each of the distribution predictions (the decimals less than 1.0).
I would like it to look like the following when finished, but preferably as a .csv file.
ID True Pred S R
1 S R 0.125 0.875
2 R R 0.105 0.895
3 S S 0.945 0.055
. . . . .
. . . . .
. . . . .
n S S 0.900 0.100
I'm a beginner and a bit fuzzy on how to get all of that parsed and then concatenated and appended. Here's what I was thinking, but feel free to suggest another direction if that would be easier.
for i in range(1, n):
s = str(i)
readin = open('mydata/output/output'+s+'out','r')
#The files are all named the same but with different numbers associated
output = open("mydata/summary.csv", "a")
storage = []
for line in readin:
#data extraction/concatenation here
if line.startswith('1'):
id = i
true = # split at the ':' and take the letter after it
pred = # split at the second ':' and take the letter after it
#some have error '+'s and some don't so I'm not exactly sure what to do to get the distributions
ds = # split at the ',' and take the string of 5 digits before it
if pred == 'R':
dr = #skip the character after the comma but take the have characters after
else:
#take the five characters after the comma
lineholder = id+' , '+true+' , '+pred+' , '+ds+' , '+dr
else: continue
output.write(lineholder)
I think using the indexes would be another option, but it might complicate things if the spacing is off in any of the files and I haven't checked this for sure.
Thank you for your help!
Well first of all, if you want to use CSV, you should use CSV module that comes with python. More about this module here: https://docs.python.org/2.7/library/csv.html I won't demonstrate how to use it, because it's pretty simple.
As for reading the input data, here's my suggestion how to break down every line of the data itself. I assume that lines of data in the input file have their values separated by spaces, and each value cannot contain a space:
def process_line(id_, line):
pieces = line.split() # Now we have an array of values
true = pieces[1].split(':')[1] # split at the ':' and take the letter after it
pred = pieces[2].split(':')[1] # split at the second ':' and take the letter after it
if len(pieces) == 6: # There was an error, the + is there
p4 = pieces[4]
else: # There was no '+' only spaces
p4 = pieces[3]
ds = p4.split(',')[0] # split at the ',' and take the string of 5 digits before it
if pred == 'R':
dr = p4.split(',')[0][1:] #skip the character after the comma but take the have??? characters after
else:
dr = p4.split(',')[0]
return id_+' , '+true+' , '+pred+' , '+ds+' , '+dr
What I mainly used here was split function of strings: https://docs.python.org/2/library/stdtypes.html#str.split and in one place this simple syntax of str[1:] to skip the first character of the string (strings are arrays after all, we can use this slicing syntax).
Keep in mind that my function won't handle any errors or lines formated differently than the one you posted as an example. If the values in every line are separated by tabs and not spaces you should replace this line: pieces = line.split() with pieces = line.split('\t').
i think u can separte floats and then combine it with the strings with the help of re module as follows:
import re
file = open('sample.txt','r')
strings=[[num for num in re.findall(r'\d+\.+\d+',i) for i in file.readlines()]]
print (strings)
file.close()
file = open('sample.txt','r')
num=[[num for num in re.findall(r'\w+\:+\w+',i) for i in file.readlines()]]
print (num)
s= num+strings
print s #[['1:S','2:R'],['0.125','0.875','73.84']] output of the code
this prog is written for one line u can use it for multiple line as well but u need to use a loop for that
contents of sample.txt:
1 1:S 2:R + 0.125,*0.875 (73.84)
2 1:S 2:R + 0.15,*0.85 (69.4)
when you run the prog the result will be:
[['1:S,'2:R'],['1:S','2:R'],['0.125','0.875','73.84'],['0.15,'0.85,'69.4']]
simply concatenate them
This uses regular expressions and the CSV module.
import re
import csv
matcher = re.compile(r'[[:blank:]]*1.*:(.).*:(.).* ([^ ]*),[^0-9]?(.*) ')
filenametemplate = 'mydata/output/output%iout'
output = csv.writer(open('mydata/summary.csv', 'w'))
for i in range(1, n):
for line in open(filenametemplate % i):
m = matcher.match(line)
if m:
output.write([i] + list(m.groups()))
I'm a python learner. If I have a lines of text in a file that looks like this
"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Can I split the lines around the inverted commas? The only constant would be their position in the file relative to the data lines themselves. The data lines could range from 10 to 100+ characters (they'll be nested network folders). I cannot see how I can use any other way to do those markers to split on, but my lack of python knowledge is making this difficult.
I've tried
optfile=line.split("")
and other variations but keep getting valueerror: empty seperator. I can see why it's saying that, I just don't know how to change it. Any help is, as always very appreciated.
Many thanks
You must escape the ":
input.split("\"")
results in
['\n',
'Y:\\DATA\x0001\\SERVER\\DATA.TXT',
' ',
'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT',
'\n']
To drop the resulting empty lines:
[line for line in [line.strip() for line in input.split("\"")] if line]
results in
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
I'll just add that if you were dealing with lines that look like they could be command line parameters, then you could possibly take advantage of the shlex module:
import shlex
with open('somefile') as fin:
for line in fin:
print shlex.split(line)
Would give:
['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
No regex, no split, just use csv.reader
import csv
sample_line = '10.0.0.1 foo "24/Sep/2015:01:08:16 +0800" www.google.com "GET /" -'
def main():
for l in csv.reader([sample_line], delimiter=' ', quotechar='"'):
print l
The output is
['10.0.0.1', 'foo', '24/Sep/2015:01:08:16 +0800', 'www.google.com', 'GET /', '-']
shlex module can help you.
import shlex
my_string = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
shlex.split(my_string)
This will spit
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
Reference: https://docs.python.org/2/library/shlex.html
Finding all regular expression matches will do it:
input=r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
re.findall('".+?"', # or '"[^"]+"', input)
This will return the list of file names:
["Y:\DATA\00001\SERVER\DATA.TXT", "V:\DATA2\00002\SERVER2\DATA2.TXT"]
To get the file name without quotes use:
[f[1:-1] for f in re.findall('".+?"', input)]
or use re.finditer:
[f.group(1) for f in re.finditer('"(.+?)"', input)]
The following code splits the line at each occurrence of the inverted comma character (") and removes empty strings and those consisting only of whitespace.
[s for s in line.split('"') if s.strip() != '']
There is no need to use regular expressions, an escape character, some module or assume a certain number of whitespace characters between the paths.
Test:
line = r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
output = [s for s in line.split('"') if s.strip() != '']
print(output)
>>> ['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
I think what you want is to extract the filepaths, which are separated by spaces. That is you want to split the line about items contained within quotations. I.e with a line
"FILE PATH" "FILE PATH 2"
You want
["FILE PATH","FILE PATH 2"]
In which case:
import re
with open('file.txt') as f:
for line in f:
print(re.split(r'(?<=")\s(?=")',line))
With file.txt:
"Y:\DATA\00001\SERVER\DATA MINER.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Outputs:
>>>
['"Y:\\DATA\\00001\\SERVER\\DATA MINER.TXT"', '"V:\\DATA2\\00002\\SERVER2\\DATA2.TXT"']
This was my solution. It parses most sane input exactly the same as if it was passed into the command line directly.
import re
def simpleParse(input_):
def reduce_(quotes):
return '' if quotes.group(0) == '"' else '"'
rex = r'("[^"]*"(?:\s|$)|[^\s]+)'
return [re.sub(r'"{1,2}',reduce_,z.strip()) for z in re.findall(rex,input_)]
Use case: Collecting a bunch of single shot scripts into a utility launcher without having to redo command input much.
Edit:
Got OCD about the stupid way that the command line handles crappy quoting and wrote the below:
import re
tokens = list()
reading = False
qc = 0
lq = 0
begin = 0
for z in range(len(trial)):
char = trial[z]
if re.match(r'[^\s]', char):
if not reading:
reading = True
begin = z
if re.match(r'"', char):
begin = z
qc = 1
else:
begin = z - 1
qc = 0
lc = begin
else:
if re.match(r'"', char):
qc = qc + 1
lq = z
elif reading and qc % 2 == 0:
reading = False
if lq == z - 1:
tokens.append(trial[begin + 1: z - 1])
else:
tokens.append(trial[begin + 1: z])
if reading:
tokens.append(trial[begin + 1: len(trial) ])
tokens = [re.sub(r'"{1,2}',lambda y:'' if y.group(0) == '"' else '"', z) for z in tokens]
I know this got answered a million year ago, but this works too:
input = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
input = input.replace('" "','"').split('"')[1:-1]
Should output it as a list containing:
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
My question Python - Error Caused by Space in argv Arument was marked as a duplicate of this one. We have a number of Python books doing back to Python 2.3. The oldest referred to using a list for argv, but with no example, so I changed things to:-
repoCmd = ['Purchaser.py', 'task', repoTask, LastDataPath]
SWCore.main(repoCmd)
and in SWCore to:-
sys.argv = args
The shlex module worked but I prefer this.