Using Sed through subprocess.call in python to conduct in file replacements - python

I've got a column in one file that I'd like to replace with a column in another file. I'm trying to use sed to do this within python, but I'm not sure I'm doing it correctly. Maybe the code will make things more clear:
20 for line in infile1.readlines()[1:]:
21 element = re.split("\t", line)
22 IID.append(element[1])
23 FID.append(element[0])
24
25 os.chdir(binary_dir)
26
27 for files in os.walk(binary_dir):
28 for file in files:
29 for name in file:
30 if name.endswith(".fam"):
31 infile2 = open(name, 'r+')
32
33 for line in infile2.readlines():
34 parts = re.split(" ", line)
35 Part1.append(parts[0])
36 Part2.append(parts[1])
37
38 for i in range(len(Part2)):
39 if Part2[i] in IID:
40 regex = '"s/\.*' + Part2[i] + '/' + Part1[i] + ' ' + Part2[i] + '/"' + ' ' + phenotype
41 print regex
42 subprocess.call(["sed", "-i.orig", regex], shell=True)
This is what print regex does. The system appears to hang during the sed process, as it remains there for quite some time without doing anything.
"s/\.*131006/201335658-01 131006/" /Users/user1/Desktop/phenotypes2
Thanks for your help, and let me know if you need further clarification!

You don't need sed if you have Python and the re module. Here is an example of how to use re to replace a given pattern in a string.
>>> import re
>>> line = "abc def ghi"
>>> new_line = re.sub("abc", "123", line)
>>> new_line
'123 def ghi'
>>>
Of course this is only one way to do that in Python. I feel that for you str.replace() will do the job too.

The first issue is shell=True that is used together with a list argument. Either drop shell=True or use a string argument (the complete shell command) instead:
from subprocess import check_call
check_call(["sed", "-i.orig", regex])
otherwise the arguments ('-i.orig' and regex) are passed to /bin/sh instead of sed.
The second issue is that you haven't provided input files and therefore sed expects data from stdin that it is why it appears to hang.
If you want to make changes in files inplace, you could use fileinput module:
#!/usr/bin/env python
import fileinput
files = ['/Users/user1/Desktop/phenotypes2'] # if it is None it behaves like sed
for line in fileinput.input(files, backup='.orig', inplace=True):
print re.sub(r'\.*131006', '201335658-01 13100', line),
fileinput.input() redirects stdout to the current file i.e., print changes the file.
The comma sets sys.stdout.softspace to avoid duplicate newlines.

Related

Python - Error Caused by Space in argv Arument [duplicate]

I'm a python learner. If I have a lines of text in a file that looks like this
"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Can I split the lines around the inverted commas? The only constant would be their position in the file relative to the data lines themselves. The data lines could range from 10 to 100+ characters (they'll be nested network folders). I cannot see how I can use any other way to do those markers to split on, but my lack of python knowledge is making this difficult.
I've tried
optfile=line.split("")
and other variations but keep getting valueerror: empty seperator. I can see why it's saying that, I just don't know how to change it. Any help is, as always very appreciated.
Many thanks
You must escape the ":
input.split("\"")
results in
['\n',
'Y:\\DATA\x0001\\SERVER\\DATA.TXT',
' ',
'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT',
'\n']
To drop the resulting empty lines:
[line for line in [line.strip() for line in input.split("\"")] if line]
results in
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
I'll just add that if you were dealing with lines that look like they could be command line parameters, then you could possibly take advantage of the shlex module:
import shlex
with open('somefile') as fin:
for line in fin:
print shlex.split(line)
Would give:
['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
No regex, no split, just use csv.reader
import csv
sample_line = '10.0.0.1 foo "24/Sep/2015:01:08:16 +0800" www.google.com "GET /" -'
def main():
for l in csv.reader([sample_line], delimiter=' ', quotechar='"'):
print l
The output is
['10.0.0.1', 'foo', '24/Sep/2015:01:08:16 +0800', 'www.google.com', 'GET /', '-']
shlex module can help you.
import shlex
my_string = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
shlex.split(my_string)
This will spit
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
Reference: https://docs.python.org/2/library/shlex.html
Finding all regular expression matches will do it:
input=r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
re.findall('".+?"', # or '"[^"]+"', input)
This will return the list of file names:
["Y:\DATA\00001\SERVER\DATA.TXT", "V:\DATA2\00002\SERVER2\DATA2.TXT"]
To get the file name without quotes use:
[f[1:-1] for f in re.findall('".+?"', input)]
or use re.finditer:
[f.group(1) for f in re.finditer('"(.+?)"', input)]
The following code splits the line at each occurrence of the inverted comma character (") and removes empty strings and those consisting only of whitespace.
[s for s in line.split('"') if s.strip() != '']
There is no need to use regular expressions, an escape character, some module or assume a certain number of whitespace characters between the paths.
Test:
line = r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
output = [s for s in line.split('"') if s.strip() != '']
print(output)
>>> ['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
I think what you want is to extract the filepaths, which are separated by spaces. That is you want to split the line about items contained within quotations. I.e with a line
"FILE PATH" "FILE PATH 2"
You want
["FILE PATH","FILE PATH 2"]
In which case:
import re
with open('file.txt') as f:
for line in f:
print(re.split(r'(?<=")\s(?=")',line))
With file.txt:
"Y:\DATA\00001\SERVER\DATA MINER.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Outputs:
>>>
['"Y:\\DATA\\00001\\SERVER\\DATA MINER.TXT"', '"V:\\DATA2\\00002\\SERVER2\\DATA2.TXT"']
This was my solution. It parses most sane input exactly the same as if it was passed into the command line directly.
import re
def simpleParse(input_):
def reduce_(quotes):
return '' if quotes.group(0) == '"' else '"'
rex = r'("[^"]*"(?:\s|$)|[^\s]+)'
return [re.sub(r'"{1,2}',reduce_,z.strip()) for z in re.findall(rex,input_)]
Use case: Collecting a bunch of single shot scripts into a utility launcher without having to redo command input much.
Edit:
Got OCD about the stupid way that the command line handles crappy quoting and wrote the below:
import re
tokens = list()
reading = False
qc = 0
lq = 0
begin = 0
for z in range(len(trial)):
char = trial[z]
if re.match(r'[^\s]', char):
if not reading:
reading = True
begin = z
if re.match(r'"', char):
begin = z
qc = 1
else:
begin = z - 1
qc = 0
lc = begin
else:
if re.match(r'"', char):
qc = qc + 1
lq = z
elif reading and qc % 2 == 0:
reading = False
if lq == z - 1:
tokens.append(trial[begin + 1: z - 1])
else:
tokens.append(trial[begin + 1: z])
if reading:
tokens.append(trial[begin + 1: len(trial) ])
tokens = [re.sub(r'"{1,2}',lambda y:'' if y.group(0) == '"' else '"', z) for z in tokens]
I know this got answered a million year ago, but this works too:
input = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
input = input.replace('" "','"').split('"')[1:-1]
Should output it as a list containing:
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
My question Python - Error Caused by Space in argv Arument was marked as a duplicate of this one. We have a number of Python books doing back to Python 2.3. The oldest referred to using a list for argv, but with no example, so I changed things to:-
repoCmd = ['Purchaser.py', 'task', repoTask, LastDataPath]
SWCore.main(repoCmd)
and in SWCore to:-
sys.argv = args
The shlex module worked but I prefer this.

bash 4.4 inside python with os.system

I have problems with running a bash script inside a python script script.py:
import os
bashCommand = """
sed "s/) \['/1, color=\"#ffcccc\", label=\"/g" list.txt | sed 's/\[/ GraphicFeature(start=/g' | sed 's/\:/, end=/g' | sed 's/>//g' | sed 's/\](/, strand=/g' | sed "s/'\]/\"),/g" >list2.txt"""
os.system("bash %s" % bashCommand)
When I run this as python script.py, no list2.txt is written, but on the terminal I see that I am inside bash-4.4 instead of the native macOS bash.
Any ideas what could cause this?
The script I posted above is part of a bigger script, where first it reads in some file and outputs list.txt.
edit: here comes some more description
In a first python script, I parsed a file (genbank file, to be specific), to write out a list with items (location, strand, name) into list.txt.
This list.txt has to be transformed to be parsable by a second python script, therefore the sed.
list.txt
[0:2463](+) ['bifunctional aspartokinase/homoserine dehydrogenase I']
[2464:3397](+) ['Homoserine kinase']
[3397:4684](+) ['Threonine synthase']
all the brackets, :, ' have to be replaced to look like desired output list2.txt
GraphicFeature(start=0, end=2463, strand=+1, color="#ffcccc", label="bifunctional aspartokinase/homoserine dehydrogenase I"),
GraphicFeature(start=2464, end=3397, strand=+1, color="#ffcccc", label="Homoserine kinase"),
GraphicFeature(start=3397, end=4684, strand=+1, color="#ffcccc", label="Threonine synthase"),
Read the file in Python, parse each line with a single regular expression, and output an appropriate line constructed from the captured pieces.
import re
import sys
# 1 2 3
# --- --- --
regex = re.compile(r"^\[(\d+):(\d+)\]\(\+\) \['(.*)'\]$")
# 1 - start value
# 2 - end value
# 3 - text value
with open("list2.txt", "w") as out:
for line in sys.stdin:
line = line.strip()
m = regex.match(line)
if m is None:
print(line, file=out)
else:
print('GraphicFeature(start={}, end={}, strand=+1, color="#ffcccc", label="{}"),'.format(*m.groups()), file=out)
I output lines that don't match the regular expression unmodified; you may want to ignore them altogether or report an error instead.

Increase String by Sequential Index

In a file dealing with climatological variables involving a running mean with hours, the hours progress in sequence.
Is there a sed/awk command that would take that hour (string) in the file and then change it by two, so next time the file is read its (202) and so on to (204) etc...
See the number being added to 'i' below.
timeprime = i + 569
'define climomslp = prmslmsl(t = 'timeprime' )
My goal is to increase the number in this case, 569, by one each time the file runs through other commands involved in processing the data.
The next desired number next to i would be
timeprime = i + 570 (where 569 is increased by one)
after that...
timeprime = i + 571 (where 570 is increased by one)
If there isn't a sed/awk command to do such a thing, is there such a thing in any other method?
Thank you for any answers.
You can definitely do this in Python (or Perl, Ruby, or whatever other scripting language you like, but you included a Python tag). For example:
#!/usr/bin/env python
import re
import sys
def replace(m):
return '{}{}'.format(m.group(1), int(m.group(2))+2)
for line in sys.stdin:
sys.stdout.write(re.sub(r'(timeprime = i \+ )(\d+)', replace, line))
Hopefully the regex itself is trivial to understand:
(timeprime = i \+ )(\d+)
Debuggex Demo
The sub function can take a to be applied to the match object instead of a string as the "replacement". So, lines that don't match will be printed unchanged; lines that do will have the match substituted for the same two parts, but with the second part replaced by int(number)+2
Here is an alternative using awk:
awk '/^timeprime = i [+]/{$5+=2} 1' file
Starting with this file:
$ cat file
timeprime = i + 569
'define climomslp = prmslmsl(t = 'timeprime' )
We can use the awk command to create a new file:
$ awk '/^timeprime = i [+]/{$5+=2} 1' file
timeprime = i + 571
'define climomslp = prmslmsl(t = 'timeprime' )
To overwrite the original file with the new one, use:
awk '/^timeprime = i [+]/{$5+=2} 1' file >file.tmp && mv file.tmp file
How it works
/^timeprime = i [+]/{$5+=2}
This looks for lines that start with ^timeprime = i + and, on those lines, the fifth field is incremented by 2.
1
This is awk's cryptic shorthand for print the line.

String portion after variable in .write() being put on new line in Python

I am relatively new to Python. So please excuse my naivety. While trying to write a string to a file, the portion of the string after the variable is put on a new line and it should not be. I am using python 2.6.5 btw
arch = subprocess.Popen("info " + agent + " | grep '\[arch\]' | awk '{print $3}'", shell=True, stdout=subprocess.PIPE)
arch, err = arch.communicate()
strarch = str(arch)
with open ("agentInfo", "a") as info:
info.write("Arch Bits: " + strarch + " bit")
info.close()
os.system("cat agentInfo")
Desireded output:
"Arch Bits: 64 bit"
Actual output:
"Arch Bits: 64
bits"
Looks like str(arch) has a trailing new line, you can remove that using str.strip or str.rstrip:
strarch = str(arch).strip() #removes all types of white-space characters
or:
strarch = str(arch).rstrip('\n') #removes only trailing '\n'
And you can also use string formatting here:
strarch = str(arch).rstrip('\n')
info.write("{}: {} {}".format("Arch Bits", strarch, "bits"))
Note that there's no need of info.close(), with statement automatically closes the file for you.

Python split string on quotes

I'm a python learner. If I have a lines of text in a file that looks like this
"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Can I split the lines around the inverted commas? The only constant would be their position in the file relative to the data lines themselves. The data lines could range from 10 to 100+ characters (they'll be nested network folders). I cannot see how I can use any other way to do those markers to split on, but my lack of python knowledge is making this difficult.
I've tried
optfile=line.split("")
and other variations but keep getting valueerror: empty seperator. I can see why it's saying that, I just don't know how to change it. Any help is, as always very appreciated.
Many thanks
You must escape the ":
input.split("\"")
results in
['\n',
'Y:\\DATA\x0001\\SERVER\\DATA.TXT',
' ',
'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT',
'\n']
To drop the resulting empty lines:
[line for line in [line.strip() for line in input.split("\"")] if line]
results in
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
I'll just add that if you were dealing with lines that look like they could be command line parameters, then you could possibly take advantage of the shlex module:
import shlex
with open('somefile') as fin:
for line in fin:
print shlex.split(line)
Would give:
['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
No regex, no split, just use csv.reader
import csv
sample_line = '10.0.0.1 foo "24/Sep/2015:01:08:16 +0800" www.google.com "GET /" -'
def main():
for l in csv.reader([sample_line], delimiter=' ', quotechar='"'):
print l
The output is
['10.0.0.1', 'foo', '24/Sep/2015:01:08:16 +0800', 'www.google.com', 'GET /', '-']
shlex module can help you.
import shlex
my_string = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
shlex.split(my_string)
This will spit
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
Reference: https://docs.python.org/2/library/shlex.html
Finding all regular expression matches will do it:
input=r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
re.findall('".+?"', # or '"[^"]+"', input)
This will return the list of file names:
["Y:\DATA\00001\SERVER\DATA.TXT", "V:\DATA2\00002\SERVER2\DATA2.TXT"]
To get the file name without quotes use:
[f[1:-1] for f in re.findall('".+?"', input)]
or use re.finditer:
[f.group(1) for f in re.finditer('"(.+?)"', input)]
The following code splits the line at each occurrence of the inverted comma character (") and removes empty strings and those consisting only of whitespace.
[s for s in line.split('"') if s.strip() != '']
There is no need to use regular expressions, an escape character, some module or assume a certain number of whitespace characters between the paths.
Test:
line = r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
output = [s for s in line.split('"') if s.strip() != '']
print(output)
>>> ['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
I think what you want is to extract the filepaths, which are separated by spaces. That is you want to split the line about items contained within quotations. I.e with a line
"FILE PATH" "FILE PATH 2"
You want
["FILE PATH","FILE PATH 2"]
In which case:
import re
with open('file.txt') as f:
for line in f:
print(re.split(r'(?<=")\s(?=")',line))
With file.txt:
"Y:\DATA\00001\SERVER\DATA MINER.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Outputs:
>>>
['"Y:\\DATA\\00001\\SERVER\\DATA MINER.TXT"', '"V:\\DATA2\\00002\\SERVER2\\DATA2.TXT"']
This was my solution. It parses most sane input exactly the same as if it was passed into the command line directly.
import re
def simpleParse(input_):
def reduce_(quotes):
return '' if quotes.group(0) == '"' else '"'
rex = r'("[^"]*"(?:\s|$)|[^\s]+)'
return [re.sub(r'"{1,2}',reduce_,z.strip()) for z in re.findall(rex,input_)]
Use case: Collecting a bunch of single shot scripts into a utility launcher without having to redo command input much.
Edit:
Got OCD about the stupid way that the command line handles crappy quoting and wrote the below:
import re
tokens = list()
reading = False
qc = 0
lq = 0
begin = 0
for z in range(len(trial)):
char = trial[z]
if re.match(r'[^\s]', char):
if not reading:
reading = True
begin = z
if re.match(r'"', char):
begin = z
qc = 1
else:
begin = z - 1
qc = 0
lc = begin
else:
if re.match(r'"', char):
qc = qc + 1
lq = z
elif reading and qc % 2 == 0:
reading = False
if lq == z - 1:
tokens.append(trial[begin + 1: z - 1])
else:
tokens.append(trial[begin + 1: z])
if reading:
tokens.append(trial[begin + 1: len(trial) ])
tokens = [re.sub(r'"{1,2}',lambda y:'' if y.group(0) == '"' else '"', z) for z in tokens]
I know this got answered a million year ago, but this works too:
input = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
input = input.replace('" "','"').split('"')[1:-1]
Should output it as a list containing:
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
My question Python - Error Caused by Space in argv Arument was marked as a duplicate of this one. We have a number of Python books doing back to Python 2.3. The oldest referred to using a list for argv, but with no example, so I changed things to:-
repoCmd = ['Purchaser.py', 'task', repoTask, LastDataPath]
SWCore.main(repoCmd)
and in SWCore to:-
sys.argv = args
The shlex module worked but I prefer this.

Categories

Resources