I have a bunch of files from a LAMMPS simulation made by a 9-lines header and a Nx5 data array (N is large, order 10000). A file looks like this:
ITEM: TIMESTEP
1700000
ITEM: NUMBER OF ATOMS
40900
ITEM: BOX BOUNDS pp pp pp
0 59.39
0 59.39
0 59.39
ITEM: ATOMS id type xu yu zu
1 1 -68.737755560980844 1.190046376093027 122.754819323806714
2 1 -68.334493269859621 0.365731265115530 122.943111038981527
3 1 -68.413018326512173 -0.456802254452782 123.436843456292138
4 1 -68.821350328206080 -1.360098170077123 123.314784135612115
5 1 -67.876948635447775 -1.533699833382506 123.072964235308660
6 1 -67.062910322675322 -2.006415676993953 123.431518511867381
7 1 -67.069984116148134 -2.899068427170739 123.057125785834685
8 1 -66.207325578729183 -3.292545155979909 123.377770523297343
...
I would like to open every file, perform a certain operation on the numerical data and save the file with a different name leaving the header unchanged. My script is:
for f in files:
filename=path+"/"+f
with open(filename) as myfile:
header = ' '.join([next(myfile) for x in xrange(9)])
data=np.loadtxt(filename,skiprows=9)
data[:,2:5]%=L #Put everything inside the box...
savetxt(filename.replace("lammpstrj","fold.lammpstrj"),data,header=header,comments="",fmt="%d %d %.15f %.15f %.15f")
The output, though, looks like this:
ITEM: TIMESTEP
1700000
ITEM: NUMBER OF ATOMS
40900
ITEM: BOX BOUNDS pp pp pp
0 59.39
0 59.39
0 59.39
ITEM: ATOMS id type xu yu zu
1 1 50.042244439019157 1.190046376093027 3.974819323806713
2 1 50.445506730140380 0.365731265115530 4.163111038981526
3 1 50.366981673487828 58.933197745547218 4.656843456292137
4 1 49.958649671793921 58.029901829922878 4.534784135612114
5 1 50.903051364552226 57.856300166617494 4.292964235308659
6 1 51.717089677324680 57.383584323006048 4.651518511867380
7 1 51.710015883851867 56.490931572829261 4.277125785834684
8 1 52.572674421270818 56.097454844020092 4.597770523297342
...
The header is not exactly the same: there are spaces at the beginning of every lines except for the first, and a newline after the last line of the header. I need to get rid of those, but I don't know how.
What am I doing wrong?
The issue is in the ' '.join(a):
a = ['sadf\n', 'sdfg\n']
' '.join(a)
>>>'sadf\n sdfg\n' # Note the space at the start of the second line.
Instead:
''.join(a)
>>>'sadf\nsdfg\n'
You will also need to trim the last '\n' in your header to prevent the empty line:
''.join(a).rstrip()
>>>'sadf\nsdfg'
The header parameter will add a newline after it automatically, so you can eliminate the original last '\n' as a redundant newline.
header = header.rstrip('\n')
The leading spaces occurs, since you join each line by an extra space character. you can solve it by the below command.
header = ''.join([next(myfile) for x in xrange(9)])
Related
I have a program which converts a simple image (black lines on white background) into 2 character ASCII art ("x" is black and "-" is white).
I want to read each line and print the number or same character in a row at the end of each line. Do you know how I can do this?
for example:
---x--- 3 1 3
--xxx-- 2 3 2
-xxxxx- 1 5 1
in the top row there are 3 dashes 1 'x' and 3 dashes, and so on.
I would like these numbers to be saved to the ASCII text document.
Thank you!
You can use itertools.groupby:
from itertools import groupby
with open("art.txt", 'r') as f:
for line in map(lambda l: l.strip(), f):
runs = [sum(1 for _ in g) for _, g in groupby(line)]
print(f"{line} {' '.join(map(str, runs))}")
# ---x--- 3 1 3
# --xxx-- 2 3 2
# -xxxxx- 1 5 1
I am working on a linux system using python3 with a file in .psl format common to genetics. This is a tab separated file that contains some cells with comma separated values. An small example file with some of the features of a .psl is below.
input.psl
1 2 3 x read1 8,9, 2001,2002,
1 2 3 mt read2 8,9,10 3001,3002,3003
1 2 3 9 read3 8,9,10,11 4001,4002,4003,4004
1 2 3 9 read4 8,9,10,11 4001,4002,4003,4004
I need to filter this file to extract only regions of interest. Here, I extract only rows with a value of 9 in the fourth column.
import csv
def read_psl_transcripts():
psl_transcripts = []
with open("input.psl") as input_psl:
csv_reader = csv.reader(input_psl, delimiter='\t')
for line in input_psl:
#Extract only rows matching chromosome of interest
if '9' == line[3]:
psl_transcripts.append(line)
return psl_transcripts
I then need to be able to print or write these selected lines in a tab delimited format matching the format of the input file with no additional quotes or commas added. I cant seem to get this part right and additional brackets, quotes and commas are always added. Below is an attempt using print().
outF = open("output.psl", "w")
for line in read_psl_transcripts():
print(str(line).strip('"\''), sep='\t')
Any help is much appreciated. Below is the desired output.
1 2 3 9 read3 8,9,10,11 4001,4002,4003,4004
1 2 3 9 read4 8,9,10,11 4001,4002,4003,4004
You might be able to solve you problem with a simple awk statement.
awk '$4 == 9' input.pls > output.pls
But with python you could solve it like this:
write_pls = open("output.pls", "w")
with open("input.pls") as file:
for line in file:
splitted_line = line.split()
if splitted_line[3] == '9':
out_line = '\t'.join(splitted_line)
write_pls.write(out_line + "\n")
write_pls.close()
What my text is
$TITLE = XXXX YYYY
1 $SUBTITLE= XXXX YYYY ANSA
2 $LABEL = first label
3 $DISPLACEMENTS
4 $MAGNITUDE-PHASE OUTPUT
5 $SUBCASE ID = 30411
What i want
$TITLE = XXXX YYYY
1 $SUBTITLE= XXXX YYYY ANSA
2 $LABEL = new label
3 $DISPLACEMENTS
4 $MAGNITUDE-PHASE OUTPUT
5 $SUBCASE ID = 30411
The code i am using
import re
fo=open("test5.txt", "r+")
num_lines = sum(1 for line in open('test5.txt'))
count=1
while (count <= num_lines):
line1=fo.readline()
j= line1[17 : 72]
j1=re.findall('\d+', j)
k=map(int,j1)
if (k==[30411]):
count1=count-4
line2=fo.readlines()[count1]
r1=line2[10:72]
r11=str(r1)
r2="new label"
r22=str(r2)
newdata = line2.replace(r11,r22)
f1 = open("output7.txt",'a')
lines=f1.writelines(newdata)
else:
f1 = open("output7.txt",'a')
lines=f1.writelines(line1)
count=count+1
The problem is in the writing of line. Once 30411 is searched and then it has to go 3 lines back and change the label to new one. The new output text should have all the lines same as before except label line. But it is not writing properly. Can anyone help?
Apart from many blood-curdling but noncritical problems, you are calling readlines() in the middle of an iteration using readline(), causing you to read lines not from the beginning of the file but from the current position of the fo handle, i.e. after the line containing 30411.
You need to open the input file again with a separate handle or (better) store the last 4 lines in memory instead of rereading the one you need to change.
So I want to count the occurrences of certain words, per line, in a text file. How many times each specific word occurred doesnt matter, just how many times any of them occurred per line. I have a file containing a list of words, delimited by newline character. It looks like this:
amazingly
astoundingly
awful
bloody
exceptionally
frightfully
.....
very
I then have another text file containing lines of text. Lets say for example:
frightfully frightfully amazingly Male. Don't forget male
green flag stops? bloody bloody bloody bloody
I'm biased.
LOOKS like he was headed very
green flag stops?
amazingly exceptionally exceptionally
astoundingly
hello world
I want my output to look like:
3
4
0
1
0
3
1
Here's my code:
def checkLine(line):
count = 0
with open("intensifiers.txt") as f:
for word in f:
if word[:-1] in line:
count += 1
print count
for line in open("intense.txt", "r"):
checkLine(line)
Here's my actual output:
4
1
0
1
0
2
1
0
any ideas?
How about this:
def checkLine(line):
with open("intensifiers.txt") as fh:
line_words = line.rstrip().split(' ')
check_words = [word.rstrip() for word in fh]
print sum(line_words.count(w) for w in check_words)
for line in open("intense.txt", "r"):
checkLine(line)
Output:
3
4
0
1
0
3
1
0
I have previously found a way to count the prefixes, as shown below, so is there a way similar to this which is so obvious I'm missing it completely?
for i in range (0, len(hardprefix)):
if len(word) > len(hardprefix[i]):
if word.startswith(hardprefix[i]):
hardprefixcount += 1
break
I need this code to use the first column of the file and count the number of a set array of suffixes found within these words
This is what i have so far
for i in range (0, len(easysuffix)):
if len (word) > len(easysuffix[i]):
if word.endswith(easysuffix[i]):
easysuffixcount += 1
break
below is a sample of my data from the csv file, with the arrays using the suffixes below that
on 1
only 4
our 1
own 1
part 7
piece 4
pieces 4
place 1
pressed 1
riot 1
september 1
shape 3
hardsuffix = ['ism']
easysuffix = ['ity', 'esome', 'ece']
Your input file is tab delimited CSV so you can use the csv module to process it.
import csv
suffixes = ['ity', 'esome', 'ece']
with open('input.csv') as words:
suffix_count = 0
reader = csv.reader(words, delimiter='\t')
for word, _ in reader:
if any(word.endswith(suffix) for suffix in suffixes):
suffix_count += 1
print "Found {} suffix(es)".format(suffix_count)