Im still new to python.
I have a text file with a list of numbers, and each number has two 'attributes' along with it:
250 121 6000.654
251 8472 650.15614
252 581 84.2
i want to search for the 1st column and return the 2nd and 3rd columns as separate variables so i can use them later.
cmd = """ cat new.txt | nawk '/'251'/{print $2}' """
os.system(cmd)
This works in that it prints the $2 column, but i want to assign this output to a variable, something like this (however this returns the number of errors AFAIK):
cmdOutput = os.system(cmd)
also i would like to change the nawk'd value based on a variable, something like this:
cmd = """ cat new.txt | nawk '/'$input'/{print $2}' """
If anyone can help, thanks.
Don't use cat and nawk. Please.
Just use Python
import sys
target= raw_input( 'target: ' ) # or target= sys.argv[1]
with open('new.txt','r') as source:
for columns in ( raw.strip().split() for raw in source ):
if column[0] == target: print column[1]
No cat. No nawk.
First of all, to format the cmd string, use
input = '251'
cmd = """ cat new.txt | nawk '/'{input}'/{{print $2}}' """.format(input=input)
But really, you don't need an external command at all.
input = '251'
with open('new.txt', 'r') as f:
for line in file:
lst = line.split()
if lst[0] == input:
column2, column3 = int(lst[1]), float(lst[2])
break
else: # the input wasn't found
column2, column3 = None, None
print(column2, column3)
I think what you're looking for is:
subprocess.Popen(["cat", "new.txt","|","nawk","'/'$input/{print $2}'"], stdout=subprocess.PIPE).stdout
Related
I'm trying to write a script with Python. I have a .txt file containing a column with numbers in this format (example) 198900.000. The idea is to read these numbers one by one and one by one use them as an input for a command:
Python reads the first number;
Python use the number as input in the command;
The command writes an output named, let's say, 'output_number'
The command should run iteratively until the end of the column (the number of rows is unknown).
Can you please help me? Thank you!
list.txt:
600.000
630.000
640.000
650.000
660.000
680.000
690.000
720.000
740.000
750.000
770.000
780.000
800.000
810.000
820.000
830.000
840.000
850.000
860.000
3310.000
Perhaps:
print(open("list.txt").read().split())
OUTPUT:
['600.000', '630.000', '640.000', '650.000', '660.000', '680.000', '690.000', '720.000', '740.000', '750.000', '770.000', '780.000', '800.000', '810.000', '820.000', '830.000', '840.000', '850.000', '860.000', '3310.000']
OR
with open("list.txt","r") as f:
lines = f.readlines()
for x in lines:
print("output_number: {}".format(x))
OUTPUT:
output_number: 600.000
output_number: 630.000
output_number: 640.000
output_number: 650.000
output_number: 660.000
.
.
.
EDIT:
OP: Python has to read the firts number and, for the firts number do a command. For example: Python reads the firts number '600.00' and then do (I write the command so could be clear) ' gmx trjconv -dump 600.00 -output dump_600.00 '. Then Python has to repeat the same for all the number that are present in the column
commands_list = { 'clear': ' gmx trjconv -dump XxX -output dump_XxX '}
def callCommand(x):
cmd = input("Enter command:")
if cmd in commands_list:
print(commands_list[cmd].replace("XxX", x))
else:
print("Command does not exit, quiting")
exit()
with open("list.txt","r") as f:
lines=f.readlines()
for x in lines:
callCommand(x)
print("output_number: {}".format(x))
OUTPUT:
Enter command:clear
gmx trjconv -dump 600.000
-output dump_600.000
output_number: 600.000
Enter command:blahhh
Command does not exit, quiting
Assuming that you only have that one column with the number.
If you have more than one column you can use the fieldnames attribute to pinpoint your column.
import csv
output_number=list()
with open('yourfiledirectory\yourfile.txt', 'r') as f:
df=csv.reader(f)
for row in df:
output_number.append(row)
I am using mostly one liners in shell scripting.
If I have a file with contents as below:
1
2
3
and want it to be pasted like:
1 1
2 2
3 3
how can I do it in shell scripting using python one liner?
PS: I tried the following:-
python -c "file = open('array.bin','r' ) ; cont=file.read ( ) ; print cont*3;file.close()"
but it printed contents like:-
1
2
3
1
2
3
file = open('array.bin','r' )
cont = file.readlines()
for line in cont:
print line, line
file.close()
You could replace your print cont*3 with the following:
print '\n'.join(' '.join(ch * n) for ch in cont.strip().split())
Here n is the number of columns.
You need to break up the lines and then reassemble:
One Liner:
python -c "file=open('array.bin','r'); cont=file.readlines(); print '\n'.join([' '.join([c.strip()]*2) for c in cont]); file.close()"
Long form:
file=open('array.bin', 'r')
cont=file.readlines()
print '\n'.join([' '.join([c.strip()]*2) for c in cont])
file.close()
With array.bin having:
1
2
3
Gives:
1 1
2 2
3 3
Unfortunately, you can't use a simple for statement for a one-liner solution (as suggested in a previous answer). As this answer explains, "as soon as you add a construct that introduces an indented block (like if), you need the line break."
Here's one possible solution that avoids this problem:
Open file and read lines into a list
Modify the list (using a list comprehension). For each item:
Remove the trailing new line character
Multiply by the number of columns
Join the modified list using the new line character as separator
Print the joint list and close file
Detailed/long form (n = number of columns):
f = open('array.bin', 'r')
n = 5
original = list(f)
modified = [line.strip() * n for line in original]
print('\n'.join(modified))
f.close()
One-liner:
python -c "f = open('array.bin', 'r'); n = 5; print('\n'.join([line.strip()*n for line in list(f)])); f.close()"
REPEAT_COUNT=3 && cat contents.txt| python -c "print('\n'.join(w.strip() * ${REPEAT_COUNT} for w in open('/dev/stdin').readlines()))"
First test from the command propmt:
paste -d" " array.bin array.bin
EDIT:
OP wants to use a variable n to show how much columns are needed.
There are different ways to repeat a command 10 times, such as
for i in {1..10}; do echo array.bin; done
seq 10 | xargs -I -- echo "array.bin"
source <(yes echo "array.bin" | head -n10)
yes "array.bin" | head -n10
Other ways are given by https://superuser.com/a/86353 and I will use a variation of
printf -v spaces '%*s' 10 ''; printf '%s\n' ${spaces// /ten}
My solution is
paste -d" " $(printf "%*s" $n " " | sed 's/ /array.bin /g')
I need to remove the trailing zero's from an export:
the code is reading original tempFile i need column 2 and 6 which contains:
12|9781624311390|1|1|0|0.0000
13|9781406273687|1|1|0|99.0000
14|9781406273717|1|1|0|104.0000
15|9781406273700|1|1|0|63.0000
the awk command changes the form to comma separated and dumps column 2 and 6 into tempFile2 - and i need to remove the trailing zeros from column 6 so the end result looks like this:
9781624311390,0
9781406273687,99
9781406273717,104
9781406273700,63
i believe this should do the trick but have had no luck implementing it:
awk '{sub("\\.*0+$",""); print}'
Below is the code i need to adjust: $6 is the column to remove zero's
if not isError:
print "Translating SQL output to tab delimited format"
awkRunSuccess = os.system(
"awk -F\"|\" '{print $2 \"\\,\" $6}' %s > %s" %
(tempFile, tempFile2)
)
if awkRunSuccess != 0: isError = True
You can use gsub("\\.*0+$","",$2) to do this, as per the following transcript:
pax> echo '9781624311390|0.0000
9781406273687|99.0000
9781406273717|104.0000
9781406273700|63.0000' | awk -F'|' '{gsub("\\.*0+$","",$2);print $1","$2}'
9781624311390,0
9781406273687,99
9781406273717,104
9781406273700,63
However, given you're already within Python (and it's no slouch when it comes to regexes), you'd probably want to use it natively rather than start up an awk process.
Try this awk command
awk -F '[|.]' '{print $2","$(NF-1)}' FileName
Output:
9781624311390,0
9781406273687,99
9781406273717,104
9781406273700,63
I have a single attribute file that has two columns. The string in column 1 matches the string in the files that need to be changed. The string in file 2 needs to be the string in file 1 column 2.
I'm not sure the best way to approach this sed? awk? There is only a single file 1 that has every key and value pair, they are all unique. There are over 10,000 File 2, that are each different but have the same format, that I would need to change from the numbers to the names. Every number in any of the File 2's will be in File 1.
File 1
1000079541 ALBlai_CCA27168
1000079542 ALBlai_CCA27169
1000082614 PHYsoj_128987
1000082623 PHYsoj_128997
1000112581 PHYcap_Phyca_508162
1000112588 PHYcap_Phyca_508166
1000112589 PHYcap_Phyca_508170
1000112592 PHYcap_Phyca_549547
1000120087 HYAara_HpaP801280
1000134210 PHYinf_PITG_01218T0
1000134213 PHYinf_PITG_01223T0
1000134221 PHYinf_PITG_01231T0
1000144497 PHYinf_PITG_13921T0
1000153541 PYTultPYU1_T002777
1000162512 PYTultPYU1_T013706
1000163504 PYTultPYU1_T014907
1000168326 PHYram_79731
1000168327 PHYram_79730
1000168332 PHYram_79725
1000168335 PHYram_79722
...
File 2
(1000079542:0.60919245567850022205,((1000162512:0.41491233674846345059,(1000153541:0.39076742568979516701,1000163504:0.52813999143574519302):0.14562273102476630537):0.28880212838980307000,(((1000144497:0.20364901110426453235,1000168327:0.22130795712572320921):0.35964649479701132906,((1000120087:0.34990382691181332042,(1000112588:0.08084123331549526725,(1000168332:0.12176200773214326811,1000134213:0.09481932223544080329):0.00945982345360765406):0.01846847662360769429):0.19758412044470402558,((1000168326:0.06182031367986642878,1000112589:0.07837371928562210377):0.03460740736793390532,(1000134210:0.13512192366876615846,(1000082623:0.13344777464787777044,1000112592:0.14943677128375676411):0.03425386814075986885):0.05235436818005634318):0.44112430521695145114):0.21763784827666701749):0.22507080810857052477,(1000112581:0.02102132893524749635,(1000134221:0.10938436290969000275,(1000082614:0.05263067805665807425,1000168335:0.07681947209386902342):0.03562545894572662769):0.02623229853693959113):0.49114147006852687527):0.23017851954961116023):0.64646763541457552549,1000079541:0.90035900920746847476):0.0;
Desired Result
(ALBlai_CCA27169:0.60919245567850022205,((PYTultPYU1_T013706:0.41491233674846345059, ...
Python:
import re
# Build a dictionary of replacements:
with open('File 1') as f:
repl = dict(line.split() for line in f)
# Read in the file and make the replacements:
with open('File 2') as f:
data = f.read()
data = re.sub(r'(\d+):',lambda m: repl[m.group(1)]+':',data)
# Write it back out:
with open('File 2','w') as f:
f.write(data)
Full running awk solution. Hope it helps.
awk -F":" 'BEGIN {
while (getline < "file1")
{
split($0,dat," ");
a[dat[1]]=dat[2];
}
}
{
gsub(substr($1,2,length($1)),a[substr($1,2,length($1))],$0); print
}' file2
I'll do something like that in bash:
while read -r key value
do
echo s/($key:/($value:/g >> sedtmpfile
done < file1
sed -f sedtmpfile file2 > result
rm sedtmpfile
I have a set of 1000 text files with names in_s1.txt, in_s2.txt and so. Each file contains millions of rows and each row has 7 columns like:
ccc245 1 4 5 5 3 -12.3
For me the most important is the values from the first and seventh columns; the pairs ccc245 , -12.3
What I need to do is to find between all the in_sXXXX.txt files, the 10 cases with the lowest values of the seventh column value, and I also need to get where each value is located, in which file. I need something like:
FILE 1st_col 7th_col
in_s540.txt ccc3456 -9000.5
in_s520.txt ccc488 -723.4
in_s12.txt ccc34 -123.5
in_s344.txt ccc56 -45.6
I was thinking about using python and bash for this purpose but at the moment I did not find a practical approach. All what I know to do is:
concatenate all in_ files in IN.TXT
search the lowest values there using: for i in IN.TXT ; do sort -k6n $i | head -n 10; done
given the 1st_col and 7th_col values of the top ten list, use them to filter the in_s files, using grep -n VALUE in_s*, so I get for each value the name of the file
It works but it is a bit tedious. I wonder about a faster approach only using bash or python or both. Or another better language for this.
Thanks
In python, use the nsmallest function in the heapq module -- it's designed for exactly this kind of task.
Example (tested) for Python 2.5 and 2.6:
import heapq, glob
def my_iterable():
for fname in glob.glob("in_s*.txt"):
f = open(fname, "r")
for line in f:
items = line.split()
yield fname, items[0], float(items[6])
f.close()
result = heapq.nsmallest(10, my_iterable(), lambda x: x[2])
print result
Update after above answer accepted
Looking at the source code for Python 2.6, it appears that there's a possibility that it does list(iterable) and works on that ... if so, that's not going to work with a thousand files each with millions of lines. If the first answer gives you MemoryError etc, here's an alternative which limits the size of the list to n (n == 10 in your case).
Note: 2.6 only; if you need it for 2.5 use a conditional heapreplace() as explained in the docs. Uses heappush() and heappushpop() which don't have the key arg :-( so we have to fake it.
import glob
from heapq import heappush, heappushpop
from pprint import pprint as pp
def my_iterable():
for fname in glob.glob("in_s*.txt"):
f = open(fname, "r")
for line in f:
items = line.split()
yield -float(items[6]), fname, items[0]
f.close()
def homegrown_nlargest(n, iterable):
"""Ensures heap never has more than n entries"""
heap = []
for item in iterable:
if len(heap) < n:
heappush(heap, item)
else:
heappushpop(heap, item)
return heap
result = homegrown_nlargest(10, my_iterable())
result = sorted(result, reverse=True)
result = [(fname, fld0, -negfld6) for negfld6, fname, fld0 in result]
pp(result)
I would:
take first 10 items,
sort them and then
for every line read from files insert the element into those top10:
in case its value is lower than highest one from current top10,
(keeping the sorting for performance)
I wouldn't post the complete program here as it looks like homework.
Yes, if it wasn't ten, this would be not optimal
Try something like this in python:
min_values = []
def add_to_min(file_name, one, seven):
# checks to see if 7th column is a lower value than exiting values
if len(min_values) == 0 or seven < max(min_values)[0]:
# let's remove the biggest value
min_values.sort()
if len(min_values) != 0:
min_values.pop()
# and add the new value tuple
min_values.append((seven, file_name, one))
# loop through all the files
for file_name in os.listdir(<dir>):
f = open(file_name)
for line in file_name.readlines():
columns = line.split()
add_to_min(file_name, columns[0], float(columns[6]))
# print answers
for (seven, file_name, one) in min_values:
print file_name, one, seven
Haven't tested it, but it should get you started.
Version 2, just runs the sort a single time (after a prod by S. Lott):
values = []
# loop through all the files and make a long list of all the rows
for file_name in os.listdir(<dir>):
f = open(file_name)
for line in file_name.readlines():
columns = line.split()
values.append((file_name, columns[0], float(columns[6]))
# sort values, print the 10 smallest
values.sort()
for (seven, file_name, one) in values[:10]
print file_name, one, seven
Just re-read you question, with millions of rows, you might run out of RAM....
A small improvement of your shell solution:
$ cat in.txt
in_s1.txt
in_s2.txt
...
$ cat in.txt | while read i
do
cat $i | sed -e "s/^/$i /" # add filename as first column
done |
sort -n -k8 | head -10 | cut -d" " -f1,2,8
This might be close to what you're looking for:
for file in *; do sort -k6n "$file" | head -n 10 | cut -f1,7 -d " " | sed "s/^/$file /" > "${file}.out"; done
cat *.out | sort -k3n | head -n 10 > final_result.out
If your files are million lines, you might want to consider using "buffering". the below script goes through those million lines, each time comparing field 7 with those in the buffer. If a value is smaller than those in the buffer, one of them in buffer is replaced by the new lower value.
for file in in_*.txt
do
awk -vt=$t 'NR<=10{
c=c+1
val[c]=$7
tag[c]=$1
}
NR>10{
for(o=1;o<=c;o++){
if ( $7 <= val[o] ){
val[o]=$7
tag[o]=$1
break
}
}
}
END{
for(i=1;i<=c;i++){
print val[i], tag[i] | "sort"
}
}' $file
done