time difference between counting lines in a file from python/unix

time difference between counting lines in a file from python/unix - python

im using 'wc -l' for file with 50 columns and 3000 records to count the lines in python code itself below
cmd='wc -l /path of file'
status,output=command.getstatusoutput(cmd)
and again i tried using the below one in python
row_count=sum(1 for line in(file path))
I just tried taking time from both of the commands wc -l is faster,I just dont know which is faster could you let me know the reasons behind this
ex : time
wc -l : 0.005s
python : 0.54s

Try this one:
with open("inp.txt", "r") as inpt:
print(len(inpt.readlines()))

Related

Count columns of gzipped tsv without loading

I have a large tab-delimited file that has been gzipped, and I would like to know how many columns it has. For small files I can just unzip and read into python, for large files this is slow. Is there a way to count the columns quickly without loading the file into python?
Effeciently counting number of columns of text file is almost identical, but since my files are gzipped just reading the first line won't work. Is there a way to make python efficiently unzip just enough to read the first line?

... but since my files are gzipped just reading the first line won't work.
Yes it will.
import csv
import gzip
with gzip.open('file.tsv.gz', 'rt') as gzf:
reader = csv.reader(gzf, dialect=csv.excel_tab)
print(len(next(reader)))

This can be done with traditional unix command line tools. For example:
$ zcat file.tsv.gz | head -n 1 | tr $'\t' '\n' | wc -l
zcat (or gunzip -c) unzips and outputs to standard output, without modifying the file. 'head -n 1' reads exactly one line and outputs it. The 'tr' command replaces tabs with newlines, and 'wc -l' counts the number of lines. Because 'head -n 1' exits after one line, this has the effect of terminating the zcat command as well. It's quite fast. If first line of the file is a header, simply omit the 'wc -l' to see what the headers are.

Parallel Processing Issue in Python

I have a python script A.pyand it takes arguments with a target file with some list of IPs and outs a CSV file with Information found regarding the IPs from some sources.( Run Method : python A.py Input.txt -c Output.csv ).
It took ages to get the work done. Later,I split Input file ( split -l 1000 Input.txt) -> created directories ( 10 directories) -> executed the script with the Input splitted in 10 directories parallel in screen mode
How to do this kind of jobs efficiently ? Any suggestions please ?

Try this:
parallel --round --pipepart -a Input.txt --cat python A.py {} -c {#}.csv
If A.py can read from a fifo then this is more efficient:
parallel --round --pipepart -a Input.txt --fifo python A.py {} -c {#}.csv
If your disk has long seek times then it might be faster to use --pipe instead of --pipepart.

How can I get my Python script to work using bash?

I am new to this site so hopefully this is the correct location to place this question.
I am trying to write a script using python for Linux, that:
creates a file file.txt
appends the output of the 'lsof' command to file.txt
read each line of the output and append them to an array.
then print each line.
I'm basically just doing this to familiarize myself with using python for bash, I'm new to this area so any help would be great. I'm not sure where to go from here. Also if there is a better way to do this I'm open to that!
#!/usr/bin/env python
import subprocess
touch = "touch file.txt"
subprocess.call(touch, shell=True)
xfile = "file.txt"
connection_count = "lsof -i tcp | grep ESTABLISHED | wc -l"
count = subprocess.call(connection_count, shell=True)
if count > 0:
connection_lines = "lsof -i tcp | grep ESTABLISHED >> file.txt"
subprocess.call(connection_lines, shell=True)
with open(subprocess.call(xfile, shell=True), "r") as ins:
array = []
for line in ins:
array.append(line)
for i in array:
print i

subprocess.call returns the return code for the process that was started ($? in bash). This is almost certainly not what you want -- and explains why this line almost certainly fails:
with open(subprocess.call(xfile, shell=True), "r") as ins:
(you can't open a number).
Likely, you want to be using subprocess.Popen with stdout=subprocess.PIPE. Then you can read the output from the pipe. e.g. to get the count, you probably want something like:
connection_count = "lsof -i tcp | grep ESTABLISHED"
proc = subprocess.POPEN(connection_count, shell=True, stdout=subprocess.PIPE)
# line counting moved to python :-)
count = sum(1 for unused_line in proc.stdout)
(you could also use Popen.communicate here)
Note, excessive use of shell=True is always a bit scary for me... It's much better to chain your pipes together as demonstrated in the documentation.

python or bash script to pass all files in a folder to java command line

I have the following Java command line working fine Mac os.
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer file.txt > output.txt
Multiple files can be passed as input with spaces as follows.
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer file1.txt file2.txt > output.txt
Now I have 100 files in a folder. All these files I have to pass as input to this command. I used
python os.system in a for loop of directories as follows .
for i,f in enumerate(os.listdir(filedir)):
os.system('java -cp "stanford-ner.jar" edu.stanford.nlp.process.PTBTokenizer "%s" > "annotate_%s.txt"' %(f,i))
This works fine only for the first file. But for all othe outputs like annotate_1,annotate_2 it creates only the file with nothing inside that. I thought of using for loop the files and pass it to subprocess.popen() , but that seems of no hope.
Now I am thinking of passing the files in a loop one by one to execute the command sequentially by passing each file in a bash script. I am also wondering whether I can parallely executes 10 files (atleast) in different terminals at a time. Any solution is fine, but I think this question will help me to gain some insights into different this.

If you want to do this from the shell instead of Python, the xargs tool can almost do everything you want.
You give it a command with a fixed list of arguments, and feed it input with a bunch of filenames, and it'll run the command multiple times, using the same fixed list plus a different batch of filenames from its input. The --max-args option sets the size of the biggest group. If you want to run things in parallel, the --max-procs option lets you do that.
But that's not quite there, because it doesn't do the output redirection. But… do you really need 10 separate files instead of 1 big one? Because if 1 big one is OK, you can just redirect all of them to it:
ls | xargs --max-args=10 --max-procs=10 java -cp stanford-ner.jar\
edu.stanford.nlp.process.PTBTokenizer >> output.txt

To pass all .txt files in the current directory at once to the java subprocess:
#!/usr/bin/env python
from glob import glob
from subprocess import check_call
cmd = 'java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer'.split()
with open('output.txt', 'wb', 0) as file:
check_call(cmd + glob('*.txt'), stdout=file)
It is similar to running the shell command but without running the shell:
$ java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer *.txt > output.txt
To run no more than 10 subprocesses at a time, passing no more than 100 files at a time, you could use multiprocessing.pool.ThreadPool:
#!/usr/bin/env python
from glob import glob
from multiprocessing.pool import ThreadPool
from subprocess import call
try:
from threading import get_ident # Python 3.3+
except ImportError: # Python 2
from thread import get_ident
cmd = 'java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer'.split()
def run_command(files):
with open('output%d.txt' % get_ident(), 'ab', 0) as file:
return files, call(cmd + files, stdout=file)
all_files = glob('*.txt')
file_groups = (all_files[i:i+100] for i in range(0, len(all_files), 100))
for _ in ThreadPool(10).imap_unordered(run_command, file_groups):
pass
It is similar to this xargs command (suggested by #abarnert):
$ ls *.txt | xargs --max-procs=10 --max-args=100 java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer >>output.txt
except that each thread in the Python script writes to its own output file to avoid corrupting the output due to parallel writes.

If you have 100 files, and you want to kick off 10 processes, each handling 10 files, all in parallel, that's easy.
First, you want to group them into chunks of 10. You can do this with slicing or with zipping iterators; in this case, since we definitely have a list, let's just use slicing:
files = os.listdir(filedir)
groups = [files[i:i+10] for i in range(0, len(files), 10)]
Now, you want to kick off process for each group, and then wait for all of the processes, instead of waiting for each one to finish before kicking off the next. This is impossible with os.system, which is one of the many reasons os.system says "The subprocess module provides more powerful facilities for spawning new processes…"
procs = [subprocess.Popen(…) for group in groups]
for proc in procs:
proc.wait()
So, what do you pass on the command line to give it 10 filenames instead of 1? If none of the names have spaces or other special characters, you can just ' '.join them. But otherwise, it's a nightmare. Another reason subprocess is better: you can just pass a list of arguments:
procs = [subprocess.Popen(['java', '-cp', 'stanford-ner.jar',
'edu.stanford.nlp.process.PTBTokenizer'] + group)
for group in groups]
But now how to do you get all of the results?
One way is to go back to using a shell command line with the > redirection. But a better way is to do it in Python:
procs = []
files = []
for i, group in enumerate(groups):
file = open('output_{}'.format(i), 'w')
files.append(file)
procs.append(subprocess.Popen([…same as before…], stdout=file)
for proc in procs:
proc.wait()
for file in files:
file.close()
(You might want to use a with statement with ExitStack, but I wanted to make sure this didn't require Python 2.7/3.3+, so I used explicit close.)

Inside your input file directory you can do the following in bash:
#!/bin/bash
for file in *.txt
do
input=$input" \"$file\""
done
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer $input > output.txt
If you want to run it as a script. Save the file with some name, my_exec.bash:
#!/bin/bash
if [ $# -ne 2 ]; then
echo "Invalid Input. Enter a directory and a output file"
exit 1
fi
if [ ! -d $1 ]; then
echo "Please pass a valid directory"
exit 1
fi
for file in $1*.txt
do
input=$input" \"$file\""
done
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer $input > $2
Make it an executable file
chmod +x my_exec.bash
USAGE:
./my_exec.bash <folder> <output_file>

Better way in Python

Is there a better / simpler way to accomplish this in Python?
I have a bash script that calculates CPS (calls per second). It runs fine on small files but poorly on large ones. It basically takes the file that we are calculating the CPS for and extracts field 7 which is the INVITING time, sorts, and only gets the unique values. This is all put in a tmp.file. The script then cats the original file and greps for each of the values in the tmp.file, counts them, and outputs the time and count to a final file.
#!/bin/bash
cat $1 |cut -d "," -f 7 | sort |uniq > /tmp/uniq.time.txt;
list="/tmp/uniq.time.txt";
while read time
do
VALUE1=`cat $1 |grep "$time" |wc -l`;
echo $VALUE1 >> /tmp/cps.tmp;
done < $list;
rm /tmp/cps.tmp;

I think what you're trying to do is simply:
cat $1 |cut -d "," -f 7 | sort | uniq -c
note: if you want to swap the order of the fields:
| awk -F " *" '{print $3, $2}'

This can certainly be done easier and more efficiently in Python:
import sys
from itertools import groupby
with open(sys.argv[1]) as f:
times = [line.split(",")[6] for line in f]
times.sort()
for time, occurrences in groupby(times):
print time, len(list(occurrences))
The problem with your approach is that you have to serach the whole file for each unique time. You could write this more efficiently even in bash, but I think it's more convenient to do this in Python.

Reading CSV files:
http://docs.python.org/library/csv.html
Uniquifying:
set(nonUniqueItems)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

time difference between counting lines in a file from python/unix - python

Try this one: with open("inp.txt", "r") as inpt: print(len(inpt.readlines()))

Related

Count columns of gzipped tsv without loading

Parallel Processing Issue in Python

How can I get my Python script to work using bash?

python or bash script to pass all files in a folder to java command line

Better way in Python

Categories

Resources