Simple command line handling equivalent of Perl in Python

Simple command line handling equivalent of Perl in Python - python

I have done some basic Perl coding but never something in python. I would like to do the equivalent of sending the file to be read from in the command line option. This file is tab delimited, so split each column and then be able to perform some operation in those columns.
The perl code for doing this is
#!/usr/bin/perl
use warnings;
use strict;
while(<>) {
chomp;
my #H = split /\t/;
my $col = $H[22];
if($H[30] eq "Good") {
some operation in col...
}
else {
do something else
}
}
What would be the python equivalent of this task?
Edit: I need the H[22] column to be a unicode character. How do I make col variable to be so?

#file: process_columns.py
#!/usr/bin/python
import fileinput
for line in fileinput.input():
cols = l.split('\t')
# do something with the columns
The snippet above can be used this way
./process_columns.py < data
or just
./process_columns.py data

Related to: Python equivalent of Perl's while (<>) {...}?
#!/usr/bin/env python
import fileinput
for line in fileinput.input():
line = line.rstrip("\r\n") # equiv of chomp
H = line.split('\t')
if H[30]=='Good':
# some operation in col
# first - what do you get from this?
print repr(H[22])
# second - what do you get from this?
print unicode(H[22], "Latin-1")
else:
# do something else
pass # only necessary if no code in this section
Edit: at a guess, you are reading a byte-string and must properly encode it to a unicode string; the proper way to do this depends on what format the file is saved in and what your localization settings are. Also see Character reading from file in Python

Related

Writing a hashtag to a file

I am using a python script to create a shell script that I would ideally like to annotate with comments. If I want to add strings with hashtags in them to a code section like this:
with open(os.path.join("location","filename"),"w") as f:
file = f.read()
file += """my_function() {{
if [ $# -eq 0 ]
then
echo "Please supply an argument"
return
fi
echo "argument is $1"
}}
"""
with open(os.path.join("location","filename"),"w") as f:
f.write(file)
what is the best way I can accomplish this?

You already have a # character in that string literal, in $#, so I'm not sure what the problem is.
Python considers a """ string literal as one big string, newlines, comment-esque sequences and all, as you've noticed, until the ending """.
To also pass escape characters (e.g. \n as \n, not a newline) through raw, you'd use r"""...""".
In other words, with
with open("x", "w") as f:
f.write("""x
hi # hello world
""")
you end up with a file containing
x
hi # hello world

In terms of your wider goal, to write a file with a bash function file from a Python script seems a little wayward.
This is not really a reliable practise, if your use case specifically requires you to define a bash function via script, please explain your use case further. A cleaner way to do this would be:
Define an .sh file and read contents in from there:
# function.sh
my_function() {{
# Some code
}}
Then in your script:
with open('function.sh', 'r') as function_fd:
# Opened in 'append' mode so that content is automatically appended
with open(os.path.join("location","filename"), "a") as target_file:
target_file.write(function_fd.read())

more efficient way to extract lines from a file whose first column matches another file

I have two files, target and clean.
Target has some 1055772 lines, each of which has 3000 columns, tab separated. (size is 7.5G)
Clean is slightly shorter at 806535. Clean only has one column, that matches the format of the first column of Target. (size is 13M)
I want to extract the lines of target that has a matching first column to a line in clean.
I wrote a grep based loop to do this but its painfully slow. Speedups will be rewarded with upvotes and/or smileys.
clean = "/path/to/clean"
target = "/path/to/target"
oFile = "/output/file"
head -1 $target > $oFile
cat $clean | while read snp; do
echo $snp
grep $snp $target >> $oFile
done
$ head $clean
1_111_A_G
1_123_T_A
1_456_A_G
1_7892_C_G
Edit: Wrote a simple python script to do it.
clean_variants_file = "/scratch2/vyp-scratch2/cian/UCLex_August2014/clean_variants"
allChr_file = "/scratch2/vyp-scratch2/cian/UCLex_August2014/allChr_snpStats"
outfile = open("/scratch2/vyp-scratch2/cian/UCLex_August2014/results.tab","w")
clean_variant_dict = {}
for line in open(clean_variants_file):
clean_variant_dict[line.strip()] = 0
for line in open(allChr_file):
ll = line.strip().split("\t")
id_ = ll[0]
if id_ in clean_variant_dict:
outfile.write(line)
outfile.close()

This Perl solution would use quite a lot of memory (because we load the entire file into memory), but would save you from looping twice. It uses a hash for duplicate checking, where each line is stored as a key. Note that this code is not thoroughly tested, but seems to work on a limited set of data.
use strict;
use warnings;
my ($clean, $target) = #ARGV;
open my $fh, "<", $clean or die "Cannot open file '$clean': $!";
my %seen;
while (<$fh>) {
chomp;
$seen{$_}++;
}
open $fh, "<", $target
or die "Cannot open file '$target': $!"; # reuse file handle
while (<$fh>) {
my ($first) = /^([^\t]*)/;
print if $seen{$first};
}
If your target file is proper tab separated CSV data, you could use Text::CSV_XS which reportedly is very fast.

python solution:
with open('/path/to/clean', 'r') as fin:
keys = set(fin.read().splitlines())
with open('/path/to/target', 'r') as fin, open('/output/file', 'w') as fout:
for line in fin:
if line[:line.index('\t')] in keys:
fout.write(line)

Using a perl one-liner:
perl -F'\t' -lane '
BEGIN{ local #ARGV = pop; #s{<>} = () }
print if exists $s{"$F[0]\n"}
' target clean
Switches:
-F: Alternate pattern for -a switch
-l: Enable line ending processing
-a: Splits the line on space and loads them in an array #F
-n: Creates a while(<>){...} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
Or as a perl script:
use strict;
use warnings;
die "Usage: $0 target clean\n" if #ARGV != 2;
my %s = do {
local #ARGV = pop;
map {$_ => 1} (<>)
};
while (<>) {
my ($f) = split /\t/;
print if $s{"$f\n"}
}

For fun, I thought I would convert a solution or two into Perl6.
Note: These are probably going to be slower than the originals until Rakudo/NQP gets more optimizations, which really only started in earnest fairly recently at the time of posting.
First is TLP's Perl5 answer converted nearly one-to-one into Perl6.
#! /usr/bin/env perl6
# I have a link named perl6 aliased to Rakudo on MoarVM-jit
use v6;
multi sub MAIN ( Str $clean, Str $target ){ # same as the Perl5 version
MAIN( :$clean, :$target ); # call the named version
}
multi sub MAIN ( Str :$clean!, Str :$target! ){ # using named arguments
note "Processing clean file";
my %seen := SetHash.new;
for open( $clean, :r ).lines -> $line {
next unless $line.chars; # skip empty lines
%seen{$line}++;
}
note "Processing target file";
for open( $target, :r ).lines -> $line {
$line ~~ /^ $<first> = <-[\t]>+ /;
say $line if %seen{$<first>.Str};
}
}
I used MAIN subroutines so that you will get a Usage message if you don't give it the correct arguments.
I also used a SetHash instead of a regular Hash to reduce memory use since we don't need to know how many we have found, only that they were found.
Next I tried to combine all of the lines in the clean file into one regex.
This is similar to the sed and grep answer from Cyrus, except instead of many regexes there is only one.
I didn't want to change the subroutine that I had already written, so I added one that is differentiated by adding --single-regex or -s to the command line. ( All of the examples are in the same file )
multi sub MAIN ( Str :$clean!, Str :$target!, Bool :single-regex(:s($))! ){
note "Processing clean file";
my $regex;
{
my #regex = open( $clean, :r ).lines.grep(*.chars);
$regex = /^ [ | #regex ] /;
} # throw away #regex
note "Processing target file";
for open( $target, :r ).lines -> $line {
say $line if $line ~~ $regex;
}
}
I will say that I took quite a bit longer to write this than it would have taken me to write it in Perl5. Most of the time was taken up searching for some idioms online, and looking over the source files for Rakudo. I don't think it would take much effort to get better at Perl6 than Perl5.

Removing a small number of lines from a large file

I have a very large text file, where most of the lines are composed of ASCII characters, but a small fraction of lines have non-ASCII characters. What is the fastest way to create a new text file containing only the ASCII lines? Right now I am checking each character in each line to see if it's ASCII, and writing each line to the new file if all the characters are ASCII, but this method is rather slow. Also, I am using Python, but would be open to using other languages in the future.
Edit: updated with code
#!/usr/bin/python
import string
def isAscii(s):
for c in s:
if ord(c) > 127 or ord(c) < 0:
return False
return True
f = open('data.tsv')
g = open('data-ASCII-only.tsv', 'w')
linenumber = 1
for line in f:
if isAscii(line):
g.write(line)
linenumber += 1
f.close()
g.close()

You can use grep: "-v" keeps the opposite, -P uses perl regex syntax, and [\x80-\xFF] is the character range for non-ascii.
grep -vP "[\x80-\xFF]" data.tsv > data-ASCII-only.tsv
See this question How do I grep for all non-ASCII characters in UNIX for more about search for ascii characters with grep.

The following suggestion uses a command-line filter (ie, you would use it on the shell command line), this example works in a shell on linux or unix systems, maybe OSX too (I've heard OSX is BSDish):
$ cat big_file | tr -dc '\000-\177' > big_file_ascii_only
It uses the "tr" (translate) filter. In this case, we are telling tr to "delete" all characters which are outside the range octal-000 to octal-177. You may wish to tweak the charcter set - check the man page for tr to get some ideas on other ways to specify the characters you want to keep (or delete)

The other approaches given will work if, and only if, the file is
encoded in such a way that "non-ASCII" is equivalent to "high bit
set", such as Latin-1 or UTF-8. Here's a program in Python 3 that will
work with any encoding.
#!/usr/bin/env python3
import codecs
in_fname = "utf16file"
in_encoding = "utf-16"
out_fname = "ascii_lines"
out_encoding = "ascii"
def is_ascii(s):
try:
s.encode("ascii")
except UnicodeEncodeError:
return False
return True
f_in = codecs.open(in_fname, "r", in_encoding)
f_out = codecs.open(out_fname, "w", out_encoding)
for s in f_in:
if is_ascii(s):
f_out.write(s)
f_in.close()
f_out.close()

Are there a set of simple scripts to manipulate csv files available somewhere?

I am looking for a few scripts which would allow to manipulate generic csv files...
typically something like:
add-row FILENAME INSERT_ROW
get-row FILENAME GREP_ROW
replace-row FILENAME GREP_ROW INSERT_ROW
delete-row FILENAME GREP_ROW
where
FILENAME the name of a csv file, with the first row containing headers, "" used to delimit strings which might contain ','
GREP_ROW a string of pairs field1=value1[,fieldN=valueN,...] used to identify a row based on its fields values in a csv file
INSERT_ROW a string of pairs field1=value1[,fieldN=valueN,...] used to replace(or add) the fields of a row.
peferably in python using the csv package...
ideally leveraging python to associate each field as a variable and allowing more advanced GREP rules like fieldN > XYZ...

Perl has a tradition of in-place editing derived from the unix philosophy.
We could for example write simple add-row-by-num.pl command as follows :
#!/usr/bin/perl -pi
BEGIN { $ln=shift; $line=shift; }
print "$line\n" if $ln==$.;
close ARGV if eof;
Replace the third line by $_="$line\n" if $ln==$.; to replace lines. Eliminate the $line=shift; and replace the third line by $_ = "" if $ln==$.; to delete lines.
We could write a simple add-row-by-regex.pl command as follows :
#!/usr/bin/perl -pi
BEGIN { $regex=shift; $line=shift; }
print "$line\n" if /$regex/;
Or simply the perl command perl -pi -e 'print "LINE\n" if /REGEX/'; FILES. Again, we may replace the print $line by $_="$line\n" or $_ = "" for replace or delete, respectively.
We do not need the close ARGV if eof; line anymore because we need not rest the $. counter after each file is processed.
Is there some reason the ordinary unix grep utility does not suffice? Recall the regular expression (PATERN){n} matches PATERN exactly n times, i.e. (\s*\S+\s*,){6}{\s*777\s*,) demands a 777 in the 7th column.
There is even a perl regular expression to transform your fieldN=value pairs into this regular expression, although I'd use split, map, and join myself.
Btw, File::Inplace provides inplace editing for file handles.

Perl has the DBD::CSV driver, which lets you access a CSV file as if it were an SQL database. I've played with it before, but haven't used it extensively, so I can't give a thorough review of it. If your needs are simple enough, this may work well for you.

App::CCSV does some of that.

The usual way in Python is to use the csv.reader to load the data into a list of tuples, then do your add/replace/get/delete operations on that native python object, and then use csv.writer to write the file back out.
In-place operations on CSV files wouldn't make much sense anyway. Since the records are not typically of fixed length, there is no easy way to insert, delete, or modify a record without moving all the other records at the same time.
That being said, Python's fileinput module has a mode for in-place file updates.

How can I parse a C header file with Perl?

I have a header file in which there is a large struct. I need to read this structure using some program and make some operations on each member of the structure and write them back.
For example I have some structure like
const BYTE Some_Idx[] = {
4,7,10,15,17,19,24,29,
31,32,35,45,49,51,52,54,
55,58,60,64,65,66,67,69,
70,72,76,77,81,82,83,85,
88,93,94,95,97,99,102,103,
105,106,113,115,122,124,125,126,
129,131,137,139,140,149,151,152,
153,155,158,159,160,163,165,169,
174,175,181,182,183,189,190,193,
197,201,204,206,208,210,211,212,
213,214,215,217,218,219,220,223,
225,228,230,234,236,237,240,241,
242,247,249};
Now, I need to read this and apply some operation on each of the member variable and create a new structure with different order, something like:
const BYTE Some_Idx_Mod_mul_2[] = {
8,14,20, ...
...
484,494,498};
Is there any Perl library already available for this? If not Perl, something else like Python is also OK.
Can somebody please help!!!

Keeping your data lying around in a header makes it trickier to get at using other programs like Perl. Another approach you might consider is to keep this data in a database or another file and regenerate your header file as-needed, maybe even as part of your build system. The reason for this is that generating C is much easier than parsing C, it's trivial to write a script that parses a text file and makes a header for you, and such a script could even be invoked from your build system.
Assuming that you want to keep your data in a C header file, you will need one of two things to solve this problem:
a quick one-off script to parse exactly (or close to exactly) the input you describe.
a general, well-written script that can parse arbitrary C and work generally on to lots of different headers.
The first case seems more common than the second to me, but it's hard to tell from your question if this is better solved by a script that needs to parse arbitrary C or a script that needs to parse this specific file. For code that works on your specific case, the following works for me on your input:
#!/usr/bin/perl -w
use strict;
open FILE, "<header.h" or die $!;
my #file = <FILE>;
close FILE or die $!;
my $in_block = 0;
my $regex = 'Some_Idx\[\]';
my $byte_line = '';
my #byte_entries;
foreach my $line (#file) {
chomp $line;
if ( $line =~ /$regex.*\{(.*)/ ) {
$in_block = 1;
my #digits = #{ match_digits($1) };
push #digits, #byte_entries;
next;
}
if ( $in_block ) {
my #digits = #{ match_digits($line) };
push #byte_entries, #digits;
}
if ( $line =~ /\}/ ) {
$in_block = 0;
}
}
print "const BYTE Some_Idx_Mod_mul_2[] = {\n";
print join ",", map { $_ * 2 } #byte_entries;
print "};\n";
sub match_digits {
my $text = shift;
my #digits;
while ( $text =~ /(\d+),*/g ) {
push #digits, $1;
}
return \#digits;
}
Parsing arbitrary C is a little tricky and not worth it for many applications, but maybe you need to actually do this. One trick is to let GCC do the parsing for you and read in GCC's parse tree using a CPAN module named GCC::TranslationUnit.
Here's the GCC command to compile the code, assuming you have a single file named test.c:
gcc -fdump-translation-unit -c test.c
Here's the Perl code to read in the parse tree:
use GCC::TranslationUnit;
# echo '#include <stdio.h>' > stdio.c
# gcc -fdump-translation-unit -c stdio.c
$node = GCC::TranslationUnit::Parser->parsefile('stdio.c.tu')->root;
# list every function/variable name
while($node) {
if($node->isa('GCC::Node::function_decl') or
$node->isa('GCC::Node::var_decl')) {
printf "%s declared in %s\n",
$node->name->identifier, $node->source;
}
} continue {
$node = $node->chain;
}

Sorry if this is a stupid question, but why worry about parsing the file at all? Why not write a C program that #includes the header, processes it as required and then spits out the source for the modified header. I'm sure this would be simpler than the Perl/Python solutions, and it would be much more reliable because the header would be being parsed by the C compilers parser.

You don't really provide much information about how what is to be modified should be determined, but to address your specific example:
$ perl -pi.bak -we'if ( /const BYTE Some_Idx/ .. /;/ ) { s/Some_Idx/Some_Idx_Mod_mul_2/g; s/(\d+)/$1 * 2/ge; }' header.h
Breaking that down, -p says loop through input files, putting each line in $_, running the supplied code, then printing $_. -i.bak enables in-place editing, renaming each original file with a .bak suffix and printing to a new file named whatever the original was. -w enables warnings. -e'....' supplies the code to be run for each input line. header.h is the only input file.
In the perl code, if ( /const BYTE Some_Idx/ .. /;/ ) checks that we are in a range of lines beginning with a line matching /const BYTE Some_Idx/ and ending with a line matching /;/.
s/.../.../g does a substitution as many times as possible. /(\d+)/ matches a series of digits. The /e flag says the result ($1 * 2) is code that should be evaluated to produce a replacement string, instead of simply a replacement string. $1 is the digits that should be replaced.

If all you need to do is to modify structs, you can directly use regex to split and apply changes to each value in the struct, looking for the declaration and the ending }; to know when to stop.
If you really need a more general solution you could use a parser generator, like PyParsing

There is a Perl module called Parse::RecDescent which is a very powerful recursive descent parser generator. It comes with a bunch of examples. One of them is a grammar that can parse C.
Now, I don't think this matters in your case, but the recursive descent parsers using Parse::RecDescent are algorithmically slower (O(n^2), I think) than tools like Parse::Yapp or Parse::EYapp. I haven't checked whether Parse::EYapp comes with such a C-parser example, but if so, that's the tool I'd recommend learning.

Python solution (not full, just a hint ;)) Sorry if any mistakes - not tested
import re
text = open('your file.c').read()
patt = r'(?is)(.*?{)(.*?)(}\s*;)'
m = re.search(patt, text)
g1, g2, g3 = m.group(1), m.group(2), m.group(3)
g2 = [int(i) * 2 for i in g2.split(',')
out = open('your file 2.c', 'w')
out.write(g1, ','.join(g2), g3)
out.close()

There is a really useful Perl module called Convert::Binary::C that parses C header files and converts structs from/to Perl data structures.

You could always use pack / unpack, to read, and write the data.
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
my #data;
{
open( my $file, '<', 'Some_Idx.bin' );
local $/ = \1; # read one byte at a time
while( my $byte = <$file> ){
push #data, unpack('C',$byte);
}
close( $file );
}
print join(',', #data), "\n";
{
open( my $file, '>', 'Some_Idx_Mod_mul_2.bin' );
# You have two options
for my $byte( #data ){
print $file pack 'C', $byte * 2;
}
# or
print $file pack 'C*', map { $_ * 2 } #data;
close( $file );
}

For the GCC::TranslationUnit example see hparse.pl from http://gist.github.com/395160
which will make it into C::DynaLib, and the not yet written Ctypes also.
This parses functions for FFI's, and not bare structs contrary to Convert::Binary::C.
hparse will only add structs if used as func args.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Simple command line handling equivalent of Perl in Python - python

#file: process_columns.py #!/usr/bin/python import fileinput for line in fileinput.input(): cols = l.split('\t') # do something with the columns The snippet above can be used this way ./process_columns.py < data or just ./process_columns.py data

Related

Writing a hashtag to a file

more efficient way to extract lines from a file whose first column matches another file

Removing a small number of lines from a large file

Are there a set of simple scripts to manipulate csv files available somewhere?

How can I parse a C header file with Perl?

Categories

Resources