How do you simulate awk in python, for multiline output? - python

I am used to have awk to retrieve a column from a file.
I need to do something similar now in python. At the moment I use a subprocess and save the result in a variable.
Is possible to run something similar to awk in python, without write a lot of code? I was looking at split; but I don't get how do you parse trough multiple lines.
The input that I have is similar to a simple ls -la or netstat -r. I would like to get the 3rd column, so I can do what I would do with
awk '{print $3}'
Example of the source:
a b c d e
1 2 4 5 2
X Y Z S R
The shortest that I can think of, is a loop splitting for each line, then split each line in single string, print the string[2]. But I am not sure how to write this in the simplest and shortest way; as short as write the awk command in a subprocess.

In bash, using pythonpy
rtb#bartek-laptop ~ $ cat tmp
a b c d e
1 2 4 5 2
X Y Z S R
rtb#bartek-laptop ~ $ cat tmp | py -x "x.split()[2]"
c
4
Z
Or in script
with open('tmp') as f:
result = [line.split()[2] for line in f]
# now result contains list ['c', '4', 'Z']

Related

access multiple output array of python in bash

I have a python script that print out 3 different lists. How can I access them. For example:
python out:
[1,2,3,4][a,b,c,d][p,q,r,s]
Now in bash I want to access them as:
list1=[1,2,3,4]
list2=[a,b,c,d]
list3=[p,q,r,s]
So far, I tried something like:
x=$(python myscript.py input.csv)
Now, If I use echo $x I can see the above mentioned list: [1,2,3,4][a,b,c,d][p,q,r,s]
How could I get 3 different lists? Thanks for help.
The Python output does not match the bash syntax. If you can not print the bash syntax directly from the Python script you will need to parse the output first.
I suggest using the sed command for parsing the output into bash arrays:
echo $x | sed 's|,| |g; s|\[|list1=(|; s|\[|list2=(|; s|\[|list3=(|;s|\]|)\n|g;'
Command explanation
sed 's|,| |g; # replaces `,` by blank space
s|\[|list1=(|; # replaces the 1st `[` by `list1=(`
s|\[|list2=(|; # replaces the 2nd `[` by `list2=(`
s|\[|list3=(|; # replaces the 3rd `[` by `list3=(`
s|\]|)\n|g;' # replaces all `]` by `)`
The output would be something like:
list1=(1 2 3 4)
list2=(a b c d)
list3=(p q r s)
At this point, the output are not actual lists. To turn the output into bash commands, you can surround the whole command with eval $(...), then the output will be evaluated as a bash command.
Putting all together:
$ eval $(echo $x | sed 's|,| |g; s|\[|list1=(|; s|\[|list2=(|; s|\[|list3=(|;s|\]|)\n|g;')
$ echo ${list1[#]}
1 2 3 4
$ echo ${list2[#]}
a b c d
$ echo ${list3[#]}
p q r s
Here is one approach using bash.
#!/usr/bin/env bash
##: This line is a simple test that it works.
##: IFS='][' read -ra main_list <<< [1,2,3,4][a,b,c,d][p,q,r,s]
IFS='][' read -ra main_list < <(python myscript.py input.csv)
n=1
while read -r list; do
[[ $list ]] || continue
read -ra list$((n++)) <<< "${list//,/ }"
done < <(printf '%s\n' "${main_list[#]}")
declare -p list1 list2 list3
Output
declare -a list1=([0]="1" [1]="2" [2]="3" [3]="4")
declare -a list2=([0]="a" [1]="b" [2]="c" [3]="d")
declare -a list3=([0]="p" [1]="q" [2]="r" [3]="s")
As per Philippe's comment, a for loop is also an option.
IFS='][' read -ra main_list < <(python myscript.py input.csv)
n=1
for list in "${main_list[#]}"; do
[[ $list ]] || continue
read -ra list$((n++)) <<< "${list//,/ }"
done
declare -p list1 list2 list3

Fill missing line numbers into file using sed / awk / bash

I have a (tab-delimited) file where the first "word" on each line is the line number. However, some line numbers are missing. I want to insert new lines (with corresponding line number) so that throughout the file, the number printed on the line matches the actual line number. (This is for later consumption into readarray with cut/awk to get the line after the line number.)
I've written this logic in python and tested it works, however I need to run this in an environment that doesn't have python. The actual file is about 10M rows. Is there a way to represent this logic using sed, awk, or even just plain shell / bash?
linenumre = re.compile(r"^\d+")
i = 0
for line in sys.stdin:
i = i + 1
linenum = int(linenumre.findall(line)[0])
while (i < linenum):
print(i)
i = i + 1
print(line, end='')
test file looks like:
1 foo 1
2 bar 1
4 qux 1
6 quux 1
9 2
10 fun 2
expected output like:
1 foo 1
2 bar 1
3
4 qux 1
5
6 quux 1
7
8
9 2
10 fun 2
Like this, with awk:
awk '{while(++ln!=$1){print ln}}1' input.txt
Explanation, as a multiline script:
{
# Loop as long as the variable ln (line number)
# is not equal to the first column and insert blank
# lines.
# Note: awk will auto-initialize an integer variable
# with 0 upon its first usage
while(++ln!=$1) {
print ln
}
}
1 # this always expands to true, making awk print the input lines
I've written this logic in python and tested it works, however I need to run this in an environment that doesn't have python.
In case you want to have running python code where python is not installed you might freeze your code. The Hitchhiker's Guide to Python has overview of tools which are able to do it. I suggest first trying pyinstaller as it support various operation system and seems easy to use.
This might work for you (GNU join, seq and join):
join -a1 -t' ' <(seq $(sed -n '$s/ .*//p' file)) file 2>/dev/null
Join a file created by the command seq using the last line number in file with file.

How can I paste contents of 2 files or single file multiple times?

I am using mostly one liners in shell scripting.
If I have a file with contents as below:
1
2
3
and want it to be pasted like:
1 1
2 2
3 3
how can I do it in shell scripting using python one liner?
PS: I tried the following:-
python -c "file = open('array.bin','r' ) ; cont=file.read ( ) ; print cont*3;file.close()"
but it printed contents like:-
1
2
3
1
2
3
file = open('array.bin','r' )
cont = file.readlines()
for line in cont:
print line, line
file.close()
You could replace your print cont*3 with the following:
print '\n'.join(' '.join(ch * n) for ch in cont.strip().split())
Here n is the number of columns.
You need to break up the lines and then reassemble:
One Liner:
python -c "file=open('array.bin','r'); cont=file.readlines(); print '\n'.join([' '.join([c.strip()]*2) for c in cont]); file.close()"
Long form:
file=open('array.bin', 'r')
cont=file.readlines()
print '\n'.join([' '.join([c.strip()]*2) for c in cont])
file.close()
With array.bin having:
1
2
3
Gives:
1 1
2 2
3 3
Unfortunately, you can't use a simple for statement for a one-liner solution (as suggested in a previous answer). As this answer explains, "as soon as you add a construct that introduces an indented block (like if), you need the line break."
Here's one possible solution that avoids this problem:
Open file and read lines into a list
Modify the list (using a list comprehension). For each item:
Remove the trailing new line character
Multiply by the number of columns
Join the modified list using the new line character as separator
Print the joint list and close file
Detailed/long form (n = number of columns):
f = open('array.bin', 'r')
n = 5
original = list(f)
modified = [line.strip() * n for line in original]
print('\n'.join(modified))
f.close()
One-liner:
python -c "f = open('array.bin', 'r'); n = 5; print('\n'.join([line.strip()*n for line in list(f)])); f.close()"
REPEAT_COUNT=3 && cat contents.txt| python -c "print('\n'.join(w.strip() * ${REPEAT_COUNT} for w in open('/dev/stdin').readlines()))"
First test from the command propmt:
paste -d" " array.bin array.bin
EDIT:
OP wants to use a variable n to show how much columns are needed.
There are different ways to repeat a command 10 times, such as
for i in {1..10}; do echo array.bin; done
seq 10 | xargs -I -- echo "array.bin"
source <(yes echo "array.bin" | head -n10)
yes "array.bin" | head -n10
Other ways are given by https://superuser.com/a/86353 and I will use a variation of
printf -v spaces '%*s' 10 ''; printf '%s\n' ${spaces// /ten}
My solution is
paste -d" " $(printf "%*s" $n " " | sed 's/ /array.bin /g')

How to get os.system() output as a string and not a set of characters? [duplicate]

This question already has an answer here:
How can I make a for-loop loop through lines instead of characters in a variable?
(1 answer)
Closed 6 years ago.
I'm trying to get output from os.system using the following code:
p = subprocess.Popen([some_directory], stdout=subprocess.PIPE, shell=True)
ls = p.communicate()[0]
when I print the output I get:
> print (ls)
file1.txt
file2.txt
The output somehow displays as two separate strings, However, when I try to print out the strings of filenames using a for loop i get
a list of characters instead:
>> for i in range(len(ls)):
> print i, ls[i]
Output:
0 f
1 i
2 l
3 e
4 1
5 .
6 t
7 x
8 t
9 f
10 i
11 l
12 e
13 2
14 .
15 t
16 x
17 t
I need help ensuring the os.system() output returns as strings and
not a set of characters.
p.communicate returns a string. It may look like a list of filenames, but it is just a string. You can convert it to a list of filenames by splitting on the newline character:
s = p.communicate()[0]
for line in s.split("\n"):
print "line:", line
Are you aware that there are built-in functions to get a list of files in a directory?
for i in range(len(...)): is usually a code smell in Python. If you want to iterate over the numbered elements of a collection to canonical method is for i, element in enumerate(...):.
The code you quote clearly isn't the code you ran, since when you print ls you see two lines separated by a newline, but when you iterate over the characters of the string the newline doesn't appear.
The bottom line is that you are getting a string back from communicate()[0], but you are then iterating over it, giving you the individual characters. I suspect what you would like to do is use the .splt() or .splitlines() method on ls to get the individual file names, but you are trying to run before you can walk. Forst of all, get a clear handle on what the communicate method is returning to you.
Apparently, in Python 3.6, p.communicate returns bytes object:
In [16]: type(ls)
Out[16]: bytes
Following seems to work better:
In [22]: p = subprocess.Popen([some_directory], stdout=subprocess.PIPE, shell=True)
In [23]: ls = p.communicate()[0].split()
In [25]: for i in range(len(ls)):
...: print(i, ls[i])
...:
0 b'file1.txt'
1 b'file2.txt'
But I would rather use os.listdir() instead of subprocess:
import os
for line in os.listdir():
print line

Trying to merge files after removing duplicate content

Here is my problem.
I have n files and they all have overlapping and common text in them. I want to create a file using these n files such that the new file only contains unique lines in it that exist across all of the n files.
I am looking for a bash command, python api that can do it for me. If there is an algorithm I can also attempt to code it myself.
If the order of the lines is not important, you could do this:
sort -u file1 file2 ...
This will (a) sort all the lines in all the files, and then (b) remove duplicates. This will give you the lines that are unique among all the files.
For testing common data you can use comm:
DESCRIPTION
The comm utility reads file1 and file2, which should be sorted lexically,
and produces three text columns as output: lines only in file1; lines only in
file2; and lines in both files.
Another useful tool would be merge:
DESCRIPTION
merge incorporates all changes that lead from file2 to file3 into file1.
The result ordinarily goes into file1. merge is useful for combining separate
changes to an original.
sort might mess up your order. You can try the following awk command. It hasn't been tested so make sure you backup your files. :)
awk ' !x[$0]++' big_merged_file
This will remove all duplicate lines from your file.
This might work for you:
# ( seq 1 5; seq 3 7; )
1
2
3
4
5
3
4
5
6
7
# ( seq 1 5; seq 3 7; ) | sort -nu
1
2
3
4
5
6
7
# ( seq 1 5; seq 3 7; ) | sort -n | uniq -u
1
2
6
7
# ( seq 1 5; seq 3 7; ) | sort -n | uniq -d
3
4
5
You need to merge everything first, sort then finally remove duplicates
#!/bin/bash
for file in test/*
do
cat "$file" >> final
done
sort final > final2
uniq final2 final
rm -rf final2

Categories

Resources