python script to search from a list of keywords - python

I have a list of keywords and I want to build a python script to iterate through each keyword, search (grep?) for against a given file, and write the output to a file.
I know my answer is somewhere in the world of:
for words in keywords
grep |word -o foundkeywords.txt
Maybe I should stay more in bash? Either way, pardon the noob question and any guidance is very appreciated.

python does not exactly have a lot to do with bash; that said your script in python should look like this (if there is a word on each line of your keyword file):
# open the file
with open('foundkeywords.txt') as f:
# read each line of the file
for i in f.read().split('\n'):
# if the word is find in the line
# print it
if i.find(word) != -1:
print i

Using bash only:
If you want to grep all the keywords at once, you can do
grep -f keywords inputfile
If you want to grep it sequentially, you can do
while read line; do
grep "$line" inputfile
done < keywords
Of course, this can be done in Python too. But I don't see how this would facilitate the process.

Related

How to concatenate sequences in the same multiFASTA files and then print result to a new FASTA file?

I have a folder with over 50 FASTA files each with anywhere from 2-8 FASTA sequences within them, here's an example:
testFOR.id_AH004930.fasta
>AH004930|2:1-128_Miopithecus_talapoin
ATGA
>AH004930|2:237-401_Miopithecus_talapoin
GGGT
>AH004930|2:502-580_Miopithecus_talapoin
CTTTGCT
>AH004930|2:681-747_Miopithecus_talapoin
GGTG
testFOR.id_M95099.fasta
>M95099|1:1-90_Homo_sapien
TCTTTGC
>M95099|1:100-243_Homo_sapien
ATGGTCTTTGAA
They're all grouped based on their ID number (in this case AH004930 and M95099), which I've managed to extract from the original raw multiFASTA file using the very handy seqkit code found HERE.
What I am aiming to do is:
Use cat to put these sequences together within the file like this:
>AH004930|2:1-128_Miopithecus_talapoin
ATGAGGGTCTTTGCTGGTG
>M95099|1:1-90_Homo_sapien
TCTTTGCATGGTCTTTGAA
(I'm not fussed about the nucleotide position, I'm fussed about the ID and species name!)
Print this result out into a new FASTA file.
Ideally I'd really like to have all of these 50 files condensed into 1 FASTA that I can then go ahead and filter/align:
GENE_L.fasta
>AH004930|2:1-128_Miopithecus_talapoin
ATGAGGGTCTTTGCTGGTG
>M95099|1:1-90_Homo_sapien
TCTTTGCATGGTCTTTGAA
....
So far I have found a way to achieve what I want but only one file at a time (using this code: cat myfile.fasta | sed -e '1!{/^>.*/d;}' | sed ':a;N;$!ba;s/\n//2g' > output.fasta which I've sadly lost the link for the credit for) but a lot of these file names are very similar, so it's inevitable that if I did it manually, I'd miss some/it would be way too slow.
I have tried to put this into a loop and it's kind of there! But what it does is it cats each FASTA file, put's it into a new one BUT only keeps the first header, leaving me with a massive stitched together sequence;
for FILE in *; do cat *.fasta| sed -e '1!{/^>.*/d;}'| sed ':a;N;$!ba;s/\n//2g' > output.fasta; done
output.fasta
>AH004930|2:1-128_Miopithecus_talapoin
ATGAGGGTCTTTGCTGGTGTCTTTGCATGGTCTTTGAAGGTCTTTGAAATGAGTGGT...
I wondered if making a loop similar to the one HERE would be any good but I am really unsure how to get it to print each header once it opens a new file.
How can I cat these sequences, print them into a new file and still keep these headers?
I would really appreciate any advice on where I've gone wrong in the loop and any solutions suitable for a zsh shell! I'm open to any python or linux solution. Thank you kindly in advance
This might work for you (GNU sed):
sed -s '1h;/>/d;H;$!d;x;s/\n//2g' file1 file2 file3 ...
Set -s to treat each file separately.
Copy the first line.
Delete any other lines containing >.
Append all other lines to the first.
Delete these lines except for the last.
At the end of the file, swap to the copies and remove all newlines except the first.
Repeat for all files.
Alternative for non-GNU seds:
for file in *.fasta; do sed '1h;/>/d;H;$!d;x;s/\n//2g' "$file"; done
N.B. MacOS sed may need to be put into a script and invoked using the -f option or split into several pieces using the -e option (less the ; commands), your luck may vary.
Or perhaps:
for file in file?; do sed $'1h;/>/d;H;$!d;x;s/\\n/#/;s/\\n//g;s/#/\\n/' "$file"; done
Not sure I understand exactly your issue, but if you simply want to concatenate contents from many files to a single file I believe the (Python) code below should work:
import os
input_folder = 'path/to/your/folder/with/fasta/files'
output_file = 'output.fasta'
with open(output_file, 'w') as outfile:
for file_name in os.listdir(input_folder):
if not file_name.endswith('.fasta'): # ignore this
continue
file_path = os.path.join(input_folder, file_name)
with open(file_path, 'r') as inpfile:
outfile.write(inpfile.read())

Input a string in grep command in python [duplicate]

This question already has answers here:
Passing variables to a subprocess call [duplicate]
(6 answers)
Closed 1 year ago.
I have a list of strings in python and want to run a recursive grep on each string in the list. Am using the following code,
import subprocess as sp
for python_file in python_files:
out = sp.getoutput("grep -r python_file . | wc -l")
print(out)
The output I am getting is the grep of the string "python_file". What mistake am I committing and what should I do to correct this??
Your code has several issues. The immediate answer to what you seem to be asking was given in a comment, but there are more things to fix here.
If you want to pass in a variable instead of a static string, you have to use some sort of string interpolation.
grep already knows how to report how many lines matched; use grep -c. Or just ask Python to count the number of output lines. Trimming off the pipe to wc -l allows you to also avoid invoking a shell, which is a good thing; see also Actual meaning of shell=True in subprocess.
grep already knows how to search for multiple expressions. Try passing in the whole list as an input file with grep -f -.
import subprocess as sp
out = sp.check_output(
["grep", "-r", "-f", "-", "."],
input="\n".join(python_files), text=True)
print(len(out.splitlines()))
If you want to speed up your processing and the patterns are all static strings, try also adding the -F option to grep.
Of course, all of this is relatively easy to do natively in Python, too. You should easily be able to find examples with os.walk().
Your intent isn't totally clear from the way you've written your question, but the first argument to grep is the pattern (python_file in your example), and the second is the file(s) . in your example
You could write this in native Python or just use grep directly, which is probably easier than using both!
grep args
--count will report just the number of matching lines
--file Read one or more newline separated patterns from file. (manpage)
grep --count --file patterns.txt -r .
import re
from pathlib import Path
for pattern in patterns:
count = 0
for path_file in Path(".").iterdir():
with open(path_file) as fh:
for line in fh:
if re.match(pattern, line):
count += 1
print(count)
NOTE that the behavior in your question would get a separate word count for each pattern, while you may really want a single count

How can I run this shell script inside python?

I want to run a bash script from a python program. The script has a command like this:
find . -type d -exec bash -c 'cd "$0" && gunzip -c *.gz | cut -f 3 >> ../mydoc.txt' {} \;
Normally I would run a subprocess call like:
subprocess.call('ls | wc -l', shell=True)
But that's not possible here because of the quoting signs. Any suggestions?
Thanks!
While the question is answered already, I'll still jump in because I assume that you want to execute that bash script because you do not have the functionally equivalent Python code (which is lees than 40 lines basically, see below).
Why do this instead the bash script?
Your script now is able to run on any OS that has a Python interpreter
The functionality is a lot easier to read and understand
If you need anything special, it is always easier to adapt your own code
More Pythonic :-)
Please bear in mind that is (as your bash script) without any kind of error checking and the output file is a global variable, but that can be changed easily.
import gzip
import os
# create out output file
outfile = open('/tmp/output.txt', mode='w', encoding='utf-8')
def process_line(line):
"""
get the third column (delimiter is tab char) and write to output file
"""
columns = line.split('\t')
if len(columns) > 3:
outfile.write(columns[3] + '\n')
def process_zipfile(filename):
"""
read zip file content (we assume text) and split into lines for processing
"""
print('Reading {0} ...'.format(filename))
with gzip.open(filename, mode='rb') as f:
lines = f.read().decode('utf-8').split('\n')
for line in lines:
process_line(line.strip())
def process_directory(dirtuple):
"""
loop thru the list of files in that directory and process any .gz file
"""
print('Processing {0} ...'.format(dirtuple[0]))
for filename in dirtuple[2]:
if filename.endswith('.gz'):
process_zipfile(os.path.join(dirtuple[0], filename))
# walk the directory tree from current directory downward
for dirtuple in os.walk('.'):
process_directory(dirtuple)
outfile.close()
Escape the ' marks with a \.
i.e. For every: ', replace with: \'
Triple quotes or triple double quotes ('''some string''' or """some other string""") are handy as well. See here (yeah, its python3 documentation, but it all works 100% in python2)
mystring = """how many 'cakes' can you "deliver"?"""
print(mystring)
how many 'cakes' can you "deliver"?

How to use awk if statement and for loop in subprocess.call

Trying to print filename of files that don't have 12 columns.
This works at the command line:
for i in *dim*; do awk -F',' '{if (NR==1 && NF!=12)print FILENAME}' $i; done;
When I try to embed this in subprocess.call in a python script, it doesn't work:
subprocess.call("""for %i in (*dim*.csv) do (awk -F, '{if ("NR==1 && NF!=12"^) {print FILENAME}}' %i)""", shell=True)
The first error I received was "Print is unexpected at this time" so I googled and added ^ within parentheses. Next error was "unexpected newline or end of string" so googled again and added the quotes around NR==1 && NF!=12. With the current code it's printing many lines in each file so I suspect something is wrong with the if statement. I've used awk and for looped before in this style in subprocess.call but not combined and with an if statement.
Multiple input files in AWK
In the string you are passing to subprocess.call(), your if statement is evaluating a string (probably not the comparison you want). It might be easier to just simplify the shell command by doing everything in AWK. You are executing AWK for every $i in the shell's for loop. Since you can give multiple input files to AWK, there is really no need for this loop.
You might want to scan through the entire files until you find any line that has other than 12 fields, and not only check the first line (NR==1). In this case, the condition would be only NF!=12.
If you want to check only the first line of each file, then NR==1 becomes FNR==1 when using multiple files. NR is the "number of records" (across all input files) and FNR is "file number of records" for the current input file only. These are special built-in variables in AWK.
Also, the syntax of AWK allows for the blocks to be executed only if the line matches some condition. Giving no condition (as you did) runs the block for every line. For example, to scan through all files given to AWK and print the name of a file with other than 12 fields on the first line, try:
awk -F, 'FNR==1 && NF!=12{print FILENAME; nextfile}' *dim*.csv
I have added the .csv to your wildcard *dim* as you had in the Python version. The -F, of course changes the field separator to a comma from the default space. For every line in each file, AWK checks if the number of fields NF is 12, if it's not, it executes the block of code, otherwise it goes on to the next line. This block prints the FILENAME of the current file AWK is processing, then skips to the beginning of the next file with nextfile.
Try running this AWK version with your subprocess module in Python:
subprocess.call("""awk -F, 'FNR==1 && NF!=12{print FILENAME; nextfile}' *dim*.csv""", shell=True)
The triple quotes makes it a literal string. The output of AWK goes to stdout and I'm assuming you know how to use this in Python with the subprocess module.
Using only Python
Don't forget that Python is itself an expressive and powerful language. If you are already using Python, it may be simpler, easier, and more portable to use only Python instead of a mixture of Python, bash, and AWK.
You can find the names of files (selected from *dim*.csv) with the first line of each file having other than 12 comma-separated fields with:
import glob
files_found = []
for filename in glob.glob('*dim*.csv'):
with open(filename, 'r') as f:
firstline = f.readline()
if len(firstline.split(',')) != 12:
files_found.append(filename)
f.close()
print(files_found)
The glob module gives the listing of files matching the wildcard pattern *dim*.csv. The first line of each of these files is read and split into fields separated by commas. If the number of these fields is not 12, it is added to the list files_found.

Bash while loop calling a Python script

I would like to call a Python script from within a Bash while loop. However, I do not understand very well how to use appropriately the while loop (and maybe variable) syntax of the Bash. The behaviour I am looking for is that, while a file still contains lines (DNA sequences), I am calling a Python script to extract groups of sequences so that another program (dialign2) can align them. Finally, I add the alignments to a result file. Note: I am not trying to iterate over the file. What should I change in order for the Bash while loop to work? I also want to be sure that the while loop will re-check the changing file.txt on each loop. Here is my attempt:
#!/bin/bash
# Call a python script as many times as needed to treat a text file
c=1
while [ `wc -l file.txt` > 0 ] ; # Stop when file.txt has no more lines
do
echo "Python script called $c times"
python script.py # Uses file.txt and removes lines from it
# The Python script also returns a temp.txt file containing DNA sequences
c=$c + 1
dialign -f temp.txt # aligns DNA sequences
cat temp.fa >>results.txt # append DNA alignements to result file
done
Thanks!
No idea why you want to do this.
c=1
while [[ -s file.txt ]] ; # Stop when file.txt has no more lines
do
echo "Python script called $c times"
python script.py # Uses file.txt and removes lines from it
c=$(($c + 1))
done
try -gt to eliminate the shell metacharacter >
while [ `wc -l file.txt` -gt 0 ]
do
...
c=$[c + 1]
done
#OP if you want to loop through a file , just use while read loop. Also, you are not using the variables $c as well as the line. Are you passing each line to your Python script? Or you just calling your Python script whenever a line is encountered? (your script going to be slow if you do that)
while true
do
while read -r line
do
# if you are taking STDIN in myscript.py, then something must be passed to
# myscript.py, if not i really don't understand what you are doing.
echo "$line" | python myscript.py > temp.txt
dialign -f temp.txt # aligns DNA sequences
cat temp.txt >>results.txt
done <"file.txt"
if [ ! -s "file.txt" ]; break ;fi
done
Lastly, you could have done everything in Python. the way to iterate "file.txt" in Python is simply
f=open("file.txt"):
for line in f:
print "do something with line"
print "or bring what you have in myscript.py here"
f.close()
The following should do what you say you want:
#!/bin/bash
c=1
while read line;
do
echo "Python script called $c times"
# $line contains a line of text from file.txt
python script.py
c=$((c + 1))
done < file.txt
However, there is no need to use bash, to iterate over the lines in a file. You can do that quite easily without ever leaving python:
myfile = open('file.txt', 'r')
for count, line in enumerate(myfile):
print '%i lines in file' % (count + 1,)
# the variable "line" contains the line of text from the file.txt
# Do your thing here.

Categories

Resources