Grep a string in python - python

Friends,
I have a situation where i need to grep a word from a string
[MBeanServerInvocationHandler]com.bea:Name=itms2md01,Location=hello,Type=ServerRuntime
What I want to grep is the word that assigned to the variable Name in the above string which is itms2md01.
In my case i have to grep which ever string assigned to Name= so there is no particular string i have to search
Tried:
import re
import sys
file = open(sys.argv[2], "r")
for line in file:
if re.search(sys.argv[1], line):
print line,

Deak is right. As I am not having enough reputation to comment, I am depicting it below. I am not going to the file level. Just see as an instance:-
import re
str1 = "[MBeanServerInvocationHandler]com.bea:Name=itms2md01,Location=hello,Type=ServerRuntime"
pat = '(?<=Name=)\w+(?=,)'
print re.search(pat, str1).group()
Accordingly you can apply your logic with the file content with this pattern

I like to use named groups, because I'm often searching for more than one thing. But even for one item in the search, it still works nicely, and I can remember very easily what I was searching for.
I'm not certain that I fully understand the question, but if you are saying that the user can pass a key to search the value for and also a file from which to search, you can do that like this:
So, for this case, I might do:
import re
import sys
file = open(sys.argv[2], "r")
for line in file:
match = re.search(r"%s=(?P<item>[^,]+)" % sys.argv[1], line)
if match is not None:
print match.group('item')
I am assuming that is the purpose, as you have included sys.argv[1] into the search, though you didn't mention why you did so in your question.

Related

concatenate file contents into a list with python

I wrote a small script in python to concatenate some lines from different files into one file. But somehow it doesn't print anything I like it to print by the function I wrote. I tried to spot the problems, but after one evening and one morning, I still can't find the problem. Could somebody help me please? Thanks a lot!
So I have a folder where around thousands of .fa files are. In each of the .fa file, I would like to extract the line starting with ">", and also do some change to extract the information I like. In the end, I would like to combine all the information extracted from one file into one line in a new file, and then concatenate all the information from all the .fa file into one .txt file.
So the folder:
% ls
EstimatedSpeciesTree.nwk HOG9998.fa concatenate_gene_list_HOG.py
HOG9997.fa HOG9999.fa output
One .fa file for example
>BnaCnng71140D [BRANA]
MTSSFKLSDLEEVTTNAEKIQNDLLKEILTLNAKTEYLRQFLHGSSDKTFFKKHVPVVSYEDMKPYIERVADGEPSEIIS
GGPITKFLRRYSF
>Cadbaweo98702.t [THATH]
MTSSFKLSDLEEVTTNAEKIQNDLLKEILTLNAKTEYLRQFLHGSSDKTFFKKHVPVVSYEDMKPYIERVADGEPSEIIS
GGPITKFLRRYSF
What I would like to have is one file like this
HOG9997.fa BnaCnng71140D:BRANA Cadbaweo98702.t:THATH
HOG9998.fa Bkjfnglks098:BSFFE dsgdrgg09769.t
HOG9999.fa Dsdfdfgs1937:XDGBG Cadbaweo23425.t:THATH Dkkjewe098.t:THUGB
# NOTE: the number of lines in each .fa file are uncertain. Also, some lines has [ ], but some lines has not.
So my code is
#!/usr/bin/env python3
import os
import re
import csv
def concat_hogs(a_file):
output = []
for row in a_file: # select the gene names in each HOG fasta file
if row.startswith(">"):
trans = row.partition(" ")[0].partition(">")[2]
if row.partition(" ")[2].startswith("["):
species = re.search(r"\[(.*?)\]", row).group(1)
else:
species = ""
output.append(trans + ":" + species)
return '\t'.join(output)
folder = "/tmp/Fasta/"
print("Concatenate names in " + folder)
for file in os.listdir(folder):
if file.endswith('.fa'):
reader = csv.reader(file, delimiter="\t")
print(file + concat_hogs(reader))
But the output only prints the file name with out the part that should be generated by the function concat_hogs(file). I don't understand why.
The error comes from you passing the name of the file to your concat_hogs function instead of an iterable file handle. You are missing the actual opening of the file for reading purposes.
I agree with Jay M that your code can be simplified drastically, not least by using regular expressions more efficiently. Also pathlib is awesome.
But I think it can be even more concise and expressive. Here is my suggestion:
#!/usr/bin/env python3
import re
from pathlib import Path
GENE_PATTERN = re.compile(
r"^>(?P<trans>[\w.]+)\s+(?:\[(?P<species>\w+)])?"
)
def extract_gene(string: str) -> str:
match = re.search(GENE_PATTERN, string)
return ":".join(match.groups(default=""))
def concat_hogs(file_path: Path) -> str:
with file_path.open("r") as file:
return '\t'.join(
extract_gene(row)
for row in file
if row.startswith(">")
)
def main() -> None:
folder = Path("/tmp/Fasta/")
print("Concatenate names in", folder)
for element in folder.iterdir():
if element.is_file() and element.suffix == ".fa":
print(element.name, concat_hogs(element))
if __name__ == '__main__':
main()
I am using named capturing groups for the regular expression because I prefer it for readability and usability later on.
Also I assume that the first group can only contain letters, digits and dots. Adjust the pattern, if there are more options.
PS
Just to add a few additional explanations:
The pathlib module is a great tool for any basic filesystem-related task. Among a few other useful methods you can look up there, I use the Path.iterdir method, which just iterates over elements in that directory instead of creating an entire list of them in memory first the way os.listdir does.
The RegEx Match.groups method returns a tuple of the matched groups, the default parameter allows setting the value when a group was not matched. I put an empty string there, so that I can always simply str.join the groups, even if the species-group was not found. Note that this .groups call will result in an AttributeError, if no match was found because then the match variable will be set to None. It may or may not be useful for you to catch this error.
For a few additional pointers about using regular expressions in Python, there is a great How-To-Guide in the docs. In addition I can only agree with Jay M about how useful regex101.com is, regardless of language specifics. Also, I think I would recommend using his approach of reading the entire file into memory as a single string first and then using re.findall on it to grab all matches at once. That is probably much more efficient than going line-by-line, unless you are dealing with gigantic files.
In concat_hogs I pass a generator expression to str.join. This is more efficient than first creating a list and passing that to join because no additional memory needs to be allocated. This is possible because str.join accepts any iterable of strings and that generator expression (... for ... in ...) returns a Generator, which inherits from Iterator and thus from Iterable. For more insight about the container inheritance structures I always refer to the collections.abc docs.
Use standard Python libraries
In this case
regex (use a site such as regex101 to test your regex)
pathlib to encapsulate paths in a platform independent way
collections.namedtuple to make data more structured
A breakdown of the regex used here:
>([a-z0-9A-Z\.]+?)\s*(\n|\[([A-Z]+)\]?\n)
> The start of block character
(regex1) First matchig block
\s* Any amount of whitespace (i.e. zero space is ok)
(regex2|regex3) A choice of two possible regex
regex1: + = One or more of characters in [class] Where class is any a to z or 0 to 9 or a dot
regex2: \n = A newline that immediately follows the whitespace
regex3: [([A-Z]+)] = One or more upper case letter inside square brackets
Note1: The brackets create capture groups, which we later use to split out the fields.
Note2: The regex demands zero or more whitespace between the first and second part of the text line, this makes it more resiliant.
import re
from collections import namedtuple
from pathlib import Path
import os
class HOG(namedtuple('HOG', ['filepath', 'filename', 'key', 'text'], defaults=[None])):
__slots__ = ()
def __str__(self):
return f"{self.key}:{self.text}"
regex = re.compile(r">([a-z0-9A-Z\.]+?)\s*(\n|\[([A-Z]+)\]?\n)")
path = Path(os.path.abspath("."))
wildcard = "*.fa"
files = list(path.glob("*.fa"))
print(f"Searching {path}/{wildcard} => found {len(files)} files")
data = {}
for file in files:
print(f"Processing {file}")
with file.open() as hf:
text = hf.read(-1)
matches = regex.findall(text)
for match in matches:
key = match[0].strip()
text = match[-1].strip()
if file.name not in data:
data[file.name] = []
data[file.name].append(HOG(path, file.name, key, text))
print("Now you have the data you can process it as you like")
for file, entries in data.items():
line = "\t".join(list(str(e) for e in entries))
print(file, line)
# e.g. Write the output as desired
with Path("output").open("w") as fh:
for file, entries in data.items():
line = "\t".join(list(str(e) for e in entries))
fh.write(f"{file}\t{line}\n")

Searching multiple files to define variable

With Python, I need to search a file for a string and use it to define a variable. If there are no matches in that file, it searches another file. I only have 2 files for now, but handling more is a plus. Here is what I currently have:
regex = re.compile(r'\b[01] [01] '+dest+r'\b')
dalt=None
with open(os.path.join('path','to','file','file.dat'), 'r') as datfile:
for line in datfile:
if regex.search(line):
params=line.split()
dalt=int(params[1])
break
if dalt is None:
with open(os.path.join('different','file','path','file.dat'), 'r') as fdatfile:
for line in fdatfile:
if regex.search(line):
params=line.split()
dalt=int(params[1])
break
if dalt is None:
print "Not found, giving up"
dalt=0
Is there a better way to do this? I feel like a loop would work but I'm not sure how exactly. I'm sure there are also ways to make the code more "safe", suggestions in addition to answers are appreciated.
I'm coding for Python 2.73
As requested, here is an example of what I am searching for:
The string I will have to search with is "KBFI" (dest), and I want to find this line:
1 21 1 0 KBFI Boeing Field King Co Intl
Previously I had if dest in line, but in some cases dest can appear in other lines. So I switched to a regex that also matches the two digits before dest, which can be 0 or 1. This seems to be working fine at least most of the time (haven't identified any bad cases yet). Although based on the spec, supposedly the right line will start with a 1, so maybe the right search is:
r'^1\s.*'+dest
But I haven't tested that. I suppose a fairly exact search would be:
r'^1\s+\d{,5}\s+[01]\s+[01]\s+'+dest+r'\b'
Since the fields are 1, up to five digit number (this is what I need to return), 0 or 1, 0 or 1, then the string I'm searching for. (I haven't done much regex so I'm learning)
fileinput can take a list of files:
regex = re.compile(regexstring)
dir1 = "path_to_dir/file.dat"
dir2 = "path_to_dir2/file.dat"
import fileinput
import os
for line in fileinput.input([dir1,dir2]): # pass all files to check
if regex.search(line):
params = line.split()
dalt = int(params[1])
break # found it so leave the loop
print(dalt)
else: # if we get here no file had what we want
print "Not found, giving"
If you want all the files from certain directories with similar names use glob and whatever pattern you want to match:
import glob
dir1 = "path_to_dir/"
dir2 = "path_to_dir2/"
path1_files = glob.glob(dir1+"file*.dat")
path2_files = glob.glob(dir2+"file*.dat")
You might not actually need a regex either, a simple in line may be enough.

python pattern match and process

i am trying to parse log with bunch of lines.
The line i am trying to parse from live trace (kind of tail from file) is the one that starts with "Contact".
Actually i need to use everything between brackets whatever is within
[2a00:c30:7141:230:1066:4f46:7243:a6d2] and number separated by double dots after brackets (56791)
as variables.
I have tried wit regex search but i do not know how to deal with.
Contact: "200" <sip:200#[2a00:c30:7141:230:1066:4f46:7243:a6d2]:56791;transport=udp;registering_acc=example_com>;expires=600
If the format is always the same:
for line in logfile:
if "Contact" in line:
myIPAddress=line.split('[')[1].split(']')[0]
myPort=line.split(']:')[1].split(';')[0]
use regex to do so
import re
logfile = open('xxx.log')
p = r'\[([a-f0-9:]+)\]:([0-9]+)'
pattern = re.compile(p)
for line in logfile:
if line.startswith('Contact:'):
print pattern.search(line).groups()
logfile.close()
If you getting new entries through something like tail -f $logfile, you can pipe the output of that to this:
import re
import sys
for line in sys.stdin:
m = re.match(r'Contact: .*?\[(.*?)\]:(\d+)', line)
if m is not None:
address, port = m.groups()
print address, port
Basically reads each line that comes in on standard input and tryes to find the items you are interested in. If a line does not match, then shows nothing.
data =re.search(r'Contact: .*?\[(.*?)\]:(\d+)', line_in_file)
if match:
temp=line_in_file.split('[')
temp1=temp[1].split(';')
hexValues = re.findall('[a-f0-9]', temp1[0])

Rename Files Based on File Content

Using Python, I'm trying to rename a series of .txt files in a directory according to a specific phrase in each given text file. Put differently and more specifically, I have a few hundred text files with arbitrary names but within each file is a unique phrase (something like No. 85-2156). I would like to replace the arbitrary file name with that given phrase for every text file. The phrase is not always on the same line (though it doesn't deviate that much) but it always is in the same format and with the No. prefix.
I've looked at the os module and I understand how
os.listdir
os.path.join
os.rename
could be useful but I don't understand how to combine those functions with intratext manipulation functions like linecache or general line reading functions.
I've thought through many ways of accomplishing this task but it seems like easiest and most efficient way would be to create a loop that finds the unique phrase in a file, assigns it to a variable and use that variable to rename the file before moving to the next file.
This seems like it should be easy, so much so that I feel silly writing this question. I've spent the last few hours looking reading documentation and parsing through StackOverflow but it doesn't seem like anyone has quite had this issue before -- or at least they haven't asked about their problem.
Can anyone point me in the right direction?
EDIT 1: When I create the regex pattern using this website, it creates bulky but seemingly workable code:
import re
txt='No. 09-1159'
re1='(No)' # Word 1
re2='(\\.)' # Any Single Character 1
re3='( )' # White Space 1
re4='(\\d)' # Any Single Digit 1
re5='(\\d)' # Any Single Digit 2
re6='(-)' # Any Single Character 2
re7='(\\d)' # Any Single Digit 3
re8='(\\d)' # Any Single Digit 4
re9='(\\d)' # Any Single Digit 5
re10='(\\d)' # Any Single Digit 6
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
name = m.group(0)
print name
When I manipulate that to fit the glob.glob structure, and make it like this:
import glob
import os
import re
re1='(No)' # Word 1
re2='(\\.)' # Any Single Character 1
re3='( )' # White Space 1
re4='(\\d)' # Any Single Digit 1
re5='(\\d)' # Any Single Digit 2
re6='(-)' # Any Single Character 2
re7='(\\d)' # Any Single Digit 3
re8='(\\d)' # Any Single Digit 4
re9='(\\d)' # Any Single Digit 5
re10='(\\d)' # Any Single Digit 6
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10,re.IGNORECASE|re.DOTALL)
for fname in glob.glob("\file\structure\here\*.txt"):
with open(fname) as f:
contents = f.read()
tname = rg.search(contents)
print tname
Then this prints out the byte location of the the pattern -- signifying that the regex pattern is correct. However, when I add in the nname = tname.group(0) line after the original tname = rg.search(contents) and change around the print function to reflect the change, it gives me the following error: AttributeError: 'NoneType' object has no attribute 'group'. When I tried copying and pasting #joaquin's code line for line, it came up with the same error. I was going to post this as a comment to the #spatz answer but I wanted to include so much code that this seemed to be a better way to express the `new' problem. Thank you all for the help so far.
Edit 2: This is for the #joaquin answer below:
import glob
import os
import re
for fname in glob.glob("/directory/structure/here/*.txt"):
with open(fname) as f:
contents = f.read()
tname = re.search('No\. (\d\d\-\d\d\d\d)', contents)
nname = tname.group(1)
print nname
Last Edit: I got it to work using mostly the code as written. What was happening is that there were some files that didn't have that regex expression so I assumed Python would skip them. Silly me. So I spent three days learning to write two lines of code (I know the lesson is more than that). I also used the error catching method recommended here. I wish I could check all of you as the answer, but I bothered #Joaquin the most so I gave it to him. This was a great learning experience. Thank you all for being so generous with your time. The final code is below.
import os
import re
pat3 = "No\. (\d\d-\d\d)"
ext = '.txt'
mydir = '/directory/files/here'
for arch in os.listdir(mydir):
archpath = os.path.join(mydir, arch)
with open(archpath) as f:
txt = f.read()
s = re.search(pat3, txt)
if s is None:
continue
name = s.group(1)
newpath = os.path.join(mydir, name)
if not os.path.exists(newpath):
os.rename(archpath, newpath + ext)
else:
print '{} already exists, passing'.format(newpath)
Instead of providing you with some code which you will simply copy-paste without understanding, I'd like to walk you through the solution so that you will be able to write it yourself, and more importantly gain enough knowledge to be able to do it alone next time.
The code which does what you need is made up of three main parts:
Getting a list of all filenames you need to iterate
For each file, extract the information you need to generate a new name for the file
Rename the file from its old name to the new one you just generated
Getting a list of filenames
This is best achieved with the glob module. This module allows you to specify shell-like wildcards and it will expand them. This means that in order to get a list of .txt file in a given directory, you will need to call the function glob.iglob("/path/to/directory/*.txt") and iterate over its result (for filename in ...:).
Generate new name
Once we have our filename, we need to open() it, read its contents using read() and store it in a variable where we can search for what we need. That would look something like this:
with open(filename) as f:
contents = f.read()
Now that we have the contents, we need to look for the unique phrase. This can be done using regular expressions. Store the new filename you want in a variable, say newfilename.
Rename
Now that we have both the old and the new filenames, we need to simply rename the file, and that is done using os.rename(filename, newfilename).
If you want to move the files to a different directory, use os.rename(filename, os.path.join("/path/to/new/dir", newfilename). Note that we need os.path.join here to construct the new path for the file using a directory path and newfilename.
There is no checking or protection for failures (check is archpath is a file, if newpath already exists, if the search is succesful, etc...), but this should work:
import os
import re
pat = "No\. (\d\d\-\d\d\d\d)"
mydir = 'mydir'
for arch in os.listdir(mydir):
archpath = os.path.join(mydir, arch)
with open(archpath) as f:
txt = f.read()
s = re.search(pat, txt)
name = s.group(1)
newpath = os.path.join(mydir, name)
os.rename(archpath, newpath)
Edit: I tested the regex to show how it works:
>>> import re
>>> pat = "No\. (\d\d\-\d\d\d\d)"
>>> txt='nothing here or whatever No. 09-1159 you want, does not matter'
>>> s = re.search(pat, txt)
>>> s.group(1)
'09-1159'
>>>
The regex is very simple:
\. -> a dot
\d -> a decimal digit
\- -> a dash
So, it says: search for the string "No. " followed by 2+4 decimal digits separated by a dash.
The parentheses are to create a group that I can recover with s.group(1) and that contains the code number.
And that is what you get, before and after:
Text of files one.txt, two.txt and three.txt is always the same, only the number changes:
this is the first
file with a number
nothing here or whatever No. 09-1159 you want, does not matter
the number is
Create a backup of your files, then try something like this:
import glob
import os
def your_function_to_dig_out_filename(lines):
import re
# i'll let you attempt this yourself
for fn in glob.glob('/path/to/your/dir/*.txt'):
with open(fn) as f:
spam = f.readlines()
new_fn = your_function_to_dig_out_filename(spam)
if not os.path.exists(new_fn):
os.rename(fn, new_fn)
else:
print '{} already exists, passing'.format(new_fn)

How to Find a String in a Text File And Replace Each Time With User Input in a Python Script?

I am new to python so excuse my ignorance.
Currently, I have a text file with some words marked as <>.
My goal is to essentially build a script which runs through a text file with such marked words. Each time the script finds such a word, it would ask the user for what it wants to replace it with.
For example, if I had a text file:
Today was a <<feeling>> day.
The script would run through the text file so the output would be:
Running script...
feeling? great
Script finished.
And generate a text file which would say:
Today was a great day.
Advice?
Edit: Thanks for the great advice! I have made a script that works for the most part like I wanted. Just one thing. Now I am working on if I have multiple variables with the same name (for instance, "I am <>. Bob is also <>.") the script would only prompt, feeling?, once and fill in all the variables with the same name.
Thanks so much for your help again.
import re
with open('in.txt') as infile:
text = infile.read()
search = re.compile('<<([^>]*)>>')
text = search.sub(lambda m: raw_input(m.group(1) + '? '), text)
with open('out.txt', 'w') as outfile:
outfile.write(text)
Basically the same solution as that offerred by #phihag, but in script form
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse
import re
from os import path
pattern = '<<([^>]*)>>'
def user_replace(match):
return raw_input('%s? ' % match.group(1))
def main():
parser = argparse.ArgumentParser()
parser.add_argument('infile', type=argparse.FileType('r'))
parser.add_argument('outfile', type=argparse.FileType('w'))
args = parser.parse_args()
matcher = re.compile(pattern)
for line in args.infile:
new_line = matcher.sub(user_replace, line)
args.outfile.write(new_line)
args.infile.close()
args.outfile.close()
if __name__ == '__main__':
main()
Usage: python script.py input.txt output.txt
Note that this script does not account for non-ascii file encoding.
To open a file and loop through it:
Use raw_input to get input from user
Now, put this together and update you question if you run into problems :-)
I understand you want advice on how to structure your script, right? Here's what I would do:
Read the file at once and close it (I personally don't like to have open file objects, especially if my filesystem is remote).
Use a regular expression (phihag has suggested one in his answer, so I won't repeat it) to match the pattern of your placeholders. Find all of your placeholders and store them in a dictionary as keys.
For each word in the dictionary, ask the user with raw_input (not just input). And store them as values in the dictionary.
When done, parse your text substituting any instance of a given placeholder (key) with the user word (value). This is also done with regex.
The reason for using a dictionary is that a given placeholder could occur more than once and you probably don't want to make the user repeat the entry over and over again...
Try something like this
lines = []
with open(myfile, "r") as infile:
lines = infile.readlines()
outlines = []
for line in lines:
index = line.find("<<")
if index > 0:
word = line[index+2:line.find(">>")]
input = raw_input(word+"? ")
outlines.append(line.replace("<<"+word+">>", input))
else:
outlines.append(line)
with open(outfile, "w") as output:
for line in outlines:
outfile.write(line)
Disclaimer: I haven't actually run this, so it might not work, but it looks about right and is similar to something I've done in the past.
How it works:
It parses the file in as a list where each element is one line of the file.
It builds the output list of lines. It iterates through the lines in the input, checking if the string << exist. If it does, it rips out the word inside the << and >> brackets, using it as the question for a raw_input query. It takes the input from that query and replaces the value inside the arrows (and the arrows) with the input. It then appends this value to the list. If it didn't see the arrows it simply appended the line.
After running through all the lines, it writes them to the output file. You can make this whatever file you want.
Some issues:
As written, this will work for only one arrow statement per line. So if you had <<firstname>> <<lastname>> on the same line it would ignore the lastname portion. Fixing this wouldn't be too hard to implement - you could place a while loop using the index > 0 statement and holding the lines inside that if statement. Just remember to update the index again if you do that!
It iterates through the list three times. You could likely reduce this to two, but if you have a small text file this shouldn't be a huge problem.
It could be sensitive to encoding - I'm not entirely sure about that however. Worst case there you need to cast as a string.
Edit: Moved the +2 to fix the broken if statement.

Categories

Resources