Python isnt printing my regex expression results (or anything)? - python

I'm trying to extract the first number that appears in the first line of my text file. I'm a noob, so I'm playing around with regex. The issue I have is nothing is printing, so i'm not sure if it's my code or something else?
I've tried printing my file names too and nothing happens either so i'm not sure whats going on
work_dir = "User/...my folder of 9 text files"
for path in glob.glob(os.path.join(work_dir, "*.txt")):
with io.open(path, mode="r", encoding="utf-8") as file:
first_line = file.readline()
for line[34:] in first_line:
if "LOCUS" in line[0:34]:
matches = int(re.search(r"(\d+)", first_line).group(0))
print(matches)
name = os.path.basename(path).replace(".gbff", "")
print(name)
Here's the head of an example of the types of files im working with.
It's a text file even though it looks like a table here.
LOCUS AE017334 *5227419* bp DNA circular BCT 03-DEC-2015
DEFINITION Bacillus anthracis str. 'Ames Ancestor', complete genome.
ACCESSION AE017334
VERSION AE017334.2
DBLINK BioProject: PRJNA10784
BioSample: SAMN02603433
I need the number I've put ** around

I actually got output for your regex and text format, and its working fine with slicing and other stuff u mentioned, so its not the regex or the for loop part, since u are saying its not printing anything and i am assuming its not printing out errors too i think it has something to do with your path or directory readings.
anyways here's your regex part:
f
first_line='LOCUS AE017334 *5227419* bp DNA circular BCT 03-DEC-2015'
matches = int(re.search(r"(\d+)", first_line[34:]).group(0))
print(matches)
output:
5227419
posting this so others trying to answer can skip these steps and check into the other parts of your code

str = open('a.txt', 'r').read()
import re
start = '*'
end = '*'
print( (str[str.find(start)+len(start):str.rfind(end)]))
print("\n")
I saved your file as a.txt replace with your file name & if you need only locus values. Re arrange text before using regx

Related

Python - Regex pattern from a log file

I previously raised question where I wanted to find the latest file from a set of servers. This was answered thanks to this community.
Now another problem I am facing is that when I try to find a regex pattern match within the latest log file, I am getting a blank result i.e. the program cannot find the match even though notepad++ matches my pattern with the string I want [I just wanted to check if my pattern was at fault].
I want to search for 3 different patterns within the same file and I have tried the below code but still no output:
import glob
import os
import re
paths = [r'\\Server1\Logs\*.log',
r'\\Server2\Logs\*.log',
.....
r'\\Server16\Logs\*.log']
list_of_files = []
for path in paths:
list_of_files.extend(glob.glob(path))
#Find the latest log file from all the servers:
if list_of_files:
latest_file = max(list_of_files, key=os.path.getctime)
f = open(os.path.join(latest_file), "r")
s = f.read()
#Search for the required pattern:
pattern1 = re.search(r"[A-Z\s]{4}\_[A-Z\s]{8}\_[A-Z\s]{4}\_[0-9.]{4}[A-Z\s]{3}\"\]\s\-\s[a-z\s]{8}", s)
pattern2 = re.search(r"[A-Z]{4}\_[A-Z]{5}\_[A-Z]{4}\_[0-9]{1}\_[0-9]{8}\_[0-9.]{7}[A-Z]{3}\"\]\s\-\s[a-z]{8}", s)
pattern3 = re.search(r"[A-Z]{4}\_[A-Z]{4}\_[0-9]{1}\_[0-9]{8}\_[0-9.]{7}[A-Z]{3}\"\]\s\-\s[a-z]{8}", s)
print(pattern1)
print(pattern2)
print(pattern3)
print(latest_file)
else:
print("No log files found!")
Please note that I have tried re.findall an other re methods as well but to no success.
I have also tried to use for line in s and then pattern search in line but again, no success.
Apologies if this has a simple solution that I am not able to grasp but since I am new to the concept of programming itself, any help is really appreciated.
Thank you in advance!
As advised by the community, below is the sample of the log file that I am trying to find my pattern in:
Full line of random data that is not important to me
Another full line of random data that is not important to me
.
.
.
.
.
Yet another full line of random data that is not important to me
Upload of ["\\DATA01-ABC\companyname.projectname.appname\Production\WORD-Outbound\ProcessNameAndFile\WORD_ACTIVITY_FILE_010.DAT"] - complete.
WORD_ACTIVITY_FILE_010.DAT"] - complete this is what I want to match.
As this log is something that I don't think I can paste on the internet, I have replaced the desired output with random words that hopefully make sense.
The issue was resolved after adding "utf-16" argument to 'f' variable which was reading our .log file.
Before the code was reading the log file in ASCII mode which resulted in each character in the log file spaced out.
This fixed the issue for me:
f = codecs.open(os.path.join(latest_file), "r", "utf-16")

Trying to understand how to get import re to work in pycharm

I'm going through a course at work for Python. We're using Pycharm, and I'm not sure if that's what the problem is.
Basically, I have to read in a text file, scrub it, then count the frequency of specific words. The counting is not an issue. (I looped through a scrubbed list, checked the scrubbed list for the specific words, then added the specific words to a dictionary as I looped through the list. It works fine).
My issue is really about scrubbing the data. I ended up doing successive scrubs to get to a final clean list. But when I read the documentation, I should be able to use regex or re and scrub my file with one line of code. No matter what I do, importing re, or regex I get errors that stop my code.
How can I write the below code pythonically?
# Open the file in read mode
with open('chocolate.txt', 'r') as file:
input_col = file.read().replace(',', '')
text3 = input_col.replace('.', '')
text2 = text3.replace('"', '')
text = text2.split()
You could try using a regular expression which looks something like this
import re
result = re.sub(r'("|.|,)', "", text)
print(result)
Here text is the string you would read from the text file
Hope this helps!
x = re.sub(r'("|\.|,)', "", str)

Write a single poly-linear string to multiple lines in .txt

I have encountered a strange problem which I am struggling to resolve. When I run a re.findall() through a .txt file, and then try to print and write the results. all of the results I would expect appear, but they do so in different formats.
The code (modified from a similar thread I found earlier):
import re
with open ('test.txt') as text:
text = text.read()
match = re.findall(r'[\w\.-]+#[\w\.-]+', text)
for i in match:
with open ('list.txt', 'a') as dest:
i = str(i)
print(i)
dest.write(i)
The interpreter then produces the result:
a#a
b#b
c#c
which is exactly what I would expect it to do, given the contents of test.txt.
However, list.txt reads:
(generic existing text goes here)
a#ab#bc#c
while I want it to (and believe it should) read
(generic existing text goes here)
a#a
b#b
c#c
I've tried using str.writelines.() in place of str.write() but this was not helpful. What differences between print() and str.write() are causing this ambiguity, and how would one go about avoiding it.
N.B. I am 99% sure that line 8 i = str(i) serves no purpose, but I've left it in because it's what I've been doing. Not really sure why...
I'll start with your last comment. What str(i) does is it converts i to its string representation (which is defined in i's class's __str__ method). If you call str(4) you get '4', for example. This is unnecessary in this case because re.findall returns a list of strings as per the documentation.
As for your actual issue: you're missing the newlines. I would also prefer to open the file fewer times than you are.
Perhaps try:
import re
with open ('test.txt') as text:
text = text.read()
match = re.findall(r'[\w\.-]+#[\w\.-]+', text)
with open('list.txt', 'a') as dest:
for i in match:
print(i)
dest.write(i + '\n')
(You can also remove the print(i) line if you don't want to see the output in the console every time a write is done.)

Rename Files Based on File Content

Using Python, I'm trying to rename a series of .txt files in a directory according to a specific phrase in each given text file. Put differently and more specifically, I have a few hundred text files with arbitrary names but within each file is a unique phrase (something like No. 85-2156). I would like to replace the arbitrary file name with that given phrase for every text file. The phrase is not always on the same line (though it doesn't deviate that much) but it always is in the same format and with the No. prefix.
I've looked at the os module and I understand how
os.listdir
os.path.join
os.rename
could be useful but I don't understand how to combine those functions with intratext manipulation functions like linecache or general line reading functions.
I've thought through many ways of accomplishing this task but it seems like easiest and most efficient way would be to create a loop that finds the unique phrase in a file, assigns it to a variable and use that variable to rename the file before moving to the next file.
This seems like it should be easy, so much so that I feel silly writing this question. I've spent the last few hours looking reading documentation and parsing through StackOverflow but it doesn't seem like anyone has quite had this issue before -- or at least they haven't asked about their problem.
Can anyone point me in the right direction?
EDIT 1: When I create the regex pattern using this website, it creates bulky but seemingly workable code:
import re
txt='No. 09-1159'
re1='(No)' # Word 1
re2='(\\.)' # Any Single Character 1
re3='( )' # White Space 1
re4='(\\d)' # Any Single Digit 1
re5='(\\d)' # Any Single Digit 2
re6='(-)' # Any Single Character 2
re7='(\\d)' # Any Single Digit 3
re8='(\\d)' # Any Single Digit 4
re9='(\\d)' # Any Single Digit 5
re10='(\\d)' # Any Single Digit 6
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
name = m.group(0)
print name
When I manipulate that to fit the glob.glob structure, and make it like this:
import glob
import os
import re
re1='(No)' # Word 1
re2='(\\.)' # Any Single Character 1
re3='( )' # White Space 1
re4='(\\d)' # Any Single Digit 1
re5='(\\d)' # Any Single Digit 2
re6='(-)' # Any Single Character 2
re7='(\\d)' # Any Single Digit 3
re8='(\\d)' # Any Single Digit 4
re9='(\\d)' # Any Single Digit 5
re10='(\\d)' # Any Single Digit 6
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10,re.IGNORECASE|re.DOTALL)
for fname in glob.glob("\file\structure\here\*.txt"):
with open(fname) as f:
contents = f.read()
tname = rg.search(contents)
print tname
Then this prints out the byte location of the the pattern -- signifying that the regex pattern is correct. However, when I add in the nname = tname.group(0) line after the original tname = rg.search(contents) and change around the print function to reflect the change, it gives me the following error: AttributeError: 'NoneType' object has no attribute 'group'. When I tried copying and pasting #joaquin's code line for line, it came up with the same error. I was going to post this as a comment to the #spatz answer but I wanted to include so much code that this seemed to be a better way to express the `new' problem. Thank you all for the help so far.
Edit 2: This is for the #joaquin answer below:
import glob
import os
import re
for fname in glob.glob("/directory/structure/here/*.txt"):
with open(fname) as f:
contents = f.read()
tname = re.search('No\. (\d\d\-\d\d\d\d)', contents)
nname = tname.group(1)
print nname
Last Edit: I got it to work using mostly the code as written. What was happening is that there were some files that didn't have that regex expression so I assumed Python would skip them. Silly me. So I spent three days learning to write two lines of code (I know the lesson is more than that). I also used the error catching method recommended here. I wish I could check all of you as the answer, but I bothered #Joaquin the most so I gave it to him. This was a great learning experience. Thank you all for being so generous with your time. The final code is below.
import os
import re
pat3 = "No\. (\d\d-\d\d)"
ext = '.txt'
mydir = '/directory/files/here'
for arch in os.listdir(mydir):
archpath = os.path.join(mydir, arch)
with open(archpath) as f:
txt = f.read()
s = re.search(pat3, txt)
if s is None:
continue
name = s.group(1)
newpath = os.path.join(mydir, name)
if not os.path.exists(newpath):
os.rename(archpath, newpath + ext)
else:
print '{} already exists, passing'.format(newpath)
Instead of providing you with some code which you will simply copy-paste without understanding, I'd like to walk you through the solution so that you will be able to write it yourself, and more importantly gain enough knowledge to be able to do it alone next time.
The code which does what you need is made up of three main parts:
Getting a list of all filenames you need to iterate
For each file, extract the information you need to generate a new name for the file
Rename the file from its old name to the new one you just generated
Getting a list of filenames
This is best achieved with the glob module. This module allows you to specify shell-like wildcards and it will expand them. This means that in order to get a list of .txt file in a given directory, you will need to call the function glob.iglob("/path/to/directory/*.txt") and iterate over its result (for filename in ...:).
Generate new name
Once we have our filename, we need to open() it, read its contents using read() and store it in a variable where we can search for what we need. That would look something like this:
with open(filename) as f:
contents = f.read()
Now that we have the contents, we need to look for the unique phrase. This can be done using regular expressions. Store the new filename you want in a variable, say newfilename.
Rename
Now that we have both the old and the new filenames, we need to simply rename the file, and that is done using os.rename(filename, newfilename).
If you want to move the files to a different directory, use os.rename(filename, os.path.join("/path/to/new/dir", newfilename). Note that we need os.path.join here to construct the new path for the file using a directory path and newfilename.
There is no checking or protection for failures (check is archpath is a file, if newpath already exists, if the search is succesful, etc...), but this should work:
import os
import re
pat = "No\. (\d\d\-\d\d\d\d)"
mydir = 'mydir'
for arch in os.listdir(mydir):
archpath = os.path.join(mydir, arch)
with open(archpath) as f:
txt = f.read()
s = re.search(pat, txt)
name = s.group(1)
newpath = os.path.join(mydir, name)
os.rename(archpath, newpath)
Edit: I tested the regex to show how it works:
>>> import re
>>> pat = "No\. (\d\d\-\d\d\d\d)"
>>> txt='nothing here or whatever No. 09-1159 you want, does not matter'
>>> s = re.search(pat, txt)
>>> s.group(1)
'09-1159'
>>>
The regex is very simple:
\. -> a dot
\d -> a decimal digit
\- -> a dash
So, it says: search for the string "No. " followed by 2+4 decimal digits separated by a dash.
The parentheses are to create a group that I can recover with s.group(1) and that contains the code number.
And that is what you get, before and after:
Text of files one.txt, two.txt and three.txt is always the same, only the number changes:
this is the first
file with a number
nothing here or whatever No. 09-1159 you want, does not matter
the number is
Create a backup of your files, then try something like this:
import glob
import os
def your_function_to_dig_out_filename(lines):
import re
# i'll let you attempt this yourself
for fn in glob.glob('/path/to/your/dir/*.txt'):
with open(fn) as f:
spam = f.readlines()
new_fn = your_function_to_dig_out_filename(spam)
if not os.path.exists(new_fn):
os.rename(fn, new_fn)
else:
print '{} already exists, passing'.format(new_fn)

How to Find a String in a Text File And Replace Each Time With User Input in a Python Script?

I am new to python so excuse my ignorance.
Currently, I have a text file with some words marked as <>.
My goal is to essentially build a script which runs through a text file with such marked words. Each time the script finds such a word, it would ask the user for what it wants to replace it with.
For example, if I had a text file:
Today was a <<feeling>> day.
The script would run through the text file so the output would be:
Running script...
feeling? great
Script finished.
And generate a text file which would say:
Today was a great day.
Advice?
Edit: Thanks for the great advice! I have made a script that works for the most part like I wanted. Just one thing. Now I am working on if I have multiple variables with the same name (for instance, "I am <>. Bob is also <>.") the script would only prompt, feeling?, once and fill in all the variables with the same name.
Thanks so much for your help again.
import re
with open('in.txt') as infile:
text = infile.read()
search = re.compile('<<([^>]*)>>')
text = search.sub(lambda m: raw_input(m.group(1) + '? '), text)
with open('out.txt', 'w') as outfile:
outfile.write(text)
Basically the same solution as that offerred by #phihag, but in script form
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse
import re
from os import path
pattern = '<<([^>]*)>>'
def user_replace(match):
return raw_input('%s? ' % match.group(1))
def main():
parser = argparse.ArgumentParser()
parser.add_argument('infile', type=argparse.FileType('r'))
parser.add_argument('outfile', type=argparse.FileType('w'))
args = parser.parse_args()
matcher = re.compile(pattern)
for line in args.infile:
new_line = matcher.sub(user_replace, line)
args.outfile.write(new_line)
args.infile.close()
args.outfile.close()
if __name__ == '__main__':
main()
Usage: python script.py input.txt output.txt
Note that this script does not account for non-ascii file encoding.
To open a file and loop through it:
Use raw_input to get input from user
Now, put this together and update you question if you run into problems :-)
I understand you want advice on how to structure your script, right? Here's what I would do:
Read the file at once and close it (I personally don't like to have open file objects, especially if my filesystem is remote).
Use a regular expression (phihag has suggested one in his answer, so I won't repeat it) to match the pattern of your placeholders. Find all of your placeholders and store them in a dictionary as keys.
For each word in the dictionary, ask the user with raw_input (not just input). And store them as values in the dictionary.
When done, parse your text substituting any instance of a given placeholder (key) with the user word (value). This is also done with regex.
The reason for using a dictionary is that a given placeholder could occur more than once and you probably don't want to make the user repeat the entry over and over again...
Try something like this
lines = []
with open(myfile, "r") as infile:
lines = infile.readlines()
outlines = []
for line in lines:
index = line.find("<<")
if index > 0:
word = line[index+2:line.find(">>")]
input = raw_input(word+"? ")
outlines.append(line.replace("<<"+word+">>", input))
else:
outlines.append(line)
with open(outfile, "w") as output:
for line in outlines:
outfile.write(line)
Disclaimer: I haven't actually run this, so it might not work, but it looks about right and is similar to something I've done in the past.
How it works:
It parses the file in as a list where each element is one line of the file.
It builds the output list of lines. It iterates through the lines in the input, checking if the string << exist. If it does, it rips out the word inside the << and >> brackets, using it as the question for a raw_input query. It takes the input from that query and replaces the value inside the arrows (and the arrows) with the input. It then appends this value to the list. If it didn't see the arrows it simply appended the line.
After running through all the lines, it writes them to the output file. You can make this whatever file you want.
Some issues:
As written, this will work for only one arrow statement per line. So if you had <<firstname>> <<lastname>> on the same line it would ignore the lastname portion. Fixing this wouldn't be too hard to implement - you could place a while loop using the index > 0 statement and holding the lines inside that if statement. Just remember to update the index again if you do that!
It iterates through the list three times. You could likely reduce this to two, but if you have a small text file this shouldn't be a huge problem.
It could be sensitive to encoding - I'm not entirely sure about that however. Worst case there you need to cast as a string.
Edit: Moved the +2 to fix the broken if statement.

Categories

Resources