python regular expression search not finding underscore in path - python

I have a long file which follows some structure and I want to parse this file to extract an object called sample:
The file named paths_text.txt is like that:
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
My code runs fine like this:
import os
os.chdir('/groups/cgsd/alexandre/python_code')
import re
with open('./src/paths_text.txt') as f:
for line in f:
sample = re.search(r'pfg\d+',line)
print(sample)
But when I search for underscore I get None as a result of my match, why?
import os
os.chdir('/groups/cgsd/alexandre/python_code')
import re
with open('./src/paths_text.txt') as f:
for line in f:
sample = re.search(r'pfg\d+_',line)
print(sample)

Becuase there's G and T between pfg001 and _. \d+ only counts numbers.

Related

Define type for fileinput.input file object

I am trying to find & replace in file using below code snippet:
from pathlib import Path
import fileinput
import re
TEXT_FILE_PATH = Path.cwd() / "temp.txt"
FIND_PUBLIC_PATH_REGEX = r"(publicPath: ').*(',)"
REPLACE_PUBLIC_PATH_REGEX = r"\1/\2"
with fileinput.input(TEXT_FILE_PATH, inplace=True) as file:
for line in file:
print(re.sub(FIND_PUBLIC_PATH_REGEX, REPLACE_PUBLIC_PATH_REGEX, line), end="")
The problem is I am getting type error on last line which says:
No overloads for "sub" match the provided arguments
Argument types: (Literal['(publicPath: \').*(\',)'], Literal['\\1/\\2'], AnyStr#input)Pylance(reportGeneralTypeIssues)
I can't understand why I am getting this error.
Regards.

How to open a LaTeX file in Python that starts with a comment

Code:
import os
import re
import time
import csv
from TexSoup import TexSoup
path = os.getcwd()
texFile = path + '\\Paper16.tex'
print(texFile)
soup = TexSoup(open(texFile, 'r'))
This returns no output when I try to print(soup) and I believe it is because the first line starts with %.
I think this is some sort of bug of TexSoup.
Namely, if you remove the first line or comment out second line instead, the TexSoup is able to parse the file and print(soup) will give some output.
In addition, if you terminate the first line by adding braces:
%{\documentstyle[aps,epsf,rotate,epsfig,preprint]{revtex}}
the TexSoup is also able to parse the file.

unable to print the print into textfile from another textfile in python

Attempting to use a code to pull all .xml file names from a particular directory in a repo, then fliter out all the files which contain a certain key word. The part of the code which pulls the file names and puts it into the first text file works properly. The second part which takes the first text file and filters it into a second text file does not work as anticipated. I am not recieving an error or anything but the array of lines is empty which is odd because the text file is not. I am wondering if anyone sees anything obvious that I am missing I have been at this for a long time so its easy to miss simple things any help would be appreciated. I have looked into other examples and the way they did it is similar to mine logically just not sure where I went wrong. Thank you in advance
This is the code:
#!/usr/bin/python
import glob
import re
import os
import fnmatch
from pprint import pprint
import xml.etree.ElementTree as ET
import cPickle as pickle
#open a text output file
text_file = open("TestCases.txt", "w")
matches = []
#initialize the array called matches with all the xml files in the selected directories (bbap, bbsc, bbtest, and bbrtm
for bbdir in ['bbap/nr', 'bbsc/nr','bbtest/nr', 'bbrtm/nr']:
for root, dirnames,filenames in os.walk('/repo/bob/ebb/'+bbdir):
for filename in fnmatch.filter(filenames, '*.xml'):
matches.append(os.path.join(root,filename))
#for each listing in matches test it against the desired filter to achieve the wanted tests
for each_xml in matches:
if each_xml.find('dlL1NrMdbfUPe') != -1:
tree = ET.parse(each_xml)
root = tree.getroot()
for child in root:
for test_suite in child:
for test_case in test_suite:
text_file.write(pickle.dumps(test_case.attrib))
#modify the text so it is easy to read
with open("TestCases.txt") as f:
with open("Formatted_File", "w") as f1:
for line in f:
if "DLL1NRMDBFUPE" in line:
f1.write(line)
Just As I had anticipated, the error was one made from looking at code for too long. I simply forgot to close text_file before opening f.
Fix:
import glob
import re
import os
import fnmatch
from pprint import pprint
import xml.etree.ElementTree as ET
import cPickle as pickle
#open a text output file
text_file = open("TestCases.txt", "w")
matches = []
#initialize the array called matches with all the xml files in the selected directories (bbap, bbsc, bbtest, and bbrtm
for bbdir in ['bbap/nr', 'bbsc/nr','bbtest/nr', 'bbrtm/nr']:
for root, dirnames,filenames in os.walk('/repo/bob/ebb/'+bbdir):
for filename in fnmatch.filter(filenames, '*.xml'):
matches.append(os.path.join(root,filename))
#for each listing in matches test it against the desired filter to achieve the wanted tests
for each_xml in matches:
if each_xml.find('dlL1NrMdbfUPe') != -1:
tree = ET.parse(each_xml)
root = tree.getroot()
for child in root:
for test_suite in child:
for test_case in test_suite:
text_file.write(pickle.dumps(test_case.attrib))
**text_file.close()**
#modify the text so it is easy to read
with open("TestCases.txt") as f:
with open("Formatted_File", "w") as f1:
for line in f:
if "DLL1NRMDBFUPE" in line:
f1.write(line)
f.close()
f1.close()

How to get data from txt file in python log analysis?

I am beginner to python, I am trying to do log analysis, but I do not know how to get the txt file.
This is the code for outputting date, but these dates must be taken from the txt file :
import sys
import re
file = open ('desktop/trail.txt')
for line_string in iter(sys.stdin.readline,''):
line = line_string.rstrip()
date = re.search(r'date=[0-9]+\-[0-9]+\-[0-9]+', line)
date = date.group()
print date
You can use with statement to open a file safely and read each line with a readlines method. readlines returns a list of string.
Below code should work in your case:
import sys
import re
with open('desktop/trail.txt', 'r') as f:
for line in f.readlines():
line = line_string.rstrip()
date = re.search(r'date=[0-9]+\-[0-9]+\-[0-9]+', line)
date = date.group()
print date
you can do something like
for line in file.readlines():
don't forget about file closing! You can do it with file.close()

How to get filename from stdin

I am writing a script and i am running it from the console like this
cat source_text/* | ./mapper.py
and i would like to get the filename of each file reading at the time. Source texts folder contains a bunch of text files whose filename i need to extract as well in my mapper script.
Is that possible?
import sys
import re
import os
# re is for regular expressions
pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*",
re.MULTILINE | re.DOTALL | re.IGNORECASE)
# Read pairs as lines of input from STDIN
for line in sys.stdin:
....
You cannot do that directly, but fileinput module can help you.
You just have to call you script that way:
./mapper.py source_text/*
And change it that way:
import fileinput
...
# Read pairs as lines of input from STDIN
for line in fileinput.input():
...
Then the name of the file being processed is available as fileinput.filename(), and you can also have access the the number of the line in current file as fileinput.filelineno() and still other goodies...
That is not possible. You can modify your program to read directly from the files like this:
import sys
import re
# re is for regular expressions
pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*",
re.MULTILINE | re.DOTALL | re.IGNORECASE)
for filename in sys.argv[1:]:
with open(filename, "rU") as f:
for line in f.readlines():
if pattern.search(line) is not None:
print filename, line,
Then you can call it with:
$ ./grep_files.py source_text/*
If you use this instead of cat:
grep -r '' source_text/ | ./mapper.py
The input for mapper.py will be like:
source_text/answers.txt:42
source_text/answers.txt:42
source_text/file1.txt:Hello world
You can then retrieve the filename using:
for line in sys.stdin:
filename, line = line.split(':', 1)
...
However Python is more than capable to iterate over files in a directory and reading them line-by-line, for example:
for filename in os.listdir(path):
for line in open(filename):
...

Categories

Resources