Finding strings in Text Files in Python

Finding strings in Text Files in Python - python

I need a program to find a string (S) in a file (P), and return the number of thimes it appears in the file, to do this i decided tocreate a function:
def file_reading(P, S):
file1= open(P, 'r')
pattern = S
match1 = "re.findall(pattern, P)"
if match1 != None:
print (pattern)
I know it doesn't look very good, but for some reason it's not outputing anything, let alone the right answer.

There are multiple problems with your code.
First of all, calling open() returns a file object. It does not read the contents of the file. For that you need to use read() or iterate through the file object.
Secondly, if your goal is to count the number of matches of a string, you don't need regular expressions. You can use the string function count(). Even still, it doesn't make sense to put the regular expression call in quotes.
match1 = "re.findall(pattern, file1.read())"
Assigns the string "re.findall(pattern, file1.read())" to the variable match1.
Here is a version that should work for you:
def file_reading(file_name, search_string):
# this will put the contents of the file into a string
file1 = open(file_name, 'r')
file_contents = file1.read()
file1.close() # close the file
# return the number of times the string was found
return file_contents.count(search_string)

You can read line by line instead of reading the entire file and find the nunber of time the pattern is repeated and add it to the total count c
def file_reading(file_name, pattern):
c = 0
with open(file_name, 'r') as f:
for line in f:
c + = line.count(pattern)
if c: print c

There are a few errors; let's go through them one by one:
Anything in quotes is a string. Putting "re.findall(pattern, file1.read())" in quotes just makes a string. If you actually want to call the re.findall function, no quotes are needed :)
You check whether match1 is None or not, which is really great, but then you should return that matches, not the initial pattern.
The if-statement should not be indented.
Also:
Always a close a file once you have opened it! Since most people forget to do this, it is better to use the with open(filename, action) syntax.
So, taken together, it would look like this (I've changed some variable names for clarity):
def file_reading(input_file, pattern):
with open(input_file, 'r') as text_file:
data = text_file.read()
matches = re.findall(pattern, data)
if matches:
print(matches) # prints a list of all strings found

Related

Why is this Python Regex Expression not working?

I'm new to regex expressions, and python. But created a script that uses multiple regex expressions. Two of which, work when run through Regexpal.com. But when I run the script. They do not work. Script works fine, when I run my other regex expressions. Here are the two that are not working. Can someone explain why they do not work, and give me the correct expressions?
I tested these three different ones, none work. I have a line with
Patient: Höler, Adam* 10.07.1920 ID-Nr: 1118111111
And I want to extract Patient: Höler, Adam.
Patient:\s.*\*
Patient:.*?([*])
Patient:.*\*
I have another line with
VCI-exsp = 20mm;
And I'm trying to extract VCI-exsp=20mm (get rid of the ';'). This is the regex expression I made, but it also works on regexpal.com (and on Atom), but not when I run the script.
VCI-exsp =[^;]*
Here is the scripts I have, regexText is a text file full of my regex expressions. And Realthingnotaphony is the text file with the text I'm trying to extract data from. If the problem is that I'm not including r, how would I inject it into the expressions?
regexarr = []
with open("regexText.txt") as fw:
for line in fw:
regexarr.append(re.compile(line))
matchs = []
count = 1
with open('Realthingnotaphony.txt') as f:
for line in f:
for regexp in regexarr:
test = re.search(regexp, line)
if test != None:
matchs.append(test)
print(test.group(0))

You are reading in from a text file but you are not stripping the newlines. This means your search criteria are not what you think they are. You can check this by using print(regexarr) after loading the first file.
[re.compile('Patient:\\s.*\\*\n'), re.compile('Patient:.*?([*])\n'), re.compile('Patient:.*\\*\n')]
Change your code to:
import re
with open("regexText.txt") as fw:
# This removes the newline character
regexarr = fw.read().splitlines()
# print(regexarr)
matchs = []
count = 1
with open('Realthingnotaphony.txt') as f:
for line in f:
for regexp in regexarr:
test = re.search(regexp, line)
if test != None:
matchs.append(test)
print(test.group(0))
Then your search terms Patient:\s.*\* and VCI-exsp =[^;]* will work.
Note:
You have a logic error in adding the entries to your match list because you are looping over each search term and resetting the result. This means you can only ever get a result on the last search term!
You can fix this by testing your output or by moving the regex loop. Note you can't just swap it with the for line in f because that is an iterator and you will exhaust the iterator on the first loop.
This would make your code:
import re
with open("regexText.txt") as fw:
regexarr = fw.read().splitlines()
# print(regexarr)
matchs = []
count = 1
for regexp in regexarr:
with open('Realthingnotaphony.txt') as f:
for line in f:
test = re.search(regexp, line)
if test != None:
matchs.append(test)
print(test.group(0))
You can also fix this by using the loading the entire file instead of each line and using the re.findall method rather than re.search. This will return a list of strings that you can then unbundle.

Concatenate lines with previous line based on number of letters in first column

New to coding and trying to figure out how to fix a broken csv file to make be able to work with it properly.
So the file has been exported from a case management system and contains fields for username, casenr, time spent, notes and date.
The problem is that occasional notes have newlines in them and when exporting the csv the tooling does not contain quotation marks to define it as a string within the field.
see below example:
user;case;hours;note;date;
tnn;123;4;solved problem;2017-11-27;
tnn;124;2;random comment;2017-11-27;
tnn;125;3;I am writing a comment
that contains new lines
without quotation marks;2017-11-28;
HJL;129;8;trying to concatenate lines to re form the broken csv;2017-11-29;
I would like to concatenate lines 3,4 and 5 to show the following:
tnn;125;3;I am writing a comment that contains new lines without quotation marks;2017-11-28;
Since every line starts with a username (always 3 letters) I thought I would be able to iterate the lines to find which lines do not start with a username and concatenate that with the previous line.
It is not really working as expected though.
This is what I have got so far:
import re
with open('Rapp.txt', 'r') as f:
for line in f:
previous = line #keep current line in variable to join next line
if not re.match(r'^[A-Za-z]{3}', line): #regex to match 3 letters
print(previous.join(line))
Script shows no output just finishes silently, any thoughts?

I think I would go a slightly different way:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
for line in f:
if not re.search("\d{4}-\d{1,2}-\d{1,2};\n", line):
line = re.sub("\n", "", line)
all_the_data = "".join([all_the_data, line])
print (all_the_data)
There a several ways to do this each with pros and cons, but I think this keeps it simple.
Loop the file as you have done and if the line doesn't end in a date and ; take off the carriage return and stuff it into all_the_data. That way you don't have to play with looking back 'up' the file. Again, lots of way to do this. If you would rather use the logic of starts with 3 letters and a ; and looking back, this works:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
all_the_data = ""
for line in f:
if not re.search("^[A-Za-z]{3};", line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
print ("results:")
print (all_the_data)
Pretty much what was asked for. The logic being if the current line doesn't start right, take out the previous line's carriage return from all_the_data.
If you need help playing with the regex itself, this site is great: http://regex101.com

The regex in your code matches to all the lines (string) in the txt (finds a valid match to the pattern). The if condition is never true and hence nothing prints.
with open('./Rapp.txt', 'r') as f:
join_words = []
for line in f:
line = line.strip()
if len(line) > 3 and ";" in line[0:4] and len(join_words) > 0:
print(';'.join(join_words))
join_words = []
join_words.append(line)
else:
join_words.append(line)
print(";".join(join_words))
I've tried to not use regex here to keep it a little clear if possible. But, regex is a better option.

A simple way would be to use a generator that acts as a filter on the original file. That filter would concatenate a line to the previous one if it has not a semicolon (;) in its 4th column. Code could be:
def preprocess(fd):
previous = next(fd)
for line in fd:
if line[3] == ';':
yield previous
previous = line
else:
previous = previous.strip() + " " + line
yield previous # don't forget last line!
You could then use:
with open(test.txt) as fd:
rd = csv.DictReader(preprocess(fd))
for row in rd:
...
The trick here is that the csv module only requires on object that returns a line each time next function is applied to it, so a generator is appropriate.
But this is only a workaround and the correct way would be that the previous step directly produces a correct CSV file.

In Python, how do I efficiently check if a string has been found in a file yet?

in a Python function I'm writing I'm going through a text file, line by line, to replace each occurence of a certain string by a (numerical) value. Once I'm at the end of the file I would like to know if this string appeared in the file at all.
The function string.replace() does not tell you if anything has been replaced or not so I find myself having to go over each line twice, to look for the string and again to replace the string.
So far, I've come up with 2 ways to do this.
For each line:
use line.find(...) to look for the string, if it hasn't been found before
if the string is found, mark it as found
newLine = line.replace(...)
(do sth. with newLine ...)
For each line:
do newLine = line.replace(...) first
if newLine != line mark the string as found
(do sth. with newLine ...)
Here's my question:
Is there a better, i.e., more efficient or more pythonic way to do this?
If not, which of the above ways is faster?

I'd do something roughly like
found = False
newlines = []
for line in f:
if oldstring in line:
found = True
newlines.append(line.replace(oldstring, newstring))
else:
newlines.append(line)
Because that's the most easily understandable to me, I think.
There may be faster ways, but the best way depends on how often the string will occur in lines. Almost every line or almost no lines, that makes a big difference.

This example will work with multiple replacements:
replacements = {'string': [1,0], 'string2': [2,0]}
with open('somefile.txt') as f:
for line in f:
for key, value in replacements.iteritems():
if key in line:
new_line = line.replace(key, value[0])
replacements[key][1] += 1
# At the end
for key, value in replacements.iteritems():
print('Replaced {} with {} {} times'.format(key, *value))

Since we anyway have to go through the string twice, I'd make it as follows:
import re
with open('yourfile.txt', 'r', encoding='utf-8') as f: # check encoding
s = f.read()
oldstr, newstr = 'XXX', 'YYY'
count = len(list(re.finditer(oldstr, s)))
s_new = s.replace(oldstr, newstr)
print(oldstr, 'has been found and replaced by', newstr, count, 'times')

Refering to a list of names using Python

I am new to Python, so please bear with me.
I can't get this little script to work properly:
genome = open('refT.txt','r')
datafile - a reference genome with a bunch (2 million) of contigs:
Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA
The file is opened:
cont_list = open('dataT.txt','r')
a list of contigs that I want to extract from the dataset listed above:
Contig_01
Contig_02
Contig_03
Contig_05
My hopeless script:
for line in cont_list:
if genome.readline() not in line:
continue
else:
a=genome.readline()
s=line+a
data_out = open ('output.txt','a')
data_out.write("%s" % s)
data_out.close()
input('Press ENTER to exit')
The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05".
I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-

I would first try to generate an iterable which gives you a tuple: (contig, gnome):
def pair(file_obj):
for line in file_obj:
yield line, next(file_obj)
Now, I would use that to get the desired elements:
wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
pairs = pair(fin)
while wanted:
p = next(pairs)
if p[0] in wanted:
# write to output file, store in a list, or dict, ...
wanted.forget(p[0])

I would recommend several things:
Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place.
Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them.
Here's some example code that might do what you're looking for
from itertools import izip_longest
# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
for line in contigfile:
contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL
# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
# Nifty way to sort through fasta files 2 lines at a time
for name, seq in izip_longest(*[genomefile]*2):
# compare the contig name to your list of contigs
if name.rstrip() in contigs:
outfile.write(name) #optional. remove if you only want the seq
outfile.write(seq)

Here's a pretty compact approach to get the sequences you'd like.
def get_sequences(data_file, valid_contigs):
sequences = []
with open(data_file) as cont_list:
for line in cont_list:
if line.startswith(valid_contigs):
sequence = cont_list.next().strip()
sequences.append(sequence)
return sequences
if __name__ == '__main__':
valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
sequences = get_sequences('dataT.txt', valid_contigs)
print(sequences)
The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters.
From there, writing the sequences grabbed to an output file is pretty straightforward.
Example output:
['TGCAGGTAAAAAACTGTCACCTGCTGGT',
'TGCAGGTCTTCCCACTTTATGATCCCTTA',
'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
'TGCAGTAAGGGTAAGATTTGCTTGACCTA']

Writing items to file on separate lines without blank line at the end

I have a file with a bunch of text that I want to tear through, match a bunch of things and then write these items to separate lines in a new file.
This is the basics of the code I have put together:
f = open('this.txt', 'r')
g = open('that.txt', 'w')
text = f.read()
matches = re.findall('', text) # do some re matching here
for i in matches:
a = i[0] + '\n'
g.write(a)
f.close()
g.close()
My issue is I want each matched item on a new line (hence the '\n') but I don't want a blank line at the end of the file.
I guess I need to not have the last item in the file being trailed by a new line character.
What is the Pythonic way of sorting this out? Also, is the way I have set this up in my code the best way of doing this, or the most Pythonic?

If you want to write out a sequence of lines with newlines between them, but no newline at the end, I'd use str.join. That is, replace your for loop with this:
output = "\n".join(i[0] for i in matches)
g.write(output)
In order to avoid having to close your files explicitly, especially if your code might be interrupted by exceptions, you can use the with statement to make things simpler. The following code replaces the entire code in your question:
with open('this.txt') as f, open('that.txt', 'w') as g:
text = f.read()
matches = re.findall('', text) # do some re matching here
g.write("\n".join(i[0] for i in matches))
or, since you don't need both files open at the same time:
with open('this.txt') as f:
text = f.read()
matches = re.findall('', text) # do some re matching here
with open('that.txt', 'w') as g:
g.write("\n".join(i[0] for i in matches))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding strings in Text Files in Python - python

You can read line by line instead of reading the entire file and find the nunber of time the pattern is repeated and add it to the total count c def file_reading(file_name, pattern): c = 0 with open(file_name, 'r') as f: for line in f: c + = line.count(pattern) if c: print c

Related

Why is this Python Regex Expression not working?

Concatenate lines with previous line based on number of letters in first column

In Python, how do I efficiently check if a string has been found in a file yet?

Refering to a list of names using Python

Writing items to file on separate lines without blank line at the end

Categories

Resources