I want my code to accept input that is a string (for example: "ABC") and then read through a txt file and find that string; once it finds the string, it should output the closest integer (for example: 456) to the string in the file. Is this possible?
So far, I've found code that can print "related lines," so lines that are 2 lines away from my string. This code is:
f = open("textfile.txt", "r")
searchlines = f.readlines()
f.close()
for i, line in enumerate(searchlines):
if "string" in line:
for l in searchlines[i:i+2]: print
print
This code outputs the two lines in front of and behind my string. But however, I need to print a specific integer, so I'm not sure how to proceed from there.
For my purposes, I need the "closest integer" to the right of the string.
You could read the file into one string with f.read() and then get the integer via a regular expression that captures the integer into a group:
import re
f = open("textfile.txt", "r")
content = f.read()
f.close()
find = 'hello' # the string to find
result = re.search(find + r'\D*(\d+)', content)
print(result.group(1))
Related
I have following data and link combination of 100000 entries
dn:id=2150fccc-beb8-42f8-b201-182a6bf5ddfe,ou=test,dc=com
link:545214569
dn:id=ffa55959-457d-49e6-b4cf-a34eff8bbfb7,ou=test,dc=com
link:32546897
dn:id=3452a4c3-b768-43f5-8f1e-d33c14787b9b,ou=test,dc=com
link:6547896541
I am trying to write a program in python 2.7 to add left padding zeros if value of link is less than 10 .
Eg:
545214569 --> 0545214569
32546897 --> 0032546897
can you please guide me what am i doing wrong with the following program :
with open("test.txt", "r") as f:
line=f.readline()
line1=f.readline()
wordcheck = "link"
wordcheck1= "dn"
for wordcheck1 in line1:
with open("pad-link.txt", "a") as ff:
for wordcheck in line:
with open("pad-link.txt", "a") as ff:
key, val = line.strip().split(":")
val1 = val.strip().rjust(10,'0')
line = line.replace(val,val1)
print (line)
print (line1)
ff.write(line1 + "\n")
ff.write('%s:%s \n' % (key, val1))
The usual pythonic way to pad values in Python is by using string formatting and the Format Specification Mini Language
link = 545214569
print('{:0>10}'.format(link))
Your for wordcheck1 in line1: and for workcheck in line: aren't doing what you think. They iterate one character at a time over the lines and assign that character to the workcheck variable.
If you only want to change the input file to have leading zeroes, this can be simplified as:
import re
# Read the whole file into memory
with open('input.txt') as f:
data = f.read()
# Replace all instances of "link:<digits>", passing the digits to a function that
# formats the replacement as a width-10 field, right-justified with zeros as padding.
data = re.sub(r'link:(\d+)', lambda m: 'link:{:0>10}'.format(m.group(1)), data)
with open('output.txt','w') as f:
f.write(data)
output.txt:
dn:id=2150fccc-beb8-42f8-b201-182a6bf5ddfe,ou=test,dc=com
link:0545214569
dn:id=ffa55959-457d-49e6-b4cf-a34eff8bbfb7,ou=test,dc=com
link:0032546897
dn:id=3452a4c3-b768-43f5-8f1e-d33c14787b9b,ou=test,dc=com
link:6547896541
i don't know why you have to open many times. Anyway, open 1 time, then for each line, split by :. the last element in list is the number. Then you know what lenght the digits should consistently b, say 150, then use zfill to padd the 0. then put the lines back by using join
for line in f.readlines():
words = line.split(':')
zeros = 150-len(words[-1])
words[-1] = words[-1].zfill(zeros)
newline = ':'.join(words)
# write this line to file
I need a program to find a string (S) in a file (P), and return the number of thimes it appears in the file, to do this i decided tocreate a function:
def file_reading(P, S):
file1= open(P, 'r')
pattern = S
match1 = "re.findall(pattern, P)"
if match1 != None:
print (pattern)
I know it doesn't look very good, but for some reason it's not outputing anything, let alone the right answer.
There are multiple problems with your code.
First of all, calling open() returns a file object. It does not read the contents of the file. For that you need to use read() or iterate through the file object.
Secondly, if your goal is to count the number of matches of a string, you don't need regular expressions. You can use the string function count(). Even still, it doesn't make sense to put the regular expression call in quotes.
match1 = "re.findall(pattern, file1.read())"
Assigns the string "re.findall(pattern, file1.read())" to the variable match1.
Here is a version that should work for you:
def file_reading(file_name, search_string):
# this will put the contents of the file into a string
file1 = open(file_name, 'r')
file_contents = file1.read()
file1.close() # close the file
# return the number of times the string was found
return file_contents.count(search_string)
You can read line by line instead of reading the entire file and find the nunber of time the pattern is repeated and add it to the total count c
def file_reading(file_name, pattern):
c = 0
with open(file_name, 'r') as f:
for line in f:
c + = line.count(pattern)
if c: print c
There are a few errors; let's go through them one by one:
Anything in quotes is a string. Putting "re.findall(pattern, file1.read())" in quotes just makes a string. If you actually want to call the re.findall function, no quotes are needed :)
You check whether match1 is None or not, which is really great, but then you should return that matches, not the initial pattern.
The if-statement should not be indented.
Also:
Always a close a file once you have opened it! Since most people forget to do this, it is better to use the with open(filename, action) syntax.
So, taken together, it would look like this (I've changed some variable names for clarity):
def file_reading(input_file, pattern):
with open(input_file, 'r') as text_file:
data = text_file.read()
matches = re.findall(pattern, data)
if matches:
print(matches) # prints a list of all strings found
Is there a way to detect the new line character after I've read from a file and stored the results into a string? Here is the code:
with open("text.txt") as file:
content_string = file.read()
file.close()
re.search("\n", content_string)
The content_string looks like this:
Hello world!
Hello WORLD!!!!!
I want to extract the new line character after the first line "Hello world!". Does this character even exist at that point?
As per Jongware comment, the regex search you perform finds the newline. You just need to use that result.
From the re module documentation
re.search(pattern, string, flags=0)
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance. Return None if no position in the string matches
the pattern; note that this is different from finding a zero-length
match at some point in the string.
In terms of code, checking that translates into:
with open("text.txt") as file:
content_string = file.read()
file.close()
m = re.search("\n", content_string)
if m:
print "Found a newline"
else:
print "No newline found"
Now, your file might very well contain "\r" rather than "\n": they print likely the same, but the regex would not match. In that case, give also this test a try, replacing the correct line in the code:
m = re.search("\n", content_string)
with:
m = re.search("[\r\n]", content_string)
which will look for either.
Is there a way to detect the new line character after I've read from a
file and stored the results into a string?
If I understand you correctly, you want to concatenate multiple lines into one string.
Input:
Hello world!
Hello WORLD!!!!!
test.py:
result = []
with open("text.txt", "rb") as inputs:
for line in inputs:
result.append(line.strip()) # strip() removes newline charactor
print " ".join([x for x in result])
output:
Hello world! Hello WORLD!!!!!
How about if I have more lines, and I want to detect the first
newline? For some reason, in my text it won't detect it.
with open("text.txt") as f:
first_line = f.readline()
print(first_line)
I have a BLAST output in default format. I want to parse and extract only the info I need using regex. However, in the line below
Query= contig1
There is a space there between '=' and 'contig1'. So in my output it prints a space in front. How to avoid this? Below is a piece of my code,
import re
output = open('out.txt','w')
with open('in','r') as f:
for line in f:
if re.search('Query=\s', line) != None:
line = line.strip()
line = line.rstrip()
line = line.strip('Query=\s')
line = line.rstrip('\s/')
query = line
print >> output,query
output.close()
Output should look like this,
contig1
You could actually use the returned match to extract the value you want:
for line in f:
match = re.search('Query=\s?(.*)', line)
if match is not None:
query = match.groups()[0]
print >> output,query
What we do here is: we search for a Query= followed (or not) by a space character and extract any other characters (with match.groups()[0], because we have only one group in the regular expression).
Also depending on the data nature you might want to do only simple string prefix matching like in the following example:
output = open('out.txt','w')
with open('in.txt','r') as f:
for line in f:
if line.startswith('Query='):
query = line.replace('Query=', '').strip()
print >> output,query
output.close()
In this case you don't need the re module at all.
If you are just looking for lines like tag=value, do you need regex?
tag,value=line.split('=')
if tag == 'Query':
print value.strip()
a='Query= conguie'
print "".join(a.split('Query='))
#output conguie
Comma in print statement adds space between parameters. Change
print output,query
to
print "%s%s"%(output,query)
def regexread():
import re
result = ''
savefileagain = open('sliceeverfile3.txt','w')
#text=open('emeverslicefile4.txt','r')
text='09,11,14,34,44,10,11, 27886637, 0\n561, Tue, 5,Feb,2013, 06,25,31,40,45,06,07, 19070109, 0\n560, Fri, 1,Feb,2013, 05,21,34,37,38,01,06, 13063500, 0\n559, Tue,29,Jan,2013,'
pattern='\d\d,\d\d,\d\d,\d\d,\d\d,\d\d,\d\d'
#with open('emeverslicefile4.txt') as text:
f = re.findall(pattern,text)
for item in f:
print(item)
savefileagain.write(item)
#savefileagain.close()
The above function as written parses the text and returns sets of seven numbers. I have three problems.
Firstly the 'read' file which contains exactly the same text as text='09,...etc' returns a TypeError expected string or buffer, which I cannot solve even by reading some of the posts.
Secondly, when I try to write results to the 'write' file, nothing is returned and
thirdly, I am not sure how to get the same output that I get with the print statement, which is three lines of seven numbers each which is the output that I want.
This should do the trick:
import re
filename = 'sliceeverfile3.txt'
pattern = '\d\d,\d\d,\d\d,\d\d,\d\d,\d\d,\d\d'
new_file = []
# Make sure file gets closed after being iterated
with open(filename, 'r') as f:
# Read the file contents and generate a list with each line
lines = f.readlines()
# Iterate each line
for line in lines:
# Regex applied to each line
match = re.search(pattern, line)
if match:
# Make sure to add \n to display correctly when we write it back
new_line = match.group() + '\n'
print new_line
new_file.append(new_line)
with open(filename, 'w') as f:
# go to start of file
f.seek(0)
# actually write the lines
f.writelines(new_file)
You're sort of on the right track...
You'll iterate over the file:
How to iterate over the file in python
and apply the regex to each line. The link above should really answer all 3 of your questions when you realize you're trying to write 'item', which doesn't exist outside of that loop.