I have a text file where i need to print its previous word.the text file contains as:
Sparrow=beak
Hen=nest Honey=comb
I need to output as:
Sparrow
Hen
Honey
Coding:
import re
with open('qwert.txt', 'r') as f:
for line in f:
res = re.findall(r'(?:=(\w-))', line)
if res:
print res
I am not getting output,please help!
Using postiive lookahead assertion (?=...):
import re
with open('qwert.txt', 'r') as f:
for line in f:
for res in re.findall(r'\w+(?==\w)', line):
# Match word characters (`\w+`) followed by `=` and word character.
print res
If the string format is kinda strict, you can use split which not only get you keys, but also values.
line = "Hen=nest Honey=comb"
keys = [key for key, value in [token.split('=') for token in line.split(' ')]]
Then just print keys out.
Related
This question already has answers here:
Python: Searching for text between lines with keywords
(2 answers)
Closed last month.
I want to read particular lines from the text file. E.g. all the contents between "This contents information"
I have created a script to perform the task, but it's not a good method. Are there any better way to do it?
readText=open("test.txt","r")
wanted_lines = [4,5,6,7]
count = 1
with open('test.txt', 'r') as infile:
for line in infile:
line = line.strip()
if count in wanted_lines:
print(line)
else:
pass
count += 1
You can compare each line to the sentinel, start outputting once it matches, and stop outputting once it matches again:
with open('test.txt') as infile:
for output in False, True:
for line in map(str.rstrip, infile):
if line == 'This contents information':
break
if output:
print(line)
Demo: https://replit.com/#blhsing/TroubledMysteriousMonitors
You could consider reading the entire text file into a string, and then using a regular expression to extract the contents you want:
with open('test.txt', 'r') as file:
data = file.read()
contents = re.search(r'^This contents information\n(.*?)\nThis contents information\b', inp, flags=re.M|re.S).group(1)
print(contents)
This prints:
City:LK
Country:LL
Postcode:123
You can use split, with "This contents information" as the delimiter.
In the example above, the file will be split into 3 sections, of which we only need to grab the second one (index=1). You can then use .strip() to remove unwanted space.
Code:
with open('test.txt', 'r') as infile:
text = infile.read()
required_info = text.split("This contents information")[1].strip()
print(required_info)
Output:
City:LK
Country: LL
Postcode:123
Instead of prewriting the line numbers, just have a conditional statement that checks for the data you want.
readText=open("test.txt","r")
with open('test.txt', 'r') as infile:
for line in infile:
line = line.strip()
if line == "text to look for":
printline = True
elif line == "text to end content":
printline = False
elif printline == True:
print(line)
I think the best method would be to use regex.
import re
text=""
with open('test.txt', 'r') as infile:
text = infile.read()
# Don't forget to replace here with the word you want to search among what you want to find.
# This contents information(.*?)\nThis contents information
# this regex finds everything between these two words
# example: 'test 123asda test' -> test(.*?)test => ' 123asda '
regex = re.compile(r'This contents information(.*?)\nThis contents information', re.DOTALL)
matches = [m.groups()[0] for m in regex.finditer(text)]
for m in matches:
print(f'{m.strip()}')
import re
with open("file.txt","r") as f:
data =f.readlines()
string="".join(data) #join each line into one string
ls=re.split(r"(\n*?)This contents information\n",string) #split the string where the regex we specified.
for i in range(len(ls)): #print the list. Ohoo you got the answer
print(ls[i])
I am trying to filter and extract one word from line.
Pattern is: GR.C.24 GRCACH GRALLDKD GR_3AD etc
input will be : the data is GRCACH got from server.
output : GRCAACH
problem : Pattern will start from GR<can be any thing> and end when whitespace encount
I am able to find pattern but not able to end when space encounter.
code is:
import re
fp_data = []
with open("output", "r") as fp:
fp_data = fp.readlines()
for da in fp_data:
match = re.search("\sGR.*", da)
print da
if match:
print dir(match)
print match.group()
Output: GRCACH got from server
Excepted: GRCAACH (or possible word start with GR)
Use:
(?:\s|^)(GR\S*)
(?:\s|^) matches whitespace or start of string
(GR\S*) matches GR followed by 0 or more non-whitespace characters and places match in Group 1
No need to read the entire file into memory (what if the file were very large?). You can iterate the file line by line.
import re
with open("output", "r") as fp:
for line in fp:
matches = re.findall(r"(?:\s|^)(GR\S*)", line)
print(line, matches)
Regex Demo
readlines() method leave trailing new line character "\n" so I used list comprehension to delete this character using rstrip() method and to not operate on empty lines using isspace() method.
import re
fp_data = []
with open("output", "r") as fp:
fp_data = [line.rstrip() for line in fp if not line.isspace()]
for line in fp_data:
match = re.search("\sGR.*", line)
print(line)
if match:
print(match)
print(match.group())
Not sure if I understood your answer and your edit after my question about the desired output correctly, but assuming that you want to list all occurences of words that start with GR, here is a suggestion:
import re
fp_data = []
with open("output", "r") as fp:
fp_data = fp.readlines()
for da in fp_data:
print da
match = re.findall('\\b(GR\\S*)\\b', da)
if match:
print match
The usage of word boundaries (\b) has the benefit of matching at beginning of line and end of line as well.
Using python, I am trying to search a file for a token, and then count the number of white spaces which precede that token to the start of the line.
So if the file is like this:
<index>
<scm>
</scm>
</index>
I want to find the number of spaces which precede <scm>
The solution for a single-line mode:
import itertools
with open('yourfile.txt', 'r') as f:
txt = f.read()
print(len(list(itertools.takewhile(lambda c: c.isspace(), txt[txt.index('<scm>')-1::-1]))))
The output:
5
txt[txt.index('<scm>')-1::-1] - "reversed" slice from the position of string <scm> to the beginning of the text
itertools.takewhile(func, iterable) - will accumulate values/characters from the input string(iterable) untill value/character is a whitespace (c.isspace())
If you meant just for the single line case, this would get you the preceeding spaces for that line
def get_preceeding_spaces(file_name, tag):
with open(file_name, 'r') as f:
for line in f.readlines():
if tag in line:
prefix = line.split(tag)[0]
if re.match('\s*', prefix):
return len(prefix)
print(get_preceeding_spaces('test.html', '<scm>'))
returns for your file:
3
You could use a regular expression. The number of spaces would be:
import re
with open('input.txt') as f_input:
r = re.search('( +)' + re.escape("<scm>"), f_input.read(), re.S)
print len(r.groups()[0])
Which would be 3. Or the number of whitespace characters:
with open('input.txt') as f_input:
r = re.search('(\s+)' + re.escape("<scm>"), f_input.read(), re.S)
print len(r.groups()[0])
Which would be 5
I have a file with some lines. Out of those lines I will choose only lines which starts with xxx. Now the lines which starts with xxx have pattern as follows:
xxx:(12:"pqrs",223,"rst",-90)
xxx:(23:"abc",111,"def",-80)
I want to extract only the string which are their in the first double quote
i.e., "pqrs" and "abc".
Any help using regex is appreciated.
My code is as follows:
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in 'xxx:':
if re.match('^xxx:',line):
c=line
break
this code is giving me error
Your code is wrongly indented. Your f = f.readlines() has 9 spaces in front while for line in f: has 4 spaces. It should look like below.
import re
list_of_prefixes = ["xxx","aaa"]
resulting_list = []
with open("raw.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in list_of_prefixes:
if re.match(phrase + ':\(\d+:\"(\w+)',line) != None:
resulting_list.append(re.findall(phrase +':\(\d+:\"(\w+)',line)[0])
Well you are heading in the right direction.
If the input is this simple, you can use regex groups.
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
m = re.match('^xxx:\(\d*:("[^"]*")',line)
if m is not None:
print(m.group(1))
All the magic is in the regular expression.
^xxx:(\d*:("[^"]*") means
Start from the beginning of the line, match on "xxx:(<any number of numbers>:"<anything but ">"
and because the sequence "<anything but ">" is enclosed in round brackets it will be available as a group (by calling m.group(1)).
PS: next time make sure to include the exact error you are getting
results = []
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
if line.startswith("xxx"):
line = line.split(":") # line[1] will be what is after :
result = line[1].split(",")[0][1:-1] # will be pqrs
results.append(result)
You want to look for lines that start with xxx
then split the line on the :. The first thing after the : is what you want -- up to the comma. Then your result is that string, but remove the quotes. There is no need for regex. Python string functions will be fine
To check if a line starts with xxx do
line.startswith('xxx')
To find the text in first double-quotes do
re.search(r'"(.*?)"', line).group(1)
(as match.group(1) is the first parenthesized subgroup)
So the code will be
with open("file") as f:
for line in f:
if line.startswith('xxx'):
print(re.search(r'"(.*?)"', line).group(1))
re module docs
I got a task to replace "O"(capital O) by "0" in a text file by using python. But one condition is that I have to preserve the other words like Over, NATO etc. I have to replace only the words like 9OO to 900, 2OO6 to 2006 and so on. I tried a lot but yet not successful. My code is given below. Please help me any one. Thanks in advance
import re
srcpatt = 'O'
rplpatt = '0'
cre = re.compile(srcpatt)
with open('myfile.txt', 'r') as file:
content = file.read()
wordlist = re.findall(r'(\d+O|O\d+)',str(content))
print(wordlist)
for word in wordlist:
subcontent = cre.sub(rplpatt, word)
newrep = re.compile(word)
newcontent = newrep.sub(subcontent,content)
with open('myfile.txt', 'w') as file:
file.write(newcontent)
print('"',srcpatt,'" is successfully replaced by "',rplpatt,'"')
re.sub can take in a replacement function, so we can pare this down pretty nicely:
import re
with open('myfile.txt', 'r') as file:
content = file.read()
with open('myfile.txt', 'w') as file:
file.write(re.sub(r'\d+[\dO]+|[\dO]+\d+', lambda m: m.group().replace('O', '0'), content))
import re
srcpatt = 'O'
rplpatt = '0'
cre = re.compile(srcpatt)
reg = r'\b(\d*)O(O*\d*)\b'
with open('input', 'r') as f:
for line in f:
while re.match(reg,line): line=re.sub(reg, r'\g<1>0\2', line)
print line
print('"',srcpatt,'" is successfully replaced by "',rplpatt,'"')
You can probably get away with matching just a leading digit followed by O. This won't handle OO7, but it will work nicely with 8080 for example. Which none of the answers here matching the trailing digits will. If you want to do that you need to use a lookahead match.
re.sub(r'(\d)(O+)', lambda m: m.groups()[0] + '0'*len(m.groups()[1]), content)