regex Extract word and end with space in string - python

I am trying to filter and extract one word from line.
Pattern is: GR.C.24 GRCACH GRALLDKD GR_3AD etc
input will be : the data is GRCACH got from server.
output : GRCAACH
problem : Pattern will start from GR<can be any thing> and end when whitespace encount
I am able to find pattern but not able to end when space encounter.
code is:
import re
fp_data = []
with open("output", "r") as fp:
fp_data = fp.readlines()
for da in fp_data:
match = re.search("\sGR.*", da)
print da
if match:
print dir(match)
print match.group()
Output: GRCACH got from server
Excepted: GRCAACH (or possible word start with GR)

Use:
(?:\s|^)(GR\S*)
(?:\s|^) matches whitespace or start of string
(GR\S*) matches GR followed by 0 or more non-whitespace characters and places match in Group 1
No need to read the entire file into memory (what if the file were very large?). You can iterate the file line by line.
import re
with open("output", "r") as fp:
for line in fp:
matches = re.findall(r"(?:\s|^)(GR\S*)", line)
print(line, matches)
Regex Demo

readlines() method leave trailing new line character "\n" so I used list comprehension to delete this character using rstrip() method and to not operate on empty lines using isspace() method.
import re
fp_data = []
with open("output", "r") as fp:
fp_data = [line.rstrip() for line in fp if not line.isspace()]
for line in fp_data:
match = re.search("\sGR.*", line)
print(line)
if match:
print(match)
print(match.group())

Not sure if I understood your answer and your edit after my question about the desired output correctly, but assuming that you want to list all occurences of words that start with GR, here is a suggestion:
import re
fp_data = []
with open("output", "r") as fp:
fp_data = fp.readlines()
for da in fp_data:
print da
match = re.findall('\\b(GR\\S*)\\b', da)
if match:
print match
The usage of word boundaries (\b) has the benefit of matching at beginning of line and end of line as well.

Related

How to separate captured text when using regex?

I want to extract text strings from a file using regex and add them to a list to create a new file with the extracted text, but I'm not able to separate the text I want to capture from the surrounding regex stuff that gets included
Example text:
#女
&10"「信号が乱れているみたい。聞こえる? アナタ?」"100("",3,2,0,0)
100("se003",0,2,0,0)
#男
&11"「──ポン太、もっと近づ────すぐ直す」"100("",4,2,0,0)
#女
&12"「……了解」"&13"またガニメデステーションに送られた通信信号と混線してしまったのだろう。別段慌てるような事ではなかった。"&14"作業船の方を確認した後、女はやるべき事を進めようとカプセルに視線を戻す。"52("_BGMName","bgm06")
42("BGM","00Sound.dat")
52("_GRPName","y12r1")42("DrawBG","00Draw.dat")#女
&15"「!?」"&16"睡眠保存カプセルは確かに止まっていたのに、その『中身』は止まっていなかった。"&17"スーツの外は真空状態で何も聞こえない。だが、その『中身』が元気よく泣いている事は見ればわかる。"100("",3,2,0,0)
100("se003",0,2,0,0)
#男
&18"「お──信号がまた──どうした!」"#女
&19"「信じられない。赤ちゃんよ。しかもこの子は……生きている。生きようとしてる!!」"100("",4,2,0,0)
I want to extract what is between &00"text to capture" and only keep what's between the quotation marks.
I've tried various ways of writing the regex using non capturing groups, lookahead/behind but python will always capture everything.
What I've currently got in the code below would work if it only occurred once per line, but sometimes there are multiple per line so I can't just add group 1 to the list like in #2 below.
In the code below #1 will append the corresponding string found on the line including the stuff I want to remove:
&10"「信号が乱れているみたい。聞こえる? アナタ?」"100("",3,2,0,0)
#2 will output what I actually want:
「信号が乱れているみたい。聞こえる? アナタ?」
but it only works if it occurs once per line so &13, &14 and &16, &17 disappear.
How can I add only the part I want to extract especially when it occurs multiple times per line?
# Code:
def extract(filename):
words = []
with open(filename, 'r', encoding="utf8") as f:
for line in f:
if (re.search(r'(?<=&\d")(.+?"*)(?=")|(?<=&\d\d")(.+?"*)(?=")|(?<=&\d\d\d")(.+?"*)(?=")|(?<=&\d\d\d\d")(.+?"*)(?=")|(?<=&\d\d\d\d")(.+?"*)(?=")', line)):
#1 words.append(line)
#2 words.append(re.split(r'(?<=&)\d+"(.+?)(?=")', line)[1])
for line in words:
print(line +"\n")
You can shorten the pattern and match & followed by 1+ digits and capture what is between double quotes in group 1.
Read the whole file at once and use re.findall to the capture group values.
&\d+"([^"]*)"
The pattern matches:
&\d+ Match & and 1+ digits
" Match opening double quote
([^"]*) Capture group 1, match any char except " (including newlines)
" Match closing double quote
See a regex demo and a Python demo.
def extract(filename):
with open(filename, 'r', encoding="utf8") as f:
return re.findall(r'&\d+"([^"]*)"', f.read())
import re
filename = "D:\\a.txt"
words = []
# Only works
with open(filename, 'r', encoding="utf8") as f:
for line in f:
line = line.strip('\n')
grps = re.findall(r'&[0-9]{1,2}(\".*?\")', line)
if grps:
#print(len(grps.groups()))
words.append(grps)
for line in words:
pass
print(line)
This is the output that the above code snippet produced.

Can't replace string with variable

I came up with the below which finds a string in a row and copies that row to a new file. I want to replace Foo23 with something more dynamic (i.e. [0-9], etc.), but I cannot get this, or variables or regex, to work. It doesn't fail, but I also get no results. Help? Thanks.
with open('C:/path/to/file/input.csv') as f:
with open('C:/path/to/file/output.csv', "w") as f1:
for line in f:
if "Foo23" in line:
f1.write(line)
Based on your comment, you want to match lines whenever any three letters followed by two numbers are present, e.g. foo12 and bar54. Use regex!
import re
pattern = r'([a-zA-Z]{3}\d{2})\b'
for line in f:
if re.findall(pattern, line):
f1.write(line)
This will match lines like 'some line foo12' and 'another foo54 line', but not 'a third line foo' or 'something bar123'.
Breaking it down:
pattern = r'( # start capture group, not needed here, but nice if you want the actual match back
[a-zA-Z]{3} # any three letters in a row, any case
\d{2} # any two digits
) # end capture group
\b # any word break (white space or end of line)
'
If all you really need is to write all of the matches in the file to f1, you can use:
matches = re.findall(pattern, f.read()) # finds all matches in f
f1.write('\n'.join(matches)) # writes each match to a new line in f1
In essence, your question boils down to: "I want to determine whether the string matches pattern X, and if so, output it to the file". The best way to accomplish this is to use a reg-ex. In Python, the standard reg-ex library is re. So,
import re
matches = re.findall(r'([a-zA-Z]{3}\d{2})', line)
Combining this with file IO operations, we have:
data = []
with open('C:/path/to/file/input.csv', 'r') as f:
data = list(f)
data = [ x for x in data if re.findall(r'([a-zA-Z]{3}\d{2})\b', line) ]
with open('C:/path/to/file/output.csv', 'w') as f1:
for line in data:
f1.write(line)
Notice that I split up your file IO operations to reduce nesting. I also removed the filtering outside of your IO. In general, each portion of your code should do "one thing" for ease of testing and maintenance.

How do I find a token in a file and then count the number of spaces that precede the token?

Using python, I am trying to search a file for a token, and then count the number of white spaces which precede that token to the start of the line.
So if the file is like this:
<index>
<scm>
</scm>
</index>
I want to find the number of spaces which precede <scm>
The solution for a single-line mode:
import itertools
with open('yourfile.txt', 'r') as f:
txt = f.read()
print(len(list(itertools.takewhile(lambda c: c.isspace(), txt[txt.index('<scm>')-1::-1]))))
The output:
5
txt[txt.index('<scm>')-1::-1] - "reversed" slice from the position of string <scm> to the beginning of the text
itertools.takewhile(func, iterable) - will accumulate values/characters from the input string(iterable) untill value/character is a whitespace (c.isspace())
If you meant just for the single line case, this would get you the preceeding spaces for that line
def get_preceeding_spaces(file_name, tag):
with open(file_name, 'r') as f:
for line in f.readlines():
if tag in line:
prefix = line.split(tag)[0]
if re.match('\s*', prefix):
return len(prefix)
print(get_preceeding_spaces('test.html', '<scm>'))
returns for your file:
3
You could use a regular expression. The number of spaces would be:
import re
with open('input.txt') as f_input:
r = re.search('( +)' + re.escape("<scm>"), f_input.read(), re.S)
print len(r.groups()[0])
Which would be 3. Or the number of whitespace characters:
with open('input.txt') as f_input:
r = re.search('(\s+)' + re.escape("<scm>"), f_input.read(), re.S)
print len(r.groups()[0])
Which would be 5

extracting certain strings from a a file using python

I have a file with some lines. Out of those lines I will choose only lines which starts with xxx. Now the lines which starts with xxx have pattern as follows:
xxx:(12:"pqrs",223,"rst",-90)
xxx:(23:"abc",111,"def",-80)
I want to extract only the string which are their in the first double quote
i.e., "pqrs" and "abc".
Any help using regex is appreciated.
My code is as follows:
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in 'xxx:':
if re.match('^xxx:',line):
c=line
break
this code is giving me error
Your code is wrongly indented. Your f = f.readlines() has 9 spaces in front while for line in f: has 4 spaces. It should look like below.
import re
list_of_prefixes = ["xxx","aaa"]
resulting_list = []
with open("raw.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in list_of_prefixes:
if re.match(phrase + ':\(\d+:\"(\w+)',line) != None:
resulting_list.append(re.findall(phrase +':\(\d+:\"(\w+)',line)[0])
Well you are heading in the right direction.
If the input is this simple, you can use regex groups.
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
m = re.match('^xxx:\(\d*:("[^"]*")',line)
if m is not None:
print(m.group(1))
All the magic is in the regular expression.
^xxx:(\d*:("[^"]*") means
Start from the beginning of the line, match on "xxx:(<any number of numbers>:"<anything but ">"
and because the sequence "<anything but ">" is enclosed in round brackets it will be available as a group (by calling m.group(1)).
PS: next time make sure to include the exact error you are getting
results = []
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
if line.startswith("xxx"):
line = line.split(":") # line[1] will be what is after :
result = line[1].split(",")[0][1:-1] # will be pqrs
results.append(result)
You want to look for lines that start with xxx
then split the line on the :. The first thing after the : is what you want -- up to the comma. Then your result is that string, but remove the quotes. There is no need for regex. Python string functions will be fine
To check if a line starts with xxx do
line.startswith('xxx')
To find the text in first double-quotes do
re.search(r'"(.*?)"', line).group(1)
(as match.group(1) is the first parenthesized subgroup)
So the code will be
with open("file") as f:
for line in f:
if line.startswith('xxx'):
print(re.search(r'"(.*?)"', line).group(1))
re module docs

python regex to print previous words in text file

I have a text file where i need to print its previous word.the text file contains as:
Sparrow=beak
Hen=nest Honey=comb
I need to output as:
Sparrow
Hen
Honey
Coding:
import re
with open('qwert.txt', 'r') as f:
for line in f:
res = re.findall(r'(?:=(\w-))', line)
if res:
print res
I am not getting output,please help!
Using postiive lookahead assertion (?=...):
import re
with open('qwert.txt', 'r') as f:
for line in f:
for res in re.findall(r'\w+(?==\w)', line):
# Match word characters (`\w+`) followed by `=` and word character.
print res
If the string format is kinda strict, you can use split which not only get you keys, but also values.
line = "Hen=nest Honey=comb"
keys = [key for key, value in [token.split('=') for token in line.split(' ')]]
Then just print keys out.

Categories

Resources