I want to extract text strings from a file using regex and add them to a list to create a new file with the extracted text, but I'm not able to separate the text I want to capture from the surrounding regex stuff that gets included
Example text:
#女
&10"「信号が乱れているみたい。聞こえる? アナタ?」"100("",3,2,0,0)
100("se003",0,2,0,0)
#男
&11"「──ポン太、もっと近づ────すぐ直す」"100("",4,2,0,0)
#女
&12"「……了解」"&13"またガニメデステーションに送られた通信信号と混線してしまったのだろう。別段慌てるような事ではなかった。"&14"作業船の方を確認した後、女はやるべき事を進めようとカプセルに視線を戻す。"52("_BGMName","bgm06")
42("BGM","00Sound.dat")
52("_GRPName","y12r1")42("DrawBG","00Draw.dat")#女
&15"「!?」"&16"睡眠保存カプセルは確かに止まっていたのに、その『中身』は止まっていなかった。"&17"スーツの外は真空状態で何も聞こえない。だが、その『中身』が元気よく泣いている事は見ればわかる。"100("",3,2,0,0)
100("se003",0,2,0,0)
#男
&18"「お──信号がまた──どうした!」"#女
&19"「信じられない。赤ちゃんよ。しかもこの子は……生きている。生きようとしてる!!」"100("",4,2,0,0)
I want to extract what is between &00"text to capture" and only keep what's between the quotation marks.
I've tried various ways of writing the regex using non capturing groups, lookahead/behind but python will always capture everything.
What I've currently got in the code below would work if it only occurred once per line, but sometimes there are multiple per line so I can't just add group 1 to the list like in #2 below.
In the code below #1 will append the corresponding string found on the line including the stuff I want to remove:
&10"「信号が乱れているみたい。聞こえる? アナタ?」"100("",3,2,0,0)
#2 will output what I actually want:
「信号が乱れているみたい。聞こえる? アナタ?」
but it only works if it occurs once per line so &13, &14 and &16, &17 disappear.
How can I add only the part I want to extract especially when it occurs multiple times per line?
# Code:
def extract(filename):
words = []
with open(filename, 'r', encoding="utf8") as f:
for line in f:
if (re.search(r'(?<=&\d")(.+?"*)(?=")|(?<=&\d\d")(.+?"*)(?=")|(?<=&\d\d\d")(.+?"*)(?=")|(?<=&\d\d\d\d")(.+?"*)(?=")|(?<=&\d\d\d\d")(.+?"*)(?=")', line)):
#1 words.append(line)
#2 words.append(re.split(r'(?<=&)\d+"(.+?)(?=")', line)[1])
for line in words:
print(line +"\n")
You can shorten the pattern and match & followed by 1+ digits and capture what is between double quotes in group 1.
Read the whole file at once and use re.findall to the capture group values.
&\d+"([^"]*)"
The pattern matches:
&\d+ Match & and 1+ digits
" Match opening double quote
([^"]*) Capture group 1, match any char except " (including newlines)
" Match closing double quote
See a regex demo and a Python demo.
def extract(filename):
with open(filename, 'r', encoding="utf8") as f:
return re.findall(r'&\d+"([^"]*)"', f.read())
import re
filename = "D:\\a.txt"
words = []
# Only works
with open(filename, 'r', encoding="utf8") as f:
for line in f:
line = line.strip('\n')
grps = re.findall(r'&[0-9]{1,2}(\".*?\")', line)
if grps:
#print(len(grps.groups()))
words.append(grps)
for line in words:
pass
print(line)
This is the output that the above code snippet produced.
Related
I am trying to filter and extract one word from line.
Pattern is: GR.C.24 GRCACH GRALLDKD GR_3AD etc
input will be : the data is GRCACH got from server.
output : GRCAACH
problem : Pattern will start from GR<can be any thing> and end when whitespace encount
I am able to find pattern but not able to end when space encounter.
code is:
import re
fp_data = []
with open("output", "r") as fp:
fp_data = fp.readlines()
for da in fp_data:
match = re.search("\sGR.*", da)
print da
if match:
print dir(match)
print match.group()
Output: GRCACH got from server
Excepted: GRCAACH (or possible word start with GR)
Use:
(?:\s|^)(GR\S*)
(?:\s|^) matches whitespace or start of string
(GR\S*) matches GR followed by 0 or more non-whitespace characters and places match in Group 1
No need to read the entire file into memory (what if the file were very large?). You can iterate the file line by line.
import re
with open("output", "r") as fp:
for line in fp:
matches = re.findall(r"(?:\s|^)(GR\S*)", line)
print(line, matches)
Regex Demo
readlines() method leave trailing new line character "\n" so I used list comprehension to delete this character using rstrip() method and to not operate on empty lines using isspace() method.
import re
fp_data = []
with open("output", "r") as fp:
fp_data = [line.rstrip() for line in fp if not line.isspace()]
for line in fp_data:
match = re.search("\sGR.*", line)
print(line)
if match:
print(match)
print(match.group())
Not sure if I understood your answer and your edit after my question about the desired output correctly, but assuming that you want to list all occurences of words that start with GR, here is a suggestion:
import re
fp_data = []
with open("output", "r") as fp:
fp_data = fp.readlines()
for da in fp_data:
print da
match = re.findall('\\b(GR\\S*)\\b', da)
if match:
print match
The usage of word boundaries (\b) has the benefit of matching at beginning of line and end of line as well.
After defining two keywords, my goal is to:
read full contents of an unstructured text file (1000+ lines of text)
loop through contents, fetch 60 characters to the left of keyword each time it is hit
append each 60 character string in a separate line of a new text file
I have the code to read unstructured text file and write to the new text file.
I am having trouble creating code which will seek each keyword, fetch contents, then loop through end of file.
Very simply, here is what I have so far:
#read file, store in variable
content=open("demofile.txt", "r")
#seek "KW1" or "KW2", take 60 characters to the left, append to text file, loop
#open a text file, write variable contents, close file
file=open("output.txt","w")
file.writelines(content)
file.close()
I need help with the middle portion of this code. For example, if source text file says:
"some text, some text, some text, KEYWORD"
I would like to return:
"some text, some text, some text, "
In a new row for each keyword found.
Thank you.
result = []
# Open the file
with open('your_file') as f:
# Iterate through lines
for line in f.readlines():
# Find the start of the word
index = line.find('your_word')
# If the word is inside the line
if index != -1:
if index < 60:
result.append(line[:index])
else:
result.append(line[index-60:index])
After it you can write result to a file
If you have several words, you can modify your code like this:
words = ['waka1', 'waka2', 'waka3']
result = []
# Open the file
with open('your_file') as f:
# Iterate through lines
for line in f.readlines():
for word in words:
# Find the start of the word
index = line.find(word)
# If the word is inside the line
if index != -1:
if index < 60:
result.append(line[:index])
else:
result.append(line[index-60:index])
You could go for a regex based solution as well!
import re
# r before the string makes it a raw string so the \'s aren't used as escape chars.
# \b indicates a word border to regex. like a new line, space, tab, punctuation, etc...
kwords = [r"\bparameter\b", r"\bpointer\b", r"\bfunction\b"]
in_file = "YOUR_IN_FILE"
out_file = "YOUR_OUT_FILE"
patterns = [r"([\s\S]{{0,60}}?){}".format(i) for i in kwords]
# patterns is now a list of regex pattern strings which will match between 0-60
# characters (as many as possible) followed by a word boder, followed by your
# keyword, and finally followed by another word border. If you don't care about
# the word borders then remove both the \b from each string. The actual object
# matched will only be the 0-60 characters before your parameter and not the
# actual parameter itself.
# This WILL include newlines when trying to scan backwards 60 characters.
# If you DON'T want to include newlines, change the `[\s\S]` in patterns to `.`
with open(in_file, "r") as f:
data = f.read()
with open(out_file, "w") as f:
for pattern in patterns:
matches = re.findall(pattern, data)
# The above will find all occurences of your pattern and return a list of
# occurences, as strings.
matches = [i.replace("\n", " ") for i in matches]
# The above replaces any newlines we found with a space.
# Now we can print the messages for you to see
print("Matches for " + pattern + ":", end="\n\t")
for match in matches:
print(match, end="\n\t")
# and write them to a file
f.write(match + "\r\n")
print("\n")
Depending on the specifics of what you need captured, you should have enough information here to adapt it to your problem. Leave a comment if you have any questions about regex.
I came up with the below which finds a string in a row and copies that row to a new file. I want to replace Foo23 with something more dynamic (i.e. [0-9], etc.), but I cannot get this, or variables or regex, to work. It doesn't fail, but I also get no results. Help? Thanks.
with open('C:/path/to/file/input.csv') as f:
with open('C:/path/to/file/output.csv', "w") as f1:
for line in f:
if "Foo23" in line:
f1.write(line)
Based on your comment, you want to match lines whenever any three letters followed by two numbers are present, e.g. foo12 and bar54. Use regex!
import re
pattern = r'([a-zA-Z]{3}\d{2})\b'
for line in f:
if re.findall(pattern, line):
f1.write(line)
This will match lines like 'some line foo12' and 'another foo54 line', but not 'a third line foo' or 'something bar123'.
Breaking it down:
pattern = r'( # start capture group, not needed here, but nice if you want the actual match back
[a-zA-Z]{3} # any three letters in a row, any case
\d{2} # any two digits
) # end capture group
\b # any word break (white space or end of line)
'
If all you really need is to write all of the matches in the file to f1, you can use:
matches = re.findall(pattern, f.read()) # finds all matches in f
f1.write('\n'.join(matches)) # writes each match to a new line in f1
In essence, your question boils down to: "I want to determine whether the string matches pattern X, and if so, output it to the file". The best way to accomplish this is to use a reg-ex. In Python, the standard reg-ex library is re. So,
import re
matches = re.findall(r'([a-zA-Z]{3}\d{2})', line)
Combining this with file IO operations, we have:
data = []
with open('C:/path/to/file/input.csv', 'r') as f:
data = list(f)
data = [ x for x in data if re.findall(r'([a-zA-Z]{3}\d{2})\b', line) ]
with open('C:/path/to/file/output.csv', 'w') as f1:
for line in data:
f1.write(line)
Notice that I split up your file IO operations to reduce nesting. I also removed the filtering outside of your IO. In general, each portion of your code should do "one thing" for ease of testing and maintenance.
I have a file with some lines. Out of those lines I will choose only lines which starts with xxx. Now the lines which starts with xxx have pattern as follows:
xxx:(12:"pqrs",223,"rst",-90)
xxx:(23:"abc",111,"def",-80)
I want to extract only the string which are their in the first double quote
i.e., "pqrs" and "abc".
Any help using regex is appreciated.
My code is as follows:
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in 'xxx:':
if re.match('^xxx:',line):
c=line
break
this code is giving me error
Your code is wrongly indented. Your f = f.readlines() has 9 spaces in front while for line in f: has 4 spaces. It should look like below.
import re
list_of_prefixes = ["xxx","aaa"]
resulting_list = []
with open("raw.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in list_of_prefixes:
if re.match(phrase + ':\(\d+:\"(\w+)',line) != None:
resulting_list.append(re.findall(phrase +':\(\d+:\"(\w+)',line)[0])
Well you are heading in the right direction.
If the input is this simple, you can use regex groups.
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
m = re.match('^xxx:\(\d*:("[^"]*")',line)
if m is not None:
print(m.group(1))
All the magic is in the regular expression.
^xxx:(\d*:("[^"]*") means
Start from the beginning of the line, match on "xxx:(<any number of numbers>:"<anything but ">"
and because the sequence "<anything but ">" is enclosed in round brackets it will be available as a group (by calling m.group(1)).
PS: next time make sure to include the exact error you are getting
results = []
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
if line.startswith("xxx"):
line = line.split(":") # line[1] will be what is after :
result = line[1].split(",")[0][1:-1] # will be pqrs
results.append(result)
You want to look for lines that start with xxx
then split the line on the :. The first thing after the : is what you want -- up to the comma. Then your result is that string, but remove the quotes. There is no need for regex. Python string functions will be fine
To check if a line starts with xxx do
line.startswith('xxx')
To find the text in first double-quotes do
re.search(r'"(.*?)"', line).group(1)
(as match.group(1) is the first parenthesized subgroup)
So the code will be
with open("file") as f:
for line in f:
if line.startswith('xxx'):
print(re.search(r'"(.*?)"', line).group(1))
re module docs
I have a file with a bunch of text that I want to tear through, match a bunch of things and then write these items to separate lines in a new file.
This is the basics of the code I have put together:
f = open('this.txt', 'r')
g = open('that.txt', 'w')
text = f.read()
matches = re.findall('', text) # do some re matching here
for i in matches:
a = i[0] + '\n'
g.write(a)
f.close()
g.close()
My issue is I want each matched item on a new line (hence the '\n') but I don't want a blank line at the end of the file.
I guess I need to not have the last item in the file being trailed by a new line character.
What is the Pythonic way of sorting this out? Also, is the way I have set this up in my code the best way of doing this, or the most Pythonic?
If you want to write out a sequence of lines with newlines between them, but no newline at the end, I'd use str.join. That is, replace your for loop with this:
output = "\n".join(i[0] for i in matches)
g.write(output)
In order to avoid having to close your files explicitly, especially if your code might be interrupted by exceptions, you can use the with statement to make things simpler. The following code replaces the entire code in your question:
with open('this.txt') as f, open('that.txt', 'w') as g:
text = f.read()
matches = re.findall('', text) # do some re matching here
g.write("\n".join(i[0] for i in matches))
or, since you don't need both files open at the same time:
with open('this.txt') as f:
text = f.read()
matches = re.findall('', text) # do some re matching here
with open('that.txt', 'w') as g:
g.write("\n".join(i[0] for i in matches))