I came up with the below which finds a string in a row and copies that row to a new file. I want to replace Foo23 with something more dynamic (i.e. [0-9], etc.), but I cannot get this, or variables or regex, to work. It doesn't fail, but I also get no results. Help? Thanks.
with open('C:/path/to/file/input.csv') as f:
with open('C:/path/to/file/output.csv', "w") as f1:
for line in f:
if "Foo23" in line:
f1.write(line)
Based on your comment, you want to match lines whenever any three letters followed by two numbers are present, e.g. foo12 and bar54. Use regex!
import re
pattern = r'([a-zA-Z]{3}\d{2})\b'
for line in f:
if re.findall(pattern, line):
f1.write(line)
This will match lines like 'some line foo12' and 'another foo54 line', but not 'a third line foo' or 'something bar123'.
Breaking it down:
pattern = r'( # start capture group, not needed here, but nice if you want the actual match back
[a-zA-Z]{3} # any three letters in a row, any case
\d{2} # any two digits
) # end capture group
\b # any word break (white space or end of line)
'
If all you really need is to write all of the matches in the file to f1, you can use:
matches = re.findall(pattern, f.read()) # finds all matches in f
f1.write('\n'.join(matches)) # writes each match to a new line in f1
In essence, your question boils down to: "I want to determine whether the string matches pattern X, and if so, output it to the file". The best way to accomplish this is to use a reg-ex. In Python, the standard reg-ex library is re. So,
import re
matches = re.findall(r'([a-zA-Z]{3}\d{2})', line)
Combining this with file IO operations, we have:
data = []
with open('C:/path/to/file/input.csv', 'r') as f:
data = list(f)
data = [ x for x in data if re.findall(r'([a-zA-Z]{3}\d{2})\b', line) ]
with open('C:/path/to/file/output.csv', 'w') as f1:
for line in data:
f1.write(line)
Notice that I split up your file IO operations to reduce nesting. I also removed the filtering outside of your IO. In general, each portion of your code should do "one thing" for ease of testing and maintenance.
Related
I want to extract text strings from a file using regex and add them to a list to create a new file with the extracted text, but I'm not able to separate the text I want to capture from the surrounding regex stuff that gets included
Example text:
#女
&10"「信号が乱れているみたい。聞こえる? アナタ?」"100("",3,2,0,0)
100("se003",0,2,0,0)
#男
&11"「──ポン太、もっと近づ────すぐ直す」"100("",4,2,0,0)
#女
&12"「……了解」"&13"またガニメデステーションに送られた通信信号と混線してしまったのだろう。別段慌てるような事ではなかった。"&14"作業船の方を確認した後、女はやるべき事を進めようとカプセルに視線を戻す。"52("_BGMName","bgm06")
42("BGM","00Sound.dat")
52("_GRPName","y12r1")42("DrawBG","00Draw.dat")#女
&15"「!?」"&16"睡眠保存カプセルは確かに止まっていたのに、その『中身』は止まっていなかった。"&17"スーツの外は真空状態で何も聞こえない。だが、その『中身』が元気よく泣いている事は見ればわかる。"100("",3,2,0,0)
100("se003",0,2,0,0)
#男
&18"「お──信号がまた──どうした!」"#女
&19"「信じられない。赤ちゃんよ。しかもこの子は……生きている。生きようとしてる!!」"100("",4,2,0,0)
I want to extract what is between &00"text to capture" and only keep what's between the quotation marks.
I've tried various ways of writing the regex using non capturing groups, lookahead/behind but python will always capture everything.
What I've currently got in the code below would work if it only occurred once per line, but sometimes there are multiple per line so I can't just add group 1 to the list like in #2 below.
In the code below #1 will append the corresponding string found on the line including the stuff I want to remove:
&10"「信号が乱れているみたい。聞こえる? アナタ?」"100("",3,2,0,0)
#2 will output what I actually want:
「信号が乱れているみたい。聞こえる? アナタ?」
but it only works if it occurs once per line so &13, &14 and &16, &17 disappear.
How can I add only the part I want to extract especially when it occurs multiple times per line?
# Code:
def extract(filename):
words = []
with open(filename, 'r', encoding="utf8") as f:
for line in f:
if (re.search(r'(?<=&\d")(.+?"*)(?=")|(?<=&\d\d")(.+?"*)(?=")|(?<=&\d\d\d")(.+?"*)(?=")|(?<=&\d\d\d\d")(.+?"*)(?=")|(?<=&\d\d\d\d")(.+?"*)(?=")', line)):
#1 words.append(line)
#2 words.append(re.split(r'(?<=&)\d+"(.+?)(?=")', line)[1])
for line in words:
print(line +"\n")
You can shorten the pattern and match & followed by 1+ digits and capture what is between double quotes in group 1.
Read the whole file at once and use re.findall to the capture group values.
&\d+"([^"]*)"
The pattern matches:
&\d+ Match & and 1+ digits
" Match opening double quote
([^"]*) Capture group 1, match any char except " (including newlines)
" Match closing double quote
See a regex demo and a Python demo.
def extract(filename):
with open(filename, 'r', encoding="utf8") as f:
return re.findall(r'&\d+"([^"]*)"', f.read())
import re
filename = "D:\\a.txt"
words = []
# Only works
with open(filename, 'r', encoding="utf8") as f:
for line in f:
line = line.strip('\n')
grps = re.findall(r'&[0-9]{1,2}(\".*?\")', line)
if grps:
#print(len(grps.groups()))
words.append(grps)
for line in words:
pass
print(line)
This is the output that the above code snippet produced.
I want to separate this file into lines (each ending with a period (question mark, exclamation point, etc)) in order to make it easier to work with later on.
I attempted to use nltk, but to no avail:
text = r'你在哪里? 我想看到你的狗!我很喜欢你。'
tokenized_text=sent_tokenize(text)
print(tokenized_text)
Actual result:
['你在哪里? 我想看到你的狗!我很喜欢你。']
Expected result:
['你在哪里?
我想看到你的狗!
我很喜欢你。']
seeing as how no one has responded...
import re
text = r'你在哪里? 我想看到你的狗!我很喜欢你。'
text_tokens = re.findall(r'(.*?[?!。])\s?', text) #with all the seperating tokens between []
print("\n".join(text_tokens))
outputs
你在哪里?
我想看到你的狗!
我很喜欢你。
explanation .+? looks for one to infinity characters up to the first occurrence of
[?!。] any of the tokens you want to split on
(previous)\s? stripping any excess spaces if it exists. by only capturing the text and token
"\n".join(text_tokens) joins the list to a newline/formats each match as a new line.
if you were reading it from a file to another file a really simple program could look like this
import re
text_tokens = []
with open("example.txt", 'r') as text:
text_tokens = re.findall(r'(.+?[?!。])\s?', text.read())
with open("output.txt", 'w+') as out:
out.write("\n".join(text_tokens))
I actually want to do a search and replace but ignore all my commented lines, and I also just want to replace only the first found...
input-file.txt
#replace me
#replace me
replace me
replace me
...like with:
text = text.replace("replace me", "replaced!", 1) # with max. 1 rep.
But I'm not sure how to approach(ignore) those comments. So that I get:
#replace me
#replace me
replaced!
replace me
As I see it, the existing solutions have one or more of several problems:
Incomplete (e.g. requiring match on start of line)
Incomplete (e.g. requiring match not containing \n)
Clunky (e.g. looong file-based solutions)
I'm pretty sure a pure-regex solution would require variable-width lookbehinds, which the re module doesn't support (though I think the regex module does). With a small tweak though, regex can still provide a fairly clean answer.
import re
i = re.search(r'^([^#\n]?)+replace me', string_to_replace, re.M).start()
replaced_string = ''.join([
string_to_replace[:i],
re.sub(r'replace me', 'replaced!', string_to_replace[i:], 1, re.M),
])
The idea is that you find the first uncommented line containing the start of your match, and then you replace the first instance of 'replace me' that you find starting on that line. The ^([^#\n]?)+ bit in the regex says
^ -- Find the start of a line.
([^#\n]?)+ -- Find as few ([^#\n]?) as you can before matching the rest of the expression.
([^#\n]?) -- Find 0 or 1 of [^#\n].
[^#\n] -- Find anything that's not # or \n.
Note that we're using raw strings r'' to prevent double escaping things like backslashes when creating our regex expressions, and we're using re.M to search across line breaks.
Note that the behavior is a bit weird if the string you're string to replace contains the pattern \n#. In that case, you'll wind up replacing part or all of one or more commented lines, which may not be what you want. Considering the problems with the alternatives, I'd be inclined to say the alternatives are all wrong approaches.
If that's not what you want, excluding all commented lines gets doubly weird because of some uncertainty in how they'd get merged back together. For example, consider the following input file.
#comment 1
replace
#comment 2
me
replace
me
What happens if you want to replace the string replace\nme? Do you exclude the first match because \n#comment 2 is stuck in between? If you use the first match, where does \n#comment 2 go? Does it go before or after the replacement? Is the replacement multiple lines as well so that it can still get sandwiched in? Do you just delete it?
Have a flag that marks whether you have completed the replacement yet. And then only replace when that flag is true and the lines is not a comment:
not_yet_replaced = True
with open('input-file.txt') as f:
for l in f:
if not_yet_replaced and not l.startswith('#') and 'replace me' in l:
l = l.replace('replace me', 'replaced!')
not_yet_replaced = False
print(l)
You can use a break after the first occurrence like so:
with open('input.txt', 'r') as f:
content = f.read().split('\n')
for i in range(len(content)):
if content[i] == 'replace me':
content[i] = 'replaced'
break
with open('input.txt', 'w') as f:
content = ('\n').join(content)
f.write(content)
Output :
(xenial)vash#localhost:~/python/stack_overflow$ cat input.txt
#replace me
#replace me
replaced
replace me
If the input file is not very big, you can read it into memory as a list of lines. Then iterate over the lines and replace the first matching one. Then write the lines back to the file:
with open('input-file.txt', 'r+') as f:
lines = f.readlines()
substr = 'replace me'
for i in range(len(lines)):
if lines[i].startswith('#'):
continue
if substr in lines[i]:
lines[i] = lines[i].replace(substr, 'replaced!', 1)
break
f.seek(0)
f.truncate()
f.writelines(lines)
I'm not sure whether or not you have managed to get the text out of the file, so you can do that by doing
f = open("input-file.txt", "r")
text = f.read()
f.close()
Then the way I would do this is first split the text into lines like so
lines = text.split("\n")
then do the replacement on each line, checking it does not start with a "#"
for index, line in enumerate(lines):
if len(line) > 0 and line[0] != "#" and "replace me" in line:
lines[index] = line.replace("replace me", "replaced!")
break
then stitch the lines back together.
new_text = "\n".join(lines)
hope this helps :)
Easiest way is to use a multiline regex along with its sub() method and giving it a count of 1:
import re
r = re.compile("^replace me$", re.M)
s = """
#replace me
#replace me
replace me
replace me
"""
r.sub("replaced!", s, 1)
Gives
#replace me
#replace me
replaced!
replace me
Online demo here
How can I reduce multiple blank lines in a text file to a single line at each occurrence?
I have read the entire file into a string, because I want to do some replacement across line endings.
with open(sourceFileName, 'rt') as sourceFile:
sourceFileContents = sourceFile.read()
This doesn't seem to work
while '\n\n\n' in sourceFileContents:
sourceFileContents = sourceFileContents.replace('\n\n\n', '\n\n')
and nor does this
sourceFileContents = re.sub('\n\n\n+', '\n\n', sourceFileContents)
It's easy enough to strip them all, but I want to reduce multiple blank lines to a single one, each time I encounter them.
I feel that I'm close, but just can't get it to work.
This is a reach, but perhaps some of the lines aren't completely blank (i.e. they have only whitespace characters that give the appearance of blankness). You could try removing all possible whitespace between newlines.
re.sub(r'(\n\s*)+\n+', '\n\n', sourceFileContents)
Edit: realized the second '+' was superfluous, as the \s* will catch newlines between the first and last. We just want to make sure the last character is definitely a newline so we don't remove leading whitespace from a line with other content.
re.sub(r'(\n\s*)+\n', '\n\n', sourceFileContents)
Edit 2
re.sub(r'\n\s*\n', '\n\n', sourceFileContents)
Should be an even simpler solution. We really just want to a catch any possible space (which includes intermediate newlines) between our two anchor newlines that will make the single blank line and collapse it down to just the two newlines.
Your code works for me. Maybe there is a chance of carriage return \r would be present.
re.sub(r'[\r\n][\r\n]{2,}', '\n\n', sourceFileContents)
You can use just str methods split and join:
text = "some text\n\n\n\nanother line\n\n"
print("\n".join(item for item in text.split('\n') if item))
Very simple approach using re module
import re
text = 'Abc\n\n\ndef\nGhijk\n\nLmnop'
text = re.sub('[\n]+', '\n', text) # Replacing one or more consecutive newlines with single \n
Result:
'Abc\ndef\nGhijk\nLmnop'
If the lines are completely empty, you can use regex positive lookahead to replace them with single lines:
sourceFileContents = re.sub(r'\n+(?=\n)', '\n', sourceFileContents)
If you replace your read statement with the following, then you don't have to worry about whitespace or carriage returns:
with open(sourceFileName, 'rt') as sourceFile:
sourceFileContents = ''.join([l.rstrip() + '\n' for l in sourceFile])
After doing this, both of your methods you tried in the OP work.
OR
Just write it out in a simple loop.
with open(sourceFileName, 'rt') as sourceFile:
lines = ['']
for line in (l.rstrip() for l in sourceFile):
if line != '' or lines[-1] != '\n':
lines.append(line + '\n')
sourceFileContents = "".join(lines)
I guess another option which is longer, but maybe prettier?
with open(sourceFileName, 'rt') as sourceFile:
last_line = None
lines = []
for line in sourceFile:
# if you want to skip lines with only whitespace, you could add something like:
# line = line.lstrip(" \t")
if last_line != "\n":
lines.append(line)
last_line = line
contents = "".join(lines)
I was trying to find some clever generator function way of writing this, but it's been a long week so I can't.
Code untested, but I think it should work?
(edit: One upside is I removed the need for regular expressions which fixes the "now you have two problems" problem :) )
(another edit based on Marc Chiesa's suggestion of lingering whitespace)
For someone who can't do regex like me, if the code to process is python:
import autopep8
autopep8.fixcode('your_code')
Another quick solution, just in case your code isn't Python:
for x in range(100):
content.replace(" ", " ") # reduce the number of multiple whitespaces
# then
for x in range(20):
content.replace("\n\n", "\n") # reduce the number of multiple white lines
Note that if you have more than 100 consecutive whitespaces or 20 consecutive new lines, you'll want to increase the repetition times.
If decoding from unicode, watch out for non-breaking spaces which show up in cat -vet as M-BM-:
sourceFileContents = sourceFile.read()
sourceFileContents = re.sub(r'\n(\s*\n)+','\n\n',sourceFileContents.replace("\xc2\xa0"," "))
I have a file with a bunch of text that I want to tear through, match a bunch of things and then write these items to separate lines in a new file.
This is the basics of the code I have put together:
f = open('this.txt', 'r')
g = open('that.txt', 'w')
text = f.read()
matches = re.findall('', text) # do some re matching here
for i in matches:
a = i[0] + '\n'
g.write(a)
f.close()
g.close()
My issue is I want each matched item on a new line (hence the '\n') but I don't want a blank line at the end of the file.
I guess I need to not have the last item in the file being trailed by a new line character.
What is the Pythonic way of sorting this out? Also, is the way I have set this up in my code the best way of doing this, or the most Pythonic?
If you want to write out a sequence of lines with newlines between them, but no newline at the end, I'd use str.join. That is, replace your for loop with this:
output = "\n".join(i[0] for i in matches)
g.write(output)
In order to avoid having to close your files explicitly, especially if your code might be interrupted by exceptions, you can use the with statement to make things simpler. The following code replaces the entire code in your question:
with open('this.txt') as f, open('that.txt', 'w') as g:
text = f.read()
matches = re.findall('', text) # do some re matching here
g.write("\n".join(i[0] for i in matches))
or, since you don't need both files open at the same time:
with open('this.txt') as f:
text = f.read()
matches = re.findall('', text) # do some re matching here
with open('that.txt', 'w') as g:
g.write("\n".join(i[0] for i in matches))