I have a text file with two lines in a text file:
<BLAHBLAH>483920349<FOOFOO>
<BLAHBLAH>4493<FOOFOO>
Thats the only thing in the text file. Using python, I want to write to the text file so that i can take away BLAHBLAH and FOOFOO from each line. It seems like a simple task but after refreshing my file manipulation i cant seem to find a way to do it.
Help is greatly appreciated :)
Thanks!
If it's a text file as you say, and not HTML/XML/something else, just use replace:
for line in infile.readlines():
cleaned_line = line.replace("BLAHBLAH","")
cleaned_line = cleaned_line.replace("FOOFOO","")
and write cleaned_line to an output file.
f = open(path_to_file, "w+")
f.write(f.read().replace("<BLAHBLAH>","").replace("<FOOFOO>",""))
f.close()
Update (saving to another file):
f = open(path_to_input_file, "r")
output = open(path_to_output_file, "w")
output.write(f.read().replace("<BLAHBLAH>","").replace("<FOOFOO>",""))
f.close()
output.close()
Consider the regular expressions module re.
result_text = re.sub('<(.|\n)*?>',replacement_text,source_text)
The strings within < and > are identified. It is non-greedy, ie it will accept a substring of the least possible length. For example if you have "<1> text <2> more text", a greedy parser would take in "<1> text <2>", but a non-greedy parser takes in "<1>" and "<2>".
And of course, your replacement_text would be '' and source_text would be each line from the file.
Related
This is what I got:(Edited after blhsing answer.)
import re
File1 = open('text.txt', 'r')
regex = re.compile(r'\b(?:[12]?\d{1,4}|30{4})#[^#]+#')
string = File1.read()
itemdesc = regex.findall(string)
for word in itemdesc:
print (word)
By using: \b(?:[12]?\d{1,4}|30{4})#[^#]+# I could find:
5173#bunch of text here
of, bunch here, text
text here, bunch of
#
After finding this text I would like to replace it in another file where a similar one exists.
At the current stage, I still need to implement something like:
\b(?:number)#[^#]+#
In order to find a text move and replace it in another file where one with the same number is located, also before doing it checking if there are multiple occurrences.
After doing that I will have another problem which is saving the multiple occurrences and storing it in another text in order to manually do the rest.
Hope u guys can help, any help is appreciated it doesn't need to be a solution. :)
The problem here is that you're reading the file and matching the regex line by line when you actually want to match the regex over multiple lines. You should therefore read the entire file into one string before matching it against the regex:
import re
File1 = open('text.txt', 'r')
regex = re.compile(r'\b(?:[12]?\d{1,4}|30{4})#[^#]+#')
string = File1.read()
itemdesc = regex.findall(string)
for word in itemdesc:
print (word)
I am trying to delete comments starting on new lines in a Python code file using Python code and regular expressions. For example, for this input:
first line
#description
hello my friend
I would like to get this output:
first line
hello my friend
Unfortunately this code didn't work for some reason:
with open(input_file,"r+") as f:
string = re.sub(re.compile(r'\n#.*'),"",f.read()))
f.seek(0)
f.write(string)
for some reason the output I get is the same as the input.
1) There is no reason to call re.compile unless you save the result. You can always just use the regular expression text.
2) Seeking to the beginning of the file and writing there may cause problems for you if your replacement text is shorter than your original text. It is easier to re-open the file and write the data.
Here is how I would fix your program:
import re
input_file = 'in.txt'
with open(input_file,"r") as f:
data = f.read()
data = re.sub(r'\n#.*', "", data)
with open(input_file, "w") as f:
f.write(data)
It doesn't seem right to start the regular expression with \n, and I don't think you need to use re.compile here.
In addition to that, you have to use the flag re.M to make the search on multiline
This will delete all lines that start with # and empty lines.
with open(input_file, "r+") as f:
text = f.read()
string = re.sub('^(#.*)|(\s*)$', '', text, flags=re.M)
f.write(string)
I am trying to find this explicit sub-string of a specific line in a text file '"swp_pt", "3"' with the double-quotes and all. I want to change the number to any other number, but I need specifically to go to the first integer after the quoted swp_pt variable and change it only. I am still just trying to find the correct swp_pt call in the text file and have not been able to do even that yet.
Here is my code so far:
ddsFile = open('Product_FD_TD_SI_s8p.dds')
for line in ddsFile:
print(line)
marker = re.search('("swp_pt", ")[0-9]+', line)
print(marker)
print(marker.group())
ddsFile.close()
If anyone has a clue how to do this, I would very much appreciate your help.
Mike
Do you really need to do this in Python? sed -i will do what you want and is considerably simpler.
But if you need it, I would do something like:
def replace_swp_pt(line):
regex = r'"swp_pt", "(\d+)"'
replacement = '"swp_pt", "4"'
return re.sub(regex, replacement, line)
def transform_file(file_name, transform_line_func):
with open(file_name, 'r') as f:
# Buffer full contents in memory. This only works if your file
# fits in memory; otherwise you will need to use a temporary file.
file_contents = f.read()
with open(file_name, 'w') as f:
for line in file_contents.split('\n'):
transformed_line = transform_line_func(line)
f.write(transformed_line + '\n')
if __name__ == '__main__':
transform_file('Product_FD_TD_SI_s8p.dds', replace_swp_pt)
I am trying to write a python script to read in a large text file from some modeling results, grab the useful data and save it as a new array. The text file is output in a way that has a ## starting each line that is not useful. I need a way to search through and grab all the lines that do not include the ##. I am used to using grep -v in this situation and piping to a file. I want to do it in python!
Thanks a lot.
-Tyler
I would use something like this:
fh = open(r"C:\Path\To\File.txt", "r")
raw_text = fh.readlines()
clean_text = []
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line)
Or you could also clean the newline and carriage return non-printing characters at the same time with a small modification:
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line.rstrip("\r\n"))
You would be left with a list object that contains one line of required text per element. You could split this into individual words using string.split() which would give you a nested list per original list element which you could easily index (assuming your text has whitespaces of course).
clean_text[4][7]
would return the 5th line, 8th word.
Hope this helps.
[Edit: corrected indentation in loop]
My suggestion would be to do the following:
listoflines = [ ]
with open(.txt, "r") as f: # .txt = file, "r" = read
for line in f:
if line[:2] != "##": #Read until the second character
listoflines.append(line)
print listoflines
If you're feeling brave, you can also do the following, CREDITS GO TO ALEX THORNTON:
listoflines = [l for l in f if not l.startswith('##')]
The other answer is great as well, especially teaching the .startswith function, but I think this is the more pythonic way and also has the advantage of automatically closing the file as soon as you're done with it.
I'm a newbie in python who is starting to learn it.
I wanted to make a script that count the same letter pattern in a text file. Problem is my text file has multiple lines. I couldn't find some of my patterns as they went over to the next line.
My file and the pattern are a DNA sequence.
Example:
'attctcgatcagtctctctagtgtgtgagagactctagctagatcgtccactcactgac**ga
tc**agtcagt**gatc**tctcctactacaaggtgacatgagtgtaaattagtgtgagtgagtgaa'
I'm looking for 'gatc'. The second one was counted, but the first wasn't.
So, how can I make this file to a one line text file?
You can join the lines when you read the pattern from the file:
fd = open('dna.txt', 'r')
dnatext = ''.join(fd.readlines())
dnatext.count('gatc')
dnatext = text.replace('\n', '') // join text lines
gatc_count = dnatext.count('gatc') // count 'gatc' occurrences
This should do the trick :
dnatext = "".join(dnatext.split("\n"))