I'm a newbie in python who is starting to learn it.
I wanted to make a script that count the same letter pattern in a text file. Problem is my text file has multiple lines. I couldn't find some of my patterns as they went over to the next line.
My file and the pattern are a DNA sequence.
Example:
'attctcgatcagtctctctagtgtgtgagagactctagctagatcgtccactcactgac**ga
tc**agtcagt**gatc**tctcctactacaaggtgacatgagtgtaaattagtgtgagtgagtgaa'
I'm looking for 'gatc'. The second one was counted, but the first wasn't.
So, how can I make this file to a one line text file?
You can join the lines when you read the pattern from the file:
fd = open('dna.txt', 'r')
dnatext = ''.join(fd.readlines())
dnatext.count('gatc')
dnatext = text.replace('\n', '') // join text lines
gatc_count = dnatext.count('gatc') // count 'gatc' occurrences
This should do the trick :
dnatext = "".join(dnatext.split("\n"))
Related
I am fairly new to Python. I have a text file, full of common misspellings. The correct spelling of the word is prefixed with a $ character, and all misspelled versions of the word preceding it; one on each line.
mispelling.txt:
$year
eyar
yera
$years
eyars
eyasr
yeasr
yeras
yersa
I want to create a new text file, based on mispelling.txt, where the format appears as this:
new_mispelling.txt:
eyar->year
yera->year
eyars->years
eyasr->years
yeasr->years
yeras->years
yersa->years
The correct spelling of the word is on the right-hand side of its misspelling, separated by ->; on the same line.
Question:
How do I read in the file, read $ as a new word and thus a new line in my output file, propagate an output file and save to disk?
The purpose of this is to have my collected data be of the same format as this open-source Wikipedia entry dataset of "all" commonly misspelled words, that doesn't contain my own entries of words and misspellings.
As you process the file line-by-line, if you find a word that starts with $, set that as the "currently active correct spelling". Then each subsequent line is a misspelling for that word, so format that into a string and write it to the output file.
current_word = ""
with open("mispelling.txt") as f_in, open("new_mispelling.txt", "w") as f_out:
for line in f_in:
line = line.strip() # Remove whitespace at start and end
if line.startswith("$"):
# If the line starts with $
# Slice the current line from the second character to the end
# And save it as current_word
current_word = line[1:]
else:
# If it doesn't start with $, create the string we want
# And write it.
f_out.write(f"{line}->{current_word}\n")
With your input file, this gives:
eyar->year
yera->year
eyars->years
eyasr->years
yeasr->years
yeras->years
yersa->years
The f"{line}->{current_word}\n" construct is called an f-string and is used for string interpolation in python 3.6+.
A regex solution:
You can use the pattern '\$(\w+)(.*?)(?=\$|$)' and join each value starting with $ with the subsequent values by ->, then join all these by \n for the groups captured, then finally join all such values by \n. Make sure to use re.DOTALL flag since its a multi-line string:
import re
txt='''$year
eyar
yera
$years
eyars
eyasr
yeasr
yeras
yersa'''
print('\n'.join('\n'.join('->'.join((v, m.group(1)))
for v in m.group(2).strip('\n').split('\n')) for m in
re.finditer('\$(\w+)(.*?)(?=\$|$)', txt, re.DOTALL)))
OUTPUT:
eyar->year
yera->year
eyars->years
eyasr->years
yeasr->years
yeras->years
yersa->years
I'm leaving file read/write for you assuming that's not the problem you are asking for
sorry if this question may look a bit dumb for some of you but i'm totally a beginner at programming in Python so i'm quite bad and got a still got a lot to learn.
So basically I have this long text file separated by paragraphs, sometimes the newline can be double or triple to make the task more hard for us so i added a little check and looks like it's working fine so i have a variable called "paragraph" that tells me in which paragraph i am currently.
Now basically i need to scan this text file and search for some sequences of words in it but the newline character is the worst enemy here, for example if i have the string = "dummy text" and i'm looking into this:
"random questions about files with a dummy
text and strings
hey look a new paragraph here"
As you can see there is a newline between dummy and text so reading the file line by line doesn't work. So i was wondering to load directly the entire paragraph to a string so this way i can even remove punctuation and stuff more easly and check directly if those sequences of words are contained in it.
All this must be done without libraries.
However my piece of code of paragraph counter works while the file is being read, so if uploading a whole paragraph in a string is possible i should basically use something like "".join until the paragraph increases by 1 because we're on the next paragraph? Any idea?
This should do the trick. It is very short and elegant:
with open('dummy text.txt') as file:
data = file.read().replace('\n', '')
print(data)#prints out the file
The output is:
"random questions about files with a dummy text and strings hey look a new paragraph here"
I think you do not need to think it in a difficult way. Here is a very commonly used pattern for this kind of problems.
paragraphs = []
lines = []
for line in open('text.txt'):
if not line.strip(): # empty line
if lines:
paragraphs.append("".join(lines))
lines = []
else:
lines.append(line)
if lines:
paragraphs.append("".join(lines))
If a stripped line is empty, you encounter the second \n and it means that you have to join previous lines to a paragraph.
If you encounter the 3rd \n, you must not join again so remove your previous lines (lines = []). In this way, you will not join the same paragraph again.
To check the last line, try this pattern.
f = open('text.txt')
line0 = f.readline()
while True:
# do what you have to do with the previous line, `line0`
line = f.readline()
if not line: # `line0` was the last line
# do what you have to do with the last line
break
line0 = line
You can strip the newline character. Here is an example from a different problem.
data = open('resources.txt', 'r')
book_list = []
for line in data:
new_line = line.rstrip('\n')
book_list.append(new_line)
This is what I got:(Edited after blhsing answer.)
import re
File1 = open('text.txt', 'r')
regex = re.compile(r'\b(?:[12]?\d{1,4}|30{4})#[^#]+#')
string = File1.read()
itemdesc = regex.findall(string)
for word in itemdesc:
print (word)
By using: \b(?:[12]?\d{1,4}|30{4})#[^#]+# I could find:
5173#bunch of text here
of, bunch here, text
text here, bunch of
#
After finding this text I would like to replace it in another file where a similar one exists.
At the current stage, I still need to implement something like:
\b(?:number)#[^#]+#
In order to find a text move and replace it in another file where one with the same number is located, also before doing it checking if there are multiple occurrences.
After doing that I will have another problem which is saving the multiple occurrences and storing it in another text in order to manually do the rest.
Hope u guys can help, any help is appreciated it doesn't need to be a solution. :)
The problem here is that you're reading the file and matching the regex line by line when you actually want to match the regex over multiple lines. You should therefore read the entire file into one string before matching it against the regex:
import re
File1 = open('text.txt', 'r')
regex = re.compile(r'\b(?:[12]?\d{1,4}|30{4})#[^#]+#')
string = File1.read()
itemdesc = regex.findall(string)
for word in itemdesc:
print (word)
I am trying to write a python script to read in a large text file from some modeling results, grab the useful data and save it as a new array. The text file is output in a way that has a ## starting each line that is not useful. I need a way to search through and grab all the lines that do not include the ##. I am used to using grep -v in this situation and piping to a file. I want to do it in python!
Thanks a lot.
-Tyler
I would use something like this:
fh = open(r"C:\Path\To\File.txt", "r")
raw_text = fh.readlines()
clean_text = []
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line)
Or you could also clean the newline and carriage return non-printing characters at the same time with a small modification:
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line.rstrip("\r\n"))
You would be left with a list object that contains one line of required text per element. You could split this into individual words using string.split() which would give you a nested list per original list element which you could easily index (assuming your text has whitespaces of course).
clean_text[4][7]
would return the 5th line, 8th word.
Hope this helps.
[Edit: corrected indentation in loop]
My suggestion would be to do the following:
listoflines = [ ]
with open(.txt, "r") as f: # .txt = file, "r" = read
for line in f:
if line[:2] != "##": #Read until the second character
listoflines.append(line)
print listoflines
If you're feeling brave, you can also do the following, CREDITS GO TO ALEX THORNTON:
listoflines = [l for l in f if not l.startswith('##')]
The other answer is great as well, especially teaching the .startswith function, but I think this is the more pythonic way and also has the advantage of automatically closing the file as soon as you're done with it.
I have a text file with two lines in a text file:
<BLAHBLAH>483920349<FOOFOO>
<BLAHBLAH>4493<FOOFOO>
Thats the only thing in the text file. Using python, I want to write to the text file so that i can take away BLAHBLAH and FOOFOO from each line. It seems like a simple task but after refreshing my file manipulation i cant seem to find a way to do it.
Help is greatly appreciated :)
Thanks!
If it's a text file as you say, and not HTML/XML/something else, just use replace:
for line in infile.readlines():
cleaned_line = line.replace("BLAHBLAH","")
cleaned_line = cleaned_line.replace("FOOFOO","")
and write cleaned_line to an output file.
f = open(path_to_file, "w+")
f.write(f.read().replace("<BLAHBLAH>","").replace("<FOOFOO>",""))
f.close()
Update (saving to another file):
f = open(path_to_input_file, "r")
output = open(path_to_output_file, "w")
output.write(f.read().replace("<BLAHBLAH>","").replace("<FOOFOO>",""))
f.close()
output.close()
Consider the regular expressions module re.
result_text = re.sub('<(.|\n)*?>',replacement_text,source_text)
The strings within < and > are identified. It is non-greedy, ie it will accept a substring of the least possible length. For example if you have "<1> text <2> more text", a greedy parser would take in "<1> text <2>", but a non-greedy parser takes in "<1>" and "<2>".
And of course, your replacement_text would be '' and source_text would be each line from the file.