Load a text file paragraph into a string without libraries - python

sorry if this question may look a bit dumb for some of you but i'm totally a beginner at programming in Python so i'm quite bad and got a still got a lot to learn.
So basically I have this long text file separated by paragraphs, sometimes the newline can be double or triple to make the task more hard for us so i added a little check and looks like it's working fine so i have a variable called "paragraph" that tells me in which paragraph i am currently.
Now basically i need to scan this text file and search for some sequences of words in it but the newline character is the worst enemy here, for example if i have the string = "dummy text" and i'm looking into this:
"random questions about files with a dummy
text and strings
hey look a new paragraph here"
As you can see there is a newline between dummy and text so reading the file line by line doesn't work. So i was wondering to load directly the entire paragraph to a string so this way i can even remove punctuation and stuff more easly and check directly if those sequences of words are contained in it.
All this must be done without libraries.
However my piece of code of paragraph counter works while the file is being read, so if uploading a whole paragraph in a string is possible i should basically use something like "".join until the paragraph increases by 1 because we're on the next paragraph? Any idea?

This should do the trick. It is very short and elegant:
with open('dummy text.txt') as file:
data = file.read().replace('\n', '')
print(data)#prints out the file
The output is:
"random questions about files with a dummy text and strings hey look a new paragraph here"

I think you do not need to think it in a difficult way. Here is a very commonly used pattern for this kind of problems.
paragraphs = []
lines = []
for line in open('text.txt'):
if not line.strip(): # empty line
if lines:
paragraphs.append("".join(lines))
lines = []
else:
lines.append(line)
if lines:
paragraphs.append("".join(lines))
If a stripped line is empty, you encounter the second \n and it means that you have to join previous lines to a paragraph.
If you encounter the 3rd \n, you must not join again so remove your previous lines (lines = []). In this way, you will not join the same paragraph again.
To check the last line, try this pattern.
f = open('text.txt')
line0 = f.readline()
while True:
# do what you have to do with the previous line, `line0`
line = f.readline()
if not line: # `line0` was the last line
# do what you have to do with the last line
break
line0 = line

You can strip the newline character. Here is an example from a different problem.
data = open('resources.txt', 'r')
book_list = []
for line in data:
new_line = line.rstrip('\n')
book_list.append(new_line)

Related

List the first words per line from a text file in Python

I need to select the first word on each line and make a list from them from a text file:
I would copy the text but it's the formatting is quite screwed up. will try
All the other text is unnecessary.
I have tried
string=[]
for line in f:
String.append(line.split(None, 1)[0]) # add only first word
from another solution, but it keeps returning a "Index out of bounds" error.
I can get the first word from the first line using string=text.partition(' ')[0]
but I do not know how to repeat this for the other lines.
I am still new to python and to the site, I hope my formatting is bearable! (when opened, I encode the text to accept symbols, like so
wikitxt=open('racinesPrefixesSuffixes.txt', 'r', encoding='utf-8')
could this be the issue?)
The reason it's raising an IndexError is because the specific line is empty.
You can do this:
words = []
for line in f:
if line.strip():
words.append(line.split(maxsplit=1)[0])
Here line.strip() is checking if the line consists of only whitespace. If it does only consist of whitespace, it will simply skip the line.
Or, if you like list comprehension:
words = [line.split(maxsplit=1)[0] for line in f if line.strip()]

Iterative regex in Python: find and replace

This is one of those "I'd know how to do it in C" type questions. :p
I'm asking this as similar questions in SO don't have a particular aspect I'm looking for.
I'm essentially looking to find and replace items that also have possessive forms. So if there is a "rabbit" in the list, and also a "rabbit's", then replace "rabbit" with a series of asterisks.
Something along the lines of:
#!/usr/bin/env python
import re
text = open("list.txt", "w")
for line in text:
test = line
if re.match(test+"'", line) or re.match(test+"'s", line):
line = "****"
However, this clearly won't work as the for each mechanism makes line be used for both iteration and pattern matching.
with open('file.txt') as f:
# Remove the \n characters at the end of each line
all_lines = [x.strip() for x in f.readlines()]
for line in all_lines:
# Check for presence of word' or word's
if line+"'" in all_lines or line+"'s" in all_lines:
print('****')
else:
print(line)
It's worth noting that this is quite a brute force way of doing and for huge lists will take a bit longer (it loads the file into memory) but should give you an idea.
you can use str.endswith:
text = open("list.txt", "r")
for line in text:
test = line.strip()
if test.endswith("'s"):
line = "****"
Here i have explained why your code is not going to work:
replace this:
test = line
to:
test = line.strip() # to remove new line character
so your test will be rabbit\n', if you don't remove newline character
you also need to open file on read mode
text = open("list.txt",'r')
you match is not going to work, think of it:
suppose test="rabbit's"
test+"'" will give you `rabbit's'`

Searching a text file and grabbing all lines that do not include ## in python

I am trying to write a python script to read in a large text file from some modeling results, grab the useful data and save it as a new array. The text file is output in a way that has a ## starting each line that is not useful. I need a way to search through and grab all the lines that do not include the ##. I am used to using grep -v in this situation and piping to a file. I want to do it in python!
Thanks a lot.
-Tyler
I would use something like this:
fh = open(r"C:\Path\To\File.txt", "r")
raw_text = fh.readlines()
clean_text = []
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line)
Or you could also clean the newline and carriage return non-printing characters at the same time with a small modification:
for line in raw_text:
if not line.startswith("##"):
clean_text.append(line.rstrip("\r\n"))
You would be left with a list object that contains one line of required text per element. You could split this into individual words using string.split() which would give you a nested list per original list element which you could easily index (assuming your text has whitespaces of course).
clean_text[4][7]
would return the 5th line, 8th word.
Hope this helps.
[Edit: corrected indentation in loop]
My suggestion would be to do the following:
listoflines = [ ]
with open(.txt, "r") as f: # .txt = file, "r" = read
for line in f:
if line[:2] != "##": #Read until the second character
listoflines.append(line)
print listoflines
If you're feeling brave, you can also do the following, CREDITS GO TO ALEX THORNTON:
listoflines = [l for l in f if not l.startswith('##')]
The other answer is great as well, especially teaching the .startswith function, but I think this is the more pythonic way and also has the advantage of automatically closing the file as soon as you're done with it.

how to : multiline to oneline by removing newlines

I'm a newbie in python who is starting to learn it.
I wanted to make a script that count the same letter pattern in a text file. Problem is my text file has multiple lines. I couldn't find some of my patterns as they went over to the next line.
My file and the pattern are a DNA sequence.
Example:
'attctcgatcagtctctctagtgtgtgagagactctagctagatcgtccactcactgac**ga
tc**agtcagt**gatc**tctcctactacaaggtgacatgagtgtaaattagtgtgagtgagtgaa'
I'm looking for 'gatc'. The second one was counted, but the first wasn't.
So, how can I make this file to a one line text file?
You can join the lines when you read the pattern from the file:
fd = open('dna.txt', 'r')
dnatext = ''.join(fd.readlines())
dnatext.count('gatc')
dnatext = text.replace('\n', '') // join text lines
gatc_count = dnatext.count('gatc') // count 'gatc' occurrences
This should do the trick :
dnatext = "".join(dnatext.split("\n"))

Jython code for deleting spaces from Text file

I am trying to write a jython code for deleting spaces from Text file.I have a following scenario.
I have a text file like
STARTBUR001 20120416
20120416MES201667 20120320000000000201203210000000002012032200000000020120323000000000201203240000000002012032600000000020120327000000000201203280000000002012032900000000020120330000000000
20120416MES202566 2012030500000000020120306000000000201203070000000002012030800000000020120309000000000201203100000000002012031100000000020120312000000000201203130000000002012031400000000020
20120416MES275921 20120305000000000201203060000000002012030700000000020120308000000000201203090000000002012031000000000020120311000000000201203120000000002012031300000000020120314000000000
END 0000000202
Here all lines are single lines.
But what i want is like
STARTBUR001 20120416
20120416MES201667 20120320000000000201203210000000002012032200000000020120323000000000201203240000000002012032600000000020120327000000000201203280000000002012032900000000020120330000000000
20120416MES202566 2012030500000000020120306000000000201203070000000002012030800000000020120309000000000201203100000000002012031100000000020120312000000000201203130000000002012031400000000020
20120416MES275921 20120305000000000201203060000000002012030700000000020120308000000000201203090000000002012031000000000020120311000000000201203120000000002012031300000000020120314000000000
END 0000000202
So in all i want to start checking from second line till i encounter END and delete all spaces at tyhe end of each line.
Can someone guide me for writing this code??
tried like:
srcfile=open('d:/BUR001.txt','r')
trgtfile=open('d:/BUR002.txt','w')
readfile=srcfile.readline()
while readfile:
trgtfile.write(readfile.replace('\s',''))
readfile=srcfile.readline()
srcfile.close()
trgtfile.close()
Thanks,
Mahesh
You can use fact that those special lines starts with special values:
line = srcfile.readline()
while line:
line2 = line
if not line2.startswith('START') and not line2.startswith('END'):
line2 = line2.replace(' ','')
trgtfile.write(line2)
line = srcfile.readline()
Also note that with readline() result strings ends with \n (or are empty at end of input file), and this code removes all spaces from the line, not only those at end of the line.
If I understood your example all you want is to remove empty lines, so instead of reading file line by line read it at once:
content = srcfile.read()
and then remove empty lines from content:
while '\n\n' in content:
content = content.replace('\n\n', '\n')

Categories

Resources