How to read from text file into array paragraph by paragraph?

How to read from text file into array paragraph by paragraph? - python

Making a text based game and want to read from the story text file via paragraph rather than printing a certain amount of characters?
You wake up from a dazed slumber to find yourself in a deep dank cave with moonlight casting upon the entrance...
You see a figure approaching towards you... Drawing nearer you hear him speak...

You want this: my_list = my_string.splitlines()
https://docs.python.org/3/library/stdtypes.html#str.splitlines

Like #martineau suggested you need a delimiter for separate different paragraphs.
This can even be a new line character (\n) and after you have it you read all content of the file and split it by the defined delimiter.
Doing so you generate a list of elements with each one being a paragraph.
Some example code:
delimiter = "\n"
with open("paragraphs.txt", "r") as paragraphs_file:
all_content = paragraphs_file.read() #reading all the content in one step
#using the string methods we split it
paragraphs = all_content.split(delimiter)
This approach has some drawbacks like the fact that read all the content and if the file is big you fill the memory with thing that you don't need now, at the moment of the story.
Looking at your text example and knowing that you will continuously print the retrieved text, reading one line a time could be a better solution:
with open("paragraphs.txt", "r") as paragraphs_file:
for paragraph in paragraphs_file: #one line until the end of file
if paragraph != "\n":
print(paragraph)
Obviously add some logic control where you need it.

Related

Python regex fullmatch doesn't work as expected

I have a text file that contains some sentences, I'm checking them if they are valid sentences based on some rules and writing valid or not valid to a seperate text file. My main problem is when I'm using ctrl + f and enter my regex to search bar it matches the strings that I wanted to match but in code, it works wrong. Here is my code:
import re
pattern = re.compile('(([A-Z])[a-z\s,]*)((: ["‘][a-z,!?\.\s]*["’][.,!?])|(; [a-zA-Z\s]*[!.?])|(\s["‘][a-z,.;!?\s]*["’])|([\.?!]))')
text=open('validSentences',"w+")
with open('sentences.txt',encoding='utf8') as file:
lines = file.readlines()
for line in lines:
matches = pattern.fullmatch(line)
if(matches==None):
text.write("not valid"+"\n")
else:
text.write("valid"+"\n")
file.close()
In documents it says that fullmatch matches only whole string matches and thats what I'm trying to do but this code writes not valid for all sentences that I have. The text file that I have:
How can you say that to me?
As he looked at his reflection in the mirror, he took a deep breath.
He nodded at himself and, feeling braver, he stepped outside the bathroom. He bumped straight into the
extremely tall man, who was waiting by the door.
David said ‘Oh, sorry!’.
The happy pair discussed their future life 2gether and shared sweet words of admiration.
We will not stop you; I promise!
Come here ASAP!
He pushed his chair back and went to the kitchen at 2 pM.
I do not know...
The main character in the movie said: "Play hard. Work harder."
When I enter my regex in vs code with ctrl+f whole first, second, fourth, seventh and eight lines are highligting so according to fullmatch() funtion they need to print as "valid" but they aren't. I need help with this issue.

First, remove lines = file.readlines() as it already moves the file handle to the end of the file stream. Then, you need to keep in mind that when using for line in lines:, the line variable has a trailing newline, so
Either use line=line.rstrip() to remove the trailing whitespace before running the regex or
Ensure your pattern ends in \n? (an optional newline), or even \s* (any zero or more whitespace).
So, a possible solution looks like
with open('sentences.txt',encoding='utf8') as file:
for line in file:
matches = pattern.fullmatch(line.rstrip('\n'))
...
Or,
pattern = re.compile(r'([A-Z][a-z\s,]*)(?:: ["‘][a-z,!?\.\s]*["’][.,!?]|; [a-zA-Z\s]*[!.?]|\s["‘][a-z,.;!?\s]*["’]|[.?!])\s*')
#...
with open('sentences.txt',encoding='utf8') as file:
for line in file:
....

python/regex copy paragraphs in order to another txt document

I'm working on what I initially thought would be a pretty simple program. Essentially, it should find key words then copy that paragraph to another document. What I want to do is take content from document 1 (both are .txt files) and re-order the paragraphs into a desired order.
I think I've written my python part correctly, as it works with other snippets (or seems to just fine), but the regex part (admittedly I'm very new to this) for some reason does not work.
I've tried a number of things and searched all through stack overflow. What I have currently "catches" almost the entire txt file instead of just the paragraph. This may be obvious but in addition to it catching most of the document, it's catching paragraphs without the target term (in this case, discussing) in it.
I appreciate all help in advance.
def write_function():
with open('minnar.txt','r') as rf, open('regexoutput.txt', 'a') as wf:
content = rf.read()
matches = target.findall(content)
print(matches)
for match in matches:
wf.write(match + '\n \n')
target = re.compile('([^\']*(?=discussing)[^\']*)')
write_function()```

If your paragraph means the text between quote, then the regex should be follow:
\'([^']+)\'
https://pythex.org/?regex=%5C%27(%5B%5E%27%5D%2B)%5C%27&test_string=This%20is%20%27the%20thing%27%20that%20I%20talked%20about.%20And%20I%20think%20this%20%27should%20be%20the%20one%20that%20they%20expected%27&ignorecase=0&multiline=0&dotall=1&verbose=0

How to extract subset of data in groups before and after a string

I have a text file. In text file based on specific word it should make the data into two groups like everything before specific word as 1 group and every thing after specific word as another group
text file some thing like this
hello every one
Is any space here?
CHAIN
every thing of the
file lies here
Based on CHAIN we separate text file into two groups
group 1
hello every one
Is any space here?
group 2
every thing of the
file lies here

you can try a solution with split ans access each string using index as given below.
a = """
hello every one
Is any space here?
CHAIN
every thing of the
file lies here
"""
print(a.split("CHAIN")[0])
print(a.split("CHAIN")[1])

You mentioned you have a text file say test.txt.
You code:
with open("test.txt", "r") as f:
data = f.readlines()
part1, part2 = ("".join(data).split("CHAIN"))
print(part1)
print(part2)
Gives me:
hello every one
Is any space here?
every thing of the
file lies here
Otherwise other solution is also good.

just for completeness (other answers work as well):
if you have a text file
file = open('file.txt', 'r').read()
print(file.split('CHAIN'))
# if you want to remove the new spaces (\n)
print([text.strip() for text in file.split('CHAIN')])

split() not splitting all white spaces?

I am trying to take a text document and write each word separately into another text document. My only issue is with the code I have sometimes the words aren't all split based on the white space and I'm wondering if I'm just using .split wrong? If so, could you explain why or what to do better?
Here's my code:
list_of_words = []
with open('ExampleText.txt', 'r') as ExampleText:
for line in ExampleText:
for word in line.split(''):
list_of_words.append(word)
print("Done!")
print("Also done!")
with open('TextTXT.txt', 'w') as EmptyTXTdoc:
for word in list_of_words:
EmptyTXTdoc.write("%s\n" % word)
EmptyTXTdoc.close()
This is the first line in the ExampleText text document as it is written in the newly created EmptyTXTdoc:
Submit
a personal
statement
of
research
and/or
academic
and/or
career
plans.

Use .split() (or .split(' ') for only spaces) instead of .split('').
Also, consider sanitizing the line with .strip() for every iteration of the file, since the line is accepted with a newline (\n) in its end.

.split('') Will not remove a space because there isn't a space in between the two apostrophes. You're telling it to split on, well, nothing.

Need help finding the correct regex pattern for my string pattern

I'm terrible with RegEx patterns, and I'm writing a simple python program that requires splitting lines of a file into a 'content' part and a 'tags' part, and then further splitting the tags parts into individual tags. Here's a simple example of what one line of my file might look like:
The Beatles <music,rock,60s,70s>
I've opened my file with begun reading lines like this:
def Load(self, filename):
file = open(filename, r)
for line in file:
#Ignore comments and empty lines..
if not line.startswith('#') and not line.strip():
#...
Forgive my likely terrible Python, it's my first few days with the language. Anyway, next I was thinking it would be useful to use a regex to break my string into sections - with a variable to store the 'content' (for example, "The Beatles"), and a list/set to store each of the tags. As such, I need a regex (or two?) that can:
Split the raw part from the <> part.
And split the tags part into a list based on the commas.
Finally, I want to make sure that the content part retains its capitalization and inner spacing. But I want to make sure the tags are all lower-case and without white space.
I'm wondering if any of the regex experts out there can help me find the correct pattern(s) to achieve my goals here?

This is a solution that gets around the problem without using by relying on multiple splits.
# This separates the string into the content and the remainder
content, tagStr = line.split('<')
# This splits the tagStr into individual tags. [:-1] is used to remove trailing '>'
tags = tagStr[:-1].split(',')
print content
print tags
The problem with this is that it leaves a trailing whitespace after the content.
You can remove this with:
content = content[:-1]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read from text file into array paragraph by paragraph? - python

You want this: my_list = my_string.splitlines() https://docs.python.org/3/library/stdtypes.html#str.splitlines

Related

Python regex fullmatch doesn't work as expected

python/regex copy paragraphs in order to another txt document

How to extract subset of data in groups before and after a string

split() not splitting all white spaces?

Need help finding the correct regex pattern for my string pattern

Categories

Resources