How to use text.split() and retain blank (empty) lines - python

New to python, need some help with my program. I have a code which takes in an unformatted text document, does some formatting (sets the pagewidth and the margins), and outputs a new text document. My entire code works fine except for this function which produces the final output.
Here is the segment of the problem code:
def process(document, pagewidth, margins, formats):
res = []
onlypw = []
pwmarg = []
count = 0
marg = 0
for segment in margins:
for i in range(count, segment[0]):
res.append(document[i])
text = ''
foundmargin = -1
for i in range(segment[0], segment[1]+1):
marg = segment[2]
text = text + '\n' + document[i].strip(' ')
words = text.split()
Note: segment [0] means the beginning of the document, and segment[1] just means to the end of the document if you are wondering about the range. My problem is when I copy text to words (in words=text.split() ) it does not retain my blank lines. The output I should be getting is:
This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in
this. If they but knew it, almost all men in their degree,
some time or other, cherish very nearly the same feelings
towards the ocean with me.
There now is your insular city of the Manhattoes, belted
round by wharves as Indian isles by coral reefs--commerce
surrounds it with her surf.
And what my current output looks like:
This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in
this. If they but knew it, almost all men in their degree,
some time or other, cherish very nearly the same feelings
towards the ocean with me. There now is your insular city of
the Manhattoes, belted round by wharves as Indian isles by
coral reefs--commerce surrounds it with her surf.
I know the problem happens when I copy text to words, since it doesn't keep the blank lines. How can I make sure it copies the blank lines plus the words?
Please let me know if I should add more code or more detail!

First split on at least 2 newlines, then split on words:
import re
paragraphs = re.split('\n\n+', text)
words = [paragraph.split() for paragraph in paragraphs]
You now have a list of lists, one per paragraph; process these per paragraph, after which you can rejoin the whole thing into new text with double newlines inserted back in.
I've used re.split() to support paragraphs being delimited by more than 2 newlines; you could use a simple text.split('\n\n') if there are ever only going to be exactly 2 newlines between paragraphs.

use a regexp to find the words and the blank lines rather than split
m = re.compile('(\S+|\n\n)')
words=m.findall(text)

Related

Splitting elements within a list and separate strings, then counting the length

If I have several lines of code, such that
"Jane, I don't like cavillers or questioners; besides, there is something truly forbidding in a child taking up her elders in that manner.
Be seated somewhere; and until you can speak pleasantly, remain silent."
I mounted into the window- seat: gathering up my feet, I sat cross-legged, like a Turk; and, having drawn the red moreen curtain nearly close, I was shrined in double retirement.
and I want to split the 'string' or sentences for each line by the ";" punctuation, I would do
for line in open("jane_eyre_sentences.txt"):
words = line.strip("\n")
words_split = words.split(";")
However, now I would get strings of text such that,
["Jane, I don't like cavillers or questioners', 'besides, there is something truly forbidding in a child taking up her elders in that manner.']
[Be seated somewhere', 'and until you can speak pleasantly, remain silent."']
['I mounted into the window- seat: gathering up my feet, I sat cross-legged, like a Turk', 'and, having drawn the red moreen curtain nearly close, I was shrined in double retirement.']
So it has now created two separate elements in this list.
How would I actually separate this list.
I know I need a 'for' loop because it needs to process through all the lines. I will need to use another 'split' method, however I have tried "\n" as well as ',' but it will not generate an answer, and the python thing says "AttributeError: 'list' object has no attribute 'split'". What would this mean?
Once I separate into separate strings, I want to calculate the length of each string, so i would do len(), etc.
You can iterate through the list of created words like this:
for line in open("jane_eyre_sentences.txt"):
words = line.strip("\n")
for sentence_part in words.split(";"):
print(sentence_part) # will print the elements of the list
print(len(sentence_part) # will print the length of the sentence parts
Alernatively if you just need the length for each of the parts:
for line in open("jane_eyre_sentences.txt"):
words = line.strip("\n")
sentence_part_lengths = [len(sentence_part) for sentence_part in words.split(";")]
Edit: With further information from your second post.
for count, line in enumerate(open("jane_eyre_sentences.txt")):
words = line.strip("\n")
if ";" in words:
wordssplit = words.split(";")
number_of_words_per_split = [(x, len(x.split())) for x in wordsplit]
print("Line {}: ".format(count), number_of_words_per_split)

Accumulating Characters in Python

So I have this textfile, and in that file it goes like this... (just a bit of it)
"The truest love that ever heart
Felt at its kindled core
Did through each vein in quickened start
The tide of being pour
Her coming was my hope each day
Her parting was my pain
The chance that did her steps delay
Was ice in every vein
I dreamed it would be nameless bliss
As I loved loved to be
And to this object did I press
As blind as eagerly
But wide as pathless was the space
That lay our lives between
And dangerous as the foamy race
Of ocean surges green
And haunted as a robber path
Through wilderness or wood
For Might and Right and Woe and Wrath
Between our spirits stood
I dangers dared I hindrance scorned
I omens did defy
Whatever menaced harassed warned
I passed impetuous by
On sped my rainbow fast as light
I flew as in a dream
For glorious rose upon my sight
That child of Shower and Gleam"
Now, the calculate the length of words without the letter 'e' in each line of text. So in the first line it should have 4, then 5, then 17, etc.
My current code is
for line in open("textname.txt"):
line_strip = line.strip()
line_strip_split = line_strip.split()
for word in line_strip_split:
if "e" not in word:
word_e = word
print (len(word_e))
My explanation is: Strip each word from each other by removing spaces, so it becomes ['Felt','at','its','kindled','core'], etc. Then we split each word because we can regard it individually when removing words with 'e'?. So we want words without e, then print the length of the string.
HOWEVER, this separates each word into a different line by splitting then separating the string? So this doesn't add all the words together in each line but separates it, so the answer becomes "4 / 2 / 3"
Try this:
for line in open("textname.txt"):
line_strip = line.strip()
line_strip_split = line_strip.split()
words_with_no_e = []
for word in line_strip_split:
if "e" not in word:
# Adding words without e to a new list
words_with_no_e.append(word)
# ''.join() will returns all the elements of array concatenated
# len() will count the length
print(len(''.join(words_with_no_e)))
It append all the words without e in into new list in each line, then concatenate all words then it prints length of it.

How to trim a string in a specific way

Say I have a sentence such as:
The bird flies at night and has a very large wing span.
My goal is to split the string so that the result comes out to be:
and has a very large wing
I've tried using split(), however, my efforts have not been successful. How can I split the string into pieces, and delete the beginning part of the string and the end part?
import re
text = 'The bird flies at night and has a very large wing span.'
l = re.split(r'.+?(?=and)|(?<=wing).+?', text)[1]
out:
and has a very large wing
I guess this is the best way to do what you want:
s = "The bird flies at night and has a very large wing span."
and_position = s.find("and") # return the first index of "and" in the string
wing_position = s.find("wing") # similar to the above
result = s[and_position:wing_position+4] # this is called python's slice
If you're not familiar with python slice, read more at here.

Replace a specific word given its position in a text file (Python)

I have a list of tuples, each on contains a word-to-be-replaced, its line and column number positions from a given text file. I want to go through the text file and replace that specific word of that specific position with a character (e.g. [('word1', 1, 1), ('word2', 1, 9), ... ]).
In other words, given a specific word, its line and column numbers inside a text file I am trying to find and replace that word with a character, for example:
given that the text file contains the following (assuming its position is as it is displayed -not written- here)
Excited him now natural saw passage offices you minuter. At by stack
being court hopes. Farther so friends am to detract. Forbade concern
do private be. Offending residence but men engrossed shy. Pretend am
stack earnest arrived company so on. Felicity informed yet had to is
admitted strictly how stack you.
and given that the word to replace is stack with position in the text to be line 3 and column 16, to replace it with the character *,
so, after the replace takes place, the text file would now have the contents:
Excited him now natural saw passage offices you minuter. At by stack
being court hopes. Farther so friends am to detract. Forbade concern
do private be. Offending residence but men engrossed shy. Pretend am
* earnest arrived company so on. Felicity informed yet had to is
admitted strictly how stack you.
I have considered linecache but it seems very inefficient for large text files. Also, given the fact that I already have the line and column numbers, I hoped there was a way to go directly to that position and perform the replace.
Does anyone know a way to do this in Python?
EDIT
The initial solution proposed using numpy's genfromtxt is (most likely) not suitable following the discussion in the follow-up issue since there is a need for every line of the text file to be present and not skipped (e.g. empty lines, strings beginning with 'w' and strings inside '/*.. /').
Try a recipe like this:
import numpy as np
import os
def changethis(pos):
# Notice file is in global scope
appex = file[pos[1]-1][:pos[2]] + '*' + file[pos[1]-1][pos[2]+len(pos[0]):]
file[pos[1]-1] = appex
pos = ('stack', 3, 16)
file = np.array([i for i in open('in.txt','r')]) #BEFORE EDIT: np.genfromtxt('in.txt',dtype='str',delimiter=os.linesep)
changethis(pos)
print(file)
The result is this:
[ 'Excited him now natural saw passage offices you minuter. At by stack being court hopes. Farther'
'so friends am to detract. Forbade concern do private be. Offending residence but men engrossed'
'shy. Pretend am * earnest arrived company so on. Felicity informed yet had to is admitted'
'strictly how stack you.']
Notice this is a bit of an hack to put a bunch of long strings into a numpy array, and somehow change them, but it should be efficient when inserting in a longer loop for position tuples.
EDIT: As #user2357112 made me realize the choice for file reader was not the most appropriate (although it worked for the exercise in question), so I've edited this answer to provide the same solution given in the follow up question.
Consider a single line:
word1 a word2 a word3 a word4
If you have these changes:
[('word1', 1, 1), ('word2', 1, 9), ... ]
And you process them in order:
* a word2 a word3 a word4
You will fail, because you are changing the positions of the words when you replace 'word1' with '*', a shorter string.
Instead, you will have to sort the list of changes by line, reversed by column:
changes = sorted(changes, key=lambda t: (t[1], -t[2]))
You can then process the changes as you iterate through the file, shown in the link referenced by #JRajan:
with open("file", "r") as fp:
fpline_text = enumerate(fp)
fpline,text = next(fpline_text)
for edit in changes:
word,line,offset = edit
line -=1 # 0 based
while fpline < line:
print(text)
fpline,text = next(fpline_text)
offset -= 1 # 0-based
cand = text[offset:offset+len(word)]
if cand != word:
print("OOPS! Word '{}' not found at ({}, {})".format(*edit))
else:
text = text[0:offset]+'*'+text[offset+len(word):]
# Rest of file
try:
while True:
print(text)
fpline,text = next(fpline_text)
except StopIteration:
pass

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources