Python: Replace, rstrip() not able to remove newlines - python

I've got a spreadsheet of information (UTF-8 CSV file being read in by the csv module) that contains information for a large number of products that need to go into an inventory db. I'm trying to setup descriptions from newlined rows of text to a html list tags.
The issue I'm having is that the following lines fail to replace the newline character in the string:
line[2] = "<ul><li>" + line[2]
line[2].replace('\n', '</li><li>')
line[2] += "</li></ul>"
The string continues to contain newline characters even when the second line is replaced by:
line[2] = line[2].rstrip()
What is going on, and what am I messing up? =)

From python manual
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
As you can see, it doesn't replace "in place", instead, try:
line[2] = "<ul><li>" + line[2]
line[2] = line[2].replace('\n', '</li><li>')
line[2] += "</li></ul>"

Replace isn't in place.
So do
line[2] = "<ul><li>" + line[2].replace('\n', '</li><li>') + "</li></ul>"

Don't forget to escape!
escaped = cgi.escape(line[2].rstrip()).replace("\n", "</li><li>")
line[2] = "<ul><li>%s</li></ul>" % escaped
Str.replace returns a copy instead of modifying in-place, and rstrip with no argument will strip all trailing whitespace. Since this is for HTML and trailing whitespace probably won't include something like "\n \n ", that probably doesn't matter to you, but it is something to be aware of.

Related

Replace all newline characters using python

I am trying to read a pdf using python and the content has many newline (crlf) characters. I tried removing them using below code:
from tika import parser
filename = 'myfile.pdf'
raw = parser.from_file(filename)
content = raw['content']
content = content.replace("\r\n", "")
print(content)
But the output remains unchanged. I tried using double backslashes also which didn't fix the issue. can someone please advise?
content = content.replace("\\r\\n", "")
You need to double escape them.
I don't have access to your pdf file, so I processed one on my system. I also don't know if you need to remove all new lines or just double new lines. The code below remove double new lines, which makes the output more readable.
Please let me know if this works for your current needs.
from tika import parser
filename = 'myfile.pdf'
# Parse the PDF
parsedPDF = parser.from_file(filename)
# Extract the text content from the parsed PDF
pdf = parsedPDF["content"]
# Convert double newlines into single newlines
pdf = pdf.replace('\n\n', '\n')
#####################################
# Do something with the PDF
#####################################
print (pdf)
If you are having issues with different forms of line break, try the str.splitlines() function and then re-join the result using the string you're after. Like this:
content = "".join(l for l in content.splitlines() if l)
Then, you just have to change the value within the quotes to what you need to join on.
This will allow you to detect all of the line boundaries found here.
Be aware though that str.splitlines() returns a list not an iterator. So, for large strings, this will blow out your memory usage.
In those cases, you are better off using the file stream or io.StringIO and read line by line.
print(open('myfile.txt').read().replace('\n', ''))
When you write something like t.replace("\r\n", "") python will look for a carriage-return followed by a new-line.
Python will not replace carriage returns by themselves or replace new-line characters by themselves.
Consider the following:
t = "abc abracadabra abc"
t.replace("abc", "x")
Will t.replace("abc", "x") replace every occurrence of the letter a with the letter x? No
Will t.replace("abc", "x") replace every occurrence of the letter b with the letter x? No
Will t.replace("abc", "x") replace every occurrence of the letter c with the letter x? No
What will t.replace("abc", "x") do?
t.replace("abc", "x") will replace the entire string "abc" with the letter "x"
Consider the following:
test_input = "\r\nAPPLE\rORANGE\nKIWI\n\rPOMEGRANATE\r\nCHERRY\r\nSTRAWBERRY"
t = test_input
for _ in range(0, 3):
t = t.replace("\r\n", "")
print(repr(t))
result2 = "".join(test_input.split("\r\n"))
print(repr(result2))
The output sent to the console is as follows:
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
Note that:
str.replace() replaces every occurrence of the target string, not just the left-most occurrence.
str.replace() replaces the target string, but not every character of the target string.
If you want to delete all new-line and carriage returns, something like the following will get the job done:
in_string = "\r\n-APPLE-\r-ORANGE-\n-KIWI-\n\r-POMEGRANATE-\r\n-CHERRY-\r\n-STRAWBERRY-"
out_string = "".join(filter(lambda ch: ch not in "\n\r", in_string))
print(repr(out_string))
# prints -APPLE--ORANGE--KIWI--POMEGRANATE--CHERRY--STRAWBERRY-
You can also just use
text = '''
As she said these words her foot slipped, and in another moment, splash! she
was up to her chin in salt water. Her first idea was that she had somehow
fallen into the sea, “and in that case I can go back by railway,”
she said to herself.”'''
text = ' '.join(text.splitlines())
print(text)
# As she said these words her foot slipped, and in another moment, splash! she was up to her chin in salt water. Her first idea was that she had somehow fallen into the sea, “and in that case I can go back by railway,” she said to herself.”
#write a file
enter code here
write_File=open("sample.txt","w")
write_File.write("line1\nline2\nline3\nline4\nline5\nline6\n")
write_File.close()
#open a file without new line of the characters
open_file=open("sample.txt","r")
open_new_File=open_file.read()
replace_string=open_new_File.replace("\n",." ")
print(replace_string,end=" ")
open_file.close()
OUTPUT
line1 line2 line3 line4 line5 line6

Python Code to Replace First Letter of String: String Index Error

Currently, I am working on parsing resumes to remove "-" only when it is used at the beginning of each line. I've tried identifying the first character of each string after the text has been split. Below is my code:
for line in text.split('\n'):
if line[0] == "-":
line[0] = line.replace('-', ' ')
line is a string. This is my way of thinking but every time I run this, I get the error IndexError: string index out of range. I'm unsure of why because since it is a string, the first element should be recognized. Thank you!
The issue you're getting is because some lines are empty.
Then your replacement is wrong:
first because it will assign the first "character" of the line but you cannot change a string because it's immutable
second because the replacement value is the whole string minus some dashes
third because line is lost at the next iteration. The original list of lines too, by the way.
If you want to remove the first character of a string, no need for replace, just slice the string (and don't risk to remove other similar characters).
A working solution would be to test with startswith and rebuild a new list of strings. Then join back
text = """hello
-yes--
who are you"""
new_text = []
for line in text.splitlines():
if line.startswith("-"):
line = line[1:]
new_text.append(line)
print("\n".join(new_text))
result:
hello
yes--
who are you
with more experience, you can pack this code into a list comprehension:
new_text = "\n".join([line[1:] if line.startswith("-") else line for line in text.splitlines()])
finally, regular expression module is also a nice alternative:
import re
print(re.sub("^-","",text,flags=re.MULTILINE))
this removes the dash on all lines starting with dash. Multiline flag tells regex engine to consider ^ as the start of the line, not the start of the buffer.
this could be due to empty lines. You could just check the length before taking the index.
new_text = []
text="-testing\nabc\n\n\nxyz"
for line in text.split("\n"):
if line and line[0] == '-':
line = line[1:]
new_text.append(line)
print("\n".join(new_text))

pandas read_table with regex header definition

For the data file formated like this:
("Time Step" "courantnumber_max" "courantnumber_avg" "flow-time")
0 0.55432343242 0.34323443432242 0.00001
I can use pd.read_table(filename, sep=' ', header=0) and it will get everything correct except for the very first header, "Time Step".
Is there a way to specify a regex string for read_table() to use to parse out the header names?
I know a way to solve the issue is to just use regex to create a list of names for the read_table() function to use, but I figured there might/should be a way to directly express that in the import itself.
Edit: Here's what it returns as headers:
['("Time', 'Step"', 'courantnumber_max', 'courantnumber_avg', 'flow-time']
So it doesn't appear to be actually possible to do this inside the pandas.read_table() function. Below is posted the actual solution I ended up using to fix the problem:
import re
def get_headers(file, headerline, regexstring, exclude):
# Get string of selected headerline
with file.open() as f:
for i, line in enumerate(f):
if i == headerline-1:
headerstring = line
elif i > headerline-1:
break
# Parse headerstring
reglist = re.split(regexstring, headerstring)
# Filter entries in reglist
#filter out blank strs
filteredlist = list(filter(None, reglist))
#filter out items in exclude list
headerslist = []
if exclude:
for entry in filteredlist:
if not entry in exclude:
headerslist.append(entry)
return headerslist
get_headers(filename, 3, r'(?:" ")|["\)\(]', ['\n'])
Code explanation:
get_headers():
Arguments, file is a file object that contains the header. headerline is the line number (starting at 1) that the header names exist. regexstring is the pattern that will be fed into re.split(). Highly recommended that you prepend a r to the regex pattern. exclude is a list of miscellaneous strings that you want to be removed from the headerlist.
The regex pattern I used:
First up we have the pipe (|) symbol. This was done to separate both the "normal" split method (which is the " ") and the other stuff that needs to be rid of (namely the parenthesis).
Starting with the first group: (?:" "). We have the (...) since we want to match those characters in order. The " " is what we want to match as the stuff to split around. The ?: basically says to not capture the contents of the group. This is important/useful as otherwise re.split() will keep any groups as a separate item. See re.split() in documentation.
The second group is simply the other characters. Without them, the first and last items would be '("Time Step' and 'flow-time)\n'. Note that this causes \n to be treated as a separate entry to the list. This is why we use the exclude argument to fix that up after the fact.

How to fetch whole string from start of comma to next comma or newline when "/" occurs in string

I have string in format
,xys=2/3,
d=e,
b*y,
b/e
I want to fetch xys=2/3 and b/e.
Right now I have regular expression which just picks 2/3 and b/e.
pattern = r'(\S+)\s*(?<![;|<|#])/\s*(\S+)'
regex = re.compile(pattern,re.DOTALL)
for result in regex.findall(data):
f.write("Division " + str(result)+ "\n\n\n")
How can I modify to pick what I intend to do?
Match anything but , (or newlines) up until the first slash /: [^,/\n]*/
Match the remaining text up to the next comma: [^,\n]*
Put the two together: [^,/\n]*/[^,\n]*
No need for regular expressions.
s = """,xys=2/3,
d=e,
b*y,
b/e
"""
l = s.split("\n")
for line in l:
if '/' in line:
print(line.strip(","))
Will this work:
x.split(",")[1].split('\n')[0] if "," in x[:-1] else None
It ignores (evaluates to None) unles , is present in the non-last position, else extract the part between , and another , or till the end, and again filter until new line if any.

Removing \n from myFile

I am trying to create a dictionary of list that the key is the anagrams and the value(list) contains all the possible words out of that anagrams.
So my dict should contain something like this
{'aaelnprt': ['parental', 'paternal', 'prenatal'], ailrv': ['rival']}
The possible words are inside a .txt file. Where every word is separated by a newline. Example
Sad
Dad
Fruit
Pizza
Which leads to a problem when I try to code it.
with open ("word_list.txt") as myFile:
for word in myFile:
if word[0] == "v": ##Interested in only word starting with "v"
word_sorted = ''.join(sorted(word)) ##Get the anagram
for keys in list(dictonary.keys()):
if keys == word_sorted: ##Heres the problem, it doesn't get inside here as theres extra characters in <word_sorted> possible "\n" due to the linebreak of myfi
print(word_sorted)
dictonary[word_sorted].append(word)
If every word in "word_list.txt" is followed by '\n' then you can just use slicing to get rid of the last char of the word.
word_sorted = ''.join(sorted(word[:-1]))
But if the last word in "word_list.txt" isn't followed by '\n', then you should use rstrip().
word_sorted = ''.join(sorted(word.rstrip()))
The slice method is slightly more efficient, but for this application I doubt you'll notice the difference, so you might as well just play safe & use rstrip().
Use rstrip(), it removes the \n character.
...
...
keys == word_sorted.rstrip()
...
You should try to use the .rstrip() function in your code, it will remove the "\n"
Here you can check it .rstrip()
strip only removes characters from the beginning or end of a string.
Use rstrip() to remove \n character
Also you can use replace syntax, to replace newline with something else.
str2 = str.replace("\n", "")
So, I see a few problems here, how is anything getting into the dictionary, I see no assignments? Obviously you've only provided us a snippet, so maybe that's elsewhere.
You're also using a loop when you could be using in (it's more efficient, truly it is).
with open ("word_list.txt") as myFile:
for word in myFile:
if word[0] == "v": ##Interested in only word starting with "v"
word_sorted = ''.join(sorted(word.rstrip())) ##Get the anagram
if word_sorted in dictionary:
print(word_sorted)
dictionary[word_sorted].append(word)
else:
# The case where we don't find an anagram in our dict
dictionary[word_sorted] = [word,]

Categories

Resources