Removing quotes from text files

Removing quotes from text files - python

I need to read a pipe(|)-separated text file.
One of the fields contains a description that may contain double-quotes.
I noticed that all lines that contain a " is missing in the receiving dict.
To avoid this, I tried to read the entire line, and use the string.replace() to remove them, as shown below, but it looks like the presence of those quotes creates problem at the line-reading stage, i.e before the string.replace() method.
The code is below, and the question is 'how to force python not to use any separator and keep the line whole ?".
with open(fileIn) as txtextract:
readlines = csv.reader(txtextract,delimiter="µ")
for line in readlines:
(...)
LI_text = newline[107:155]
LI_text.replace("|","/")
LI_text.replace("\"","") # use of escape char don't work.
Note: I am using version 3.6

You may use regex
In [1]: import re
In [2]: re.sub(r"\"", "", '"remove all "double quotes" from text"')
Out[2]: 'remove all double quotes from text'
In [3]: re.sub(r"(^\"|\"$)", "", '"remove all "only surrounding quotes" from text"')
Out[3]: 'remove all "only surrounding quotes" from text'
or add quote='"' and quoting=csv.QUOTE_MINIMAL options to csv.reader(), like:
with open(fileIn) as txtextract:
readlines = csv.reader(txtextract, delimiter="µ", quote='"', quoting=csv.QUOTE_MINIMAL)
for line in readlines:
(...)

Lesson: method string.replace() does not change the string itself. The modified text must be stored back (string = string.replace() )

Related

Find a character immediately before a match using regex

I need to find a regex where I can reliably find a " that happens before a "" but there are a lot of " before it as well.
For example:
{"Field":"String data "Other String Data""}
I need to fix an error I'm getting in the JSON raw string. I need to make that "" into " and remove that extra " inside the value pair. If I don't remove these I can't make the the string into an object so I can iterate through it.
I am importing this string into Python.
I have tried to figure out some lookbacks and lookarounds but they don't seem to be working.
For example, I tried this: (?=(?=(")).*"")

Have you tried just finding all "" and replacing them with "
re.sub('""', '"', s)
Though this will work for your example it can cause issues if the double double quote is intended in a string.

You could use re.split to break down your string into parts that are between quotes, then replace the non-escaped inside quotes with properly escaped ones.
To break the string apart, you can use an expression that will find quoted character sequences that are followed by one of the JSON delimiter that can appear after a closing quote (i.e.: : , ] }):
s='{"Field":"String data "Other String Data""}'
import re
parts = re.split(r'(".*?"(?=[:,}\]]))',s)
fixed = "".join(re.sub(r'(?<!^)"(?!$)',r'\"',p) for p in parts)
print(parts) # ['{', '"Field"', ':', '"String data "Other String Data""', '}']
print(fixed) # {"Field":"String data \"Other String Data\""}
Obviously this will not cover all possible edge cases (otherwise JSON wouldn't need to escape quotes as it does) but, depending on your data it may be sufficient.

remove apostrophe from string python

I'm trying to remove the apostrophe from a string in python.
Here is what I am trying to do:
source = 'weatherForecast/dataRAW/2004/grib/tmax/'
destination= 'weatherForecast/csv/2004/tmax'
for file in sftp.listdir(source):
filepath = source + str(file)
subprocess.call(['degrib', filepath, '-C', '-msg', '1', '-Csv', '-Unit', 'm', '-namePath', destination, '-nameStyle', '%e_%R.csv'])
filepath currently comes out as the path with wrapped around by apostrophes.
i.e.
`subprocess.call(['', 'weatherForecast/.../filename')]`
and I want to get the path without the apostrophes
i.e.
subprocess.call(['', weatherForecast/.../filename)]
I have tried source.strip(" ' ", ""), but it doesn't really do anything.
I have tried putting in print(filepath) or return(filepath) since these will remove the apostrophes but they gave me
syntax errors.
filepath = print(source + str(file))
^
SyntaxError: invalid syntax
I'm currently out of ideas. Any suggestions?

The strip method of a string object only removes matching values from the ends of a string, it stops searching for matches when it first encounters a non-required character.
To remove characters, replace them with the empty string.
s = s.replace("'", "")

The accepted answer to this question is actually wrong and can cause lots of trouble. strip method removes as leading/trailing characters. So you use it when you have character to remove from start and end.
If you use replace instead, you will change all characters in the string. Here is a quick example.
my_string = "'Hello rokman's iphone'"
my_string.replace("'", "")
The above code will return Hello rokamns iphone. As you can see you lost the quote before s. This is not someting you would need in your case. However, you only parse location without that character I believe. That's why it was ok for you to use at that time.
For the solution, you are doing just one thing wrong. When you call strip method you leave space before and after. The right way to use it should be like this.
my_string = "'Hello world'"
my_string.strip("'")
However this assumes that you got ', if you get " from the response you can change quotes like this.
my_string = '"Hello world"'
my_string.strip('"')

Why doesn't .rstrip('\n') work?

Let's say doc.txt contains
a
b
c
d
and that my code is
f = open('doc.txt')
doc = f.read()
doc = doc.rstrip('\n')
print doc
why do I get the same values?

str.rstrip() removes the trailing newline, not all the newlines in the middle. You have one long string, after all.
Use str.splitlines() to split your document into lines without newlines; you can rejoin it if you want to:
doclines = doc.splitlines()
doc_rejoined = ''.join(doclines)
but now doc_rejoined will have all lines running together without a delimiter.

Because you read the whole document into one string that looks like:
'a\nb\nc\nd\n'
When you do a rstrip('\n') on that string, only the rightmost \n will be removed, leaving all the other untouched, so the string would look like:
'a\nb\nc\nd'
The solution would be to split the file into lines and then right strip every line. Or just replace all the newline characters with nothing: s.replace('\n', ''), which gives you 'abcd'.

rstrip strips trailing spaces from the whole string. If you were expecting it to work on individual lines, you'd need to split the string into lines first using something like doc.split('\n').

Try this instead:
with open('doc.txt') as f:
for line in f:
print line,
Explanation:
The recommended way to open a file is using with, which takes care of closing the file at the end
You can iterate over each line in the file using for line in f
There's no need to call rstrip() now, because we're reading and printing one line at a time

Consider using replace and replacing each instance of '\n' with ''. This would get rid of all the new line characters in the input text.

long text as String in python

Hey I would like to declare a String in Python which is a long text (with line breaks and paragraphs). Is this possible?
If I just copy-paste de text into quotations Python only recognizes the first line and I have to manually remove all the line breaks if I want the entire text. If this is possible would it still be possible if the text have quotations ("")?

Use triple quotes :
mytext = """Some text
Some more text
etc...
"""

Surround your string content with """ to indicate a multi-line string.
>>>a = """
...this
...is
...a
...multi-line
...string
..."""
>>> a
'this\nis\na\nmulti-line\nstring\n'

dealing with \n characters at end of multiline string in python

I have been using python with regex to clean up a text file. I have been using the following method and it has generally been working:
mystring = compiledRegex.sub("replacement",mystring)
The string in question is an entire text file that includes many embedded newlines. Some of the compiled regex's cover multiple lines using the re.DOTALL option. If the last character in the compiled regex is a \n the above command will substitute all matches of the regex except the match that ends with the final newline at the end of the string. In fact, I have had several other no doubt related problems dealing with newlines and multiple newlines when they appear at the very end of the string. Can anyone give me a pointer as to what is going on here? Thanks in advance.

If i correctly undestood you and all that you need is to get a text without newline at the end of the each line and then iterate over this text in order to find a required word than you can try to use the following:
data = (line for line in text.split('\n') if line.strip())# gives you all non empty lines without '\n'at the end
Now you can either search/replace any text you need using list slicing or regex functionality.
Or you can use replace in order to replace all '\n' to whenever you want:
text.replace('\n', '')

My bet is that your file does not end with a newline...
>>> content = open('foo').read()
>>> print content
TOTAL:.?C2
abcTOTAL:AC2
defTOTAL:C2
>>> content
'TOTAL:.?C2\nabcTOTAL:AC2\ndefTOTAL:C2'
...so the last line does not match the regex:
>>> regex = re.compile('TOTAL:.*?C2\n', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefTOTAL:C2'
If that is the case, the solution is simple: just match either a newline or the end of the file (with $):
>>> regex = re.compile('TOTAL:.*?C2(\n|$)', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefXXX'

I can't get a good handle on what is going on from your explanation but you may be able to fix it by replacing all multiple newlines with a single newline as you read in the file. Another option might be to just trim() the regex removing the \n at the end unless you need it for something.

Is the question mark to prevent the regex matching more than one iine at a time? If so then you probably want to be using the MULTILINE flag instead of DOTALL flag. The ^ sign will now match just after a new line or the beginning of a string and the $ sign will now match just before a newline character or the end of a string.
eg.
regex = re.compile('^TOTAL:.*$', re.MULTILINE)
content = regex.sub('', content)
However, this still leaves with the problem of empty lines. But why not just run one additional regex at the end that removes blank lines.
regex = re.compile('\n{2,}')
content = regex.sub('\n', content)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing quotes from text files - python

Lesson: method string.replace() does not change the string itself. The modified text must be stored back (string = string.replace() )

Related

Find a character immediately before a match using regex

remove apostrophe from string python

Why doesn't .rstrip('\n') work?

long text as String in python

dealing with \n characters at end of multiline string in python

Categories

Resources