Strip "\n" from a multiline sting, except when inline - python

If I have a multiline string that contains "\n" as part of the text itself, for example:
python_text = """
#commands.command()
def multi():
print("Three\nLines\nHere")
"""
How can I remove all newlines except those in the text itself? I've tried creating a list with splitlines() but the multiline string will also be split on the \n's in the text itself. Using splitlines(True) in combination with strip() doesn't work for the same reasons.
print(python_text.splitlines())
Output (formatted):
[
'#commands.command()',
'def multi():',
' print("Three',
'Lines',
'Here")'
]
Whilst my desired output is:
[
'#commands.command()',
'def multi():',
' print("Three\nLines\nHere")'
]
(Or a multiline string print instead of list print, but in the same format)
Is there any way I can only strip the 'trailing' newline characters from a multiline string?
If anything is unclear please let me know and I'll try to explain further.
Edit: Corrected explanation about splitlines() and strip().

You have to escape your newlines. In this case you can just make it a raw string literal:
python_text = r"""
#commands.command()
def multi():
print("Three\nLines\nHere")
"""
print(python_text.splitlines())
>>>['', '#commands.command()', 'def multi():', ' print("Three\\nLines\\nHere")']

Related

Deleting certain characters from a string

I try to figure out how I can delete certain characters from a string. Unfortunately, it doesn't work. I would appreciate all the help.
def delete_char(string):
string = list(string)
string.remove("\n")
return ''.join(string)
delete_char("I want \n to test \n if you \n work")
How about using replace, instead?
def delete_char(string, target_char, replacement_char=""):
return string.replace(target_char, replacement_char)
print(delete_char("I want \n to test \n if you \n work", "\n"))
You need to re-assign the string value to the removed form. Additionally I would suggest using replace instead of remove in this place, and replacing it with an empty character. Something like this should work:
def delete_char(string):
string = string.replace("\n", "")
return string
You could use str.split and str.join:
>>> ' '.join("I want \n to test \n if you \n work".split())
I want to test if you work
This isn't the same as just removing the newline character but it will ensure only one space between words.
Otherwise just replace the newline with nothing:
>>> "I want \n to test \n if you \n work".replace('\n', '')
I want to test if you work

Capture ALL strings within a Python script with regex

This question was inspired by my failed attempts after trying to adapt this answer: RegEx: Grabbing values between quotation marks
Consider the following Python script (t.py):
print("This is also an NL test")
variable = "!\n"
print('And this has an escaped quote "don\'t" in it ', variable,
"This has a single quote ' but doesn\'t end the quote as it" + \
" started with double quotes")
if "Foo Bar" != '''Another Value''':
"""
This is just nonsense
"""
aux = '?'
print("Did I \"failed\"?", f"{aux}")
I want to capture all strings in it, as:
This is also an NL test
!\n
And this has an escaped quote "don\'t" in it
This has a single quote ' but doesn\'t end the quote as it
started with double quotes
Foo Bar
Another Value
This is just nonsense
?
Did I \"failed\"?
{aux}
I wrote another Python script using re module and, from my attempts into regex, the one which finds most of them is:
import re
pattern = re.compile(r"""(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)""")
with open('t.py', 'r') as f:
msg = f.read()
x = pattern.finditer(msg, re.DOTALL)
for i, s in enumerate(x):
print(f'[{i}]',s.group(0))
with the following result:
[0] And this has an escaped quote "don\'t" in it
[1] This has a single quote ' but doesn\'t end the quote as it started with double quotes
[2] Foo Bar
[3] Another Value
[4] Did I \"failed\"?
To improve my failures, I couldn't also fully replicate what I can found with regex101.com:
I'm using Python 3.6.9, by the way, and I'm asking for more insights into regex to crack this one.
Because you want to match ''' or """ or ' or " as the delimiter, put all of that into the first group:
('''|"""|["'])
Don't put \b after it, because then it won't match strings when those strings start with something other than a word character.
Because you want to make sure that the final delimiter isn't treated as a starting delimiter when the engine starts the next iteration, you'll need to fully match it (not just lookahead for it).
The middle part to match anything but the delimiter can be:
((?:\\.|.)*?)
Put it all together:
('''|"""|["'])((?:\\.|.)*?)\1
and the result you want will be in the second capture group:
pattern = re.compile(r"""(?s)('''|\"""|["'])((?:\\.|.)*?)\1""")
with open('t.py', 'r') as f:
msg = f.read()
x = pattern.finditer(msg)
for i, s in enumerate(x):
print(f'[{i}]',s.group(2))
https://regex101.com/r/dvw0Bc/1

How to find/replace non printable / non-ascii characters using Python 3?

I have a file, some lines in a .csv file that are jamming up a database import because of funky characters in some field in the line.
I have searched, found articles on how to replace non-ascii characters in Python 3, but nothing works.
When I open the file in vi and do :set list, there is a $ at the end of a line where there should not be, and ^I^I at the beginning of the next line. The two lines should be one joined line and no ^I there. I know that $ is end of line '\n' and have tried to replace those, but nothing works.
I don't know what the ^I represents, possibly a tab.
I have tried this function to no avail:
def remove_non_ascii(text):
new_text = re.sub(r"[\n\t\r]", "", text)
new_text = ''.join(new_text.split("\n"))
new_text = ''.join([i if ord(i) < 128 else ' ' for i in new_text])
new_text = "".join([x for x in new_text if ord(x) < 128])
new_text = re.sub(r'[^\x00-\x7F]+', ' ', new_text)
new_text = new_text.rstrip('\r\n')
new_text = new_text.strip('\n')
new_text = new_text.strip('\r')
new_text = new_text.strip('\t')
new_text = new_text.replace('\n', '')
new_text = new_text.replace('\r', '')
new_text = new_text.replace('\t', '')
new_text = filter(lambda x: x in string.printable, new_text)
new_text = "".join(list(new_text))
return new_text
Is there some tool that will show me exactly what this offending character is, and a then find a method to replace it?
I am opening the file like so (the .csv was saved as UTF-8)
f_csv_in = open(csv_in, "r", encoding="utf-8")
Below are two lines that should be one with the problem non-ascii characters visible.
These two lines should be one line. Notice the $ at the end of line 37, and line 38 begins with ^I^I.
Part of the problem, that vi is showing, is that there is a new line $ on line 37 where I don't want it to be. This should be one line.
37 Cancelled,01-19-17,,basket,00-00-00,00-00-00,,,,98533,SingleSource,,,17035 Cherry Hill Dr,"L/o 1-19-17 # 11:45am$
38 ^I^IVictorville",SAN BERNARDINO,CA,92395,,,,,0,,,,,Lock:6111 ,,,No,No,,0.00,0.00,No,01-19-17,0.00,0.00,,01-19-17,00-00-00,,provider,,,Unread,00-00-00,,$
A simple way to remove non-ascii chars could be doing:
new_text = "".join([c for c in text if c.isascii()])
NB: If you are reading this text from a file, make sure you read it with the correct encoding
In the case of non-printable characters, the built-in string module has some ways of filtering out non-printable or non-ascii characters, eg. with the isprintable() functionality.
A concise way of filtering the whole string at once is presented below
>>> import string
>>>
>>> str1 = '\nsomestring'
>>> str1.isprintable()
False
>>> str2 = 'otherstring'
>>> str2.isprintable()
True
>>>
>>> res = filter(lambda x: x in string.printable, '\x01mystring')
>>> "".join(list(res))
'mystring'
This question has had some discussion on SO in the past, but there are many ways to do things, so I understand it may be confusing, since you can use anything from Regular Expressions to str.translate()
Another thing one could do is to take a look at Unicode Categories, and filter out your data based on the set of symbols you need.
It looks as if you have a csv file that contains quoted values, that is values such as embedded commas or newlines which have to be surrounded with quotes so that csv readers handle them correctly.
If you look at the example data you can see there's an opening doublequote but no closing doublequote at the end of the first line, and a closing doublequote with no opening doublequote on the second line, indicating that the quotes contain a value with an embedded newline.
The fact that the lines are broken in two may be an artefact of the application used to view them, or the code that's processing them: if the software doesn't understand csv quoting it will assume each newline character denotes a new line.
It's not clear exactly what problem this is causing in the database, but it's quite likely that quote characters - especially unmatched quotes - could be causing a problem, particularly if the data isn't being properly escaped before insertion.
This snippet rewrites the file, removing embedded commas, newlines and tabs, and instructs the writer not to quote any values. It will fail with the error message _csv.Error: need to escape, but no escapechar set if it finds a value that needs to be escaped. Depending on your data, you may need to adjust the regex pattern.
with open('lines.csv') as f, open('fixed.csv', 'w') as out:
reader = csv.reader(f)
writer = csv.writer(out, quoting=csv.QUOTE_NONE)
for line in reader:
new_row = [re.sub(r'\t|\n|,', ' ', x) for x in line]
writer.writerow(new_row)
Another approach using re, python to filter non printable ASCII character:
import re
import string
string_with_printable = re.sub(f'[^{re.escape(string.printable)}]', '', original_string)
re.escape escapes special characters in the given pattern.

A way to remove all occurrences of words within brackets in a string?

I'm trying to find a way to delete all mentions of references in a text file.
I haven't tried much, as I am new to Python but thought that this is something that Python could do.
def remove_bracketed_words(text_from_file: string) -> string:
"""Remove all occurrences of words with brackets surrounding them,
including the brackets.
>>> remove_bracketed_words("nonsense (nonsense, 2015)")
"nonsense "
>>> remove_bracketed_words("qwerty (qwerty) dkjah (Smith, 2018)")
"qwerty dkjah "
"""
with open('random_text.txt') as file:
wholefile = f.read()
for '(' in
I have no idea where to go from here or if what I've done is right. Any suggestions would be helpful!
You'll have an easier time with a text editing program that handles regular expressions, like Notepad++, than learning Python for this one task (reading in a file, correcting fundamental errors like for '(' in..., etc.). You can even use tools available online for this, such as RegExr (a regular expression tester). In RegExr, write an appropriate expression into the "expression" field and paste your text into the "text" field. Then, in the "tools" area below the text, choose the "replace" option and remove the placeholder expression. Your cleaned-up text will appear there.
You're looking for a space, then a literal opening parenthesis, then some characters, then a comma, then a year (let's just call that 3 or 4 digits), then a literal closing parenthesis, so I'd suggest the following expression:
\(.*?, \d{3,4}\)
This will preserve non-citation parenthesized text and remove the leading space before a citation.
Try re
>>> import re
>>> re.sub(r'\(.*?\)', '', 'nonsense (nonsense, 2015)')
'nonsense '
>>> re.sub(r'\(.*?\)', '', 'qwerty (qwerty) dkjah (Smith, 2018)')
'qwerty dkjah '
import re
def remove_bracketed_words(text_from_file: string) -> string:
"""Remove all occurrences of words with brackets surrounding them,
including the brackets.
>>> remove_bracketed_words("nonsense (nonsense, 2015)")
"nonsense "
>>> remove_bracketed_words("qwerty (qwerty) dkjah (Smith, 2018)")
"qwerty dkjah "
"""
with open('random_text.txt', 'r') as file:
wholefile = file.read()
# Be care for use 'w', it will delete raw data.
whth open('random_text.txt', 'w') as file:
file.write(re.sub(r'\(.*?\)', '', wholefile))

Remove special characters (in list) in string

I have a bunch of special characters which are in a list like:
special=[r'''\\''', r'''+''', r'''-''', r'''&''', r'''|''', r'''!''', r'''(''', r''')''', r'''{''', r'''}''',\
r'''[''', r''']''', r'''^''', r'''~''', r'''*''', r'''?''', r''':''', r'''"''', r''';''', r''' ''']
And I have a string:
stringer="Müller my [ string ! is cool^&"
How do I make this replacement? I am expecting:
stringer = "Müller my string is cool"
Also, is there some builtin to replace these ‘special’ chars in Python?
This can be solved with a simple generator expression:
>>> ''.join(ch for ch in stringer if ch not in special)
'M\xc3\xbcllermystringiscool'
Note that this also removes the spaces, since they're in your special list (the last element). If you don't want them removed, either don't include the space in special or do modify the if check accordingly.
If you remove the space from your specials you can do it using re.sub() but note that first you need to escape the special regex characters.
In [58]: special=[r'''\\''', r'''+''', r'''-''', r'''&''', r'''|''', r'''!''', r'''(''', r''')''', r'''{''', r'''}''',\
r'''[''', r''']''', r'''^''', r'''~''', r'''*''', r'''?''', r''':''', r'''"''', r''';''']
In [59]: print re.sub(r"[{}]".format(re.escape(''.join(special))), '', stringer, re.U)
Müller my string is cool

Categories

Resources