function to get rid of delimeters in python - python

I have a function where the user passes in a file and a String and the code should get rid of the specificed delimeters. I am having trouble finishing the part where I loop through my code and get rid of each of the replacements. I will post the code down below
def forReader(filename):
try:
# Opens up the file
file = open(filename , "r")
# Reads the lines in the file
read = file.readlines()
# closes the files
file.close()
# loops through the lines in the file
for sentence in read:
# will split each element by a spaace
line = sentence.split()
replacements = (',', '-', '!', '?' '(' ')' '<' ' = ' ';')
# will loop through the space delimited line and get rid of
# of the replacements
for sentences in line:
# Exception thrown if File does not exist
except FileExistsError:
print('File is not created yet')
forReader("mo.txt")
mo.txt
for ( int i;
After running the filemo.txt I would like for the output to look like this
for int i

Here's a way to do this using regex. First, we create a pattern consisting of all the delimiter characters, being careful to escape them, since several of those characters have special meaning in a regex. Then we can use re.sub to replace each delimiter with an empty string. This process can leave us with two or more adjacent spaces, which we then need to replace with a single space.
The Python re module allows us to compile patterns that are used frequently. Theoretically, this can make them more efficient, but it's a good idea to test such patterns against real data to see if it does actually help. :)
import re
delimiters = ',-!?()<=;'
# Make a pattern consisting of all the delimiters
pat = re.compile('|'.join(re.escape(c) for c in delimiters))
s = 'for ( int i;'
# Remove the delimiters
z = pat.sub('', s)
#Clean up any runs of 2 or more spaces
z = re.sub(r'\s{2,}', ' ', z)
print(z)
output
for int i

Related

Reading a text until matched string and print it as a single line

I am trying to write a program as a python beginner.
This function scans the given text. Whenever I come across the symbols “#, &, % or $”, I have to take the text until that point and print it as a single line. Also, I have to skip the symbol, start the next line from the letter right after the symbol and print it until I come across another symbol.
I hope I told it well.
symbols= "#, &, %, $”"
for symbol in text:
if text.startswith (symbols):
print (text)
I know it's not correct but that was all I could think. Any help is appreciated.
IIUC, you need to split the string by each of the delimiters, so you could do the following:
symbols = "#, &, %, $".split(', ')
print(symbols) # this is a list
text = "The first # The second & and here you have % and finally $"
# make a copy of text
replaced = text[:]
# unify delimiters
for symbol in symbols:
replaced = replaced.replace(symbol, '#')
print(replaced) # now the string only have # in the place of other symbols
for chunk in replaced.split('#'):
if chunk: # avoid printing empty strings
print(chunk)
Output
['#', '&', '%', '$'] # print(symbols)
The first # The second # and here you have # and finally # # print(replaced)
The first
The second
and here you have
and finally
The first step:
symbols = "#, &, %, $".split(', ')
print(symbols) # this is a list
converts your string to a list. The second step replaces, using replace, all symbols with only one because str.split only works with a single string:
# unify delimiters
for symbol in symbols:
replaced = replaced.replace(symbol, '#')
The third and final step is to split the string by the chosen symbol (i.e #):
for chunk in replaced.split('#'):
if chunk: # avoid printing empty strings
print(chunk)
If I follow the instructions given to you literally, I believe it translates into nothing fancier than just replacing each of those 4 special characters in the text by a newline character:
text = """This is a line with # and & in it.
This line has % and $ in it.
This line has nothing intersting in it.
This line has just # in it.
The end.
"""
text = text.replace('#', '\n').replace('&', '\n').replace('%', '\n').replace('$', '\n')
print(text)
Prints:
This is a line with
and
in it.
This line has
and
in it.
This line has nothing intersting in it.
This line has just
in it.

Capture ALL strings within a Python script with regex

This question was inspired by my failed attempts after trying to adapt this answer: RegEx: Grabbing values between quotation marks
Consider the following Python script (t.py):
print("This is also an NL test")
variable = "!\n"
print('And this has an escaped quote "don\'t" in it ', variable,
"This has a single quote ' but doesn\'t end the quote as it" + \
" started with double quotes")
if "Foo Bar" != '''Another Value''':
"""
This is just nonsense
"""
aux = '?'
print("Did I \"failed\"?", f"{aux}")
I want to capture all strings in it, as:
This is also an NL test
!\n
And this has an escaped quote "don\'t" in it
This has a single quote ' but doesn\'t end the quote as it
started with double quotes
Foo Bar
Another Value
This is just nonsense
?
Did I \"failed\"?
{aux}
I wrote another Python script using re module and, from my attempts into regex, the one which finds most of them is:
import re
pattern = re.compile(r"""(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)""")
with open('t.py', 'r') as f:
msg = f.read()
x = pattern.finditer(msg, re.DOTALL)
for i, s in enumerate(x):
print(f'[{i}]',s.group(0))
with the following result:
[0] And this has an escaped quote "don\'t" in it
[1] This has a single quote ' but doesn\'t end the quote as it started with double quotes
[2] Foo Bar
[3] Another Value
[4] Did I \"failed\"?
To improve my failures, I couldn't also fully replicate what I can found with regex101.com:
I'm using Python 3.6.9, by the way, and I'm asking for more insights into regex to crack this one.
Because you want to match ''' or """ or ' or " as the delimiter, put all of that into the first group:
('''|"""|["'])
Don't put \b after it, because then it won't match strings when those strings start with something other than a word character.
Because you want to make sure that the final delimiter isn't treated as a starting delimiter when the engine starts the next iteration, you'll need to fully match it (not just lookahead for it).
The middle part to match anything but the delimiter can be:
((?:\\.|.)*?)
Put it all together:
('''|"""|["'])((?:\\.|.)*?)\1
and the result you want will be in the second capture group:
pattern = re.compile(r"""(?s)('''|\"""|["'])((?:\\.|.)*?)\1""")
with open('t.py', 'r') as f:
msg = f.read()
x = pattern.finditer(msg)
for i, s in enumerate(x):
print(f'[{i}]',s.group(2))
https://regex101.com/r/dvw0Bc/1

How to find/replace non printable / non-ascii characters using Python 3?

I have a file, some lines in a .csv file that are jamming up a database import because of funky characters in some field in the line.
I have searched, found articles on how to replace non-ascii characters in Python 3, but nothing works.
When I open the file in vi and do :set list, there is a $ at the end of a line where there should not be, and ^I^I at the beginning of the next line. The two lines should be one joined line and no ^I there. I know that $ is end of line '\n' and have tried to replace those, but nothing works.
I don't know what the ^I represents, possibly a tab.
I have tried this function to no avail:
def remove_non_ascii(text):
new_text = re.sub(r"[\n\t\r]", "", text)
new_text = ''.join(new_text.split("\n"))
new_text = ''.join([i if ord(i) < 128 else ' ' for i in new_text])
new_text = "".join([x for x in new_text if ord(x) < 128])
new_text = re.sub(r'[^\x00-\x7F]+', ' ', new_text)
new_text = new_text.rstrip('\r\n')
new_text = new_text.strip('\n')
new_text = new_text.strip('\r')
new_text = new_text.strip('\t')
new_text = new_text.replace('\n', '')
new_text = new_text.replace('\r', '')
new_text = new_text.replace('\t', '')
new_text = filter(lambda x: x in string.printable, new_text)
new_text = "".join(list(new_text))
return new_text
Is there some tool that will show me exactly what this offending character is, and a then find a method to replace it?
I am opening the file like so (the .csv was saved as UTF-8)
f_csv_in = open(csv_in, "r", encoding="utf-8")
Below are two lines that should be one with the problem non-ascii characters visible.
These two lines should be one line. Notice the $ at the end of line 37, and line 38 begins with ^I^I.
Part of the problem, that vi is showing, is that there is a new line $ on line 37 where I don't want it to be. This should be one line.
37 Cancelled,01-19-17,,basket,00-00-00,00-00-00,,,,98533,SingleSource,,,17035 Cherry Hill Dr,"L/o 1-19-17 # 11:45am$
38 ^I^IVictorville",SAN BERNARDINO,CA,92395,,,,,0,,,,,Lock:6111 ,,,No,No,,0.00,0.00,No,01-19-17,0.00,0.00,,01-19-17,00-00-00,,provider,,,Unread,00-00-00,,$
A simple way to remove non-ascii chars could be doing:
new_text = "".join([c for c in text if c.isascii()])
NB: If you are reading this text from a file, make sure you read it with the correct encoding
In the case of non-printable characters, the built-in string module has some ways of filtering out non-printable or non-ascii characters, eg. with the isprintable() functionality.
A concise way of filtering the whole string at once is presented below
>>> import string
>>>
>>> str1 = '\nsomestring'
>>> str1.isprintable()
False
>>> str2 = 'otherstring'
>>> str2.isprintable()
True
>>>
>>> res = filter(lambda x: x in string.printable, '\x01mystring')
>>> "".join(list(res))
'mystring'
This question has had some discussion on SO in the past, but there are many ways to do things, so I understand it may be confusing, since you can use anything from Regular Expressions to str.translate()
Another thing one could do is to take a look at Unicode Categories, and filter out your data based on the set of symbols you need.
It looks as if you have a csv file that contains quoted values, that is values such as embedded commas or newlines which have to be surrounded with quotes so that csv readers handle them correctly.
If you look at the example data you can see there's an opening doublequote but no closing doublequote at the end of the first line, and a closing doublequote with no opening doublequote on the second line, indicating that the quotes contain a value with an embedded newline.
The fact that the lines are broken in two may be an artefact of the application used to view them, or the code that's processing them: if the software doesn't understand csv quoting it will assume each newline character denotes a new line.
It's not clear exactly what problem this is causing in the database, but it's quite likely that quote characters - especially unmatched quotes - could be causing a problem, particularly if the data isn't being properly escaped before insertion.
This snippet rewrites the file, removing embedded commas, newlines and tabs, and instructs the writer not to quote any values. It will fail with the error message _csv.Error: need to escape, but no escapechar set if it finds a value that needs to be escaped. Depending on your data, you may need to adjust the regex pattern.
with open('lines.csv') as f, open('fixed.csv', 'w') as out:
reader = csv.reader(f)
writer = csv.writer(out, quoting=csv.QUOTE_NONE)
for line in reader:
new_row = [re.sub(r'\t|\n|,', ' ', x) for x in line]
writer.writerow(new_row)
Another approach using re, python to filter non printable ASCII character:
import re
import string
string_with_printable = re.sub(f'[^{re.escape(string.printable)}]', '', original_string)
re.escape escapes special characters in the given pattern.

Python: Extract hashtags out of a text file

So, I've written the code below to extract hashtags and also tags with '#', and then append them to a list and sort them in descending order. The thing is that the text might not be perfectly formatted and not have spaces between each individual hashtag and the following problem may occur - as it may be checked with the #print statement inside the for loop :
#socality#thisismycommunity#themoderndayexplorer#modernoutdoors#mountaincultureelevated
So, the .split() method doesn't deal with those. What would be the best practice to this issue?
Here is the .txt file
Grateful for your time.
name = input("Enter file:")
if len(name) < 1 : name = "tags.txt"
handle = open(name)
tags = dict()
lst = list()
for line in handle :
hline = line.split()
for word in hline:
if word.startswith('#') : tags[word] = tags.get(word,0) + 1
else :
tags[word] = tags.get(word,0) + 1
#print(word)
for k,v in tags.items() :
tags_order = (v,k)
lst.append(tags_order)
lst = sorted(lst, reverse=True)[:34]
print('Final Dictionary: ' , '\n')
for v,k in lst :
print(k , v, '')
Use a regular expression. There are only a few limits; a tag must start with either # or #, and it may not contain any spaces or other whitespace characters.
This code
import re
tags = []
with open('../Downloads/tags.txt','Ur') as file:
for line in f.readline():
tags += re.findall(r'[##][^\s##]+', line)
creates a list of all tags in the file. You can easily adjust it to store the found tags in your dictionary; instead of storing the result straight away in tags, loop over it and do with each item as you please.
The regex is built up from these two custom character classes:
[##] - either the single character # or # at the start
[^\s##]+ - a sequence of not any single whitespace character (\s matches all whitespace such as space, tab, and returns), #, or #; at least one, and as many as possible.
So findall starts matching at the start of any tag and then grabs as much as it can, stopping only when encountering any of the "not" characters.
findall returns a list of matching items, which you can immediately add to an existing list, or loop over the found items in turn:
for tag in re.findall(r'[##][^\s##]+', line):
# process "tag" any way you want here
The source text file contains Windows-style \r\n line endings, and so I initially got a lot of empty "lines" on my Mac. Opening the text file in Universal newline mode makes sure that is handled transparently by the line reading part of Python.

Python re.sub with a string that contains a special character without modifying the string

My problem is that in order to do sub I need to escape a special character. But I don't want to modify the string that I'm substituting. Does Python have way to handle this?
fstring = r'C:\Temp\1_file.txt' #this is the new data that I want to substitute in
old_data = r'Some random text\n .*' #I'm looking for this in the file that I'll read
new_data = r'Some random text\n '+fstring #I want to change it to this
f = open(myfile,'r') #open the file
filedata = f.read() #read the file
f.close()
newfiledata = re.sub(old_data,new_data,filedata) #substitute the new data
An error is returned becasue the "\1" in the in "fstring" is seen as a group object.
error: invalid group reference
Escape eventual backslashes:
new_data = r'Some random text\n ' + fstring.replace('\\', r'\\')
\1 would normally imply a backreference to group 1 of the regex match, but that is not what you intend here (in fact no group 1, which is the reason for the error), you'll, therefore, need to escape the string so re does not treat the \1 as a metacharacter:
fstring = r'C:\Temp\\1_file.txt'

Categories

Resources