Python Search & Replace With Regex - python

I am trying to replace every occurrence of a Regex expression in a file using Python with this code:
import re
def cleanString(string):
string = string.replace(" ", "_")
string = string.replace('_"', "")
string = string.replace('"', '')
return string
test = open('test.t.txt', "w+")
test = re.sub(r':([\"])(?:(?=(\\?))\2.)*?\1', cleanString(r':([\"])(?:(?=(\\?))\2.)*?\1'), test)
However, when I run the script I am getting the following error:
Traceback (most recent call last):
File "C:/Python27/test.py", line 10, in <module>
test = re.sub(r':([\"])(?:(?=(\\?))\2.)*?\1', cleanString(r':([\"])(?:(?=(\\?))\2.)*?\1'), test)
File "C:\Python27\lib\re.py", line 155, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
I think it is reading the file incorrectly but I'm not sure what the actual issue is here

Your cleanString function is not returning anything. Ergo the "NoneType" error.
You probably want to do something like:
def cleanString(string):
string = string.replace(" ", "_")
string = string.replace('_"', "")
string = string.replace('"', '')
return string

Related

Replace Carriage Return (CR) and Carriage Return and Line Feed (CRLF) python

I import a .csv file that looks like:
by using the following code:
filetoread = 'G:/input.csv'
filetowrite = 'G:/output.csv'
# https://stackoverflow.com/questions/17658055/how-can-i-remove-carriage-return-from-a-text-file-with-python/42442131
with open(filetoread, "rb") as inf:
with open(filetowrite, "wb") as fixed:
for line in inf:
# line = line.replace('\r\r\n', 'r\n')
fixed.write(line)
print(line)
Which give the output:
b'\xef\xbb\xbfHeader1;Header2;Header3;Header4;Header5;Header6;Header7;Header8;Header9;Header10;Header11;Header12\r\n'
b';;;1999-01-01;;;01;;;;;;\r\n'
b';;;2000-01-01;;;12;;"NY123456789\r\r\n'
b'";chapter;2020-01-01 00:00:00;PE\r\n'
b';;;2020-01-01;;;45;;"NY123456789\r\r\n'
b'";chapter;1900-01-01 00:00:00;PE\r\n'
b';;;1999-01-01;;;98;;;;;;\r\n'
I have issues to replace \r\r\n to \r\n which I guess I need to do to get my desired output.
The error I get when I try to replace the \r\r\n is:
Traceback (most recent call last):
File "g:/till_format_2.py", line 10, in <module>
line = line.replace('\r\r\n', 'r\n')
TypeError: a bytes-like object is required, not 'str'
My desired output:
What do I need to add or change to the code to achieve my desired output?
As the error message says, supply a bytes object.
line = line.replace(b'\r\r\n', b'\r\n')
To get the desired output
line = line.replace(b'\r\r\n', b'')

Can't convert tuple to string what can I do?

I am trying to turn to out_chars from tuple to string. However, it seems quite troublesome since there is while loop and the state defined its to be tuple. What should I do
I try def convertString but not succesful
out_chars = []
string = ()
for i, char_token in enumerate(computer_response_generator):
out_chars.append(chars[char_token])
print(possibly_escaped_char(out_chars), end='', flush=True)
states = forward_text(net, sess, states, relevance, vocab, chars[char_token])
if i >= max_length:
break
states = forward_text(net, sess, states, relevance, vocab, sanitize_text(vocab, "\n> "))
states = convertTuple(states)
string = convertTuple(out_chars)
print(Text_to_sp(string, states))
Traceback (most recent call last):
File "/Users/quanducduy/anaconda3/chatbot-rnn-master/chatbot.py", line 358, in <module>
main()
File "/Users/quanducduy/anaconda3/chatbot-rnn-master/chatbot.py", line 44, in main
sample_main(args)
File "/Users/quanducduy/anaconda3/chatbot-rnn-master/chatbot.py", line 92, in sample_main
args.relevance, args.temperature, args.topn, convertTuple)
File "/Users/quanducduy/anaconda3/chatbot-rnn-master/chatbot.py", line 169, in chatbot
print(Text_to_sp(string, states))
File "/Users/quanducduy/anaconda3/chatbot-rnn-master/Text_to_speech.py", line 28, in Text_to_sp
myobj.save("welcome.mp3")
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gtts/tts.py", line 249, in save
self.write_to_fp(f)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gtts/tts.py", line 182, in write_to_fp
text_parts = self._tokenize(self.text)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gtts/tts.py", line 144, in _tokenize
text = text.strip()
AttributeError: 'tuple' object has no attribute 'strip'
Process finished with exit code 1
what does your tuple contains?? does it contain complex objects or simple strings numbers etc???
your problem is hard to understand from what you have posted above. but if you want o convert tuple to string you can do like this
new_str = ''.join(yourtuple)
I'm not sure that understand your question right, but if you want to make a string from tuple, its really simple.
def convertTuple(tup):
str = ''.join(tup)
return str
tuple = ('g', 'e', 'e', 'k', 's')
str = convertTuple(tuple)
print(str)
If you cannot ensure all the elements of the tuple are strings, you have to cast them.
''.join([str(elem) for elem in myTuple])

Issue with list indexing while converting letters (devnagari to english)

I am currently trying to map devnagari script with English alphabets. But once in a while I run into the error list index out of range . I don't want to miss out on any list . This is why I do not want to use error handling unless it is necessary. Could you please look into my script and help out why this error is occurring ?
In my word file I have located which word is causing the error but then If i use couple of sentence up and down from that word then the error is not there . i.e I think the error happens at a specific length of string.
clean=[]
dafuq=[]
clean_list = []
replacements = {'अ':'A','आ':'AA', 'इ':'I', 'ई':'II', 'उ':'U','ऊ':'UU', 'ए':'E', 'ऐ':'AI',
'ओ':'O','औ':'OU', 'क':'KA', 'ख':'KHA', 'ग':'GA', 'घ':'GHA', 'ङ':'NGA',
'च':'CA','छ':'CHHA', 'ज':'JA', 'झ':'JHA','ञ':'NIA', 'ट':'TA', 'ठ':'THA',
'ड':'DHA','ढ':'DHHA', 'ण':'NAE', 'त':'TA', 'थ':'THA','द':'DA', 'ध':'DHA',
'न':'NA','प':'PA', 'फ':'FA', 'ब':'B', 'भ':'BHA', 'म':'MA','य':'YA', 'र':'RA',
'ल':'L','व':'WA', 'स':'SA', 'ष':'SHHA', 'श':'SHA', 'ह':'HA', '्':'A',
'ऋ':'RI', 'ॠ':'RI','ऌ':'LI','ॐ':'OMS', 'ः':' ', 'ँ':'U',
'ं':'M', 'ृ':'RI', 'ा':'AA', 'ी':'II', 'ि':'I', 'े':'E', 'ै':'AI',
'ो':'O','ौ':'OU','ु' :'U','ू':'UU' }
import unicodedata
from functools import reduce
def reducer(r, v):
if unicodedata.category(v) in ('Mc', 'Mn'):
r[-1] = r[-1] + v
else:
r.append(v)
return r
with open('words_original.txt', mode='r',encoding="utf-8") as f:
with open ('alphabeths.txt', mode='w+', encoding='utf-8') as d:
with open('only_words.txt', mode='w+', encoding="utf-8") as e:
chunk_size = 4096
f_chunk = f.read(chunk_size)
while len(f_chunk)>0:
for word in f_chunk.split():
for char in ['।', ',', '’', '‘', '?','#','1','2','3','4','0','5','6','7','8','9',
'१','२','३','४','५','.''६','७','८','९','०', '5','6','7','8','9','0','\ufeff']:
if char in word:
word = word.replace(char, '')
if word.strip():
clean_list.append(word)
f_chunk = f.read(chunk_size)
for clean_word in clean_list:
test_word= reduce(reducer,clean_word,[])
final_word= (''.join(test_word))
dafuq.append(final_word)
print (final_word)
f_chunk = f.read(chunk_size)
This is the file I am testing it on
words_original.txt
words_original.txt
stacktrace error
Traceback (most recent call last):
File "C:\Users\KUSHAL\Desktop\EARTHQUAKE_PYTHON\test.py", line 82, in <module>
test_word= reduce(reducer,clean_word,[])
File "C:\Users\KUSHAL\Desktop\EARTHQUAKE_PYTHON\test.py", line 27, in reducer
r[-1] = r[-1] + v
IndexError: list index out of range
The problem lay with some unicode characters. It worked after removing them.

Python - U.S. ZipCode Matching

I'm working with Regex and I'm brand new to using python. I can't get the program to read from file and go through the match case properly. I'm getting a traceback error that looks like this:
Traceback (most recent call last):
File "C:\Users\Systematic\workspace\Project8\src\zipcode.py", line 18, in <module>
m = re.match(info, pattern)
File "C:\Python34\lib\re.py", line 160, in match
return _compile(pattern, flags).match(string)
File "C:\Python34\lib\re.py", line 282, in _compile
p, loc = _cache[type(pattern), pattern, flags]
TypeError: unhashable type: 'list'
zipin.txt:
3285
32816
32816-2362
32765-a234
32765-23
99999-9999
zipcode.py:
from pip._vendor.distlib.compat import raw_input
import re
userinput = raw_input('Please enter the name of the file containing the input zipcodes: ')
myfile = open(userinput)
info = myfile.readlines()
pattern = '^[0-9]{5}(?:-[0-9]{4})?$'
m = re.match(info, pattern)
if m is not None:
print("Match found - valid U.S. zipcode: " , info, "\n")
else: print("Error - no match - invalid U.S. zipcode: ", info, "\n")
myfile.close()
The problem is that readlines() returns a list, and re operates on stuff that is string like. Here is one way it could work:
import re
zip_re = re.compile('^[0-9]{5}(?:-[0-9]{4})?$')
for l in open('zipin.txt', 'r'):
m = zip_re.match(l.strip())
if m:
print l
break
if m is None:
print("Error - no match")
The code now operates in a loop over the file lines, and attempts to match the re on a stripped version of each line.
Edit:
It's actually possible to write this in a much shorter, albeit less clear way:
next((l for l in open('zipin.txt', 'r') if zip_re.match(l.strip())), None)

Regular expressions in python unicode

I need to remove all the html tags from a given webpage data. I tried this using regular expressions:
import urllib2
import re
page = urllib2.urlopen("http://www.frugalrules.com")
from bs4 import BeautifulSoup, NavigableString, Comment
soup = BeautifulSoup(page)
link = soup.find('link', type='application/rss+xml')
print link['href']
rss = urllib2.urlopen(link['href']).read()
souprss = BeautifulSoup(rss)
description_tag = souprss.find_all('description')
content_tag = souprss.find_all('content:encoded')
print re.sub('<[^>]*>', '', content_tag)
But the syntax of the re.sub is:
re.sub(pattern, repl, string, count=0)
So, I modified the code as (instead of the print statement above):
for row in content_tag:
print re.sub(ur"<[^>]*>",'',row,re.UNICODE
But it gives the following error:
Traceback (most recent call last):
File "C:\beautifulsoup4-4.3.2\collocation.py", line 20, in <module>
print re.sub(ur"<[^>]*>",'',row,re.UNICODE)
File "C:\Python27\lib\re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
What am I doing wrong?
Last line of your code try:
print(re.sub('<[^>]*>', '', str(content_tag)))

Categories

Resources