How to remove special characters from txt files using Python - python

from glob import glob
pattern = "D:\\report\\shakeall\\*.txt"
filelist = glob(pattern)
def countwords(fp):
with open(fp) as fh:
return len(fh.read().split())
print "There are" ,sum(map(countwords, filelist)), "words in the files. " "From directory",pattern
import os
uniquewords = set([])
for root, dirs, files in os.walk("D:\\report\\shakeall"):
for name in files:
[uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
print "There are" ,len(uniquewords), "unique words in the files." "From directory", pattern
So far my code is this. This counts the number of unique words and total words from D:\report\shakeall\*.txt
The problem is, for example, this code recognizes code code. and code! different words. So, this can't be an answer to an exact number of unique words.
I'd like to remove special characters from 42 text files using Windows text editor
Or make an exception rule that solve this problem.
If using the latter, how shoud I make up my code?
Make it to directly modify text files? Or make an exception that doesn't count special characters?

import re
string = open('a.txt').read()
new_str = re.sub('[^a-zA-Z0-9\n\.]', ' ', string)
open('b.txt', 'w').write(new_str)
It will change every non alphanumeric char to white space.

I'm pretty new and I doubt this is very elegant at all, but one option would be to take your string(s) after reading them in and running them through string.translate() to strip out the punctuation. Here is the Python documentation for it for version 2.7 (which i think you're using).
As far as the actual code, it might be something like this (but maybe someone better than me can confirm/improve on it):
fileString.translate(None, string.punctuation)
where "fileString" is the string that your open(fp) read in. "None" is provided in place of a translation table (which would normally be used to actually change some characters into others), and the second parameter, string.punctuation (a Python string constant containing all the punctuation symbols) is a set of characters that will be deleted from your string.
In the event that the above doesn't work, you could modify it as follows:
inChars = string.punctuation
outChars = ['']*32
tranlateTable = maketrans(inChars, outChars)
fileString.translate(tranlateTable)
There are a couple of other answers to similar questions i found via a quick search. I'll link them here, too, in case you can get more from them.
Removing Punctuation From Python List Items
Remove all special characters, punctuation and spaces from string
Strip Specific Punctuation in Python 2.x
Finally, if what I've said is completely wrong please comment and i'll remove it so that others don't try what I've said and become frustrated.

import re
Then replace
[uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
By
[uniquewords.add(re.sub('[^a-zA-Z0-9]*$', '', x) for x in open(os.path.join(root,name)).read().split()]
This will strip all trailing non-alphanumeric characters from each word before adding it to the set.

When working in Linux, some system files in /proc lib contains chars with ascii value 0.
full_file_path = 'test.txt'
result = []
with open(full_file_path, encoding='utf-8') as f:
line = f.readline()
for c in line:
if ord(c) == 0:
result.append(' ')
else:
result.append(c)
print (''.join(result))

Related

Python reading from file vs directly assigning literal

I asked a Python question minutes ago about how Python's newline work only to have it closed because of another question that's not even similar or have Python associated with it.
I have text with a '\n' character and '\t' in it, in a file. I read it using
open().read()
I then Stored the result in an identifier. My expectations is that such a text e.g
I\nlove\tCoding
being read from a file and assigned to an identifier should be same as one directly assigned to the string literal
"I\nlove\tCoding"
being directly assigned to a file.
My assumption was wrong anyway
word = I\nlove\tCoding
ends up being different from
word = open(*.txt).read()
Where the content of *.txt is exactly same as string "I\nlove\tCoding"
Edit:
I did make typo anyway, I meant \t && \n , searching with re module's search() for \t, it return None, but \t is there. Why is this please?
You need to differentiate between newlines/tabs and their corresponding escape sequences:
for filename in ('test1.txt', 'test2.txt'):
print(f"\n{filename} contains:")
fileData = open(filename, 'r').read()
print(fileData)
for pattern in (r'\\n', r'\n'):
# first is the escape sequences, second the (real) newline!
m = re.search(pattern, fileData)
if m:
print(f"found {pattern}")
Out:
test1.txt contains:
I\nlove\tCoding
found \\n
test2.txt contains:
I
love Coding
found \n
The string you get after reading from file is I\\nlove\\nCoding.If you want your string from literal equals string from file you should use r prefix. Something like this - word = r"I\nlove\nCoding"

concatenate words in a text file

I have exported a pdf file as a .txt and I observed that many words were broken into two parts due to line breaks. So, in this program, I want to join the words that are separated in the text while maintaining the correct words in the sentence. In the end, I want to get a final .txt file (or at least a list of tokens) with all words properly spelt. Can anyone help me?
my current text is like this:
I need your help be cause I am not a good progra mmer.
result I need:
I need your help because I am not a good programmer.
from collections import defaultdict
import re
import string
import enchant
document_text=open('test-list.txt','r')
text_string=document_text.read().lower()
lst=[]
errors=[]
dic=enchant.Dict('en_UK')
d=defaultdict(int)
match_pattern = re.findall(r'\b[a-zA-Z0-9_]{1,15}\b', text_string)
for w in match_pattern:
lst.append(w)
for i in lst:
if dic.check(i) is True:
continue
else:
a=list(map(''.join, zip(*([iter(lst)]*2))))
if dic.check(a) is True:
continue
else:
errors.append(a)
print (lst)
You have a bigger problem - how will your program know that:
be
cause
... should be treated as one word?
If you really wanted to, you could replace newline characters with empty spaces:
import re
document_text = """
i need your help be
cause i am not a good programmer
""".lower().replace("\n", '')
print([w for w in re.findall(r'\b[a-zA-Z0-9_]{1,15}\b', document_text)])
This will spellcheck because correctly, but will fail in cases like:
Hello! My name is
Foo.
... because isFoo is not a word.

How can I write a regular expression to replace hash-like strings

There are some windows names and folders containing names like:
c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata\57c8edb95df3f0ad4ee2dc2b8cfd4157
c:\windows\system32\config\systemprofile\appdata\locallow\microsoft\cryptneturlcache\metadata\be7ffd2fd84d3b32fd43dc8f575a9f28
c:\windows\softwaredistribution\download\ab1b092b40dee3ba964e8305ecc7d0d9
Notice how they end with a string that looks like a hash:
57c8edb95df3f0ad4ee2dc2b8cfd4157, be7ffd2fd84d3b32fd43dc8f575a9f28,
ab1b092b40dee3ba964e8305ecc7d0d9
I am not good with regex and I would like to know if there is a way to write a regex that would replace these hash-like names within a path with something like
"##HASH##"
The paths do not necessarily end with these, as these are usually folders/subfolders containing other folders of their own.
So my goal is to essentially get a path looking like:
c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata\57c8edb95df3f0ad4ee2dc2b8cfd4157\some_subfolder\some_file.inf
to become:
c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata##HASH##\some_subfolder\some_file.inf
Is there a way to do that in Python ?
Thanks in advance.
If you noticed, the "hashes" are 32 characters. (IF THIS IS TRUE FOR ALL OF THEM) Then the regex is pretty straightforward.
For example with the last string you posted
import re
text = 'c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata\\57c8edb95df3f0ad4ee2dc2b8cfd4157\some_subfolder\some_file.inf'
res = re.sub('\w{32}', '##HASH##', text)
print(res)
prints:
c:\windows\serviceprofiles\localserviceppdata\locallow\microsoft\cryptneturlcache\metadata\##HASH##\some_subfolder\some_file.inf
Notice how i escaped the \ with \\5 that's necessary to tell python it's a literal \5.
The \w{32} regex means "match any word character Exactly 32 times"
This might help:
import os
import re
uuid = re.compile('[0-9a-f]{30}\Z', re.I)
A = "c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata\57c8edb95df3f0ad4ee2dc2b8cfd4157\sub_folder"
path = os.path.normpath(A)
path = path.split(os.sep)
path = "\\".join(["##"+i+"##" if uuid.match(i) else i for i in path])
print path
Result:
c:\windows\serviceprofiles\localserviceppdata\locallow\microsoft\cryptneturlcache\metadata\##c8edb95df3f0ad4ee2dc2b8cfd4157##\sub_folder
Note: I am compiling for 30 chars in length. You can modify that value in re.compile

Python 2.7 File Handling saving username

Ok im doing some coding work making a username saver, I need to save multiple users to different text files, But I get an error when trying to set the file directory,
fileloc = "N:\Documents\1) Subjects, Word and Powerpoint\GCSE Computing\NEA\GCSE 2017\users\"
filename = fileloc+newname+".txt"
print filename
adduserfile = open(filename, "rw+")
I get the error "EOL while scanning string literal", It then highlights the last quotation mark at the end of line 1, Im not sure how to fix this, Please help
Sorry for asking such a simple question, I understand that it was my use of the address because of the special characters () causing it to break, thanks for the time and help
Using forward slashes will work under Windows.
fileloc = "N:/Documents/1) Subjects, Word and Powerpoint/GCSE Computing/NEA/GCSE 2017/users/"
Alternatively you can use a raw string literal by prefixing an r.
fileloc = r"N:\Documents\1) Subjects, Word and Powerpoint\GCSE Computing\NEA\GCSE 2017\users\"
You need to be carefull about special chars "\":
fileloc = "N:\Documents\1) Subjects, Word and Powerpoint\GCSE Computing\NEA\GCSE 2017\users\\"
filename = fileloc+"newname"+".txt"
print filename
N:\Documents) Subjects, Word and Powerpoint\GCSE Computing\NEA\GCSE 2017\users\newname.txt
I used "newname" as a string here, you can set it and change as a var.
By ending your string with a \ you're escaping the next character (which is ") hence the string is not terminated.
Beware that you're also escaping the character next to every \
What you probably want
fileloc = "N:\\Documents\\1) Subjects, Word and Powerpoint\\GCSE Computing\\NEA\\GCSE 2017\\users\\"
learn more about character escape here: https://en.wikipedia.org/wiki/Escape_character

stemming problems in python

I want to find the stems of Persian language verbs. For that first I made a file containing some current and exception stems. I want first, my code searches in the file and if the stem was there it returns the stem and if not, it goes through the rest of the code and by deleting suffixes and prefixes it returns the stem. The problem 1) is that it doesn't pay attention to the file and ignoring it, it just goes through the rest of the code and outputs a wrong stem because exceptions are in the file. 2) because I used "for", the suffixes and prefixes of verbs influence on other verbs and omit other verbs' suffixes and prefixes which sometimes outputs a wrong stem. How should I change the code that each "for" loop works independently and doesn't affect the others? (I have to just write one function and call just it)
I reduced some suffixes and prefixes.
def stemmer (verb, file):
with open (file, encoding = "utf-8") as f:
f = f.read().split()
for i in f:
if i in verb:
return i
else:
for i in suffix1:
if verb.endswith(i):
verb = verb[:-len(i)]
return verb
You don't have to put all of your code, sara. We are only concerned with the snippet that causes the problem.
My guess is that the problematic part is the check if i in verb that might fail most of the time because of trailing characters after splitting the characters. Normally, when you split the tokens, you also need to trim the ending characters with the strip() method:
>>> 'who\n'.strip() in 'who'
True
Conditionals like:
>>> "word\n" in "word"
False
>>> 'who ' in 'who'
False
will always fail and that's why the program doesn't check the exceptions at all.
I found the answer. the problem is caused by "else:". there is no need to it.
def stemmer (verb, file):
with open (file, encoding = "utf-8") as f:
f = f.read().split()
for i in f:
if i in verb:
return i
for i in suffix1: # ماضي ابعد
if verb.endswith(i):
verb = verb[:-len(i)]
break

Categories

Resources