trying to print human readable ascii string - python

I am trying to print a string which is human readable ascii but not getting any output. What am i missing?
import string
file = open("file.txt", "r")
data = file.read()
data = data.split("\n")
for line in data:
if line not in string.printable:
continue
else:
print line

If your file's content is text, you should read files like this:
import string
with open("file.txt", "r") as file:
for line in file:
if all( c in string.printable for c in line):
print line
You must check every character individually to see if it is printable. There is another post about checking that string is printable: Test if a python string is printable
Also, you can read about context manager about how to open file right way: What is the most pythonic way to open a file?

Related

Searching for a string in a file is not working in Python

I am using this code to find a string in Python:
buildSucceeded = "Build succeeded."
datafile = r'C:\PowerBuild\logs\Release\BuildAllPart2.log'
with open(datafile, 'r') as f:
for line in f:
if buildSucceeded in line:
print(line)
I am quite sure there is the string in the file although it does not return anything.
If I just print one line by line it returns a lot of 'NUL' characters between each "valid" character.
EDIT 1:
The problem was the encoding of Windows. I changed the encoding following this post and it worked: Why doesn't Python recognize my utf-8 encoded source file?
Anyway the file looks like this:
Line 1.
Line 2.
...
Build succeeded.
0 Warning(s)
0 Error(s)
...
I am currently testing with Sublime for Windows editor - which outputs a 'NUL' character between each "real" character which is very odd.
Using python command line I have this output:
C:\Dev>python readFile.py
Traceback (most recent call last):
File "readFile.py", line 7, in <module>
print(line)
File "C:\Program Files\Python35\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xfe' in position 1: character maps to <undefined>
Thanks for your help anyway...
If your file is not that big you can do a simple find. Otherwise I would check to file to see if you have the string in the file/ check the location for any spelling mistakes and try to narrow down the problem.
f = open(datafile, 'r')
lines = f.read()
answer = lines.find(buildSucceeded)
Also note that if it does not find the string answer would be -1.
As explained, the problem happening was related to encoding. In the below website there is a very good explanation on how to convert between files with one encoding to some other.
I used the last example (with Python 3 which is my case) it worked as expected:
buildSucceeded = "Build succeeded."
datafile = 'C:\\PowerBuild\\logs\\Release\\BuildAllPart2.log'
# Open both input and output streams.
#input = open(datafile, "rt", encoding="utf-16")
input = open(datafile, "r", encoding="utf-16")
output = open("output.txt", "w", encoding="utf-8")
# Stream chunks of unicode data.
with input, output:
while True:
# Read a chunk of data.
chunk = input.read(4096)
if not chunk:
break
# Remove vertical tabs.
chunk = chunk.replace("\u000B", "")
# Write the chunk of data.
output.write(chunk)
with open('output.txt', 'r') as f:
for line in f:
if buildSucceeded in line:
print(line)
Source: http://blog.etianen.com/blog/2013/10/05/python-unicode-streams/

how to choose and upload a file in python

I am writing a program where it asks you what text file the user wants to read then it begins to read whatever file name the user inputs. Here is what I have so far:
import sys
import os
import re
#CHOOSE FILE
print "Welcome to the Parsing Database"
raw_input=raw_input("enter file name to parse: ")
#ASSIGN HEADERS AND SEQUENCES
f=open("raw_input", "r")
header=[]
sequence=[]
string=""
for line in f:
if ">" in line and string=="":
header.append(line[:-2])
elif ">" in line and string!="":
sequence.append(string)
header.append(line[:-2])
string=""
else:
string=string+line[:-2]
sequence.append(string)
The first two lines work but then it says it cannot find the file that I inputted to read. Please help! Thanks.
Off the top of my head, I think that f = open("raw_input", "r") needs to be f=open(raw_input, "r"), because you are trying to reference the string contained in the variable raw_input, as opposed to trying to open a file named raw_input. Also you should probably change the name of the variable to something more readable, because raw_input() is a function used in your code as well as a variable, which makes it hard to read. Are there any other specific problems you are having with your code?
f=open("raw_input", "r")
"raw_input" is a plain string. You have to referente to it as raw_input.
Also, there's no lines if you don't use .read() with open() method so you can't parse them. Read lines from a file given from raw_input can be done doing that:
import sys
import os
import re
#CHOOSE FILE
print "Welcome to the Parsing Database"
raw_input_file=raw_input("enter file name to parse: ")
#ASSIGN HEADERS AND SEQUENCES
testfile = open(raw_input_file, "r")
secuence = []
for line in testfile.read().splitlines():
secuence.append(line)
for i in secuence:
print i
testfile.close()

How to remove lines from a file that is being printed with certain conditions?

def codeOnly (file):
'''Opens a file and prints the content excluding anything with a hash in it'''
f = open('boring.txt','r')
codecontent = f.read()
print(codecontent)
codeOnly('boring.txt')
I want to open this file and print the contents of it however i don't want to print any lines with hashes in them. Is there a function to prevent these lines from being printed?
The following script with print all lines which do not contain a #:
def codeOnly(file):
'''Opens a file and prints the content excluding anything with a hash in it'''
with open(file, 'r') as f_input:
for line in f_input:
if '#' not in line:
print(line, end='')
codeOnly('boring.txt')
Using with will ensure that the file is automatically closed afterwards.
You can check if the line contains a hash with not '#' in codecontent (using in):
def codeOnly (file):
'''Opens a file and prints the content excluding anything with a hash in it'''
f = open('boring.txt','r')
for line in f:
if not '#' in line:
print(line)
codeOnly('boring.txt')
If you really want to keep only code lines, you might want to keep the part of the line until the hash, because in languages such as python you could have code before the hash, for example:
print("test") # comments
You can find the index
for line in f:
try:
i = line.index('#')
line = line[:i]
except ValueError:
pass # don't change line
Now each of your lines will contain no text from and including the hash tag until the end of the line. Hash tags in the first position of a line will result in an empty string, you might want to handle that.

Delete every non utf-8 symbols from string

I have a big amount of files and parser. What I Have to do is strip all non utf-8 symbols and put data in mongodb.
Currently I have code like this.
with open(fname, "r") as fp:
for line in fp:
line = line.strip()
line = line.decode('utf-8', 'ignore')
line = line.encode('utf-8', 'ignore')
somehow I still get an error
bson.errors.InvalidStringData: strings in documents must be valid UTF-8:
1/b62010montecassianomcir\xe2\x86\x90ta0\xe2\x86\x90008923304320733/290066010401040101506055soccorin
I don't get it. Is there some simple way to do it?
UPD: seems like Python and Mongo don't agree about definition of Utf-8 Valid string.
Try below code line instead of last two lines. Hope it helps:
line=line.decode('utf-8','ignore').encode("utf-8")
For python 3, as mentioned in a comment in this thread, you can do:
line = bytes(line, 'utf-8').decode('utf-8', 'ignore')
The 'ignore' parameter prevents an error from being raised if any characters are unable to be decoded.
If your line is already a bytes object (e.g. b'my string') then you just need to decode it with decode('utf-8', 'ignore').
Example to handle no utf-8 characters
import string
test=u"\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n"
print ''.join(x for x in test if x in string.printable)
with open(fname, "r") as fp:
for line in fp:
line = line.strip()
line = line.decode('cp1252').encode('utf-8')

Python- need to append characters to the beginning and end of each line in text file

I should preface that I am a complete Python Newbie.
Im trying to create a script that will loop through a directory and its subdirectories looking for text files. When it encounters a text file it will parse the file and convert it to NITF XML and upload to an FTP directory.
At this point I am still working on reading the text file into variables so that they can be inserted into the XML document in the right places. An example to the text file is as follows.
Headline
Subhead
By A person
Paragraph text.
And here is the code I have so far:
with open("path/to/textFile.txt") as f:
#content = f.readlines()
head,sub,auth = [f.readline().strip() for i in range(3)]
data=f.read()
pth = os.getcwd()
print head,sub,auth,data,pth
My question is: how do I iterate through the body of the text file(data) and wrap each line in HTML P tags? For example;
<P>line of text in file </P> <P>Next line in text file</p>.
Something like
output_format = '<p>{}</p>\n'.format
with open('input') as fin, open('output', 'w') as fout:
fout.writelines( output_format(line.strip()) for line in fin )
This assumes that you want to write the new content back to the original file:
with open('path/to/textFile.txt') as f:
content = f.readlines()
with open('path/to/textFile.txt', 'w') as f:
for line in content:
f.write('<p>' + line.strip() + '</p>\n')
with open('infile') as fin, open('outfile',w) as fout:
for line in fin:
fout.write('<P>{0}</P>\n'.format(line[:-1]) #slice off the newline. Same as `line.rstrip('\n')`.
#Only do this once you're sure the script works :)
shutil.move('outfile','infile') #Need to replace the input file with the output file
in you case, you should probably replace
data=f.read()
with:
data = '\n'.join("<p>%s</p>" % l.strip() for l in f)
use data=f.readlines() here,
and then iterate over data and try something like this:
for line in data:
line="<p>"+line.strip()+"</p>"
#write line+'\n' to a file or do something else
append the and <\p> for each line
ex:
data_new=[]
data=f.readlines()
for lines in data:
data_new.append("<p>%s</p>\n" % data.strip().strip("\n"))
You could use the fileinput module to modify one or more files in-place, with optional backup file creation if desired (see its documentation for details). Here's it being used to process one file.
import fileinput
for line in fileinput.input('testinput.txt', inplace=1):
print '<P>'+line[:-1]+'<\P>'
The 'testinput.txt' argument could also be a sequence of two or more file names instead of just a single one, which could be useful especially if you're using os.walk() to generate the list of files in the directory and its subdirectories to process (as you probably should be doing).

Categories

Resources