Python - dropping special unix characters - python

When viewing a log file on my (solaris) machine, I can see it is formatted as "^A1=hello^A2=goodbye^A3=salut". I am looking to be able to use a string.split("^A") so I can break the line into parts and use the key value pairs.
However when I loop through this same file in python 2.6.4, It is loaded as "1=hello2=goodbye3=salut" and I cannot split it.
When I run a file command on this I get
file Tom.log
Tom.log: c program text
I have tried importing codecs (as below) and selecting 'UTF-8', but no joy. Any help is appreciated.
import codecs
f = codecs.open('Tom.log', 'r', 'UTF-8')
for line in f:
print(line)

Related

Python script not finding value in log file when the value is in the file

The code below is meant to find any xls or csv file used in a process. The .log file contains full paths with extensions and definitely contains multiple values with "xls" or "csv". However, Python can't find anything...Any idea? The weird thing is when I copy the content of the log file and paste it to another notepad file and save it as log, it works then...
infile=r"C:\Users\me\Desktop\test.log"
important=[]
keep_words=["xls","csv"]
with open(infile,'r') as f:
for line in f:
for word in keep_words:
if word in line:
important.append(line)
print(important)
I was able to figure it out...encoding issue...
with io.open(infile,encoding='utf16') as f:
You must change the line
for line in f:
to
for line in f.readlines():
You made the python search in the bytes opened file, not in his content, even in his lines (in a list, just like the readlines method);
I hope I was able to help (sorry about my bad English).

Syntax Error when importing a text file containing a dictionary

import ast
dict_from_file=[]
with open('4.txt', 'r') as inf:
dict_from_file = ast.literal_eval(inf.read())
File "<unknown>", line 1
["hello":"work", "please":"work"]
^
SyntaxError: invalid character in identifier
Hi Everyone! The above is my code and my error. I have a really complicated 40MB data file in the form of a dictionary to work on, but couldn't get that import to work so tried a simple one.
I'm using the latest Jupyter notebook from the latest version of Anaconda, on Windows 10. My dictionary is a txt file created using windows notepad. The complicated dictionary was originally a JSON file that I changed into a txt file thinking it would be easier but I may be wrong.
I think the above error is an encoding issue? But not sure how to fix it.
Thanks!
If you are the owner/write of the file (dict formated), save as json
import json
#To write
your_dict = {.....}
with open("file_name.txt", "w+") as f:
f.write(json.dumps(your_dict)
#To read
with open("file_name.txt") as f:
read_dict = json.load(f)
This is possibly a python 3 "feature".
This code removes the unwanted characters at the start of the input file and returns the input data as type string.
with open('4.txt', 'r',,encoding="utf-8-sig") as inf:
dict_from_file = ast.literal_eval(inf.read())
This removes the strange characters put at the beginning of read data.

Python failing to read lines properly

I'm supposed to open a file, read it line per line and display the lines out.
Here's the code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import re
in_path = "../vas_output/Glyph/20140623-FLYOUT_mins_cleaned.csv"
out_path = "../vas_gender/Glyph/"
csv_read_line = open(in_path, "rb").read().split("\n")
line_number = 0
for line in csv_read_line:
line_number+=1
print str(line_number) + line
Here's the contents of the input file:
12345^67890^abcedefg
random^test^subject
this^sucks^crap
And here's the result:
this^sucks^crapjectfg
Some weird combo of all three. In addition to this, the result of line_number is missing. Printing out the result of len(csv_read_line) outputs 1, for some reason, no matter how many is in the input file. Changing the split type from \n to ^ gives the expected output, though, so I'm assuming the problem is probably with the input file.
I'm using a Mac, and did both the python code and the input file (on Sublime Text) on the Mac itself.
Am I missing something?
You seem to be splitting on "\n" which isn't necessary, and could be incorrect depending on the line terminators used in the input file. Python includes functionality to iterate over the lines of a file one at a time. The advantages are that it will worry about processing line terminators in a portable way, as well as not requiring the entire file to be held in memory at once.
Further, note that you are opening the file in binary mode (the b character in your mode string) when you actually intend to read the file as text. This can cause problems similar to the one you are experiencing.
Also, you do not close the file when you are done with it. In this case that isn't a problem, but you should get in the habit of using with blocks when possible to make sure the file gets closed at the earliest possible time.
Try this:
with open(in_path, "r") as f:
line_number = 0
for line in f:
line_number += 1
print str(line_number) + line.rstrip('\r\n')
So your example just works for me.
But then, i just copied your text into a text editor on linux, and did it that way, so any carriage returns will have been wiped out.
Try this code though:
import os
in_path = "input.txt"
with open(in_path, "rb") as inputFile:
for lineNumber, line in enumerate(inputFile):
print lineNumber, line.strip()
It's a little cleaner, and the for line in file style deals with line breaks for you in a system independent way - Python's open has universal newline support.
I'd try the following Pythonic code:
#!/usr/bin/env python
in_path = "../vas_output/Glyph/20140623-FLYOUT_mins_cleaned.csv"
out_path = "../vas_gender/Glyph/"
with open(in_path, 'rb') as f:
for i, line in enumerate(f):
print(str(i) + line)
There are several improvements that can be made here to make it more idiomatic python.
import csv
in_path = "../vas_output/Glyph/20140623-FLYOUT_mins_cleaned.csv"
out_path = "../vas_gender/Glyph/"
#Lets open the file and make sure that it closes when we unindent
with open(in_path,"rb") as input_file:
#Create a csv reader object that will parse the input for us
reader = csv.reader(input_file,delimiter="^")
#Enumerate over the rows (these will be lists of strings) and keep track of
#of the line number using python's built in enumerate function
for line_num, row in enumerate(reader):
#You can process whatever you would like here. But for now we will just
#print out what you were originally printing
print str(line_num) + "^".join(row)

Python failed to parse txt file but the file is confirmed to be 'txt' file

I have a piece of python code that reads from a txt file properly, but my colleague gave me another set of files that appears to be of type txt file as well. But when I ran the same python code, each line is read incorrectly.
For the new files, if the line is 240,022414114120,-500,Bauer_HS5,0
It would be read as str:2[]4[]0 []0[]2[]2[]4..... All those little rectangles between each character and the leading question mark characters are all invalid characters.
And it will further get converted to something like this:
[['\xff\xfe2\x004\x000\x00', '\x000\x002\x002\x004\x001\x004\x001\x001\x004\x001\x002\x000\x00', '\x00-\x005\x000\x000\x00',......
However, if I manually create a normal text file and copy/paste the content from the input file, the parsr was able to read each line correctly. So I am thinking the input files are of different type of the normal text file. But the files' suffix are indeed 'txt'.
The files come from a device that regularly sends files to our server. This parser works fine for another device that does the same thing. And the files from both devices are all of type 'txt'.
Each line is read as {{{ for line in self._infile.xreadlines(): }}}
I am very confused why it would behave this way.
My python code is following.
def __init__(self, infile=sys.stdin, outfile=sys.stdout):
if isinstance(infile, basestring):
infile = open(infile)
if isinstance(outfile, basestring):
outfile = open(outfile, "w")
self._infile = infile
self._outfile = outfile
def sort(self):
lines = []
last_second = None
for line in self._infile.xreadlines():
line = line.replace('\r\n', '')
fields = line.split(',')
if len(fields) < 2:
continue
second = fields[1]
if last_second and second != last_second:
lines = sorted(lines, self._sort_lines)
self._outfile.write("".join([','.join(x) for x in lines]))
#self._outfile.write("\r\n")
lines = []
last_second = second
lines.append(fields)
if lines:
lines = sorted(lines, self._sort_lines)
self._outfile.write("".join([','.join(x) for x in lines]))
#self._outfile.write("\r\n")
self._infile.close()
self._outfile.close()
The start of the file you described as coming from your colleague is "\xff\xfe". These two characters make up a "byte order mark" that indicates that the file is encoded with the "UTF-16-LE" encoding (that is, 16-bit Unicode with the lower byte first). Your Python script is reading with an 8-bit encoding (probably whatever your system's default encoding is), so you're seeing lots of extra null characters (the high bytes of the 16-bit characters).
I can't speak to how the file got a different encoding. Windows text editors (like notepad.exe) are somewhat notorious for silently reencoding files in unhelpful ways if you're not careful with them, so it may be that your colleague previewed the file in an editor and then saved it before forwarding it on to you.
Anyway, the simplest fix is probably to reencode the file. There are various utilities to do this on various OSs, or you could write your own easily enough. Here's a quick and dirty function to reencode a file in Python (which will hopefully raise an exception if the encoding parameters are wrong, but perhaps not always):
def renecode_file(filename, from_encoding="UTF-16-LE", to_encoding="ascii"):
with open(filename, "rb") as f:
in_bytes = f.read() # read bytes
text = in_bytes.decode(from_encoding) # decode to unicode
out_bytes = text.encode(to_encoding) # reencode to new encoding
with open(filename, "wb") as f:
f.write(out_bytes) # write back to the file
If the file you get is going to always be encoded in UTF-16, you could change your regular script to decode it automatically. In Python 2.7, I'd suggest using the io module's open function for this (it is the same code that the regular open uses in Python 3). Note however that the file object returned won't support the xreadlines method which has been deprecated for a long time (just iterate over the file directly instead).

Issue reading / writing specific lines within a log file in Python

I am attempting to print a line that contains a word from within a log file.
I have done some research and as of yet not found a good way to implement this.
I currently have this code:
FileInput = open(FILE, "r", encoding='utf-8')
for line in FileInput:
if "DATA: " in line:
print line
After looking around this seems be the way most people are doing it but I get the following error: TypeError: coercing to Unicode: need string or buffer, NoneType found.
I know the set length from "DATA:" and the line ends with a hexadecimal value of 0A.
Either your FILE variable does not contain a proper string (can we see the value of that? can you do "print(FILE)" before trying to open the file and paste here the result?), or the file is not encoded in a way that is compatible with utf-8. Try opening it in a good editor (like jEdit or Notepad++) and see what the editor tells you it is, then specify that encoding instead of utf.
It seems you need to use
import codecs
f = codecs.open(FILE, encoding='utf-8', mode='r')
Take a look here Unicode HOWTO
try this:
FileInput = open(FILE, "r")
for line in FileInput:
if "DATA: " in line:
print(line)

Categories

Resources