Python magic is not recognizing the correct content - python

I have parsed the content of a file to a variable that looks like this;
b'8,092436.csv,,20f85'
I would now like to find out what kind of filetype this data is coming from, with;
print(magic.from_buffer(str(decoded, 'utf-8'), mime=True))
This prints;
application/octet-stream
Anyone know how I would be able to get a result saying 'csv'?

Use magic on the original file.
You also need to take into account that CSV is really just a text file that uses particular characters to delimit the content. There is no explicit identifier that indicates that the file is a CSV file. Even then the CSV module needs to be configured to use the appropriate delimiters.
The delimiter specification of a CSV file is either defined by your program or needs to be configured (see importing into Excel as an example, you are presented with a number of options to configure the type of CSV to import).

Related

Python csv package - issue with DictReader module

I'm having a curious issue with the csv package in Python 3.7.
I'm importing a csv file and able to access all the file as expected, with one exception - the header row, as stored in the "fieldnames" object, appears have the first column header (first item in fieldnames) malformed.
This first field always has the format: 'xxx"header"'
where:
xxx are garbage characters that always seem to be the same
header is the correct header text
See the following screenshot of my table <csv.DictReader> object from my debug window:
My code to open the file, follows. I added the headers[0] = table.fieldnames[0].split('"')[1] in order to extract the correct header and place it back into fieldnames`.
import csv
with self.inputfile.open() as self.inputfid:
table = csv.DictReader(self.inputfid, delimiter=',')
headers = table.fieldnames
headers[0] = table.fieldnames[0].split('"')[1]
(Note: self.inputfile is a pathlib.Path object)
I didn't notice this for a long time because I wasn't using the first column (with the # header) - I've been happily parsing with the rest of the columns for a while on multiple files.
If I look directly at the csv, there doesn't appear to be any issue:
Questions:
Does anyone know what the issue is? Is there anything I can try to correct the import issue?
If there isn't a fix, is there a better way to parse the garbage? I realize this could clear up in the future, but I think the split will still work even with just bare double quotes (the header should still be the 2nd item in the split, right?). Is there a better solution?
It looks like your csv file is encoded as utf-8-sig - a version of utf-8 used by some Windows applications, but it's being decoded as cp1252 - another encoding in common use on Windows.
>>> print('"#"'.encode('utf-8-sig').decode('cp1252'))
"#"
The "garbage" characters preceding the header are the byte-order-mark that utf-8-sig uses to tell Windows applications that a file is encoded as utf-8 rather than one of the historically more common 8-bit encodings.
To avoid the "garbage", specify utf-8-sig as the encoding when opening your file.
The code in the question could be modified to work like this:
import csv
encoding = 'utf-8-sig'
with self.inputfile.open(encoding=encoding, newline='') as self.inputfid:
table = csv.DictReader(self.inputfid, delimiter=',')
headers = table.fieldnames
...
If - as seems likely - the encoding of input files may vary, the value of encoding (or a best guess) must be determined by using a tool like chardet, as used in the comments.

Can a text file and a json file be used interchangeably? And if so how can I use it in python?

Question: I was wondering if JSON and txt files could be used interchangeably in python.
More Details: I found this on the internet and this on stack overflow to find what a JSON file is but it did not say if json and txt could be used interchangeably ie using the same commands. For example, can both use the same code with open('filename')as file: or does JSON require a different code. Also if they can be used in the same general manner is linking and using commands for a JSON file and a txt file the same process?
OS: windows 10
IDE: IDLE 64-bit
Version: Python 3.7
A .txt file can contain JSON data, and using open() in Python can open any file, with any content, and any file extension (granted the user running the code has permissions to do so)
It's not until you try to load a non JSON string or file using json.loads or json.load, respectively, where the problem starts.
In other words, a file contains binary data. The data can be represented as a string, that string could be XHTML, JSON, CSV, YAML, whatever, and you must use the appropriate parser to extract the relevant data from that format (but it's not always the file extensions that determine what to use)
does JSON require a different code
It requires another module
import json
with open(name) as f:
data = json.load(f)
You can read the raw data out of any file the same way; the difference is in reading the structure in the data.

How to change automatically the type of the excel file from Tab space separated Text to xls file?

I have an excel file whose extension is .xls but his type is Tab Space separated Text.
When I try to open the file by MS Excel it tells me that the extension is fake. And So I have to confirm that I trust the file and so I can read it then.
But my real problem is that when I try to read my file by the xlrd library it gives me this message :
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record;
And so to resolve this problem, I go to Save as in MS Excel and I change the type manually to .xls.
But my boss insist that I have to do this by code. I have 3 choices : Shell script under Linux, .bat file under Windows or Python.
So, how can I change the type of the excel file from Tab space separated Text to xls file by Shell script (command line), .bat or Python?
mv file.{xls,csv}
It's a csv file, stop treating it as an excel file and things will work a lot better. :) There are nice csv manipulation tools available in most languages. Do you really need the excel library?
The real type of the file is dictated by the contents of the file, not the name of it. xlrd doesn't care about the name at all, it cares about the contents, so xlrd is not your problem, and it's not even relevant to your task.
I don't know what you mean by "tab space separated text". Are the values separated by '\t ' (a tab character followed by a space character)? Sometimes tabs and sometimes spaces?
If the separator is constant, just use Python's csv module. If the separator is whitespace and the data does not contain whitespace, then you can use Python's split() string method. If the separator varies and can appear in the data, then you will have to write something fancier to parse it.
In any case, once you have read the data, to write out a real .xls file, your best Python option is the xlwt module.

Python's csv.writerow() is acting a tad funky

From what I've researched, csv.writeRow should take in a list, and then write it to the given csv file. Here's what I tried:
from csv import writer
with open('Test.csv', 'wb') as file:
csvFile, count = writer(file), 0
titles = ["Hello", "World", "My", "Name", "Is", "Simon"]
csvFile.writerow(titles)
I'm just trying to write it so that each word is in a different column.
When I open the file that it creates, however, I get the following message:
After pressing to continue anyways, I get a message saying that the file is either corrupted, or is a SYLK file. I can then open the file, but only after going through two error messages everytime I open the file.
Why is this?
Thanks!
It's a documented issue that Excel will assume a csv file is SYLK if the first two characters are 'ID'.
Venturing into the realm of opinion - it shouldn't, but Excel thinks it knows better than the extension. To be fair, people expect it to be able to figure out cases where the extension really is wrong, but in a case like this assuming the extension is wrong, and then further assuming the file is corrupt when it doesn't appear corrupt if interpreted according to the extension is just mind-boggling.
#John Y points out:
One thing to watch out for: The "workaround" given by the Microsoft issue linked to by #PeterDeGlopper is to (manually) prepend an apostrophe into the file. (This is also advice commonly found on the Web, including StackOverflow, to try to force CSV digits to be treated as strings rather than numbers.) This is not what I'd call good advice, as that injects a literal apostrophe into your data.
#DSM suggests using quoting=csv.QUOTE_NONNUMERIC on the writer. Excel is not confused by a file beginning with "ID" rather than ID, so if the other tools that are going to work with the CSV accept that quoting level this is probably the best solution other than just ignoring Excel's confusion.

File Reading Options Enquiry (Python)

I am a programming student for the semester. In class we have been learning about file opening, reading and writing.
We have used a_reader to achieve such tasks for file opening. I have been reading our associated text/s and I have noticed that there is a CSV reader option which I have been using.
I wanted to know if there were anymore possible ways to open/read a file as I am trying to grow my knowledge base in python and its associated contents.
EDIT:
I was referring to CSV more specifically as that is the type of files we use at the moment. We have learnt about CSV Reader and a_reader and an example from one of our lectures is shown below.
def main():
a_reader = open('IDCJAC0016_009225_1800_Data.csv', 'rU')
file_data = a_reader.read()
a_reader.close()
print file_data
main()
It may seem overly broad but I have no knowledge which is why I am asking is there more than just the 2 ways above. If there is can someone who knows provide the types so I can read up on and research on them.
If you're asking about places to store things, the first interfaces you'll meet are files and sockets (pretend a network connection is like a file, see http://docs.python.org/2/library/socket.html).
If you mean file formats (like csv), there are many! Probably you can think of many yourself, but besides csv there are html files, pictures (png, jpg, gif), archive formats (tar, zip), text files (.txt!), python files (.py). The list goes on.
There are many ways to read files in different ways.
Just plain open will take a filename and open it as a sequence of lines. Or, you can just call read() on it, and it will read the whole file at once into one giant string.
codecs.open will take a filename and a character set, and decode each line to Unicode automatically. Or, again, you can just call read() on it, and it will read and decode the whole file at once into one giant Unicode string.
csv.reader will take a file or file-like object, and read it as a sequence of CSV rows. There's no direct equivalent of read()—but you can turn any sequence into a list by just calling list on it, so list(my_reader) will give you a list of rows (each of which is, itself, a list).
zipfile.ZipFile will take a filename, or a file or file-like object, and read it as a ZIP archive. This doesn't go line by line, of course, but you can go archived file by archived file. Or you can do fancier things, like search for archived files by name.
There are modules for reading JSON and XML documents, different ways of handling binary files, and so on. Some of them work differently—for example, you can search an XML document as a tree with one module, or go element by element with a different one.
Python has a pretty extensive standard library, and you can find the documentation online. Every module that seems like it should be able to work on files, probably can.
And, beyond what comes in the standard library, PyPI, the Python Package Index has thousands of additional modules. Looking for a way to read YAML documents? Search PyPI for yaml and you'll find it.
Finally, Python makes it very easy to add things like this on your own. The skeleton of a function like csv.reader is as simple as this:
def reader(fileobj):
for line in fileobj:
yield parse_one_csv_line(line)
You can replace that parse_one_csv_line with anything you want, and you've got a custom reader. For example, here's an uppercase_reader:
def uppercase_reader(fileobj):
for line in fileobj:
yield line.upper()
In fact, you can even write the whole thing in one line:
shouts = (line.upper() for line in fileobj)
And the best thing is that, as long as your reader only yields one line at a time, your reader is itself a file-like object, so you can pass uppercase_reader(fileobj) to csv.reader and it works just fine.

Categories

Resources