Python's csv.writerow() is acting a tad funky - python

From what I've researched, csv.writeRow should take in a list, and then write it to the given csv file. Here's what I tried:
from csv import writer
with open('Test.csv', 'wb') as file:
csvFile, count = writer(file), 0
titles = ["Hello", "World", "My", "Name", "Is", "Simon"]
csvFile.writerow(titles)
I'm just trying to write it so that each word is in a different column.
When I open the file that it creates, however, I get the following message:
After pressing to continue anyways, I get a message saying that the file is either corrupted, or is a SYLK file. I can then open the file, but only after going through two error messages everytime I open the file.
Why is this?
Thanks!

It's a documented issue that Excel will assume a csv file is SYLK if the first two characters are 'ID'.
Venturing into the realm of opinion - it shouldn't, but Excel thinks it knows better than the extension. To be fair, people expect it to be able to figure out cases where the extension really is wrong, but in a case like this assuming the extension is wrong, and then further assuming the file is corrupt when it doesn't appear corrupt if interpreted according to the extension is just mind-boggling.
#John Y points out:
One thing to watch out for: The "workaround" given by the Microsoft issue linked to by #PeterDeGlopper is to (manually) prepend an apostrophe into the file. (This is also advice commonly found on the Web, including StackOverflow, to try to force CSV digits to be treated as strings rather than numbers.) This is not what I'd call good advice, as that injects a literal apostrophe into your data.
#DSM suggests using quoting=csv.QUOTE_NONNUMERIC on the writer. Excel is not confused by a file beginning with "ID" rather than ID, so if the other tools that are going to work with the CSV accept that quoting level this is probably the best solution other than just ignoring Excel's confusion.

Related

Python - writing string to HTML file causes computational issues

I appreciate that this may be an issue with my computer/software, but I want to double check that my code isn't causing the problem before ruling it out.
I have written a fairly simple program. I have a short list of strings read in from a text file, then with a different text file open, I iterate over each word in the second text file, checking if the first two letters of the word are contained in the first list of strings.
Then, if that condition is fulfilled, I use string interpolation to insert that word into a string of HTML code. Finally, I append that string to an existing empty .html. When finished iterating through, I close the html file.
with open("strings.txt", "r") as f:
strings = f.read().splitlines()
urlfile = open("links.html", "a")
with open("words.txt", "r") as f:
text = f.read().splitlines()
for word in text:
if word[:2] in strings:
html = '<a href="[URL]/{}">'.format(word)
urlfile.write(html)
urlfile.close()
so far there doesn't actually seem to be any issues with my code doing what I want - I am generating the right html code and if I print it to console it does so quickly. It is being appended to the html file.
The problem I have is that something I am doing must be computationally expensive or problematic, because Notepad++ freezes every time I try to check links.html for the results. I have managed to see that it looks correct, but Notepad++ then becomes unusable, and my computer is clearly straining. The only solution I have is to close anything related to the html file.
None of the lists used are long and all the operations should in theory be quite simple, so I feel as though I must be doing something wrong. Am I writing to files in an unsafe way? Am I doing something wildly expensive that I'm just missing? I am using Notepad++ v7.9.5, Python 3, and Anaconda prompt.
EDIT: I am now able to access the html file on my browser and on Notepad++ without issue. I think the source of the problem was some laptop software updating in the background without me noticing. I'll check that first next time!

Python csv package - issue with DictReader module

I'm having a curious issue with the csv package in Python 3.7.
I'm importing a csv file and able to access all the file as expected, with one exception - the header row, as stored in the "fieldnames" object, appears have the first column header (first item in fieldnames) malformed.
This first field always has the format: 'xxx"header"'
where:
xxx are garbage characters that always seem to be the same
header is the correct header text
See the following screenshot of my table <csv.DictReader> object from my debug window:
My code to open the file, follows. I added the headers[0] = table.fieldnames[0].split('"')[1] in order to extract the correct header and place it back into fieldnames`.
import csv
with self.inputfile.open() as self.inputfid:
table = csv.DictReader(self.inputfid, delimiter=',')
headers = table.fieldnames
headers[0] = table.fieldnames[0].split('"')[1]
(Note: self.inputfile is a pathlib.Path object)
I didn't notice this for a long time because I wasn't using the first column (with the # header) - I've been happily parsing with the rest of the columns for a while on multiple files.
If I look directly at the csv, there doesn't appear to be any issue:
Questions:
Does anyone know what the issue is? Is there anything I can try to correct the import issue?
If there isn't a fix, is there a better way to parse the garbage? I realize this could clear up in the future, but I think the split will still work even with just bare double quotes (the header should still be the 2nd item in the split, right?). Is there a better solution?
It looks like your csv file is encoded as utf-8-sig - a version of utf-8 used by some Windows applications, but it's being decoded as cp1252 - another encoding in common use on Windows.
>>> print('"#"'.encode('utf-8-sig').decode('cp1252'))
"#"
The "garbage" characters preceding the header are the byte-order-mark that utf-8-sig uses to tell Windows applications that a file is encoded as utf-8 rather than one of the historically more common 8-bit encodings.
To avoid the "garbage", specify utf-8-sig as the encoding when opening your file.
The code in the question could be modified to work like this:
import csv
encoding = 'utf-8-sig'
with self.inputfile.open(encoding=encoding, newline='') as self.inputfid:
table = csv.DictReader(self.inputfid, delimiter=',')
headers = table.fieldnames
...
If - as seems likely - the encoding of input files may vary, the value of encoding (or a best guess) must be determined by using a tool like chardet, as used in the comments.

ValueError Reading large data set with pd.read_json

I am working a set of code exercises that use a Yelp reviews dataset. At this point in the exercises I am supposed to read in review.json which has one JSON record per line. I have made a smaller version of the JSON file, with only 100 records, for testing.
I can read the entire test file into a pandas dataframe and examine it.
The complete dataset file, however, has about 6 million lines. The recommendation is to use chunksize and build a json reader. I'm hitting errors, even with my test input.
My code currently looks like this
path = 'file://localhost/Users/.../DSC_Intro/'
filename = path + 'yelp_dataset/review_100.json'
# create a reader to read in chunks
review_reader =
pd.read_json(StringIO(filename), lines=True, chunksize=10)
type(review_reader)
The type call returns
pandas.io.json.json.JsonReader
which looks good.
Then I try
for chunk in review_reader:
print(chunk)
as referenced in pandas user guide
and I get an error:
ValueError: Unexpected character found when decoding 'false'
Update - it has been suggested that the issue is caused by embedded (quoted) "\n" characters in the data file; that pandas is seeing the JSON records as, not one per line, but multiple lines.
The error message is VERY opaque, if that's the case. Also, with 6 million lines, how should I tell pd.read_json to ignore "\n" and only look at actual newlines in the data?
Update
It's been suggested that if I fix my typo (it was a typo in this post, not a typo in my code) and use a Unix file path instead of a URL (JSON doesn't care: see docs).
When I do this but keep StringIO(), I get a different ValueError.
When I do this but remove StringIO(), the code works.
This seems to be very fragile. :-(
Note The tutorial has an answer key. I've tried that code. The answer key uses
review_reader =
pd.read_json(filename, lines=True, chunksize=10)
which throws the TypeError
sequence item 0: expected str instance, bytes found
Adding StringIO() seems to have solved that.
Input Sample JSON record, one per line of the input file.
{"review_id":"Amo5gZBvCuPc_tZNpHwtsA","user_id":"DzZ7piLBF-WsJxqosfJgtA","business_id":"qx6WhZ42eDKmBchZDax4dQ","stars":5.0,"useful":1,"funny":0,"cool":0,"text":"Our family LOVES the food here. Quick, friendly, delicious, and a great restaurant to take kids to. 5 stars!","date":"2017-03-27 01:14:37"}
Firstly, your
path 'file://localhost/Users/.../DSC_Intro/'
is not valid python code. If you try to execute that as such, you will get an invalid syntax error. I assume, then, that this is just showing the value of the path variable. I don't know whether those ellipses are literal or the result of your environment truncating the display of path. I'll assume here that your path is a valid file URL for your system as it doesn't seem germane here to consider an incorrect path.
Either way, yes, read_json can read json from a file URL as you're specifying there (I learned something there) if you read it in one go:
pd.read_json(fname, lines=True)
But if you try to create a reader from this, by specifying
pd.read_json(fname, lines=True, chunksize=...)
then you get
TypeError: sequence item 0: expected str instance, bytes found
Secondly, yes, wrapping your file-like argument with StringIO makes this error go away, but it isn't helping for any reason you might think and its use is based on a misreading of the pandas docs you point to.
I'll quote a couple of bits from the read_json doc here:
Signature: pd.read_json(
path_or_buf=None, ...
path_or_buf : a valid JSON string or file-like, default: None
The string could be a URL. Valid URL schemes include http, ftp, s3,
gcs, and file. For file URLs, a host is expected. For instance, a local
file could be file://localhost/path/to/table.json
So with read_json, you can either give it an actual string that is valid JSON, or you can give it a file-like object that points to a file that contains JSON.
Notice in the pandas docs that you cite:
In [258]: jsonl = '''
.....: {"a": 1, "b": 2}
.....: {"a": 3, "b": 4}
.....: '''
.....:
is JSON, not a path. When their example then does:
df = pd.read_json(jsonl, lines=True)
it is merely parsing the JSON in the string - no files are involved here.
When it then wants to demonstrate reading from a file in chunks, it does
# reader is an iterator that returns `chunksize` lines each iteration
In [262]: reader = pd.read_json(StringIO(jsonl), lines=True, chunksize=1)
In other words, they are wrapping a JSON string, not a path, by StringIO(). This is just for the purposes of the documented example, so you can see that if you treated the JSON string as if it were being read from a file you can read it in chunks. That's what StringIO() does. So when you wrap the string that describes your file URL in StringIO(), I expect that read_json is then trying to interpret that string as JSON that's being read from a file and parse it. It understandably falls over because it isn't JSON.
This brings us back to why read_json cannot read your file URL in chunks. I don't have an immediate good answer to that. I suspect it lies in the internals of how read_json opens file URLs, or what function underlies this. If you were intent upon, or forced to, do this chunking from a file URL then I suspect you'd be looking at controlling the mode in which the file is opened, or perhaps somehow providing explicit guidance to read_json how to interpret the bytestream it gets. Libraries such as urllib2 may be useful here, I'm not sure.
But let's cut to the best fix here. Why are we trying to specify the path as a file URL? Simply specify your path as an OS path, e.g.
path = '/path/to/my/data/'
and then
filename = path + 'yelp_dataset/review_100.json'
# create a reader to read in chunks
review_reader = pd.read_json(filename, lines=True, chunksize=10)
And I betcha it works as intended! (It does for me, as it always has).
Caveat: windows doesn't use forward-slash path delimiters, and constructing paths by concatenating strings in the above fashion can be fragile, but usually if you use 'proper' forward-slash delimiters (smile), decent languages internally understand that. It's constructing paths using backslashes that is guaranteed to cause you pain. But just keep an eye on that.

Find lines which contain a specific word

First of all I have done a great research and tried many approaches to look for solution but maybe I am doing it wrong and was not able to find a solution.
My data:
https://knsim.com/test.txt
The problem is that I want to fill all the occurrence of 'Idea_Print' in a new file (not just the name 'Idea_Print' but the complete line of log.
I have tried many approaches but had not success.
My recent code:
file = open(filename, 'r')
for line in file:
if 'Idea_Print' in line:
print(line)
But this didn't work for me. I even tried using re but no success.
Thanks for help.
I tried running your code with my setup and it worked fine. Most likely, you are using the wrong filename in open().
In file = open(filename, 'r'), make sure that filename is the exact same name as your saved data, including extension. Ensure that the file you are trying to access is in the same folder as your python file. If it is not in the same folder, try including the full path to the file.
Also, at the end of your program make sure you call file.close() to allow other programs to access it correctly.
Another thing that could be causing problems is that your variable is called file. This is a built-in name in Python, so you probably don't want to overwrite it. Try changing it to something like data_file.
Edit: Looking at that file it appears that there is a space between every character. That means you will need to use 'E p i s o d e _ 2 0' instead of 'Episode_20'. However, it appears that even that does not appear in the file. Maybe you should double-check that it is the text that you are looking for.

File Reading Options Enquiry (Python)

I am a programming student for the semester. In class we have been learning about file opening, reading and writing.
We have used a_reader to achieve such tasks for file opening. I have been reading our associated text/s and I have noticed that there is a CSV reader option which I have been using.
I wanted to know if there were anymore possible ways to open/read a file as I am trying to grow my knowledge base in python and its associated contents.
EDIT:
I was referring to CSV more specifically as that is the type of files we use at the moment. We have learnt about CSV Reader and a_reader and an example from one of our lectures is shown below.
def main():
a_reader = open('IDCJAC0016_009225_1800_Data.csv', 'rU')
file_data = a_reader.read()
a_reader.close()
print file_data
main()
It may seem overly broad but I have no knowledge which is why I am asking is there more than just the 2 ways above. If there is can someone who knows provide the types so I can read up on and research on them.
If you're asking about places to store things, the first interfaces you'll meet are files and sockets (pretend a network connection is like a file, see http://docs.python.org/2/library/socket.html).
If you mean file formats (like csv), there are many! Probably you can think of many yourself, but besides csv there are html files, pictures (png, jpg, gif), archive formats (tar, zip), text files (.txt!), python files (.py). The list goes on.
There are many ways to read files in different ways.
Just plain open will take a filename and open it as a sequence of lines. Or, you can just call read() on it, and it will read the whole file at once into one giant string.
codecs.open will take a filename and a character set, and decode each line to Unicode automatically. Or, again, you can just call read() on it, and it will read and decode the whole file at once into one giant Unicode string.
csv.reader will take a file or file-like object, and read it as a sequence of CSV rows. There's no direct equivalent of read()—but you can turn any sequence into a list by just calling list on it, so list(my_reader) will give you a list of rows (each of which is, itself, a list).
zipfile.ZipFile will take a filename, or a file or file-like object, and read it as a ZIP archive. This doesn't go line by line, of course, but you can go archived file by archived file. Or you can do fancier things, like search for archived files by name.
There are modules for reading JSON and XML documents, different ways of handling binary files, and so on. Some of them work differently—for example, you can search an XML document as a tree with one module, or go element by element with a different one.
Python has a pretty extensive standard library, and you can find the documentation online. Every module that seems like it should be able to work on files, probably can.
And, beyond what comes in the standard library, PyPI, the Python Package Index has thousands of additional modules. Looking for a way to read YAML documents? Search PyPI for yaml and you'll find it.
Finally, Python makes it very easy to add things like this on your own. The skeleton of a function like csv.reader is as simple as this:
def reader(fileobj):
for line in fileobj:
yield parse_one_csv_line(line)
You can replace that parse_one_csv_line with anything you want, and you've got a custom reader. For example, here's an uppercase_reader:
def uppercase_reader(fileobj):
for line in fileobj:
yield line.upper()
In fact, you can even write the whole thing in one line:
shouts = (line.upper() for line in fileobj)
And the best thing is that, as long as your reader only yields one line at a time, your reader is itself a file-like object, so you can pass uppercase_reader(fileobj) to csv.reader and it works just fine.

Categories

Resources