I am reading an formatted sequential file output from a Fortran program. I am using the scipy.io.FortranFile class to do this, and am successfully extracting the information I need.
My problem: I do not know how long the input file is, and have no way of knowing how many records to read in. Currently, I am simply iteratively reading the file until an exception is raised (a TypeError, but I don't know if this is how it would always fail). I would prefer to do this more elegantly.
Is there anyway to detect EOF using the FortranFile class? Or alternately is there a better way to read in unformatted sequential files?
Some cursory research (I am not a Fortran programmer) indicates to me that if reading this using the Fortran READ function, one can check the IOSTAT flag to determine if you are at the end of the file. I would be surprised if a similar capability isn't provided in the FortranFile class, but I don't see any mention of it in the documentation.
Related
this is a question about programming methodology.
I'm building a python programme to check files for keywords. The keywords may occur more than one time in a file and, when one is found, I record what the keyword was, the line on which it is if found and the sentence in which it is found.
The issue I'm hitting is that I can't predict all the file types my programme may meet. So, in order to mitigate this, I've been writing code to handle each instance of file type that my programme might encounter. For example one block of code to handle .txt files, then another to handle .csv files. As I'm doing this, I've been thinking its not very efficient. I'm wondering, is there a way in python to check any file of any given type for a keyword and return the information I'm seeking to index without having to predict all the possible file types and write a case for handling each? If so, how should I be approaching this?
thanks
I am trying to know what are the functions called from the command g.io('file.json').read().iterate()
I see that a 'read' step is put in the step_instructions but I can't found the original function to import the file in to the graph.
This because I want to import a lot of data but without a file, using a python object.
I see that io().read() import a big file in a minute and I want to ricreate it but without using a file.
Thanks a lot.
First of all to be clear on the nomenclature, io() is a step, while read() and write() are step modulators and those modulators can only apply to io() step to tell it to read or write respectively. Therefore, as io() currently only works with a string file name consequently you can only read/write from/to files.
If you want to send "a lot of data" with Python, I'd first consider what you mean by that in size. If you're talking millions of vertices and edges, you should first check if the graph database you are using has its own bulk loading tool. If it does, you should use that. You may also consider methods using Gremlin/Spark as described here in the case of JanusGraph. Finally, if you must use pure Gremlin to do your bulk loading, then parameterized traversal with your Python object (I assume a list/dict of some sort) is probably the approach to take. This blog post might offer some inspiration in that line of thinking.
I'm trying to anonymize SIP Traces by replacing all the phone numbers with random ones. I'm able to read the file and the numbers from it. What I can't do however is modify the file without corrupting it.
I've tried different parsers (pyshark, dpkt & scapy) and they're all great for reading the file. Modifying doesn't work however.
What I've tried:
"Brute Force" by just reading in the file, modifying it and saving
it as .pcap again. This obviously didn't work at all and Wireshark
complained about the file being cut short (which it was (probably for character reasons?)).
All the parsers. The problem with these is that I can read the file but I can't write to it without turning it all into a string which again, breaks the file.
Is there some kind of function in one of the libraries where I could replace a pattern with another one? Or do any of you have an idea of how I could solve this differently?
Thank you for your answers
I'm building a chatbot database atm. I uses data from pushshift.io. In order to deal with big datafile, (I understand that json loads everything into RAM, so if you only have 16GB RAM and working with 30GB of data, that is a nono), I wrote a bash script that split the big file into smaller chunk of 3GB of file so that I can run it through json.loads (or pd.read_json). The problem whenever I run my code it returns
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Thus I take a look into the temp json file that I just created and I see this happens in my JSON file:
ink_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
The sample correction of the data looks like this
{"score_hidden":false,"name":"t1_cnas8zv","link_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
I notice that my bash script split the file without paying attention to the JSON objects. So my question is are there ways to write a function in python that can detect JSON objects that are not correctly formatted and deleted it?
There isn't a lot of information to go on, but I would challenge the frame a little.
There are several incremental json parsers available in Python. A quick search shows ijson should allow you to traverse your very large data structure without exploding.
You also should consider another data format (or a real database), or you will easily find yourself spending time reimplementing much much slower versions of features that already exist with the right tools.
If you are using the json standard library, then calling json.loads on badly formatted data will return JSONDecodeError. You can put your code in a try-catch statement and check if this exception occurs to make sure you only process correctly formatted data.
I'm creating a library that caches function return values to pkl files. However, sometimes when I terminate the program while writing to the pkl files, I wind up with corrupt pkl files (not always). I'm setting up the library to deal with these corrupt files (that lead mostly to an EOFError, but may also lead to an IOError). However, I need to create files that I know are corrupt to test this, and the method of terminating the program is not consistent. Is there some other way to write to a pkl file and be guaranteed an EOFError or IOError when I subsequently read from it?
Short answer: You don't need them.
Long answer: There's a better way to handle this, take a look below.
Ok, let's start by understanding each of these exception separately:
The EOFError happens whenever the parser reaches the end of file
without a complete representation of an object and, therefore, is unable to rebuild
the object.
An IOError represents a reading error, the file could be deleted or have it's permissions revoked during the process.
Now, let's develop a strategy for testing it.
One common idiom is to encapsulate the offending method, pickle.Pickler for example, with a method that may randomly throw these exceptions. Here is an example:
import pickle
from random import random
def chaos_pickle(obj, file, io_error_chance=0, eof_error_chance=0):
if random < io_error_chance:
raise IOError("Chaotic IOError")
if random < eof_error_chance:
raise EOFError("Chaotic EOFError")
return pickle.Pickler(obj, file)
Using this instead of the traditional pickle.Pickler ensures that your code randomly throws both of the exceptions (notice that there's a caveat, though, if you set io_error_chance to 1, it will never raise a EOFError.
This trick is quite useful when used along the mock library (unittest.mock) to create faulty objects for testing purposes.
Enjoy!
Take a bunch of your old, corrupted pickles and use those. If you don't have any, take a bunch of working pickles, truncate them quasi-randomly, and see which ones give errors when you try to load them. Alternatively, if the "corrupt" files don't need to even resemble valid pickles, you could just unpickle random crap you wouldn't expect to work. For example, mash the keyboard and try to unpickle the result.
Note that the docs say
The pickle module is not intended to be secure against erroneous or
maliciously constructed data. Never unpickle data received from an
untrusted or unauthenticated source.