TDD in Python for JSON file read - python

I am using Python to read a JSON file and convert each record into a class instance. I already have a sample JSON file, but for the Python code, I am trying to use test-driven development (TDD) methods for the first time. Here's what the data in my (sample) JSON file look like:
[{"Name": "Max", "Breed": "Poodle", "Color": "White", "Age": 8},
{"Name": "Jack", "Breed": "Corgi", "Color": "Black", "Age": 4},
{"Name": "Lucy", "Breed": "Labrador Retriever", "Color": "Brown", "Age": 2},
{"Name": "Bear", "Breed": "German Shepherd", "Color": "Brown", "Age": 6}]
I know I want to test for valid entries in each of the arguments for all the instances. For example, I want to check the breed against a tuple of acceptable breeds, and check that age is always given as an integer. Given my total lack of TDD experience, it's not clear to me if the code checking the objects resulting from the JSON import code is itself the test, or if I should be using one set of tests for the JSON import code and a separate set to test the instances generated by the import code.

Those are two separate instances. Testing the JSON load is completely different from testing the loaded json data. Loading the data should not be complex. (json.loads). But if you need to test it, keep it as minimal and fine grained as possible. Your tests should not affect each other.
In general, your test cases should test very specific portions of your code. That is, it should test specific functionality of your program. Because you mentioned validating the json data you load (breed for instance), this implies that your program should also have this validate functionality. For this instance, you would have test cases like the ones below.
import doggie
def test_validate_breed():
# Positive test case -- everything here should pass. (Assuming you give good data)
# your load_json routine itself could be a test. But generally this either works or
# it raises a json exception... At any rate, load_json returns a list of dictionaries
# like those you described above.
l = load_json()
for d in l:
assert doggie.validate_breed(d)
# Generate an invalid dictionary entry
d = { "Name": "Some Name", "Breed": "Invalid!", ... }
assert False == doggie.validate_breed(d)
def test_validate_age():
l = load_json()
for d in l:
assert doggie.validate_age(d)
# generate an invalid dictionary entry
d = { "Name": "Some Name", ... , "Age": 1000 }
assert False == doggie.validate_age(d)
The beauty of testing is that it exposes flaws in your design. It is very good at exposing unnecessary coupling.
I recommend you check out nose for unit testing. It makes running tests a cinch and provides nice utility functions that better describe test failures. For instance:
>>> import nose.tools as nt
>>> nt.assert_equal(4, 5)
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\MSFBPY27\lib\unittest\case.py", line 509, in assertEqual
assertion_func(first, second, msg=msg)
File "C:\MSFBPY27\lib\unittest\case.py", line 502, in _baseAssertEqual
raise self.failureException(msg)
AssertionError: 4 != 5

Related

Mixing dicts with non-series error when attempting json to csv with pandas, but only when the json is retrieved from url

Python 3.8.5 with Pandas 1.1.3
This code, to convert Python json dict list to csv, using pandas, works without issue:
import csv
import pandas as pd
import json
data = [{"results": [{"type": "ID", "value": "1234", "normalized": "1234", "count": 1, "offsets": [{"start": 14, "end": 25}], "id_b": "10"}, {"type": "ID", "value": "5678", "normalized": "5678", "count": 1, "offsets": [{"start": 32, "end": 43}], "id_b": "11"}], "responseHeaders": {"Date": "Tue, 25 May 2021 14:41:28 GMT", "Content-Type": "application/json", "Content-Length": "350", "Connection": "keep-alive", "Server": "openresty", "X-StuffAPI-ProcessedLanguage": "eng", "X-StuffAPI-Request-Id": "abcdef", "Strict-Transport-Security": "max-age=63072000; includeSubDomains; preload", "X-StuffAPI-App-Id": "123456789", "X-StuffAPI-Concurrency": "1"}}]
pd.read_json(json.dumps(data)).to_csv('file.csv')
The value in the data variable above is pasted directly from the response of an API call to one of our services. The problem occurs when I attempt to do everything including the API call in one script. Let's first look at everything in the script that seems to be working fine:
import csv
import pandas as pd
import json
import stuff.api
def run(key, url):
# Create an API instance
api = API(user_key=key, service_url=url)
# submit data from a text file to the API parser
file1 = open("123.txt","r")
text_data = file1.read()
params = DocumentParameters()
params["content"] = text_data
file1.close()
try:
return api.data(params)
except StuffAPIException as exception:
print(exception)
if __name__ == '__main__':
result = run('1234', 'https://192.168.0.125:8100/rest/')
y = json.dumps(result)
t = type(y)
print(y)
print(t)
The above print(y) statement will return the exact data which I've shown in the data variable in the first code block above. And the print(t) statement was for me to capture the return type to help me try and diagnose the issue - the result of that is <class 'str'>.
So now we add this right under the print(t) line (exactly as in the first code block):
pd.read_json(json.dumps(result)).to_csv('file.csv')
And I get this error:
ValueError: Mixing dicts with non-Series may lead to ambiguous
ordering.
I have seen the many threads about this error, but none of them seem to pertain exactly to what's happening here.
With my limited experience thus far, I am guessing this issue may be due to the return type being string? I'm not sure, but this troubleshooting step is just the first hurdle to overcome - I need to eventually be able to parse the data into separate columns of the csv file, but for now, I just need to get it into the csv file without errors.
I understand you won't be able to fully reproduce this without access to my server, but hoping that's not needed to figure this out.

How do you test your ncurses app in Python?

We've built cli-app with Python. Some part need ncurses, so we use
npyscreen. We've successfully tested most part of app using pytest
(with the help of mock and other things). But we stuck in 'how to test
the part of ncurses code'
Take this part of our ncurses code that prompt user to answer:
"""
Generate text user interface:
example :
fields = [
{"type": "TitleText", "name": "Name", "key": "name"},
{"type": "TitlePassword", "name": "Password", "key": "password"},
{"type": "TitleSelectOne", "name": "Role",
"key": "role", "values": ["admin", "user"]},
]
form = form_generator("Form Foo", fields)
print(form["role"].value[0])
print(form["name"].value)
"""
def form_generator(form_title, fields):
def myFunction(*args):
form = npyscreen.Form(name=form_title)
result = {}
for field in fields:
t = field["type"]
k = field["key"]
del field["type"]
del field["key"]
result[k] = form.add(getattr(npyscreen, t), **field)
form.edit()
return result
return npyscreen.wrapper_basic(myFunction)
We have tried many ways, but failed:
stringIO to capture the output: failed
redirect the output to file: failed
hecate: failed
I think it's only work if we run whole program
pyautogui
I think it's only work if we run whole program
This is the complete steps of what I have
tried
So the last thing I use is to use patch. I patch those
functions. But the cons is the statements inside those functions are
remain untested. Cause it just assert the hard-coded return value.
I find npyscreen docs
for writing test. But I don't completely understand. There is just one example.
Thank you in advance.
I don't see it mentioned in the python docs, but you can use the screen-dump feature of the curses library to capture information for analysis.

NodeJS stdout python multiple prints?

I have a python script that prints JSON and string:
# script.py
print('{"name": "bob", "height": 4, "weight": 145}')
print('hello')
sys.stdout.flush()
A nodejs app calls the python script via child-process. But I'm getting error on the output. How can I process the python output in nodejs?
// nodejs
var process = spawn('python3', ["./script.py", toSend]);
process.stdout.on('data', function(data) {
message = JSON.parse(data)
console.log(message)
})
I'm getting this a SyntaxError: Unexpected token from running this.
In your python script...
This line
print('{"name": "bob", "height": 4, "weight": 145}')
should be changed
import json
print(json.dumps({"name": "bob", "height": 4, "weight": 145}))
That will handle and make sure the JSON is formatted correctly, so that the JSON string can be parsed by node (but your current version should be fine). However in this case the real problem is what follows...
You are ending your script with
print('hello')
which means that JSON.parse() is going to try and parse hello as part of the JSON.parse() because you are reading from stdout... hello is not JSON formatted. So JSON.parse() is going to fail. So remove that line as well.
If you have more then one json object to send as you stated in your comments
You can either combine all the data into a single JSON object
my_object = {"data": "info"....} and json.dumps() that single larger object..
or
obj1 = {}
obj2 = {}
myobjects = [obj1, obj2]
print(json.dumps(myobjects))
and the Node side will recieve a list of objects that can be iterated on

Do file size requirements change when importing a CSV file to MongoDB?

Background:
I'm attempting to follow a tutorial in which I'm importing a CSV file that's approximately 324MB
to MongoLab's sandbox plan (capped at 500MB), via pymongo in Python 3.4.
The file holds ~ 770,000 records, and after inserting ~ 164,000 I hit my quota and received:
raise OperationFailure(error.get("errmsg"), error.get("code"), error)
OperationFailure: quota exceeded
Question:
Would it be accurate to say the JSON-like structure of NoSQL takes more space to hold the same data as a CSV file? Or am I doing something screwy here?
Further information:
Here are the database metrics:
Here's the Python 3.4 code I used:
import sys
import pymongo
import csv
MONGODB_URI = '***credentials removed***'
def main(args):
client = pymongo.MongoClient(MONGODB_URI)
db = client.get_default_database()
projects = db['projects']
with open('opendata_projects.csv') as f:
records = csv.DictReader(f)
projects.insert(records)
client.close()
if __name__ == '__main__':
main(sys.argv[1:])
Yes, JSON takes up much more space than CSV. Here's an example:
name,age,job
Joe,35,manager
Fred,47,CEO
Bob,23,intern
Edgar,29,worker
translated in JSON, it would be:
[
{
"name": "Joe",
"age": 35,
"job": "manager"
},
{
"name": "Fred",
"age": 47,
"job": "CEO"
},
{
"name": "Bob",
"age": 23,
"job": "intern"
},
{
"name": "Edgar",
"age": 29,
"job": "worker"
}
]
Even with all whitespace removed, the JSON is 158 characters, while the CSV is only 69 characters.
Not accounting for things like compression, a set of json documents would take up more space than a csv, because the field names are repeated in each record, whereas in the csv the field names are only in the first row.
The way files are allocated is another factor:
In the filesize section of the Database Metrics screenshot you attached, notice that it says that the first file allocated is 16MB, then the next one is 32MB, and so on. So when your data grew past 240MB total, you had 5 files, of 16MB, 32MB, 64MB, 128MB, and 256MB. This explains why your filesize total is 496MB, even though your data size is only about 317MB. The next file that would be allocated would be 512MB, which would put you way past the 500MB limit.

Python - how to avoid exec for batching?

I have an existing python application (limited deployment) that requires the ability to run batches/macros (ie do foo 3 times, change x, do y). Currently I have this implemented as exec running through a text file which contains simple python code to do all the required batching.
However exec is messy (ie security issues) and there are also some cases where it doesn't act exactly the same as actually having the same code in your file. How can I get around using exec? I don't want to write my own mini-macro language, and users need to use multiple different macros per session, so I can't setup it such that the macro is a python file that calls the software and then runs itself or something similar.
Is there a cleaner/better way to do this?
Pseudocode: In the software it has something like:
-when a macro gets called
for line in macrofile:
exec line
and the macrofiles are python, ie something like:
property_of_software_obj = "some str"
software_function(some args)
etc.
Have you considered using a serialized data format like JSON? It's lightweight, can easily translate to Python dictionaries, and all the cool kids are using it.
You could construct the data in a way that is meaningful, but doesn't require containing actual code. You could then read in that construct, grab the parts you want, and then pass it to a function or class.
Edit: Added a pass at a cheesy example of a possible JSON spec.
Your JSON:
{
"macros": [
{
"function": "foo_func",
"args": {
"x": "y",
"bar": null
},
"name": "foo",
"iterations": 3
},
{
"function": "bar_func",
"args": {
"x": "y",
"bar": null
},
"name": "bar",
"iterations": 1
}
]
}
Then you parse it with Python's json lib:
import json
# Get JSON data from elsewhere and parse it
macros = json.loads(json_data)
# Do something with the macros
for macro in macros:
run_macro(macro) # For example
And the resulting Python data is almost identical syntactically to JSON aside from some of the keywords like True, False, None (true, false, null in JSON).
{
'macros': [
{
'args':
{
'bar': None,
'x': 'y'
},
'function': 'foo_func',
'iterations': 3,
'name': 'foo'
},
{
'args':
{
'bar': None,
'x': 'y'
},
'function': 'bar_func',
'iterations': 1,
'name': 'bar'
}
]
}

Categories

Resources