Recently, I started working with JSON (with Python 3.7 under Debian 9). This is the first (probably of many) data sets in JSON which I've had the pleasure of working with.
I have used the Python built-in JSON module to interpret arbitrary strings and files. I now have a database with ~5570 rows pertaining information regarding to a given list of servers. There are a lot of things in the pipeline, which I have devised a plan for, but I'm stuck on this particular sanitation.
Here's the code I'm using to parse:
#!/usr/local/bin/python3.7
import json
def servers_from_json(file_name):
with open(file_name, 'r') as f:
data = json.loads(f.read())
servers = [{'asn': item['data']['resource'], 'resource': item['data']['allocations'][0]['asn_name']} for item in data]
return servers
servers = servers_from_json('working-things/working-format-for-parse')
print(servers)
My motive
I'm trying to get match each one of these servers to their ASN_NAME (which is a field ripped straight from RIPE's API; thus providing me with information pertaining to the physical dc each server is located at. Then, once that's done I'll write them to an existing SQL table, next to a Boolean.
So, here's where it gets funky. If I run the whole dataset through this I get this error message:
Traceback (most recent call last):
File "./parse-test.py", line 12, in <module>
servers = servers_from_json('2servers.json')
File "./parse-test.py", line 7, in servers_from_json
data = json.loads(f.read())
File "/usr/local/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python3.7/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 38 column 2 (char 1098)
I noticed that the problem with my initial data set was that each JSON object wasn't delimited by ,\n.
Did some cleaning, still no luck.
I then added the first 3(?) objects to a completely clean file and.. success. I can get the script to read and interpret them the way I want.
Here's the data set with the comma delimiter:
http://db.farnworth.site/servers.json
Here's the working data set:
http://db.farnworth.site/working-format.json
Anyone got any ideas?
I am here assuming that | will not be present as part of the data. And separate each of the information chunks using | and then convert it into a list and load each list item using json module. Hope it helps!
You can try:
import json
import re
with open("servers.json", 'r') as f:
data = f.read()
pattern = re.compile(r'\}\{')
data = pattern.sub('}|{', data).split('|')
for item in data:
server_info = json.loads(item)
allocations = server_info['data']['allocations']
for alloc in allocations:
print(alloc['asn_name'])
I could read the output.json like this
import json
import re
with open("output.json", 'r') as f:
data = f.read()
server_info = json.loads(data)
for item in server_info:
allocations = item['data']['allocations']
for alloc in allocations:
print(alloc['asn_name'])
Related
I'd like to read in swiss data and get records in chunks instead of reading in the entire file.
So far, I've split the file into chunks as seen below
from io import StringIO
sprot_io = StringIO()
footer = b'CC ---------------------------------------------------------------------------\n'
with gzip.open(response['Body'], "r") as f:
for row in f:
if row == footer:
sprot_io.write(row.decode('utf-8'))
<now parse the record>
sprot_io = list()
else:
sprot_io.write(row.decode('utf-8'))
However, when I try to parse these chunks using Bio.SwissProt.parse, I get an unexpected end of file error
def parse_record(file):
seq = next(SeqIO.parse(file, format='swiss'))
return seq
I use next because the function is actually returning a generator, but I should only be getting one record anyway.
I'm assuming there is something wrong with the format I'm giving to it, but I haven't been able to figure out what could be going wrong from looking at the base code
https://github.com/biopython/biopython/blob/master/Bio/SwissProt/KeyWList.py
This is the file I'm trying to parse, but warning... it is roughly 3 gigs
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_*.dat.gz
Any help would be greatly appreciated.
I over complicated it by a lot. I ended up using smart-open to stream the data from s3, and then I passed that to the parser, which takes a ```<class '_io.BytesIOWrapper'>`` just fine.
import boto3
from smart_open import open
session = boto3.session()
tp = {'client': session.client('s3')}
handle = open(url, 'rt', transport_params=tp)
seqs = SeqIO.parse(handle, format='swiss')
record = next(seqs)
I have a .txt file with 70+k json object obtained by pulling data from twitter and dumping into file using:
with open("followers.txt", 'a') as f:
for follower in limit_handled(tweepy.Cursor(api.followers, screen_name=account_name).pages()):
for user_obj in follower:
json.dump(user_obj._json, f)
f.write("\n")
When I try to read this in python using the code below:
import json
with open('followers.txt') as json_data:
follower_data = json.load(json_data)
I get error:
ValueError: Extra data: line 2 column 1 - line 2801 column 1 (char 1489 - 8679498)
It worked when I read a test file with one json object copied from the original file using the same code above. Once I add a second json object to this file then using the same code above gives the error:
ValueError: Extra data: line 2 column 1 - line 2 column 2376 (char 1489 - 3864)
How do I read the file with more than one json object?
The issue comes when you write your JSON. You must write a single JSON object, so you may also load a single JSON object. Currently, you are writing multiple separate objects, causing the error.
Modify your write code a bit:
json_data = []
with open("followers.txt", 'a') as f:
for follower in limit_handled(tweepy.Cursor(api.followers, screen_name=account_name).pages()):
for user_obj in follower:
json_data.append(user_obj._json)
# outside the loops
json.dump(json_data, f)
Now, when reading, your existing code should work. You'll get a list of dictionaries.
Of course it is best to treat the problem from the root: write a single json and read it, as suggested by COLDSPEED.
However, if you already wrote multiple json objects into a single file, you can try the following code to use the already created file:
import json
follower_data = [] # a list of all objects
with open('followers.txt') as json_data:
for line in json_data:
follower_data.append( json.loads(line) )
Assuming you did not indent your json objects when you wrote them to 'flowers.txt', then each line in the file is a json object that can be parsed independantly.
I've been struggling with this simple problem for too long, so I thought I'd ask for help. I am trying to read a list of journal articles from National Library of Medicine ftp site into Python 3.3.2 (on Windows 7). The journal articles are in a .csv file.
I have tried the following code:
import csv
import urllib.request
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream)
data = [row for row in csvfile]
It results in the following error:
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
data = [row for row in csvfile]
File "<pyshell#4>", line 1, in <listcomp>
data = [row for row in csvfile]
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
I presume I should be working with strings not bytes? Any help with the simple problem, and an explanation as to what is going wrong would be greatly appreciated.
The problem relies on urllib returning bytes. As a proof, you can try to download the csv file with your browser and opening it as a regular file and the problem is gone.
A similar problem was addressed here.
It can be solved decoding bytes to strings with the appropriate encoding. For example:
import csv
import urllib.request
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream.read().decode('utf-8')) # with the appropriate encoding
data = [row for row in csvfile]
The last line could also be: data = list(csvfile) which can be easier to read.
By the way, since the csv file is very big, it can slow and memory-consuming. Maybe it would be preferable to use a generator.
EDIT:
Using codecs as proposed by Steven Rumbalski so it's not necessary to read the whole file to decode. Memory consumption reduced and speed increased.
import csv
import urllib.request
import codecs
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(codecs.iterdecode(ftpstream, 'utf-8'))
for line in csvfile:
print(line) # do something with line
Note that the list is not created either for the same reason.
Even though there is already an accepted answer, I thought I'd add to the body of knowledge by showing how I achieved something similar using the requests package (which is sometimes seen as an alternative to urlib.request).
The basis of using codecs.itercode() to solve the original problem is still the same as in the accepted answer.
import codecs
from contextlib import closing
import csv
import requests
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
with closing(requests.get(url, stream=True)) as r:
reader = csv.reader(codecs.iterdecode(r.iter_lines(), 'utf-8'))
for row in reader:
print row
Here we also see the use of streaming provided through the requests package in order to avoid having to load the entire file over the network into memory first (which could take long if the file is large).
I thought it might be useful since it helped me, as I was using requests rather than urllib.request in Python 3.6.
Some of the ideas (e.g using closing()) are picked from this similar post
I had a similar problem using requests package and csv.
The response from post request was type bytes.
In order to user csv library, first I a stored them as a string file in memory (in my case the size was small), decoded utf-8.
import io
import csv
import requests
response = requests.post(url, data)
# response.content is something like:
# b'"City","Awb","Total"\r\n"Bucuresti","6733338850003","32.57"\r\n'
csv_bytes = response.content
# write in-memory string file from bytes, decoded (utf-8)
str_file = io.StringIO(csv_bytes.decode('utf-8'), newline='\n')
reader = csv.reader(str_file)
for row_list in reader:
print(row_list)
# Once the file is closed,
# any operation on the file (e.g. reading or writing) will raise a ValueError
str_file.close()
Printed something like:
['City', 'Awb', 'Total']
['Bucuresti', '6733338850003', '32.57']
urlopen will return a urllib.response.addinfourl instance for an ftp request.
For ftp, file, and data urls and requests explicity handled by legacy
URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object which can work as context manager...
>>> urllib2.urlopen(url)
<addinfourl at 48868168L whose fp = <addclosehook at 48777416L whose fp = <socket._fileobject object at 0x0000000002E52B88>>>
At this point ftpstream is a file like object, using .read() would return the contents however csv.reader requires an iterable in this case:
Defining a generator like so:
def to_lines(f):
line = f.readline()
while line:
yield line
line = f.readline()
We can create our csv reader like so:
reader = csv.reader(to_lines(ftps))
And with a url
url = "http://pic.dhe.ibm.com/infocenter/tivihelp/v41r1/topic/com.ibm.ismsaas.doc/reference/CIsImportMinimumSample.csv"
The code:
for row in reader: print row
Prints
>>>
['simpleci']
['SCI.APPSERVER']
['SRM_SaaS_ES', 'MXCIImport', 'AddChange', 'EN']
['CI_CINUM']
['unique_identifier1']
['unique_identifier2']
Here is my json file format,
[{
"name": "",
"official_name_en": "Channel Islands",
"official_name_fr": "Îles Anglo-Normandes",
}, and so on......
while loading the above json which is in a file I get this error,
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes:
here is my python code,
import json
data = []
with open('file') as f:
for line in f:
data.append(json.loads(line))
,} is not allowed in JSON (I guess that's the problem according to the data given).
You appear to be processing the entire file one line at a time. Why not simply use .read() to get the entire contents at once, then feed that to json?
with open('file') as f:
contents = f.read()
data = json.loads(contents)
Better yet, why not use json.load() to pass the readable directly and let it handle the slurping?
with open('file') as f:
data = json.load(f)
The problem is in your reading and decoding the file line by line. Any single line in your file (e.g., "[{") is not a valid JSON expression.
Your individual lines are not valid JSON. For instance, the first line '[{' by itself is not a valid JSON. If your entire file is actually valid JSON and you want individual lines, first load the entire JSON and then browse through the python dictionary.
import json
data = json.loads(open('file').read()) # this should be a list
for list_item in data:
print(list_item['name'])
I'm trying to inspect my appengine backup files to work out when a data corruption occured. I used gsutil to locate and download the file:
gsutil ls -l gs://my_backup/ > my_backup.txt
gsutil cp gs://my_backup/LongAlphaString.Mymodel.backup_info file://1.backup_info
I then created a small python program, attempting to read the file and parse it using the appengine libraries.
#!/usr/bin/python
APPENGINE_PATH='/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/'
ADDITIONAL_LIBS = [
'lib/yaml/lib'
]
import sys
sys.path.append(APPENGINE_PATH)
for l in ADDITIONAL_LIBS:
sys.path.append(APPENGINE_PATH+l)
import logging
from google.appengine.api.files import records
import cStringIO
def parse_backup_info_file(content):
"""Returns entities iterator from a backup_info file content."""
reader = records.RecordsReader(cStringIO.StringIO(content))
version = reader.read()
if version != '1':
raise IOError('Unsupported version')
return (datastore.Entity.FromPb(record) for record in reader)
INPUT_FILE_NAME='1.backup_info'
f=open(INPUT_FILE_NAME, 'rb')
f.seek(0)
content=f.read()
records = parse_backup_info_file(content)
for r in records:
logging.info(r)
f.close()
The code for parse_backup_info_file was copied from
backup_handler.py
When I run the program, I get the following output:
./view_record.py
Traceback (most recent call last):
File "./view_record.py", line 30, in <module>
records = parse_backup_info_file(content)
File "./view_record.py", line 19, in parse_backup_info_file
version = reader.read()
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/api/files/records.py", line 335, in read
(chunk, record_type) = self.__try_read_record()
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/api/files/records.py", line 307, in __try_read_record
(length, len(data)))
EOFError: Not enough data read. Expected: 24898 but got 2112
I've tried with a half a dozen different backup_info files, and they all show the same error (with different numbers.)
I have noticed that they all have the same expected length: I was reviewing different versions of the same model when I made that observation, it's not true when I view the backup files of other Modules.
EOFError: Not enough data read. Expected: 24932 but got 911
EOFError: Not enough data read. Expected: 25409 but got 2220
Is there anything obviously wrong with my approach?
I guess the other option is that the appengine backup utility is not creating valid backup files.
Anything else you can suggest would be very welcome.
Thanks in Advance
There are multiple metadata files created when an AppEngine Datastore backup is run:
LongAlphaString.backup_info is created once. This contains metadata about all of the entity types and backup files that were created in datastore backup.
LongAlphaString.[EntityType].backup_info is created once per entity type. This contains metadata about the the specific backup files created for [EntityType] along with schema information for the [EntityType].
Your code works for interrogating the file contents of LongAlphaString.backup_info, however it seems that you are trying to interrogate the file contents of LongAlphaString.[EntityType].backup_info. Here's a script that will print the contents in a human-readable format for each file type:
import cStringIO
import os
import sys
sys.path.append('/usr/local/google_appengine')
from google.appengine.api import datastore
from google.appengine.api.files import records
from google.appengine.ext.datastore_admin import backup_pb2
ALL_BACKUP_INFO = 'long_string.backup_info'
ENTITY_KINDS = ['long_string.entity_kind.backup_info']
def parse_backup_info_file(content):
"""Returns entities iterator from a backup_info file content."""
reader = records.RecordsReader(cStringIO.StringIO(content))
version = reader.read()
if version != '1':
raise IOError('Unsupported version')
return (datastore.Entity.FromPb(record) for record in reader)
print "*****" + ALL_BACKUP_INFO + "*****"
with open(ALL_BACKUP_INFO, 'r') as myfile:
parsed = parse_backup_info_file(myfile.read())
for record in parsed:
print record
for entity_kind in ENTITY_KINDS:
print os.linesep + "*****" + entity_kind + "*****"
with open(entity_kind, 'r') as myfile:
backup = backup_pb2.Backup()
backup.ParseFromString(myfile.read())
print backup