appending to existing avro file oython - python

I'm exploring the avro file format and am currently struggling to append data. I seem to overwrite in each run. I found an existing thread here, saying I should not pass in a schema in order to "append" to existing file without overwriting. Even my lint gives this clue: If the schema is not present, presume we're appending.. However, If I try to declare DataFileWriter as DataFileWriter(open("users.avro", "wb"), DatumWriter(), None) then the code wont run.
Simply put, how do I append values to an existing avro files without writing over existing content.
schema = avro.schema.parse(open("user.avsc", "rb").read()
writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
print("start appending")
writer.append({"name": "Alyssa", "favorite_number": 256})
writer.append({"name": "Ben", "favorite_number": 12, "favorite_color": "blue"})
writer.close()
print("write successful!")
# Read data from an avro file
with open('users.avro', 'rb') as f:
reader = DataFileReader(open("users.avro", "rb"), DatumReader())
users = [user for user in reader]
reader.close()
print(f'Schema {schema}')
print(f'Users:\n {users}')

I'm not sure how to do it with the standard avro library, but if you use fastavro it can be done. See the example below:
from fastavro import parse_schema, writer, reader
schema = {
"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
parsed_schema = parse_schema(schema)
records = [
{"name": "Alyssa", "favorite_number": 256},
{"name": "Ben", "favorite_number": 12, "favorite_color": "blue"},
]
# Write initial 2 records
with open("users.avro", "wb") as fp:
writer(fp, schema, records)
# Append third record
with open("users.avro", "a+b") as fp:
writer(fp, schema, [{"name": "Chris", "favorite_number": 1}])
# Read all records
with open("users.avro", "rb") as fp:
for record in reader(fp):
print(record)

The solution to skip the schema is correct, but only once you have the Avro file set up with the correct schema.
In your code, what you have incorrectly, was to put wb instead of ab in the file open mode.
Putting None or no argument at all in DataFileWriter should not matter and the code should run.
This reproducible code initializes the file in the correct schema. It does not matter if it is ab or wb mode, just write to an empty file with a schema and close it.
writer = DataFileWriter(open("reproducible.avro", "ab+"), DatumWriter(), schema)
writer.close()
Now to write the actual records in the append mode (so no re-reading the file!), you can skip the schema while in the ab mode:
for i in range(3):
writer = DataFileWriter(open("reproducible.avro", "ab+"), DatumWriter())
writer.append(db_entry)
writer.close()
Finally, read the entire file:
reader = DataFileReader(open("reproducible.avro", "rb"), DatumReader())
for data in reader:
print(data)
reader.close()
Works for me on Windows, with Python 3.9.13 and avro library 1.11.1.
For full reproducible example, please begin with:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
import json
schema = {
"type": "record",
"name": "recordName",
"fields": [
{
"name": "id",
"type": "string"
}
]
}
schema = avro.schema.parse(json.dumps(schema))
db_entry = {
"id": "random_id"
}

Related

How can i convert CSV in JSON like I want

Hello I show you my problem's :
I right that for convert my csv in Json. But the résult is not exactly what I Want .
main.py
import csv
filename ="forcebrute.csv"
# opening the file using "with"
# statement
with open(filename, 'r') as data:
for line in csv.DictReader(data):
print(line)
csv
name;price;profit
Action-1;20;5
Action-2;30;10
Action-3;50;15
Action-4;70;20
Action-5;60;17
result i have:
{'name;price;profit': 'Action-1;20;5'}
{'name;price;profit': 'Action-2;30;10'}
{'name;price;profit': 'Action-3;50;15'}
{'name;price;profit': 'Action-4;70;20'}
{'name;price;profit': 'Action-5;60;17'}
And I would like this result:
You will need to specify the column delimiter then you can use json.dumps() to give you the required output format
import csv
import json
with open('forcebrute.csv') as data:
print(json.dumps([d for d in csv.DictReader(data, delimiter=';')], indent=2))
Output:
[
{
"name": "Action-1",
"price": "20",
"profit": "5"
},
{
"name": "Action-2",
"price": "30",
"profit": "10"
},
{
"name": "Action-3",
"price": "50",
"profit": "15"
},
{
"name": "Action-4",
"price": "70",
"profit": "20"
},
{
"name": "Action-5",
"price": "60",
"profit": "17"
}
]
You will need to use Dictreader from the csv library to read the contents of the CSV file and then convert the contents to a list before using json.dumps to turn the data into JSON.
import csv
import json
filename ="forcebrute.csv"
# Open the CSV file and read the contents into a list of dictionaries
with open(filename, 'r') as f:
reader = csv.DictReader(f, delimiter=';')
csv_data = list(reader)
# Convert the data to a JSON string and print it to the console
json_data = json.dumps(csv_data)
print(json_data)
An easy approach would be using pandas, also quite fast with large csv files. It might need some tweaking but you get the point.
import pandas as pd
import json
df = pd.read_csv(filename, sep = ';')
data = json.dumps(df.to_dict('records'))

Is there a way to return or get an output of JSON contents without using the print statement in Python?

Curious whether there's a way to return or output JSON contents into the terminal or somehow call them without using the print statement?
I'm writing a Python script to create a custom sensor in PRTG. The main goal of the script is to retrieve JSON data from Rubrik's API, extract specific key values from the JSON file and write them to a new JSON file. The contents of the second JSON file should be somehow outputted. PRTG requires not to use any sort of print statements in the script as it may corrupt the JSON.
I tried using various different methods, though did not find any success.
This is the code that I'm using:
import rubrik_cdm
import urllib3
import json
import sys
from datetime import datetime, timedelta
# Disable warnings
urllib3.disable_warnings()
# Authenticate by providing the node_ip and api_token
NODE_IP = ""
API_TOKEN = ""
# Establish a connection to Rubrik
rubrik = rubrik_cdm.Connect(
node_ip=NODE_IP,
api_token=API_TOKEN)
# Get Rubrik's failed archives from the past 24 hours and write them to a JSON file
def get_rubrik_failed_archives():
current_date = datetime.today() - timedelta(days=1) # <-- Get the datetime
datetime_to_string = current_date.strftime("%Y-%m-%dT%H:%M:%S")
get_failed_archives = rubrik.get("v1", f"/event/latest?event_status=Failure&event_type=Archive&before_date={datetime_to_string}&limit=1000")
with open("get_failed_archives.json", "w") as file:
json.dump(get_failed_archives, file, indent=3)
# Write first JSON file specific contents to a new JSON file
def get_rubrik_failed_archives_main():
with open("get_failed_archives.json") as json_file:
json_data = json.load(json_file)
failed_archives_data = []
for archive_data in json_data["data"]:
failed_archives_data.append({
"objectName": archive_data["latestEvent"]["objectName"],
"time": archive_data["latestEvent"]["time"],
"eventType": archive_data["latestEvent"]["eventType"],
"eventStatus": archive_data["latestEvent"]["eventStatus"],
})
with open("rubrik_failed_archives.json", "w") as file:
json.dump(failed_archives_data, file, indent=4, sort_keys=True)
return failed_archives_data
get_rubrik_failed_archives()
get_rubrik_failed_archives_main()
JSON file contents:
[
{
"eventStatus": "Failure",
"eventType": "Archive",
"objectName": "W12 BO Template",
"time": "2022-08-23T10:09:33.092Z"
},
{
"eventStatus": "Failure",
"eventType": "Archive",
"objectName": "W12 BO Template",
"time": "2022-08-23T09:06:33.786Z"
},
{
"eventStatus": "Failure",
"eventType": "Archive",
"objectName": "W12 BO Template",
"time": "2022-08-23T08:03:35.118Z"
},
{
"eventStatus": "Failure",
"eventType": "Archive",
"objectName": "W12 BO Template",
"time": "2022-08-23T07:00:32.683Z"
}
]
So, is there a way to return or get an output of JSON contents without using the print statement in Python?

Separate large JSON object into many different files

I have a JSON file with 10000 data entries like below in a file.
{
"1":{
"name":"0",
"description":"",
"image":""
},
"2":{
"name":"1",
"description":"",
"image":""
},
...
}
I need to write each entry in this object into its own file.
For example, the output of each file looks like this:
1.json
{
"name": "",
"description": "",
"image": ""
}
I have the following code, but I'm not sure how to proceed from here. Can anyone help with this?
import json
with open('sample.json', 'r') as openfile:
# Reading from json file
json_object = json.load(openfile)
You can use a for loop to iterate over all the fields in the outer object, and then create a new file for each inner object:
import json
with open('sample.json', 'r') as input_file:
json_object = json.load(input_file)
for key, value in json_object.items():
with open(f'{key}.json', 'w') as output_file:
json.dump(value, output_file)

Convert CSV into Json in Python. Format problem

I have written a python code to convert csv file into json file. But the output is not the same as I desired. please look and suggest modifications.
Below is the expected json file.
[
{
"id": "1",
"MobileNo": "923002546363"
},
{
"id": "2",
"MobileNo": "923343676143"
}
]
below is the code that I have written in python.
import csv, json
def csv_to_json(csvFilePath, jsonFilePath):
jsonArray = []
#read csv file
with open(csvFilePath, encoding='utf-8') as csvf:
#load csv file data using csv library's dictionary reader
csvReader = csv.DictReader(csvf)
#convert each csv row into python dict
for row in csvReader:
#add this python dict to json array
jsonArray.append(row)
#convert python jsonArray to JSON String and write to file
with open(jsonFilePath, 'w', encoding='utf-8') as jsonf:
jsonString = json.dumps(jsonArray, indent=4)
jsonf.write(jsonString)
csvFilePath = r'my_csv_data.csv'
jsonFilePath = r'data.json'
csv_to_json(csvFilePath, jsonFilePath)
As your post doesn't provide current output, I just created a csv file to run your code:
id,MobileNo
1,923002546363
2,923343676143
3,214134367614
And works just fine:
[
{
"id": "1",
"MobileNo": "923002546363"
},
{
"id": "2",
"MobileNo": "923343676143"
},
{
"id": "3",
"MobileNo": "214134367614"
}
]
Check if your csv file isn't corrupted. And if possible, edit your post with current output and your csv file.

JSON not getting saved correctly

So I'm pulling data from an API and want to save only specific dicts and list from the JSON response. The problem is that when I dump the data inside the loop it creates very weird looking data in the file which isn't actually JSON.
r=requests.get(url,headers=header)
result=r.json()
with open ('myfile.json','a+') as file:
for log in result['logs']:
hello=json.dump(log['log']['driver']['username'], file)
hello=json.dump(log['log']['driver']['first_name'],file)
hello=json.dump(log['log']['driver']['last_name'],file)
for event in log['log']['events']:
hello=json.dump(event['event']['id'],file)
hello=json.dump(event['event']['start_time'],file)
hello=json.dump(event['event']['type'],file)
hello=json.dump(event['event']['location'],file)
The end goal here is to convert this data into a CSV. The only reason I'm saving it to a JSON file is so that I can load it and save it into a CSV then. The API endpoint I'm targeting is Logs:
https://developer.keeptruckin.com/reference#get-logs
I think #GBrandt has the right idea as far as creating valid JSON output goes, but as I said in a comment, I don't think that JSON-to-JSON conversion step is really necessary — since you could just create the CSV file from the JSON you already have:
(Modified to also split start_time into two separate fields as per you follow-on question.)
result = r.json()
with open('myfile.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
for log in result['logs']:
username = log['log']['driver']['username']
first_name = log['log']['driver']['first_name']
last_name = log['log']['driver']['last_name']
for event in log['log']['events']:
id = event['event']['id']
start_time = event['event']['start_time']
date, time = start_time.split('T') # Split time into two fields.
_type = event['event']['type'] # Avoid using name of built-in.
location = event['event']['location']
if not location:
location = "N/A"
writer.writerow(
(username, first_name, last_name, id, date, time, _type, location))
It looks like you're just dumping individual JSON strings into the file in an unstructured way.
json.dump will not magically create a JSON dict-like object and save it into the file. See:
json.dump(log['log']['driver']['username'], file)
What it actually does there is just stringifying the driver's username and dumping it right into the file, so the file will have only a string, not a JSON object (which I'm guessing is what you want). It is JSON, just not really useful.
What you're looking for is this:
r=requests.get(url,headers=header)
result=r.json()
with open ('myfile.json','w+') as file:
logs = []
for log in result['logs']:
logs.append({
'username': log['log']['driver']['username'],
'first_name': log['log']['driver']['first_name'],
'last_name': log['log']['driver']['last_name'],
# ...
'events': [
({
'id': event['event']['id'],
'start_time': event['event']['start_time'],
# ...
}) for event in log['log']['events']
]
})
json.dump(logs, file)
Also, I would recommend not using append mode on JSON files, a .json is expected to hold a single JSON object (as far as I'm concerned).
How about the code below (A sample json is loaded from a file instead of via HTTP call in order to get data to work with).
Sample JSON taken from https://developer.keeptruckin.com/reference#get-logs
import json
with open('input.json', 'r') as f_in:
data = json.load(f_in)
data_to_collect = []
logs = data['logs']
with open('output.json', 'w') as f_out:
for log in logs:
_log = log['log']
data_to_collect.append({key: _log['driver'].get(key) for key in ['username', 'first_name', 'last_name']})
data_to_collect[-1]['events'] = []
for event in _log['events']:
data_to_collect[-1]['events'].append(
{key: event['event'].get(key) for key in ['id', 'start_time', 'type', 'location']})
json.dump(data_to_collect, f_out)
Output file
[
{
"username": "demo_driver",
"first_name": "Demo",
"last_name": "Driver",
"events": [
{
"start_time": "2016-10-16T07:00:00Z",
"type": "driving",
"id": 221,
"location": "Mobile, AL"
},
{
"start_time": "2016-10-16T09:00:00Z",
"type": "sleeper",
"id": 474,
"location": null
},
{
"start_time": "2016-10-16T11:00:00Z",
"type": "driving",
"id": 475,
"location": null
}
]
}
]

Categories

Resources