Reading json file in pyspark with out changing old schema - python

I received the json every day with 10 attributes but some days if any attribute has no value they will send the 9 attributes and 10th attribute has not there in json. How can I read the json file in pyspark without changing old table schema

It seems like you should enforce a schema when reading the files.
I'm assuming you have something like this:
df = spark.read.json(path_to_json_files)
In order to preserve all the attributes/fields, use the schema like so:
df = spark.read.schema(file_schema).json(path_to_json_files)
To get the file_schema you can use an old file(s) that you know every attribute is available:
file_schema = spark.read.json(full_json_file).schema

Related

how to read JSON data from database table into python pandas?

Instead of loading data from a JSON file, I need to retrieve JSON data from database and apply business logic. So, in order to do that I have used python "json" module to load data, and I could see the data getting printed in my console. However, when I am trying to read that data in pandas to create a dataframe out of it, its not happening. Please see my code below
def jsonRd():
json_obj = json.loads("my table name")
json_ead = pd.read_json(json_obj)
There are other confidential data which I cannot put here are the part of above function. So, when I print "json_obj", it is showing data. But when I try to print "json_eed", nothing seems to be happening. Don't see any error also.
please suggest

Reading from CSV, converting to JSON and storing in MongoDB

I am trying to read a CSV file in Pandas, convert each row in to a JSON object and append them to a dict and then store in MongoDB.
Here is my code
data = pd.DataFrame(pd.read_csv('data/airports_test.csv'))
for i in data.index:
json = data.apply(lambda x: x.to_json(), axis=1)
json_dict = json.to_dict()
print(json_dict[5])
ins = collection.insert_many(json_dict)
# for i in json_dict:
# ins = collection.insert_one(json_dict[i])
If I print elements of the dict I get the correct output (I think..). If I try to use collection.insert_many, I get the error 'documents must be a non empty list' If I try to loop through the dict and add one at a time I get the error
document must be an instance of dict, bson.son.SON, bson.raw_bson.RawBSONDocument, or a type that inherits from collections.MutableMapping
I have Googled and Googled but I can't seem to find a solution! Any help would be massively appreciated.
You can skip processing the individual rows of the DataFrame via:
import json
import pandas
data = pandas.DataFrame(pandas.read_csv('test2.csv'))
data = data.to_dict(orient="records")
collection.insert_many(data)
As an aside, I think I would personally use the csv module and dictReader rather than pandas here but this way is fine.

store pandas df as blob in oracle database

I want to detect different dataframes in the excel file and give each detected dataframe an id and store this dataframe as an object/blob into Oracle database.
So in DB table, it would look like:
DF_ID
DF_BLOB
1
/blob string for df 1/
2
/blob string for df 2/
I know how to store entire excel file as blob in oracle (basically directly store excelfile.read())
but I cannot directly read() or open() pandas df. Then how can I store this df object as blob?
The go to library for storing Python objects in a binary format is pickle.
To get a byte-string instead of writing to a file use pickle.dumps():
pickle.dumps(obj, protocol=None, *, fix_imports=True, buffer_callback=None)
Return the pickled representation of the object obj as a bytes object, instead of writing it to a file.
Arguments protocol, fix_imports and buffer_callback have the same meaning as in the Pickler constructor.

Is there a function to print JSON schema in Python?

I read in the JSON file as Pandas data frame. Now I want to print the JSON object schema. I look around and mostly I only saw links to do this online but my file is too big (almost 11k objects/lines). I'm new at this so I was wondering is there a function/code that I can do this in python?
What I have so far...
import json
import pandas as pd
df = pd.read_json('/Users/name/Downloads/file.json', lines = True)
print(df)
I can't add a comment, but, maybe if you try to convert the df into json inside a variable and then print the variable.
you can if you use the pydantic and datamodel-code-generator libraries.
Use datamodel-code-generator to produce the Python model and then use pydantic to print out the schema from the model.

How to transform CSV or DB record data for Kafka and how to get it back into a csv or DF on the other side

I've successfully set up a Kafka instance at my job and I've been able to pass simple 'Hello World' messages through it.
However, I'm not sure how to do more interesting things. I've got a CSV that contains four records from a DB that I'm trying to move through kafka, then take into a DF on the other side and save it as a CSV again.
producer = KafkaProducer(boostrap_servers='my-server-id:443',
....
df = pd.read_csv('data.csv')
df = df.to_json()
producer.send(mytopic, df.encode('utf8'))
This returns code in a tuple object (conusmer.record object, bool) that contains a list of my data. I can access the data as:
msg[0][0][6].decode('utf8')
But that comes in as a single string that I can't pass to a dataframe simply (it just merges everything into one thing).
I'm not sure if I even need a dataframe or a to_json() method or anything. I'm really just not sure how to organize data to send properly and then return it and feed it back into a dataframe so that I can either a) save it to a CSV or b) reinsert the dataframe to a DB with to_Sql.
Kafka isn't really suited to send entire matricies/dataframes around.
You can send a list of CSV rows, JSON arrays, or preferrably some other compressable binary dataformat such as Avro or Protobuf as whole objects. If you are working exclusively in Python, you could pickle the data you send and receive.
When you read the data, you must deserialize it but how you do that, is ultimately your choice, and there is no simple answer for any given application.
The solution, for this one case, would be json_normalize, then to_csv, however... And I would like to point out that Kafka isn't required for you to test that, as you definitely should be writing unit tests...
df = pd.read_csv('data.csv')
jdf = df.to_json()
msg_value = jdf # pretend you got a message from Kafka, as a JSON string
df = pd.json_normalize(msg_value) # back to a dataframe
df.to_csv()

Categories

Resources