Reading from CSV, converting to JSON and storing in MongoDB - python

I am trying to read a CSV file in Pandas, convert each row in to a JSON object and append them to a dict and then store in MongoDB.
Here is my code
data = pd.DataFrame(pd.read_csv('data/airports_test.csv'))
for i in data.index:
json = data.apply(lambda x: x.to_json(), axis=1)
json_dict = json.to_dict()
print(json_dict[5])
ins = collection.insert_many(json_dict)
# for i in json_dict:
# ins = collection.insert_one(json_dict[i])
If I print elements of the dict I get the correct output (I think..). If I try to use collection.insert_many, I get the error 'documents must be a non empty list' If I try to loop through the dict and add one at a time I get the error
document must be an instance of dict, bson.son.SON, bson.raw_bson.RawBSONDocument, or a type that inherits from collections.MutableMapping
I have Googled and Googled but I can't seem to find a solution! Any help would be massively appreciated.

You can skip processing the individual rows of the DataFrame via:
import json
import pandas
data = pandas.DataFrame(pandas.read_csv('test2.csv'))
data = data.to_dict(orient="records")
collection.insert_many(data)
As an aside, I think I would personally use the csv module and dictReader rather than pandas here but this way is fine.

Related

Is there a function to print JSON schema in Python?

I read in the JSON file as Pandas data frame. Now I want to print the JSON object schema. I look around and mostly I only saw links to do this online but my file is too big (almost 11k objects/lines). I'm new at this so I was wondering is there a function/code that I can do this in python?
What I have so far...
import json
import pandas as pd
df = pd.read_json('/Users/name/Downloads/file.json', lines = True)
print(df)
I can't add a comment, but, maybe if you try to convert the df into json inside a variable and then print the variable.
you can if you use the pydantic and datamodel-code-generator libraries.
Use datamodel-code-generator to produce the Python model and then use pydantic to print out the schema from the model.

How to transform CSV or DB record data for Kafka and how to get it back into a csv or DF on the other side

I've successfully set up a Kafka instance at my job and I've been able to pass simple 'Hello World' messages through it.
However, I'm not sure how to do more interesting things. I've got a CSV that contains four records from a DB that I'm trying to move through kafka, then take into a DF on the other side and save it as a CSV again.
producer = KafkaProducer(boostrap_servers='my-server-id:443',
....
df = pd.read_csv('data.csv')
df = df.to_json()
producer.send(mytopic, df.encode('utf8'))
This returns code in a tuple object (conusmer.record object, bool) that contains a list of my data. I can access the data as:
msg[0][0][6].decode('utf8')
But that comes in as a single string that I can't pass to a dataframe simply (it just merges everything into one thing).
I'm not sure if I even need a dataframe or a to_json() method or anything. I'm really just not sure how to organize data to send properly and then return it and feed it back into a dataframe so that I can either a) save it to a CSV or b) reinsert the dataframe to a DB with to_Sql.
Kafka isn't really suited to send entire matricies/dataframes around.
You can send a list of CSV rows, JSON arrays, or preferrably some other compressable binary dataformat such as Avro or Protobuf as whole objects. If you are working exclusively in Python, you could pickle the data you send and receive.
When you read the data, you must deserialize it but how you do that, is ultimately your choice, and there is no simple answer for any given application.
The solution, for this one case, would be json_normalize, then to_csv, however... And I would like to point out that Kafka isn't required for you to test that, as you definitely should be writing unit tests...
df = pd.read_csv('data.csv')
jdf = df.to_json()
msg_value = jdf # pretend you got a message from Kafka, as a JSON string
df = pd.json_normalize(msg_value) # back to a dataframe
df.to_csv()

CSV to JSON without needing it to save in another json file using Python

I am looking for a way to convert the CSV data into JSON data without needing it to save it another JSON file. Is it possible?
So the following functionality needs to be carried out.
Sample Code:
df= pd.read_csv("file_xyz.csv").to_json("another_file.json")
data_json = pd.read_json("another_file.json")
Now, if I had to do the same thing without having to save my data in "another_file.json". I want the data_json to have JSON data by directly performing some operations on CSV data. Is it possible? How can we do that?
Use DataFrame.to_json without filename:
j = pd.read_csv("file_xyz.csv").to_json()
Or if want convert output to dictionary for next processing use DataFrame.to_dict:
d = pd.read_csv("file_xyz.csv").to_dict()

Reading json file in pyspark with out changing old schema

I received the json every day with 10 attributes but some days if any attribute has no value they will send the 9 attributes and 10th attribute has not there in json. How can I read the json file in pyspark without changing old table schema
It seems like you should enforce a schema when reading the files.
I'm assuming you have something like this:
df = spark.read.json(path_to_json_files)
In order to preserve all the attributes/fields, use the schema like so:
df = spark.read.schema(file_schema).json(path_to_json_files)
To get the file_schema you can use an old file(s) that you know every attribute is available:
file_schema = spark.read.json(full_json_file).schema

How to read API JSON data and store as Python dictionary

I am pulling in info from an API. The returned data is in JSON format. I have to iterate through and get the same data for multiple inputs. I want to save the JSON data for each input in a python dictionary for easy access. This is what I have so far:
import pandas
import requests
ddict = {}
read_input = pandas.read_csv('input.csv')
for d in read_input.values:
print(d)
url = "https://api.xyz.com/v11/api.json?KEY=123&LOOKUP={}".format(d)
response = requests.get(url)
data = response.json()
ddict[d] = data
df = pandas.DataFrame.from_dict(ddict, orient='index')
with pandas.ExcelWriter('output.xlsx') as w:
df.to_excel(w, 'output')
With the above code, I get the following output:
a.com
I also get an excel output with the data only from this first line. My input csv file has close to 400 rows so I should be seeing more than 1 line in the output and in the output excel file.
If you have a better way of doing this, that would be appreciated. In addition, the excel output I get is very hard to understand. I want to read the JSON data using dictionaries and subdictionaries but I don't completely understand the format of the underlying data - I think it looks closest to a JSON array.
I have looked at numerous other posts including Parsing values from a JSON file using Python? and How do I write JSON data to a file in Python? and Converting JSON String to Dictionary Not List and How do I save results of a "for" loop into a single variable? but none of the techniques have worked so far. I would prefer not to pickle, if possible.
I'm new to Python so any help is appreciated!
I'm not going to address your challenges with JSON here as I'll need more information on the issues you're facing. However, with respect to reading from CSV using Pandas, here's a great resource: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html.
Now, your output is being read the way it is because a.com is being considered the header (undesirable). Your read statement should be:
read_input = pandas.read_csv('input.csv', header=None)
Now, read_input is a DataFrame (documentation). So, what you're really looking for is the values in the first column. You can easily get an array of values by read_input.values. This gives you a separate array for each row. So your for loop would be:
for d in read_input.values:
print(d[0])
get_info(d[0])
For JSON, I'd need to see a sample structure and your desired way of storing it.
I think there is a awkwardness in you program.
Try with this:
ddict = {}
read_input = pandas.read_csv('input.csv')
for d in read_input.values:
url = "https://api.xyz.com/v11/api.json?KEY=123&LOOKUP={}".format(d)
response = requests.get(url)
data = response.json()
ddict[d] = data
Edit: iterate the read_input.values.

Categories

Resources