Get Mongo field names from collection using pymongo

Get Mongo field names from collection using pymongo - python

I am trying to get fields names from the MongoDB using pymongo. Is there a way to do that?
Mongo Collection Format:
"_id" : ObjectId("5e7a773721ee63712e9d25a3"),
"effective_date" : "2020-03-24",
"data" : [
{
"Year" : 2020,
"month" : 1,
"Day" : 28,
"views" : 4994,
"clicks" : 3982
},
{
"Year" : 2020,
"month" : 1,
"Day" : 17,
"views" : 1987,
"clicks" : 3561
},
.
.
.
]
Is there a way I can get field names:
I want to get: _id, effective_date, data.Year, data.month, data.Day, data.views, data.clicks
This is what I have:
from datetime import datetime, timedelta, date
import pymongo
from pymongo import MongoClient
from pymongo.read_preferences import ReadPreference
from pprint import pprint
from bson.son import SON
from bson import json_util
from bson.json_util import dumps, loads
import re
client = pymongo.MongoClient(host='mongodb://00.00.00.0:00000')
db = client.collection
pprint(db)
def get_results(filters):
col=db.results
res = col.find()
res = list(res)
return dumps(res, indent=4)
Is there a way for me to get just the field names using pymongo?

We are not really filtering or aggregating in the example; we are doing a big find() and then we want all the field names. There is no projection either. So assuming that we are dragging over all the data anyway, let the client side do the work. Here's something that will capture unique field names including through arrays and give you a count of each unique field name as well:
r = [
{"_id":0, "A":"A", "data":[
{"Y":2020,"day":3,"clicks":12},
{"Y":2020,"day":4,"clicks":192}
]} ,
{"_id":1, "B":{"foo":"bar"}, "data":[
{"Y":2020,"day":3,"clicks":888,"corn":"dog"},
{"Y":2020,"day":4,"clicks":999,"zing":"zap"}
]} ,
{"_id":2, "B":{"foo":"bit"} },
{"_id":3, "B":{"fin":"bar"} }
]
coll.insert(r)
fieldNames = {}
def addFldName(s):
if s not in fieldNames:
fieldNames[s] = 0
fieldNames[s] += 1
def process(path, v):
addFldName(path)
if("dict" == v.__class__.__name__):
walkMap(path, v)
elif("list" == v.__class__.__name__):
walkList(path, v)
def walkMap(path, doc):
dot = "" if path is "" else "."
for k, v in doc.iteritems():
s = path + dot + k
process(s, v)
def walkList(path, array):
dot = "" if path is "" else "."
for n in range(0,len(array)):
s = path + dot + str(n)
process(s, array[n])
for doc in coll.find():
walkMap("", doc)
print(fieldNames)
{u'A': 1, u'data.1.clicks': 2, u'B': 3, u'data.0': 2, u'data.1': 2, u'data.0.Y': 2, u'data.1.zing': 1, u'data.0.day': 2, u'B.fin': 1, u'B.foo': 2, u'data.1.Y': 2, u'_id': 4, u'data': 2, u'data.0.corn': 1, u'data.0.clicks': 2, u'data.1.day': 2}
It's a little weird, but yes, data.0.clicks is unique and shows up in 2 docs.

Related

Python - Mongoengine: date range query

I am relatively new to mongoDb in python, so kindly help
I have created a collection called waste:
class Waste(Document):
meta = {'collection': 'Waste'}
item_id = IntField(required=True)
date_time_record = DateTimeField(default=datetime.utcnow)
waste_id = IntField(unique=True, required=True)
weight = FloatField(required= True)
I want to do a range query for a given start and end date:
I have tried the following query:
start = datetime(start_year, start_month, start_day)
end = datetime(end_year, end_month, end_day)
kwargs['date_time_record'] = {'$lte': end, '$gte': start}
reports = Waste.objects(**kwargs).get()
But I keep getting the error: DoesNotExist: Waste matching query does not exist.
the date value being sent as:
{
"start_year": 2020,
"start_month" : 5,
"start_day" : 10,
"end_year": 2020,
"end_month" : 5,
"end_day" : 20
}
when I try to get the first object from the collection, the output in json is:
{"_id": {"$oid": "5ebbcf126fdbb9db9f74d24a"}, "item_id": 96387295, "date_time_record": {"$date": 1589366546870}, "waste_id": 24764942, "weight": 32546.0}
a $date is added and I am unable to decipher the numbers in the date field. But when I look at the data using the mongo compass it looks just fine:
There exist a record in the given date range so I am unable to understand where am I going wrong.

I got this working by using Q:
the query I used is
reports = Waste.objects((Q(date_time_record__gte=start) & Q(date_time_record__lte=end)))
The response is:
[{"_id": {"$oid": "5ebbcf126fdbb9db9f74d24a"}, "item_id": 96387295, "date_time_record": {"$date": 1589366546870}, "waste_id": 24764942, "weight": 32546.0}]

how to accelerate compute for pyspark

The source data is event logs from device and all data is json format,
sample of raw json data
{"sn": "123", "ip": null, "evt_name": "client_requestData", "evt_content": {"count": 1, "hour": 13, "dow": 0, "segmentation": {"requestService": "music", "requestData": "is_online", "requestOpcode": "get_state"}, "sum": 0}, "evt_ts": 1521350052, "app_key": "f6e7f4f8ec4b4d6dae6fa2b5ed8f90cb6a640759", "sdk_name": "countlysdk_0.0.9", "sdk_version": "17.05"}
{"sn": "123", "ip": null, "evt_name": "client_requestData2", "evt_content": {"count": 1, "hour": 13, "dow": 0, "segmentation": {"requestService": "fm", "requestData": "is_online", "requestOpcode": "get_state"}, "sum": 0}, "evt_ts": 1521350053, "app_key": "f6e7f4f8ec4b4d6dae6fa2b5ed8f90cb6a640759", "sdk_name": "countlysdk_0.0.9", "sdk_version": "17.05"}
{"sn": "123", "ip": null, "evt_name": "client_requestData3", "evt_content": {"count": 1, "hour": 13, "dow": 0, "segmentation": {"requestService": "video", "requestData": "is_online", "requestOpcode": "get_state"}, "sum": 0}, "evt_ts": 1521350054, "app_key": "f6e7f4f8ec4b4d6dae6fa2b5ed8f90cb6a640759", "sdk_name": "countlysdk_0.0.9", "sdk_version": "17.05"}
{"sn": "123", "ip": null, "evt_name": "client_requestData4", "evt_content": {"count": 1, "hour": 13, "dow": 0, "segmentation": {"requestService": "fm", "requestData": "is_online", "requestOpcode": "get_state"}, "sum": 0}
I have a event list,eg: tar_task_list, about 100 and more items,and for each event
I need to aggregate all the event from the raw data and then save this to a event csv file
Below is code
#read source data
raw_data = sc.textFile("s3://xxx").map(lambda x:json.loads(x))
# TODO: NEED TO SPEED UP THIS COMPUTING
for tar_evt_name in evts:
print("...")
table_name = out_table_prefix + tar_evt_name
evt_one_rdd = raw_data.filter(lambda x: x.get("evt_name") == tar_evt_name)
evt_one_rdd.cache()
evt_one_dict = evt_one_rdd.first()
Evt_one = Row(*sorted(['{}'.format(k) for k, v in evt_one_dict.items()]))
col_len = len(evt_one_rdd.first())
evt_one_rdd2 = evt_one_rdd.map(lambda x: to_list(x, col_len)).filter(lambda x: len(x) is not 0)
evt_one_rdd2.cache()
df = spark.createDataFrame(evt_one_rdd2.map(lambda x: Evt_one(*x)))
out_csv_path = output + '/' + tar_evt_name+'/'# add last '/' for copy err
df.write.csv(out_csv_path, mode='overwrite', header=True,sep='|',nullValue="NULL")
the output data like below:
time : 2018-05-07 00:03|8dab4796-fa37-4114-0011-7637fa2b0001|f6e7f4f8ec4b4d6dae6fa2b5ed8f90cb6a640759|0.2.23|131074|2018-05-08 23:24:25|0|false|default|2.4.130

Here is my attempt,
I have noticed a few issues here,
for tar_evt_name in evts is a native Python for loop, which incur a performance penalty when it seems like you want to do a group by operation;
.cache() is used, but seemingly for no reason;
Unsure what to_list is;
Don't think evt_one_rdd2.map(lambda x: Evt_one(*x))) works;
import json
from pyspark.sql import functions as F
from pyspark.sql import Row
from pyspark.sql import Window
raw_data = sc.textFile('test.txt')
df = raw_data.map(
# Map the raw input to python dict using `json.loads`
json.loads,
).map(
# Duplicate the evt_name and evt_ts for later use in a Row object
lambda x: Row(evt_name=x['evt_name'], evt_ts=x.get('evt_ts', 1), data=x),
).toDF() # Convert into a dataframe...
# ... (I am actually unusre if this is faster...
# ... but I am more comfortable with this)
filtered_df = df.withColumn(
# NOTE: Assumed you want the first row, as you used `evt_one_rdd.first()`.
# So we assign a row number (named rn) and then filter on rn = 1.
# Here the evt_name and evt_ts becomes handy, you might want to set
# your own evt_ts properly.
'rn', F.row_number().over(
Window.partitionBy(df['evt_name']).orderBy(df['evt_ts'])
),
).filter('rn = 1').where(
# NOTE: Since you used `map(lambda x: to_list(x, col_len)).filter(lambda x: len(x) is not 0)`,
# I assume you meant data should have more than 0 keys,
# but this should be almost always true?
# Since you are grouping by `evt_name`, which means
# there is at least that key most of the time.
F.size(F.col('data')) > 0
)
filtered_df.write(....)

Python write mutiple array value into csv

with my code, i read the values of JSON data and insert into array
def retrive_json():
with open('t_v1.json') as json_data:
d = json.load(json_data)
array = []
for i in d['ride']:
origin_lat = i['origin']['lat']
origin_lng = i['origin']['lng']
destination_lat = i['destination']['lat']
destination_lng = i['destination']['lng']
array.append([origin_lat,origin_lng,destination_lat,destination_lng])
return array
the result array is this :
[[39.72417, -104.99984, 39.77446, -104.9379], [39.77481, -104.93618, 39.6984, -104.9652]]
how i can write each element of each array into specific field in csv?
i have try in this way:
wrt = csv.writer(open(t_.csv', 'w'), delimiter=',',lineterminator='\n')
for x in jjson:
wrt.writerow([x])
but the value of each array are store all in one field
How can solved it and write each in a field?
this is my json file:
{
"ride":[
{
"origin":{
"lat":39.72417,
"lng":-104.99984,
"eta_seconds":null,
"address":""
},
"destination":{
"lat":39.77446,
"lng":-104.9379,
"eta_seconds":null,
"address":null
}
},
{
"origin":{
"lat":39.77481,
"lng":-104.93618,
"eta_seconds":null,
"address":"10 Albion Street"
},
"destination":{
"lat":39.6984,
"lng":-104.9652,
"eta_seconds":null,
"address":null
}
}
]
}

Let's say we have this:
jsonstring = """{
"ride":[
{
"origin":{
"lat":39.72417,
"lng":-104.99984,
"eta_seconds":null,
"address":""
},
"destination":{
"lat":39.77446,
"lng":-104.9379,
"eta_seconds":null,
"address":null
}
},
{
"origin":{
"lat":39.77481,
"lng":-104.93618,
"eta_seconds":null,
"address":"10 Albion Street"
},
"destination":{
"lat":39.6984,
"lng":-104.9652,
"eta_seconds":null,
"address":null
}
}
]
}"""
Here is a pandas solution:
import pandas as pd
import json
# Load json to dataframe
df = pd.DataFrame(json.loads(jsonstring)["ride"])
# Create the new columns
df["o1"] = df["origin"].apply(lambda x: x["lat"])
df["o2"] = df["origin"].apply(lambda x: x["lng"])
df["d1"] = df["destination"].apply(lambda x: x["lat"])
df["d2"] = df["destination"].apply(lambda x: x["lng"])
#export
print(df.iloc[:,2:].to_csv(index=False, header=True))
#use below for file
#df.iloc[:,2:].to_csv("output.csv", index=False, header=True)
Returns:
o1,o2,d1,d2
39.72417,-104.99984,39.77446,-104.9379
39.77481,-104.93618,39.6984,-104.9652
Condensed answer:
import pandas as pd
import json
with open('data.json') as json_data:
d = json.load(json_data)
df = pd.DataFrame(d["ride"])
df["o1"],df["o2"] = zip(*df["origin"].apply(lambda x: (x["lat"],x["lng"])))
df["d1"],df["d2"] = zip(*df["destination"].apply(lambda x: (x["lat"],x["lng"])))
df.iloc[:,2:].to_csv("t_.csv",index=False,header=False)
Or, maybe the most readable solution:
import json
from pandas.io.json import json_normalize
open('data.json') as json_data:
d = json.load(json_data)
df = json_normalize(d["ride"])
cols = ["origin.lat","origin.lng","destination.lat","destination.lng"]
df[cols].to_csv("output.csv",index=False,header=False)

This might help:
import json
import csv
def retrive_json():
with open('data.json') as json_data:
d = json.load(json_data)
array = []
for i in d['ride']:
origin_lat = i['origin']['lat']
origin_lng = i['origin']['lng']
destination_lat = i['destination']['lat']
destination_lng = i['destination']['lng']
array.append([origin_lat,origin_lng,destination_lat,destination_lng])
return array
res = retrive_json()
csv_cols = ["orgin_lat", "origin_lng", "dest_lat", "dest_lng"]
with open("output_csv.csv", 'w') as out:
writer = csv.DictWriter(out, fieldnames=csv_cols)
writer.writeheader()
for each_list in res:
d = dict(zip(csv_cols,each_list))
writer.writerow(d)
Output csv generated is:
orgin_lat,origin_lng,dest_lat,dest_lng
39.72417,-104.99984,39.77446,-104.9379
39.77481,-104.93618,39.6984,-104.9652

To me it looks like you've got an array of arrays and you want the individual elements. Therefore you'll want to use a nested for loop. Your current for loop is getting each array, to then split up each array into it's elements you'll want to loop through those. I'd suggest something like this:
for x in jjson:
for y in x:
wrt.writerow([y])
Obviously you might want to update your bracketing etc this is just me giving you an idea of how to solve your issue.
Let me know how it goes!

Why the csv-Library?
array = [[1, 2, 3, 4], [5, 6, 7, 8]]
with open('test.csv', 'w') as csv_file :
csv_file.write("# Header Info\n" \
"# Value1, Value2, Value3, Value4\n") # The header might be optional
for row in array :
csv_file.write(",".join(row) + "\n")

Python - append to dictionary by name with multilevels 1, 1.1, 1.1.1, 1.1.2 (hierarchical)

I use openpyxl to read data from excel files to provide a json file at the end. The problem is that I cannot figure out an algorithm to do a hierarchical organisation of the json (or python dictionary).
The data form is like the following:
The output should be like this:
{
'id' : '1',
'name' : 'first',
'value' : 10,
'children': [ {
'id' : '1.1',
'name' : 'ab',
'value': 25,
'children' : [
{
'id' : '1.1.1',
'name' : 'abc' ,
'value': 16,
'children' : []
}
]
},
{
'id' : '1.2',
...
]
}
Here is what I have come up with, but i can't go beyond '1.1' because '1.1.1' and '1.1.1.1' and so on will be at the same level as 1.1.
from openpyxl import load_workbook
import re
from json import dumps
wb = load_workbook('resources.xlsx')
sheet = wb.get_sheet_by_name(wb.get_sheet_names()[0])
resources = {}
prev_dict = {}
list_rows = [ row for row in sheet.rows ]
for nrow in range(list_rows.__len__()):
id = str(list_rows[nrow][0].value)
val = {
'id' : id,
'name' : list_rows[nrow][1].value ,
'value' : list_rows[nrow][2].value ,
'children' : []
}
if id[:-2] == str(list_rows[nrow-1][0].value):
prev_dict['children'].append(val)
else:
resources[nrow] = val
prev_dict = resources[nrow]
print dumps(resources)

You need to access your data by ID, so first step is to create a dictionary where the IDs are the keys. For easier data manipulation, string "1.2.3" is converted to ("1","2","3") tuple. (Lists are not allowed as dict keys). This makes the computation of a parent key very easy (key[:-1]).
With this preparation, we could simply populate the children list of each item's parent. But before doing that a special ROOT element needs to be added. It is the parent of top-level items.
That's all. The code is below.
Note #1: It expects that every item has a parent. That's why 1.2.2 was added to the test data. If it is not the case, handle the KeyError where noted.
Note #2: The result is a list.
import json
testdata="""
1 first 20
1.1 ab 25
1.1.1 abc 16
1.2 cb 18
1.2.1 cbd 16
1.2.1.1 xyz 19
1.2.2 NEW -1
1.2.2.1 poz 40
1.2.2.2 pos 98
2 second 90
2.1 ezr 99
"""
datalist = [line.split() for line in testdata.split('\n') if line]
datadict = {tuple(item[0].split('.')): {
'id': item[0],
'name': item[1],
'value': item[2],
'children': []}
for item in datalist}
ROOT = ()
datadict[ROOT] = {'children': []}
for key, value in datadict.items():
if key != ROOT:
datadict[key[:-1]]['children'].append(value)
# KeyError = parent does not exist
result = datadict[ROOT]['children']
print(json.dumps(result, indent=4))

JSON Parsing help in Python

I have below data in JSON format, I have started with code below which throws a KEY ERROR.
Not sure how to get all data listed in headers section.
I know I am not doing it right in json_obj['offers'][0]['pkg']['Info']: but not sure how to do it correctly.
how can I get to different nodes like info,PricingInfo,Flt_Info etc?
{
"offerInfo":{
"siteID":"1",
"language":"en_US",
"currency":"USD"
},
"offers":{
"pkg":[
{
"offerDateRange":{
"StartDate":[
2015,
11,
8
],
"EndDate":[
2015,
11,
14
]
},
"Info":{
"Id":"111"
},
"PricingInfo":{
"BaseRate":1932.6
},
"flt_Info":{
"Carrier":"AA"
}
}
]
}
}
import os
import json
import csv
f = open('api.csv','w')
writer = csv.writer(f,delimiter = '~')
headers = ['Id' , 'StartDate', 'EndDate', 'Id', 'BaseRate', 'Carrier']
default = ''
writer.writerow(headers)
string = open('data.json').read().decode('utf-8')
json_obj = json.loads(string)
for pkg in json_obj['offers'][0]['pkg']['Info']:
row = []
row.append(json_obj['id']) # just to test,but I need column values listed in header section
writer.writerow(row)

It looks like you're accessing the json incorrectly. After you have accessed json_obj['offers'], you accessed [0], but there is no array there. json_obj['offers'] gives you another dictionary.
For example, to get PricingInfo like you asked, access like this:
json_obj['offers']['pkg'][0]['PricingInfo']
or 11 from the StartDate like this:
json_obj['offers']['pkg'][0]['offerDateRange']['StartDate'][1]
And I believe you get the KEY ERROR because you access [0] in the dictionary, which since that isn't a key, you get the error.

try to substitute this piece of code:
for pkg in json_obj['offers'][0]['pkg']['Info']:
row = []
row.append(json_obj['id']) # just to test,but I need column values listed in header section
writer.writerow(row)
With this:
for pkg in json_obj['offers']['pkg']:
row.append(pkg['Info']['Id'])
year = pkg['offerDateRange']['StartDate'][0]
month = pkg['offerDateRange']['StartDate'][1]
day = pkg['offerDateRange']['StartDate'][2]
StartDate = "%d-%d-%d" % (year,month,day)
print StartDate
writer.writerow(row)

Try this
import os
import json
import csv
string = open('data.json').read().decode('utf-8')
json_obj = json.loads(string)
print json_obj["offers"]["pkg"][0]["Info"]["Id"]
print str(json_obj["offers"]["pkg"][0]["offerDateRange"]["StartDate"][0]) +'-'+ str(json_obj["offers"]["pkg"][0]["offerDateRange"]["StartDate"][1])+'-'+str(json_obj["offers"]["pkg"][0]
["offerDateRange"]["StartDate"][2])
print str(json_obj["offers"]["pkg"][0]["offerDateRange"]["EndDate"][0]) +'-'+ str(json_obj["offers"]["pkg"][0]["offerDateRange"]["EndDate"][1])+'-'+str(json_obj["offers"]["pkg"][0]
["offerDateRange"]["EndDate"][2])
print json_obj["offers"]["pkg"][0]["Info"]["Id"]
print json_obj["offers"]["pkg"][0]["PricingInfo"]["BaseRate"]
print json_obj["offers"]["pkg"][0]["flt_Info"]["Carrier"]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get Mongo field names from collection using pymongo - python

Related

Python - Mongoengine: date range query

how to accelerate compute for pyspark

Python write mutiple array value into csv

Python - append to dictionary by name with multilevels 1, 1.1, 1.1.1, 1.1.2 (hierarchical)

JSON Parsing help in Python

Categories

Resources