I generated a dump from one of my tables and every entity looks like this:
{
"name": "¡Felicidades! Menos cel=más Yumi.",
"promotion": "Tu paciencia será recompensada con un 2x1 en un cono de yogurt de tu elección.",
"establishment_id": 1,
"quantity": 10,
"validity_date": "2017-09-22",
"terms": "aplica restricciones",
"code": "2x1yum",
"points": 500,
"active": 0,
"image": "../uploads/20170511/img-59149a2de8d3b",
"image_file_name": "2X1.jpg",
"game_level_id": 0
},
When i use the import option on firebase my database gives the child nodes a numerical ID (1, 2 , 3 etc..) and i want the child nodes to have the custom firebase ID's it gives when you push a new node to the db, i.e.( HW0vcmYU4MW0zpZX9KlEoVCq1Up2)
Im trying to write a python script to my table and give every object in the json its unique frebase ID using the admin SDK but i'm stuck trying to iterate through the json.
Should i use an csv instead?
Here is my python function, if i could get some assistance iterating through it i'd appreciate it:
import firebase_admin
import csv
from firebase_admin import credentials
from firebase_admin import db
cred = credentials.Certificate('./dev.json')
firebase_admin.initialize_app(cred, {
'databaseURL': 'https://url.com/'
})
ref = db.reference('mobile_user/')
// or csv
f = open('coupons.json')
data = []
for line in f:
//iteration here
ref.push(data)
print(data)
Related
I have a dataset consisting of 250 rows that looks like to following:
In MongoDB Compass, I inserted the first row as follows:
db.employees.insertOne([{"employee_id": 412153,
"first_name": "Carrol",
"last_name": "Dhin",
"email": "carrol.dhin#company.com",
"managing": [{"manager_id": 412153, "employee_id": 174543}],
"department": [{"department_name": "Accounting", "department_budget": 500000}],
"laptop": [{"serial_number": "CSS49745",
"manufacturer": "Lenovo",
"model": "X1 Gen 10",
"date_assigned": {$date: 01-15-2022},
"installed_software": ["MS Office", "Adobe Acrobat", "Slack"]}]})
If I wanted to insert all 250 rows into the database using PyMongo in Python, how would I ensure that every row is entered following the format that I used when I inserted it manually in the Mongo shell?
from pymongo import MongoClient
import pandas as pd
client = MongoClient(‘localhost’, 27017)
db = client.MD
collection = db.gammaCorp
df = pd.read_csv(‘ ’) #insert CSV name here
data = {}
for i in df.index:
data['employee_id'] = df['employee_id'][i]
data['first_name'] = df['first_name'][i]
data['last_name'] = df['last_name'][i]
data['email'] = df['email'][i]
data['managing'] = [{'manager_id': df['employee_id'][i]}, {'employee_id': df['managing'][i]}]
data['department'] = [{'department_name': df['department'][i]}, {'department_budget': df['department_budget'][i]}]
data['laptop'] = [{'serial_number': df['serial_number'][i]}, {'manufacturer': df['manufacturer'][i]}, {'model': df['model'][i]}, {'date_assigned': df['date_assigned'][i]}, {'installed_software': df['installed_software'][i]}]
collection.insert_one(data)
I need to populate a database in neo4j using a json of the following that contains data of some processes. Among them the name of the process, its parents and its children (if any). Here is a part of the json as an example:
[
{
"process": "IPTV_Subscriptions",
"parents": ["IPTV_Navigation","DeviceCertifications-insertion"],
"childs": ["villa_iptv", "villa_ott", "villa_calicux"]
},
{
"process": "IPTV_Navigation",
"parents": [],
"childs": ["IPTV_Subscriptions"],
},
{
"process": "DeviceCertifications-getter",
"parents": [],
"childs": ["DeviceCertifications-insertion"]
},
{
"process": "DeviceCertifications-insertion",
"parents": ["DeviceCertifications-getter"],
"childs": ["IPTV_Subscriptions"]
}
]
With the following Python code I generated, I found that I can create each node with the processes contained in the json in bulk:
import json
from py2neo import Graph
from py2neo.bulk import create_nodes, create_relationships
graph = Graph("bolt://localhost:7687", auth = ("yyyy", "xxxx"))
#Opening json
f = open('/app/conf/data.json',)
processs = json.load(f)
data=[]
for i in processs:
proc=[]
proc.append(i["process"])
data.append(proc)
keys = ["process"]
create_nodes(graph.auto(), data, labels={"process"}, keys=keys)
And checking in neo4j, I see that the nodes are already created.
But now I need to make the relationships. For each process, from the json I know which are the parents and children of that node.
I wanted to take the documentation as an example:
from py2neo import Graph
from py2neo.bulk import create_relationships
g = Graph()
data = [
(("Alice", "Smith"), {"since": 1999}, "ACME"),
(("Bob", "Jones"), {"since": 2002}, "Bob Corp"),
(("Carol", "Singer"), {"since": 1981}, "The Daily Planet"),
]
create_relationships(g.auto(), data, "WORKS_FOR", start_node_key=("Person", "name", "family name"), end_node_key=("Company", "name"))
But it didn't work for me
Having from the json the information of the parents and children, does anyone have an idea of how I can generate the massive relationships? In view of the json example I have, the relationship tags would be ParentOf and ChildOf but I have no idea how they would be generated from python.
Below is the script to create the bulk relationship using py2neo. Let me know if it works for you or not. Another thing, please label your nodes as Process rather than process (notice the upper case P). Then I use the relationship :CHILD_OF. If you want :PARENT_OF then change the tuple in data and swap the first and third item.
import json
from py2neo import Graph
from py2neo.bulk import create_relationships
graph = Graph("neo4j://localhost:7687", auth = ("neo4j", "neo4jay"))
#Opening json
f = open('data2.json',)
processs = json.load(f)
data=[]
for i in processs:
for p in i["parents"]:
data.append((i["process"],{},p))
create_relationships(graph.auto(), data, "CHILD_OF", start_node_key=("Process", "process"), end_node_key=("Process", "process"))
Result:
while working with small zipfiles(about 8MB) containg 25MB of CSV files the below code works exactly as it should. As soon as I attempt to download larger files (45MB zip file containing a 180MB csv) the code breaks and I get the following error message:
(venv) ufulu#ufulu awr % python get_awr_ranking_data.py
https://api.awrcloud.com/v2/get.php?action=get_topsites&token=REDACTED&project=REDACTED Client+%5Bw%5D&fileName=2017-01-04-2019-10-09
Traceback (most recent call last):
File "get_awr_ranking_data.py", line 101, in <module>
getRankingData(project['name'])
File "get_awr_ranking_data.py", line 67, in getRankingData
processRankingdata(rankDateData['details'])
File "get_awr_ranking_data.py", line 79, in processRankingdata
domain.append(row.split("//")[-1].split("/")[0].split('?')[0])
AttributeError: 'float' object has no attribute 'split'
My goal is to download data for 170 projects and save the data to sqlite DB.
Please bear with me me as I am a novice in the field of programming and python. I would greatly appreciate any help to fixing the code below as well as any other sugestions and improvements to making the code more robust and pythonic.
Thanks in advance
from dotenv import dotenv_values
import requests
import pandas as pd
from io import BytesIO
from zipfile import ZipFile
from sqlalchemy import create_engine
# SQL Alchemy setup
engine = create_engine('sqlite:///rankingdata.sqlite', echo=False)
# Excerpt from the initial API Call
data = {'projects': [{'name': 'Client1',
'id': '168',
'frequency': 'daily',
'depth': '5',
'kwcount': '80',
'last_updated': '2019-10-01',
'keywordstamp': 1569941983},
{
"depth": "5",
"frequency": "ondemand",
"id": "194",
"kwcount": "10",
"last_updated": "2019-09-30",
"name": "Client2",
"timestamp": 1570610327
},
{
"depth": "5",
"frequency": "ondemand",
"id": "196",
"kwcount": "100",
"last_updated": "2019-09-30",
"name": "Client3",
"timestamp": 1570610331
}
]}
#setup
api_url = 'https://api.awrcloud.com/v2/get.php?action='
urls = [] # processed URLs
urlbacklog = [] # URLs that didn't return a downloadable File
# API Call to recieve URL containing downloadable zip and csv
def getRankingData(project):
action = 'get_dates'
response = requests.get(''.join([api_url, action]),
params=dict(token=dotenv_values()['AWR_API'],
project=project))
response = response.json()
action2 = 'topsites_export'
rankDateData = requests.get(''.join([api_url, action2]),
params=dict(token=dotenv_values()['AWR_API'],
project=project, startDate=response['details']['dates'][0]['date'], stopDate=response['details']['dates'][-1]['date'] ))
rankDateData = rankDateData.json()
print(rankDateData['details'])
urls.append(rankDateData['details'])
processRankingdata(rankDateData['details'])
# API Call to download and unzip csv data and process it in pandas
def processRankingdata(url):
content = requests.get(url)
# {"response_code":25,"message":"Export in progress. Please come back later"}
if "response_code" not in content:
f = ZipFile(BytesIO(content.content))
#print(f.namelist()) to get all filenames in Zip
with f.open(f.namelist()[0], 'r') as g: rankingdatadf = pd.read_csv(g)
rankingdatadf = rankingdatadf[rankingdatadf['Search Engine'].str.contains("Google")]
domain = []
for row in rankingdatadf['URL']:
domain.append(row.split("//")[-1].split("/")[0].split('?')[0])
rankingdatadf['Domain'] = domain
rankingdatadf['Domain'] = rankingdatadf['Domain'].str.replace('www.', '')
rankingdatadf = rankingdatadf.drop(columns=['Title', 'Meta description', 'Snippet', 'Page'])
print(rankingdatadf['Search Engine'][0])
writeData(rankingdatadf)
else:
urlbacklog.append(url)
pass
# Finally write the data to database
def writeData(rankingdatadf):
table_name_from_file = project['name']
check = engine.has_table(table_name_from_file)
print(check) # boolean
if check == False:
rankingdatadf.to_sql(table_name_from_file, con=engine)
print(project['name'] + ' ...Done')
else:
print(project['name'] + ' ... already in DB')
for project in data['projects']:
getRankingData(project['name'])
The problem seems to be the split call on a float and not necessarily the download. Try changing line 79
from
domain.append(row.split("//")[-1].split("/")[0].split('?')[0])
to
domain.append(str(str(str(row).split("//")[-1]).split("/")[0]).split('?')[0])
It looks like you're trying to parse the network location portion of the URL here, you can also use urllib.parse to make this easier instead of chaining all the splits:
from urllib.parse import urlparse
...
for row in rankingdatadf['URL']:
domain.append(urlparse(row).netloc)
I think a malformed URL is causing you issues, try (to diagnose issue):
try :
for row in rankingdatadf['URL']:
try:
domain.append(urlparse(row).netloc)
catch Exception:
exit(row)
Looks like you figured it out above, you have a database entry with a NULL value for the URL field. Not sure what your fidelity requirements for this data set are but might want to enforce database rules for URL field, or use pandas to drop rows where URL is NaN.
rankingdatadf = rankingdatadf.dropna(subset=['URL'])
I have a .json.gz file that I wish to load into elastic search.
My first attempt involved using the json module to convert the JSON to a list of dicts.
import gzip
import json
from pprint import pprint
from elasticsearch import Elasticsearch
nodes_f = gzip.open("nodes.json.gz")
nodes = json.load(nodes_f)
Dict example:
pprint(nodes[0])
{u'index': 1,
u'point': [508163.122, 195316.627],
u'tax': u'fehwj39099'}
Using Elasticsearch:
es = Elasticsearch()
data = es.bulk(index="index",body=nodes)
However, this returns:
elasticsearch.exceptions.RequestError: TransportError(400, u'illegal_argument_exception', u'Malformed action/metadata line [1], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]')
Beyond this, I wish to be able to find the tax for given point query, in case this has an impact on how I should be indexing the data with elasticsearch.
Alfe pointed me in the right direction, but I couldn't get his code to work.
I found two solutions:
Line by line with a for loop:
es = elasticsearch.Elasticsearch()
for node in nodes:
_id = node['index']
es.index(index='nodes',doc_type='external',id=_id,body=node)
In bulk, using helper:
actions = [
{
"_index" : "nodes_bulk",
"_type" : "external",
"_id" : str(node['index']),
"_source" : node
}
for node in nodes
]
helpers.bulk(es,actions)
Bulk was around 22 times faster for a list of 343724 dicts.
Here is my working code using bulk api:
Define a list of dicts:
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch([{'host':'localhost', 'port': 9200}])
doc = [{'_id': 1,'price': 10, 'productID' : 'XHDK-A-1293-#fJ3'},
{'_id':2, "price" : 20, "productID" : "KDKE-B-9947-#kL5"},
{'_id':3, "price" : 30, "productID" : "JODL-X-1937-#pV7"},
{'_id':4, "price" : 30, "productID" : "QQPX-R-3956-#aD8"}]
helpers.bulk(es, doc, index='products',doc_type='_doc', request_timeout=200)
The ES bulk library showed several problems, including performance trouble, not being able to set specific _ids etc. But since the bulk API of ES is not very complicated, we did it ourselves:
import requests
headers = { 'Content-type': 'application/json',
'Accept': 'text/plain'}
jsons = []
for d in docs:
_id = d.pop('_id') # take _id out of dict
jsons.append('{"index":{"_id":"%s"}}\n%s\n' % (_id, json.dumps(d)))
data = ''.join(jsons)
response = requests.post(url, data=data, headers=headers)
We needed to set a specific _id but I guess you can skip this part in case you want a random _id set by ES automatically.
Hope that helps.
I'm working with http://pygsheets.readthedocs.io/en/latest/index.html a wrapper around the google sheets api v4.
I have a small script where I am trying to select a json sheet to upload to a google sheets worksheet. The file name is made up of the:
spreadsheetname_year_month_xxx.json
The code:
import tkinter as tk
import tkFileDialog
import pygsheets
root = tk.Tk()
root.withdraw()
file_path = tkFileDialog.askopenfilename()
print file_path
file_name = file_path.split('/')[-1]
print file_name
file_name_segments = file_name.split('_')
spreadsheet = file_name_segments[0]
worksheet = file_name_segments[1]+'_'+file_name_segments[2]
gc = pygsheets.authorize(outh_file='client_secret_xxxxxx.apps.googleusercontent.com.json')
ssheet = gc.open(spreadsheet)
ws = ssheet.add_worksheet(worksheet(ssheet,str(raw_input(file_path))))
The file path leads to a generated json file that looks like:
{
"count": 12,
"results": [
{
"case": "abc1",
"case_name": "invalid",
"case_type": "invalid",
},
{
"case": "abc2",
"case_name": "invalid",
"case_type": "invalid",
},
............
I am getting :
File "upload_to_google_sheets.py", line 27, in <module>
ws = ssheet.add_worksheet(worksheet(ssheet,str(raw_input(file_path))))
TypeError: 'unicode' object is not callable
As you can see I'm trying to instantiate a worksheet with the json data. What am I doing wrong?
The JsonSheet param is not for values of the worksheet, it is for specifying the properties of the worksheet, hence should be of this format.
As directly converting a json to spreadsheet is rather ambiguous, you will need to first convert it to a numpy array (matrix) or an pandas DataFrame. Then you can use set_as_df of just update_values for updating the spreadsheet.