I have developed a web application with neo4j(py2neo), Flask and HTML. I am creating the Neo4j graph by reading the csv file. This CSV file size is larger. So whenever I start or restart the server Neo4j graph is getting created everytime and it takes lot of time for this setup. Is there anyway I can use the created graph and store it in my machine. So My project struture is I have models.py which contains connecting the graph, creating the graph with relations, and returns the data to routes.py.
My models.py looks like this
from py2neo import Graph, Node, Relationship,NodeMatcher
import pandas as pd
class Query:
"Model the tags"
print("I am coming here")
graph = Graph("http://blah:blah#127.0.0.1/db/data")
print("hi i am second")
try:
graph.run("Match () Return 1 Limit 1")
df=pd.read_csv("data.csv")
print("I am not reading till")
graph.delete_all()
matcher = NodeMatcher(graph)
row_count = len(df)
for i in range(row_count):
tags = df['tags'][i]
each_tag = tags.split('|')
tag_data = [x.lower() for x in each_tag]
for j,first_tag in enumerate(tag_data[:-1]):
match_first_tag = matcher.match("May21tag",tagName=first_tag).first()
graph.push(match_first_tag)
for second_tag in tag_data[j+1:]:
my_tag = second_tag
second_tag = matcher.match("May21tag",tagName=second_tag).first()
graph.push(second_tag)
create_relationship = Relationship(match_first_tag ,"tagged",second_tag)
graph.merge(create_relationship)
except Exception as e:
print('not ok',e)
def __init__(tech):
tech_word = tech
print("in init")
def fetch_nodes(self,tech_word):
graph = Graph("http://blah:blah#127.0.0.1/db/data")
matcher = NodeMatcher(graph)
match_nodes = matcher.match(tagName=tech_word).first()
return (list(r.end_node["tagName"] for r in self.graph.match(nodes=(match_nodes,)).limit(6)))
So what is happening here is whenever I run the app.py I can see the graph and relationship are getting created everytime. So it takes time. WHat I need here is I wanted to use the existing graph with relations whenever I start or restart the server.
Try deleting everything in Query except the __init__ and fetch_nodes method definitions.
Related
first I'd like to check if the file exists, and Ive used this os.path:
def check_db_exist():
try:
file_exists = exists('games.db')
if file_exists:
file_size = os.path.getsize('games.db')
if file_size > 3000:
return True, file_size
else:
return False, 'too small'
else:
return False, 'does not exist'
except:
return False, 'error'
I have a separate file for my models, and creating the database. My concern is, if I import the class for the database it instantiates the sql file.
Moreover, pywebview when displaying my html, wipes all variables.
If I were to run this process as I load my page, then I can't access the variable for true/false sqlite exists.
db = SqliteDatabase('games.db')
class Game(Model):
game = CharField()
exe = CharField()
path = CharField()
longpath = CharField()
i_d = IntegerField()
class Meta:
database = db
This creates the table, so checking if the file exists is useless.
Then if I uncomment the first line in this file the database gest created, otherwise all of my db. variables are unusable. I must be missing a really obvious function to solve my problems.
# db = SqliteDatabase('games.db')
def add_game(game, exe, path, longpath, i_d):
try:
Game.create(game=game, exe=exe, path=path, longpath=longpath, i_d=i_d)
except:
pass
def loop_insert(lib):
db.connect()
for i in lib[0]:
add_game(i.name, i.exe, i.path, i.longpath, i.id)
db.close()
def initial_retrieve():
db.connect()
vals = ''
for games in Game.select():
val = js.Import.javascript(str(games.game), str(games.exe), str(games.path), games.i_d)
vals = vals + val
storage = vals
db.close()
return storage
should I just import the file at a different point in the file? whenever I feel comfortable? I havent seen that often so I didnt want to be improper in formatting.
edit: edit: Maybe more like this?
def db():
db = SqliteDatabase('games.db')
return db
class Game(Model):
game = CharField()
exe = CharField()
path = CharField()
file 2:
from sqlmodel import db, Game
def add_game(game, exe, path, longpath, i_d):
try:
Game.create(game=game, exe=exe, path=path, longpath=longpath, i_d=i_d)
except:
pass
def loop_insert(lib):
db.connect()
for i in lib[0]:
add_game(i.name, i.exe, i.path, i.longpath, i.id)
db.close()
I am not sure if this answers your question, since it seems to involve multiple processes and/or processors, but In order to check for the existence of a database file, I have used the following:
DATABASE = 'dbfile.db'
if os.path.isfile(DATABASE) is False:
# Create the database file here
pass
else:
# connect to database here
db.connect()
I would suggest using sqlite's user_version pragma:
db = SqliteDatabase('/path/to/db.db')
version = db.pragma('user_version')
if not version: # Assume does not exist/newly-created.
# do whatever.
db.pragma('user_version', 1) # Set user version.
from reddit:
me: To the original challenge, there's a reason I want to know whether the file exists. Maybe its flawed at the premises, I'll explain and you can fill in there.
This script will run on multiple machines I dont ahve access to. At the entry point of a first-time use case, I will be porting data from a remote location, if its the first time the script runs on that machine, its going down a different work flow than a repeated opening.
Akin to grabbing all pc programs vs appending and reading from teh last session. How would you suggest quickly understanding if that process has started and finished from a previous session.
Checking if the sqlite file is made made the most intuitive sense, and then adjusting to byte size. lmk
them:
This is a good question!
How would you suggest quickly understanding if that process
has started and finished from a previous session.
If the first thing your program does on a new system is download some kind of fixture data, then the way I would approach it is to load the DB file as normal, have Peewee ensure the tables exist, and then do a no-clause SELECT on one of them (either through the model, or directly on the database through the connection if you want.) If it's empty (you get no results) then you know you're on a fresh system and you need to make the remote call. If you get results (you don't need to know what they are) then you know you're not on a fresh system.
I currently have my flask running static where I run the ETL job independently and then use the result dataset in my Flask to display chartjs line charts.
However, I'd like to integrate the ETL piece in my web framework where my users can login and submit the input parameter(locations of the input files and an added version id) using HTML form, which will then be used by my ETL job to run and use the resultant data directly to display the charts on same page.
Current Setup:
My custom ETL module has submodules that act together to form a simple pipeline process:
globals.py - has my globals such as location of the s3 and the etc. ideally i'd like my user's form inputs to be stored here so that they can be used directly in all my submodules wherever necessary.
s3_bkt = 'abc' #change bucket here
s3_loc = 's3://'+s3_bkt+'/'
ip_loc = 'alv-input/'
#Ideally ,I'd like my users form inputs to be sitting here
# ip1 = 'alv_ip.csv'
# ip2 = 'Input_product_ICL_60K_Seg15.xlsx'
#version = 'v1'
op_loc = 'alv-output/'
---main-module.py - main function
import module3 as m3
import globals as g
def main(ip1,ip2,version):
data3,ip1,ip2,version = m3.module3(ip1,ip2,version)
----perform some actions on the data and return---
return res_data
---module3.py
import module2 as m2
def mod3(ip1,ip2,version):
data2,ip1,ip2,version = m2.mod2(ip1,ip2,version)
----perform some actions on the data and return---
return data3
---module2.py
import module1 as m1
import globals as g
def mod2(ip1,ip2,version):
data1,ip1,ip2,version = m1.mod1(ip1,ip2,version)
data_cnsts = pd.read_csv(ip2) #this is where i'll be using the user's input for ip2
----perform some actions on the datasets and write them to location with version_id to return---
data1.to_csv(g.op_loc+'data2-'+ version + '.csv', index=False)
return data2
---module1.py
import globals as g
def mod1(ip1,ip2,version):
#this is the location where the form input for the data location should be actually used
data = pd.read_csv(g.s3_loc+g.ip_loc+ip1)
----perform some actions on the data and return---
return data1
Flask setup:
import main-module as mm
app = Flask(__name__)
#this is where the user first hits and submits the form
#app.route('/form')
def form():
return render_template('form.html')
#app.route('/result/',methods=['GET', 'POST'])
def upload():
msg=''
if request.method == 'GET':
return f"The URL /data is accessed directly. Try going to '/upload' to submit form"
if request.method == 'POST':
ip1 = request.form['ip_file']
ip2 = request.form['ip_sas_file']
version = request.form['version']
data = mm.main(ip1,ip2,version)
grpby_vars = ['a','b','c']
grouped_data = data.groupby(['mob'])[grpby_vars].mean().reset_index()
#labels for the chart
a = [val for val in grouped_data['a']]
#values for the chart
b = [round(val*100,3) for val in grouped_data['b']]
c = [round(val*100,3) for val in grouped_data['c']]
d = [val for val in grouped_data['d']]
return render_template('results.html', title='Predictions',a=a,b=b,c=c,d=d)
The Flask setup works perfectly fine without using any form inputs from the user(if the ETL job and Flask is de-coupled i.e. when I run the ETL and supply the result data location to Flask directly.
Problem:
The problem after integration is that I'm not very sure how to pass these inputs from users to all my sub-module. Below is the error I get when I use the current setup.
data3,ip1,ip2,version = m3.module3(ip1,ip2,version)
TypeError: module3() missing 3 required positional arguments: 'ip1', 'ip2', and 'version'
So, definitely this is due to the issue with my param passing across my sub-modules.
So my question is how do I use the data from the form as a global variable across my sub-modules. Ideally I'd like them to be stored in my globals so that I'd not have to be passing them as a param through all modules.
Is there a standard way to achieve that? Might sound very trivial but I'm struggling hard to get to my end-state.
Thanks for reading through :)
I realized my dumb mistake, I should've have just included
data,ip1,ip2,version = mm.main(ip1,ip2,version)
I could also instead use my globals file by initiating the inputs with empty string and then import my globals in the flask file and update the values. This way I can avoid the route of passing the params through my sub-modules.
I recently found out about sqlalchemy in Python. I'd like to use it for data science rather than website applications.
I've been reading about it and I like that you can translate the sql queries into Python.
The main thing that I am confused about that I'm doing is:
Since I'm reading data from an already well established schema, I wish I didn't have to create the corresponding models myself.
I am able to get around that reading the metadata for the table and then just querying the tables and columns.
The problem is when I want to join to other tables, this metadata reading is taking too long each time, so I'm wondering if it makes sense to pickle cache it in an object, or if there's another built in method for that.
Edit: Include code.
Noticed that the waiting time was due to an error in the loading function, rather than how to use the engine. Still leaving the code in case people comment something useful. Cheers.
The code I'm using is the following:
def reflect_engine(engine, update):
store = f'cache/meta_{engine.logging_name}.pkl'
if update or not os.path.isfile(store):
meta = alq.MetaData()
meta.reflect(bind=engine)
with open(store, "wb") as opened:
pkl.dump(meta, opened)
else:
with open(store, "r") as opened:
meta = pkl.load(opened)
return meta
def begin_session(engine):
session = alq.orm.sessionmaker(bind=engine)
return session()
Then I use the metadata object to get my queries...
def get_some_cars(engine, metadata):
session = begin_session(engine)
Cars = metadata.tables['Cars']
Makes = metadata.tables['CarManufacturers']
cars_cols = [ getattr(Cars.c, each_one) for each_one in [
'car_id',
'car_selling_status',
'car_purchased_date',
'car_purchase_price_car']] + [
Makes.c.car_manufacturer_name]
statuses = {
'selling' : ['AVAILABLE','RESERVED'],
'physical' : ['ATOURLOCATION'] }
inventory_conditions = alq.and_(
Cars.c.purchase_channel == "Inspection",
Cars.c.car_selling_status.in_( statuses['selling' ]),
Cars.c.car_physical_status.in_(statuses['physical']),)
the_query = ( session.query(*cars_cols).
join(Makes, Cars.c.car_manufacturer_id == Makes.c.car_manufacturer_id).
filter(inventory_conditions).
statement )
the_inventory = pd.read_sql(the_query, engine)
return the_inventory
I was trying to implement a dashboard by following the instructions of https://github.com/talpor/django-dashing/ using django-dashing.
So far I have successfully customised my widget and displayed with some random data on my own web server, while I have no clue where to start if I'd like to pull some real data from DB(MySQL) and display. (like where to do the DB connecting,..etc)
Could anyone show me steps I should follow to implement it?
If it's still relevant, you can start by connecting to the database with sqlalchemy.
import sqlalchemy as sq
from sqlalchemy.engine import url as sq_url
db_connect_url = sq_url.URL(
drivername='mysql+mysqldb',
username=DB_username,
password=DB_password,
host=DB_hostname,
port=DB_port,
database=DB_name,
)
engine = sq.create_engine(db_connect_url)
From there you can manipulate the data by checking the available methods on engine. What I normally do is use pandas in situations like these.
import pandas as pd
df = pd.read_sql_table(table_name, engine)
I also had to do this recently.... I managed to get it sorted - but it is a little too clunky.
I created a Separate CherryPy REST Api in a separate Project. The Entry point looks like
#cherrpy.expose
def web_api_to_call(self, table,value):
#Do SQL Query
return str(sql_table_value)
Then in Django Created a New app, then created a Widget.py. Inside the Widget.py I wrote something like this.
import requests
class webquery(NumberWidget):
classparams=[("widget1","web_api_to_call","table","values"),
("widget2","web_api_to_call","table2","values2"),
("widget3","web_api_to_call","table3","values3")]
def myget(self):
for tup in self.classparams:
if tup[0]==type(self).__name__:
url=tup[1]
table=tup[2]
value=tup[3]
url = "http://127.0.0.1:8000/"+url
# Do Web Call Error Checking Omitted
return requests.get(url,params={"table":table,"values":value)}).text()
def get_value(self):
#Override default
return self.my_get()
#Now create new Widgets as per the static definition at the top
class widget1(web query):
id=1
class widget2(web query):
id=1
class widget3(web query):
id=1
You now just add your new widgets - as you normally would in the urls.py and then in the dashing-config.js and you are done.
Hope this assists someone.
I am using the cloudant python library to connect to my cloudant account.
Here is the code I have so far:
import cloudant
class WorkflowsCloudant(cloudant.Account):
def __init__(self):
super(WorkflowsCloudant, self).__init__(settings.COUCH_DB_ACCOUNT_NAME,
auth=(settings.COUCH_PUBLIC_KEY,
settings.COUCH_PRIVATE_KEY))
#blueprint.route('/<workflow_id>')
def get_single_workflow(account_id, workflow_id):
account = WorkflowsCloudant()
db = account.database(settings.COUCH_DB_NAME)
doc = db.document(workflow_id)
resp = doc.get().json()
if resp['account_id'] != account_id:
return error_helpers.forbidden('Invalid Account')
return jsonify(resp)
This Flask controller will have CRUD operations inside of it, but with the current implementation, I will have to set the account and db variables in each method before performing operations on the document I want to view/manipulate. How can I clean up (or DRY up) my code so that I only have to call to my main WorkflowsCloudant class?
I don't know cloudant, so I may be totally off base, but I believe this answers your question:
Delete the account, db, and doc lines from get_single_workflow.
Add the following lines to __init__:
db = account.database(settings.COUCH_DB_NAME)
self.doc = db.document(workflow_id)
Change the resp line in get_single_workflow to:
resp = WorkflowsCloudant().doc.get().json()