rdflib graph not updated. Why? - python

I am trying to understand this behavior. It's definitely not what I expect. I have two programs, one reader, and one writer. The reader opens a RDFlib graph store, then performs a query every 2 seconds
import rdflib
import random
from rdflib import store
import time
default_graph_uri = "urn:uuid:a19f9b78-cc43-4866-b9a1-4b009fe91f52"
s = rdflib.plugin.get('MySQL', store.Store)('rdfstore')
config_string = "host=localhost,password=foo,user=foo,db=foo"
rt = s.open(config_string,create=False)
if rt != store.VALID_STORE:
s.open(config_string,create=True)
while True:
graph = rdflib.ConjunctiveGraph(s, identifier = rdflib.URIRef(default_graph_uri))
rows = graph.query("SELECT ?id ?value { ?id <http://localhost#ha> ?value . }")
for r in rows:
print r[0], r[1]
time.sleep(2)
print " - - - - - - - - "
The second program is a writer that adds stuff to the triplestore
import rdflib
import random
from rdflib import store
default_graph_uri = "urn:uuid:a19f9b78-cc43-4866-b9a1-4b009fe91f52"
s = rdflib.plugin.get('MySQL', store.Store)('rdfstore')
config_string = "host=localhost,password=foo,user=foo,db=foo"
rt = s.open(config_string,create=False)
if rt != store.VALID_STORE:
s.open(config_string,create=True)
graph = rdflib.ConjunctiveGraph(s, identifier = rdflib.URIRef(default_graph_uri))
graph.add( (
rdflib.URIRef("http://localhost/"+str(random.randint(0,100))),
rdflib.URIRef("http://localhost#ha"),
rdflib.Literal(str(random.randint(0,100)))
)
)
graph.commit()
I would expect to see the number of results increment on the reader as I submit stuff using the writer, but this does not happen. The reader continues to return the same result as when it started. If however I stop the reader and restart it, the new results appear.
Does anybody know what am I doing wrong ?

One easy fix is to put "graph.commit()" just after the line "graph = rdflib.ConjunctiveGraph(...)" in reader.
I'm not sure what's the cause and why commiting before read fixes this. I'm guessing that:
When opening MySQLdb connection, a transaction is started automatically
This transaction doesn't see updates from other, later transactions.
"graph.commit()" bubbles down to some "connection.commit()" somewhere that discards this transaction and starts a new one.

Related

How to iterate through csv rows, apply a function to those values and append to new column?

I have a python script which calculates tree heights based off distance and angle from the ground, however, despite the script running with no errors my heights column is left empty. Also, I dont want to be using pandas and I would like to keep to the 'with open' method if possible, before anyone suggests going about it a different way. Any help would be great thanks. It seems that the whole script runs fine and does everything i need it to until the "for row in csvread:" block.
This is my current script:
#!/usr/bin/env python3
# Import any modules needed
import sys
import csv
import math
import os
import itertools
# Extract command line arguments, remove file extension and attach to output_filename
input_filename1 = sys.argv[1]
input_filename2 = os.path.splitext(input_filename1)[0]
filenames = (input_filename2, "treeheights.csv")
output_filename = "".join(filenames)
def TreeHeight(degrees, distance):
"""
This function calculates the heights of trees given distance
of each tree from its base and angle to its top, using the
trigonometric formula.
"""
radians = math.radians(degrees)
height = distance * math.tan(radians)
print("Tree height is:", height)
return height
def main(argv):
with open(input_filename1, 'r') as f:
with open(output_filename, 'w') as g:
csvread = csv.reader(f)
print(csvread)
csvwrite = csv.writer(g)
header = csvread.__next__()
header.append("Height.m")
csvwrite.writerow(header)
# Populating the output csv with the input data
csvwrite.writerows(itertools.islice(csvread, 0, 121))
for row in csvread:
height = TreeHeight(csvread[:,2], csvread[:,1])
row.append(height)
csvwrite.writerow(row)
return 0
if __name__ == "__main__":
status = main(sys.argv)
sys.exit(status)
Looking at your code, I think you're mostly there, but are a little confused on reading/writing rows:
# Populating the output csv with the input data
csvwrite.writerows(itertools.islice(csvread, 0, 121))
for row in csvread:
height = TreeHeight(csvread[:,2], csvread[:,1])
row.append(height)
csvwrite.writerow(row)
It looks like your reading rows 1 through 121 and writing them to your new file. Then, you're trying to iterate over your CSV reader in a second pass, compute the height, and then tack that computed value on to the end of the row, and also write to your CSV in a complete second pass.
If that's true, then you need to understand that CSV reader and writer are not designed to work "left-to-right" like that: read-write these columns, then read-write these columns... nope.
They both work "top-down", processing rows.
I propose, to get this working: iterate every row in one loop, and for every row:
read the values you need from row to compute the height
get the computed height
add the new computed to the original
write
...
header = next(csvread)
header.append("Height.m")
csvwrite.writerow(header)
for row in csvread:
degrees = float(row[1]) # second column for degrees?
distance = float(row[0]) # first column for distance?
height = TreeHeight(degrees, distance)
row.append(height)
csvwrite.writerow(row)
Some changes I made:
I replaced header = csvread.__next__() with header = next(csvread). Calling things that start with _ or __ is generally discouraged, at least in the standard library. next(<iterator>) is the built-in function that allows you to properly and safely advance through <iterator>.
Added float() conversion to textual values as read from CSV
Also, as far as I can tell, the ,2/,1 is incorrect syntax for subscripting/slice notation. You didn't get any errors because the reader was already done/exhausted from the islice() call, so your program never actually stepped into the for row in csvread: loop.

Effective way to create .csv file from MongoDB

I have a MongoDB(media_mongo) with collection main_hikari and a lot of data inside. I'm trying to make a function to create a .csv file from this data asap. I'm using this code, but it takes too much time and CPU usage
import pymongo
from pymongo import MongoClient
mongo_client = MongoClient('mongodb://admin:password#localhost:27017')
db = mongo_client.media_mongo
def download_file(down_file_name="hikari"):
docs = pd.DataFrame(columns=[])
if down_file_name == "kokyaku":
col = db.main_kokyaku
if down_file_name == "hikari":
col = db.main_hikari
if down_file_name == "hikanshou":
col = db.main_hikanshou
cursor = col.find()
mongo_docs = list(cursor)
for num, doc in enumerate(mongo_docs):
doc["_id"] = str(doc["_id"])
doc_id = doc["_id"]
series_obj = pandas.Series(doc, name=doc_id)
docs = docs.append(series_obj)
csv_export = docs.to_csv("file.csv", sep=",")
download_file()
My database has data in this format (sorry for that Japanese :D)
_id:"ObjectId("5e0544c4f4eefce9ee9b5a8b")"
事業者受付番号:"data1"
開通区分/処理区分:"data2"
開通ST/処理ST:"data3"
申込日,顧客名:"data4"
郵便番号:"data5"
住所1:"data6"
住所2:"data7"
連絡先番号:"data8"
契約者電話番号:"data9"
And about 150000 entries like this
If you have a lot of data as you indicate, then this line is going to hurt you:
mongo_docs = list(cursor)
It basically means read the entire collection into a client-side array at once. This will create a huge memory high water mark.
Better to use mongoexport as noted above or walk the cursor yourself instead of having list() slurp the whole thing, e.g.:
cursor = col.find()
for doc in cursor:
# read docs one at a time
or to be very pythonic about it:
for doc in col.find(): # or find(expression of your choice)
# read docs one at a time

How to use variables in query_to_pandas

Just got dumped into SQL with BigQuery and stuff so I don't know alot of terms for this kinda stuff. Currently trying to make a method for which you input a string (the dataset name you want to take out). But I can't seem to put in a string into the variable I want without it returning errors.
I looked up how to put in variables for SQL stuff but most of those solutions weren't for my case. Then I ended up with adding $s and adding s before the """ variable. (this ended up with a syntax error)
import pandas as pd
import bq_helper
from bq_helper import BigQueryHelper
# Some code about using BQ_helper to get the data, if you need it lmk
# test = `data.patentsview.application`
query1 = s"""
SELECT * FROM $s
LIMIT
20;
"""
response1 = patentsview.query_to_pandas_safe(query1)
response1.head(20)
With the code above it returns the error code
File "<ipython-input-63-6b07957ebb81>", line 8
"""
^
SyntaxError: invalid syntax
EDIT:
The original code that worked but would have to be manually bruteforced is this
query1 = """
SELECT * FROM `patents-public-data.patentsview.application`
LIMIT
20;
"""
response1 = patentsview.query_to_pandas_safe(query1)
response1.head(20)
If I understand you correctly, this may be what you're looking for:
#making up some variables:
vars = ['`patents-public-data.patentsview.application','`patents-private-data.patentsview.application']
for var in vars:
query = f"""SELECT * FROM {var}
LIMIT
20;
"""
print(query)
Output:
SELECT * FROM `patents-public-data.patentsview.application
LIMIT
20;
SELECT * FROM `patents-private-data.patentsview.application
LIMIT
20;
I believe this should help: https://cloud.google.com/bigquery/docs/parameterized-queries#bigquery_query_params_named-python:
To specify a named parameter, use the # character followed by an identifier, such as #param_name.

How to optimize retrieval of 10 most frequent words inside a json data object?

I'm looking for ways to make the code more efficient (runtime and memory complexity)
Should I use something like a Max-Heap?
Is the bad performance due to the string concatenation or sorting the dictionary not in-place or something else?
Edit: I replaced the dictionary/map object to applying a Counter method on a list of all retrieved names (with duplicates)
minimal request: script should take less then 30 seconds
current runtime: it takes 54 seconds
# Try to implement the program efficiently (running the script should take less then 30 seconds)
import requests
# Requests is an elegant and simple HTTP library for Python, built for human beings.
# Requests is the only Non-GMO HTTP library for Python, safe for human consumption.
# Requests is not a built in module (does not come with the default python installation), so you will have to install it:
# http://docs.python-requests.org/en/v2.9.1/
# installing it for pyCharm is not so easy and takes a lot of troubleshooting (problems with pip's main version)
# use conda/pip install requests instead
import json
# dict subclass for counting hashable objects
from collections import Counter
#import heapq
import datetime
url = 'https://api.namefake.com'
# a "global" list object. TODO: try to make it "static" (local to the file)
words = []
#####################################################################################
# Calls the site http://www.namefake.com 100 times and retrieves random names
# Examples for the format of the names from this site:
# Dr. Willis Lang IV
# Lily Purdy Jr.
# Dameon Bogisich
# Ms. Zora Padberg V
# Luther Krajcik Sr.
# Prof. Helmer Schaden etc....
#####################################################################################
requests.packages.urllib3.disable_warnings()
t = datetime.datetime.now()
for x in range(100):
# for each name, break it to first and last name
# no need for authentication
# http://docs.python-requests.org/en/v2.3.0/user/quickstart/#make-a-request
responseObj = requests.get(url, verify=False)
# Decoding JSON data from returned response object text
# Deserialize ``s`` (a ``str``, ``bytes`` or ``bytearray`` instance
# containing a JSON document) to a Python object.
jsonData = json.loads(responseObj.text)
x = jsonData['name']
newName = ""
for full_name in x:
# make a string from the decoded python object concatenation
newName += str(full_name)
# split by whitespaces
y = newName.split()
# parse the first name (check first if header exists (Prof. , Dr. , Mr. , Miss)
if "." in y[0] or "Miss" in y[0]:
words.append(y[2])
else:
words.append(y[0])
words.append(y[1])
# Return the top 10 words that appear most frequently, together with the number of times, each word appeared.
# Output example: ['Weber', 'Kris', 'Wyman', 'Rice', 'Quigley', 'Goodwin', 'Lebsack', 'Feeney', 'West', 'Marlen']
# (We don't care whether the word was a first or a last name)
# list of tuples
top_ten =Counter(words).most_common(10)
top_names_list = [name[0] for name in top_ten ]
print((datetime.datetime.now()-t).total_seconds())
print(top_names_list)
You are calling an endpoint of an API that generates dummy information one person at a time - that takes considerable amount of time.
The rest of the code is taking almost no time.
Change the endpoint you are using (there is no bulk-name-gathering on the one you use) or use built-in dummy data provided by python modules.
You can clearly see that "counting and processing names" is not the bottleneck here:
from faker import Faker # python module that generates dummy data
from collections import Counter
import datetime
fake = Faker()
c = Counter()
# get 10.000 names, split them and add 1st part
t = datetime.datetime.now()
c.update( (fake.name().split()[0] for _ in range(10000)) )
print(c.most_common(10))
print((datetime.datetime.now()-t).total_seconds())
Output for 10000 names:
[('Michael', 222), ('David', 160), ('James', 140), ('Jennifer', 134),
('Christopher', 125), ('Robert', 124), ('John', 120), ('William', 111),
('Matthew', 111), ('Lisa', 101)]
in
1.886564 # seconds
General advise for code optimization: measure first then optimize the bottlenecks.
If you need a codereview you can check https://codereview.stackexchange.com/help/on-topic and see if your code fits with the requirements for the codereview stackexchange site. As with SO some effort should be put into the question first - i.e. analyzing where the majority of your time is being spent.
Edit - with performance measurements:
import requests
import json
from collections import defaultdict
import datetime
# defaultdict is (in this case) better then Counter because you add 1 name at a time
# Counter is superiour if you update whole iterables of names at a time
d = defaultdict(int)
def insertToDict(n):
d[n] += 1
url = 'https://api.namefake.com'
api_times = []
process_times = []
requests.packages.urllib3.disable_warnings()
for x in range(10):
# for each name, break it to first and last name
try:
t = datetime.datetime.now() # start time for API call
# no need for authentication
responseObj = requests.get(url, verify=False)
jsonData = json.loads(responseObj.text)
# end time for API call
api_times.append( (datetime.datetime.now()-t).total_seconds() )
x = jsonData['name']
t = datetime.datetime.now() # start time for name processing
newName = ""
for name_char in x:
# make a string from the decoded python object concatenation
newName = newName + str(name_char)
# split by whitespaces
y = newName.split()
# parse the first name (check first if header exists (Prof. , Dr. , Mr. , Miss)
if "." in y[0] or "Miss" in y[0]:
insertToDict(y[2])
else:
insertToDict(y[0])
insertToDict(y[1])
# end time for name processing
process_times.append( (datetime.datetime.now()-t).total_seconds() )
except:
continue
newA = sorted(d, key=d.get, reverse=True)[:10]
print(newA)
print(sum(api_times))
print(sum( process_times ))
Output:
['Ruecker', 'Clare', 'Darryl', 'Edgardo', 'Konopelski', 'Nettie', 'Price',
'Isobel', 'Bashirian', 'Ben']
6.533625
0.000206
You can make the parsing part better .. I did not, because it does not matter.
It is better to use timeit for performance testing (it calls code multiple times and averages, smoothing artifacts due to caching/lag/...) (thx #bruno desthuilliers ) - in this case I did not use timeit because I do not want to call API 100000 times to average results

Using Python gdata to clear the rows in worksheet before adding data

I have a Google Spreadsheet which I'm populating with values using a python script and the gdata library. If i run the script more than once, it appends new rows to the worksheet, I'd like the script to first clear all the data from the rows before populating it, that way I have a fresh set of data every time I run the script. I've tried using:
UpdateCell(row, col, value, spreadsheet_key, worksheet_id)
but short of running a two for loops like this, is there a cleaner way? Also this loop seems to be horrendously slow:
for x in range(2, 45):
for i in range(1, 5):
self.GetGDataClient().UpdateCell(x, i, '',
self.spreadsheet_key,
self.worksheet_id)
Not sure if you got this sorted out or not, but regarding speeding up the clearing out of current data, try using a batch request. For instance, to clear out every single cell in the sheet, you could do:
cells = client.GetCellsFeed(key, wks_id)
batch_request = gdata.spreadsheet.SpreadsheetsCellsFeed()
# Iterate through every cell in the CellsFeed, replacing each one with ''
# Note that this does not make any calls yet - it all happens locally
for i, entry in enumerate(cells.entry):
entry.cell.inputValue = ''
batch_request.AddUpdate(cells.entry[i])
# Now send the entire batchRequest as a single HTTP request
updated = client.ExecuteBatch(batch_request, cells.GetBatchLink().href)
If you want to do things like save the column headers (assuming they are in the first row), you can use a CellQuery:
# Set up a query that starts at row 2
query = gdata.spreadsheet.service.CellQuery()
query.min_row = '2'
# Pull just those cells
no_headers = client.GetCellsFeed(key, wks_id, query=query)
batch_request = gdata.spreadsheet.SpreadsheetsCellsFeed()
# Iterate through every cell in the CellsFeed, replacing each one with ''
# Note that this does not make any calls yet - it all happens locally
for i, entry in enumerate(no_headers.entry):
entry.cell.inputValue = ''
batch_request.AddUpdate(no_headers.entry[i])
# Now send the entire batchRequest as a single HTTP request
updated = client.ExecuteBatch(batch_request, no_headers.GetBatchLink().href)
Alternatively, you could use this to update your cells as well (perhaps more in line with that you want). The link to the documentation provides a basic way to do that, which is (copied from the docs in case the link ever changes):
import gdata.spreadsheet
import gdata.spreadsheet.service
client = gdata.spreadsheet.service.SpreadsheetsService()
# Authenticate ...
cells = client.GetCellsFeed('your_spreadsheet_key', wksht_id='your_worksheet_id')
batchRequest = gdata.spreadsheet.SpreadsheetsCellsFeed()
cells.entry[0].cell.inputValue = 'x'
batchRequest.AddUpdate(cells.entry[0])
cells.entry[1].cell.inputValue = 'y'
batchRequest.AddUpdate(cells.entry[1])
cells.entry[2].cell.inputValue = 'z'
batchRequest.AddUpdate(cells.entry[2])
cells.entry[3].cell.inputValue = '=sum(3,5)'
batchRequest.AddUpdate(cells.entry[3])
updated = client.ExecuteBatch(batchRequest, cells.GetBatchLink().href)

Categories

Resources