I'm working on adding keywords to a collection of documents (company information records).
The keyword generation works perfectly but now I'm struggling to figure out why my update isn't working.
with open (file) as csv_file:
csv_reader=csv.DictReader(csv_file,delimiter=',')
line_count=0
for row in csv_reader:
uuid=row['uuid']
name=row['name'].lower()
keywords = ""
arrName = []
for c in name:
keywords += c
arrName.append(keywords)
print(uuid, arrName)
data = {
u'keywords': arrName}
doc_ref = store.collection("org_test").where("uuid", "==", uuid).update(data)
I know it's the last line but I get the same error if I use .update or .set.
uuid is: e1393508-30ea-8a36-3f96-dd3226033abd keywords are: ['w', 'we', 'wet', 'wetp', 'wetpa', 'wetpai', 'wetpain', 'wetpaint']
Traceback (most recent call last):
File "keywords.py", line 107, in <module>
doc_ref = store.collection("org_test").where("uuid", "==", uuid).update(data)
AttributeError: 'Query' object has no attribute 'update'
I've also tried:
doc_ref = store.collection("org_test").where("uuid", "==", uuid).document.update(data)
But I get a similar error.
You are getting this error because you cannot perform a query and fire an update call from it.
What you are doing in this part of your code:
store.collection("org_test").where("uuid", "==", uuid)
Is returning you a QuerySnapshot object, which is basically the result to a query, and this object does not contain an update() method.
In order to do that, you need to select the document using doc() which returns you a DocumentSnapshot object, and that object is what can trigger the update(), so you can do this in your code:
store.collection("org_test").document(uuid).update(data);
I realized that I had data with different data types. Most of the information can be a string but one is an array.
This was my code and solution. It takes a file name as a command-line argument, parses through the fields, creates an array of keywords based on the company name, and then lastly creates a new collection in my Firestore.
import csv
import argparse
import firebase_admin
import google.cloud
from google.cloud.firestore_v1 import ArrayUnion
from firebase_admin import credentials, firestore
def init_argparse() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description='Parse through the fields, create an array of keywords based on the company name and upload to Firestore.', add_help=False, usage=get_usage())
parser.add_argument('file', help='File to run against.')
parser.add_argument('-q', '--quiet', help='Quiet mode', action='store_true')
parser.add_argument('-v', '--version', action='version', version='%(prog)s 1.0.0', help='Version.')
return parser
# Launch program
if __name__ == '__main__':
parser = init_argparse()
args = parser.parse_args()
arg_quiet = args.quiet
file = args.file
cred = credentials.Certificate("./serviceAccountKey.json")
app = firebase_admin.initialize_app(cred)
db = firestore.client()
with open (file) as csv_file:
csv_reader=csv.DictReader(csv_file,delimiter=',')
line_count=0
co = []
for row in csv_reader:
# Initialize variables
uuid=row["uuid"]
name=row["name"].lower()
primary_role=row["primary_role"]
cb_url=row["cb_url"]
domain=row["domain"]
homepage_url=row["homepage_url"]
logo_url=row["logo_url"]
facebook_url=row["facebook_url"]
twitter_url=row["twitter_url"]
linkedin_url=row["linkedin_url"]
combined_stock_symbols=row["combined_stock_symbols"]
city=row["city"]
region=row["region"]
country_code=row["country_code"]
short_description=row["short_description"]
keywords = ""
arrName = []
clean = []
# Loop to create searchable keywords
for c in name:
keywords += c
arrName.append(keywords)
doc_ref = db.collection("orgs").document()
doc_ref.set({
u'uuid': uuid,
u'name': name,
u'keywords': ArrayUnion(arrName),
u'primary_role':primary_role,
u'cb_url': cb_url,
u'domain': domain,
u'homepage_url': homepage_url,
u'logo_url': logo_url,
u'facebook_url': facebook_url,
u'twitter_url': twitter_url,
u'linkedin_url': linkedin_url,
u'combined_stock_symbols': combined_stock_symbols,
u'city': city,
u'region': region,
u'country_code': country_code,
u'short_description': short_description
})
Related
y'all. I'm trying to figure out how to sort for a specific country's tweets using search_recent_tweets. I take a country name as input, use pycountry to get the 2-character country code, and then I can either put some sort of location filter in my query or in search_recent_tweets params. Nothing I have tried so far in either has worked.
######
import tweepy
from tweepy import OAuthHandler
from tweepy import API
import pycountry as pyc
# upload token
BEARER_TOKEN='XXXXXXXXX'
# get tweets
client = tweepy.Client(bearer_token=BEARER_TOKEN)
# TAKE USER INPUT
countryQuery = input("Find recent tweets about travel in a certain country (input country name): ")
keyword = 'women safe' # gets tweets containing women and safe for that country (safe will catch safety)
# get country code to plug in as param in search_recent_tweets
country_code = str(pyc.countries.search_fuzzy(countryQuery)[0].alpha_2)
# get 100 recent tweets containing keywords and from location = countryQuery
query = str(keyword+' place_country='+str(countryQuery)+' -is:retweet') # search for keyword and no retweets
posts = client.search_recent_tweets(query=query, max_results=100, tweet_fields=['id', 'text', 'entities', 'author_id'])
# expansions=geo.place_id, place.fields=[country_code],
# filter posts to remove retweets
# export tweets to json
import json
with open('twitter.json', 'w') as fp:
for tweet in posts.data:
json.dump(tweet.data, fp)
fp.write('\n')
print("* " + str(tweet.text))
I have tried variations of:
query = str(keyword+' -is:retweet') # search for keyword and no retweets
posts = client.search_recent_tweets(query=query, place_fields=[str(countryQuery), country_code], max_results=100, tweet_fields=['id', 'text', 'entities', 'author_id'])
and:
query = str(keyword+' place.fields='+str(countryQuery)+','+country_code+' -is:retweet') # search for keyword and no retweets
posts = client.search_recent_tweets(query=query, max_results=100, tweet_fields=['id', 'text', 'entities', 'author_id'])
These either ended up pulling me NoneType tweets aka nothing or causing a
"The place.fields query parameter value [Germany] is not one of [contained_within,country,country_code,full_name,geo,id,name,place_type]"
The documentation for search_recent_tweets seems like place.fields / place_fields / place_country should be supported.
Any advice would help!!!
I have a Dataflow pipeline that fetches data from Pub/Sub and prepares them for insertion into Big Query and them writes them into the Database.
It works fine, it can generate the schema automatically and it is able to recognise what datatype to use and everything.
However the data we are using with it can vary vastly in format. Ex: we can get both A and B for a single column
A {"name":"John"}
B {"name":["Albert", "Einstein"]}
If the first message we get gets added, then adding the second one will not work.
If i do it the other way around it does however.
i always get the following error:
INFO:root:Error: 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/project/projectname/jobs?uploadType=resumable: Provided Schema does not match Table project:test_dataset.test_table. Field cars has changed mode from NULLABLE to REPEATED with loading dataframe
ERROR:apache_beam.runners.direct.executor:Exception at bundle <apache_beam.runners.direct.bundle_factory._Bundle object at 0x7fcb9003f2c0>, due to an exception.
Traceback (most recent call last):
........
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
.....
Provided Schema does not match Table project.test_table. Field cars has changed mode from NULLABLE to REPEATED
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "apache_beam/runners/common.py", line 1233, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 582, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "newmain.py", line 211, in process
if load_job and load_job.errors:
UnboundLocalError: local variable 'load_job' referenced before assignment
Below is the code
class WriteDataframeToBQ(beam.DoFn):
def __init__(self, bq_dataset, bq_table, project_id):
self.bq_dataset = bq_dataset
self.bq_table = bq_table
self.project_id = project_id
def start_bundle(self):
self.client = bigquery.Client()
def process(self, df):
# table where we're going to store the data
table_id = f"{self.bq_dataset}.{self.bq_table}"
# function to help with the json -> bq schema transformations
generator = SchemaGenerator(input_format='dict', quoted_values_are_strings=True, keep_nulls=True)
# Get original schema to assist the deduce_schema function. If the table doesn't exist
# proceed with empty original_schema_map
try:
table = self.client.get_table(table_id)
original_schema = table.schema
self.client.schema_to_json(original_schema, "original_schema.json")
with open("original_schema.json") as f:
original_schema = json.load(f)
original_schema_map, original_schema_error_logs = generator.deduce_schema(input_data=original_schema)
except Exception:
logging.info(f"{table_id} table not exists. Proceed without getting schema")
original_schema_map = {}
# convert dataframe to dict
json_text = df.to_dict('records')
# generate the new schema, we need to write it to a file because schema_from_json only accepts json file as input
schema_map, error_logs = generator.deduce_schema(input_data=json_text, schema_map=original_schema_map)
schema = generator.flatten_schema(schema_map)
schema_file_name = "schema_map.json"
with open(schema_file_name, "w") as output_file:
json.dump(schema, output_file)
# convert the generated schema to a version that BQ understands
bq_schema = self.client.schema_from_json(schema_file_name)
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
schema_update_options=[
bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
bigquery.SchemaUpdateOption.ALLOW_FIELD_RELAXATION
],
write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
schema=bq_schema
)
job_config.schema = bq_schema
try:
load_job = self.client.load_table_from_json(
json_text,
table_id,
job_config=job_config,
) # Make an API request.
load_job.result() # Waits for the job to complete.
if load_job.errors:
logging.info(f"error_result = {load_job.error_result}")
logging.info(f"errors = {load_job.errors}")
else:
logging.info(f'Loaded {len(df)} rows.')
except Exception as error:
logging.info(f'Error: {error} with loading dataframe')
if load_job and load_job.errors:
logging.info(f"error_result = {load_job.error_result}")
logging.info(f"errors = {load_job.errors}")
def run(argv):
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args, save_main_session=True, streaming=True)
options = pipeline_options.view_as(JobOptions)
with beam.Pipeline(options=pipeline_options) as pipeline:
(
pipeline
| "Read PubSub Messages" >> beam.io.ReadFromPubSub(subscription=options.input_subscription)
| "Write Raw Data to Big Query" >> beam.ParDo(WriteDataframeToBQ(project_id=options.project_id, bq_dataset=options.bigquery_dataset, bq_table=options.bigquery_table))
)
if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
run(sys.argv)
Is there a way to change the restrictions of the table to make this work?
BigQuery isn't a document database, but a columnar oriented database. In addition, you can't update the schema of existing columns (only add or remove them).
For your use case, and because you can't know/predict the most generic schema of each of your field, the safer is to store the raw JSON as a string, and then to use the JSON functions of BigQuery to post process, in SQL, your data
I am trying to get all group members of Active Directory.
I have this code:
from ldap3 import Server, Connection, ALL, core
server = Server(address, get_info=ALL)
ad_conn = Connection(server, dn, password, auto_bind=True)
members = []
AD_GROUP_FILTER = '(&(objectClass=GROUP)(cn={group_name}))'
ad_filter = AD_GROUP_FILTER.replace('{group_name}', group_name)
result = ad_conn.search_s('OU details', ldap3.SCOPE_SUBTREE, ad_filter)
if result:
if len(result[0]) >= 2 and 'member' in result[0][1]:
members_tmp = result[0][1]['member']
for m in members_tmp:
email = get_email_by_dn(m, ad_conn)
if email:
members.append(email)
return members
But I am getting an error
'Connection' object has no attribute 'search_s'
Use search(), specify the attributes you need (it seems you build 'email' from user dn but if it were present in the directory), and fix the arguments in function call (arg. order filter then scope, plus use the proper constant SUBTREE) :
from ldap3 import Server, Connection, ALL, core
server = Server(address, get_info=ALL)
ad_conn = Connection(server, dn, password, auto_bind=True)
members = []
AD_GROUP_FILTER = '(&(objectClass=GROUP)(cn={group_name}))'
ad_filter = AD_GROUP_FILTER.replace('{group_name}', group_name)
ad_conn.search('OU details', ad_filter, ldap3.SUBTREE, attributes=['member', 'mail'])
if len(ad_conn.response):
# To grab data, you might prefer the following - or use ad_conn.entries :
for entry in ad_conn.response:
print(entry['dn'], entry['attributes'])
In __init__ method when i try to access get_movie_detail it displays an error that the object has no attribute.This happens when i try to get data from a list. When i tried returning a single variable in get_movie_detail eg. return title, it successfully returned the title but the same if i am trying to do with a list the below error appears:
Stack trace:
674
Harry Potter and the Goblet of Fire
7.6
Traceback (most recent call last):
File "C:\Users\HOME\Desktop\movie trailer\entertainment.py", line 4, in <module>
movie_name_1=media.Data('Harry potter and goblet of fire')
File "C:\Users\HOME\Desktop\movie trailer\media.py", line 73, in __init__
self.title = detail_new.detail_list[0]
AttributeError: 'list' object has no attribute 'detail_list'
Below is my code:
""" stores requested movie data"""
import webbrowser
import urllib.request
import json
import tmdbsimple as tmdb
tmdb.API_KEY = ''
class Data():
"""takes argument movie_name,searches the data from tmdb database
tmdbSimple is wrapper&classes of tmdb
movie name keyword should be precise so that api can fetch data
accurately.Incomplete keyword could lead to unwanted results.
Attributes:
Search():search method in tmdbsimple.
movie:movie method takes query and return results
search_result:getd the result of searched movies includes
data regarding movies in list format.
movie_id:stores the unique movie ID for a particular movie
title:stores title of movie.
imdb_rating:rating of movie.
movie_overview:short description of movie.
release_date:date on which movie released.
data_yt:searches the movie keyword on youtube api for trailer.
youtube_key:gets key from the search result in json format
trailer.youtube_url:complete youtube link to the trailer
poster_image_url:link to the poster of movie."""
def get_movie_detail(self,movie_name):
print("in am in get_movie_detail")
#search movie id related to query from tmdb api
search = tmdb.Search()
response_old = search.movie(query=movie_name)
search_result = search.results
s_find = search_result[0]#gets the first id of the search result
movie_id = s_find['id']
print(movie_id)
#uses movie id to get the details of movie
movie = tmdb.Movies(movie_id)
response = movie.info()#stores the movie information in dict format
title = response['title']#if you try to print title in the definition it will print correctly
print(title)
vote = response['vote_average']
print(vote)
discript = response['overview']
date = response['release_date']
poster_id = response['poster_path']
detail_list = [title,vote,discript,date,poster_id]
return detail_list
"""" issue is that if i try calling from __init__ its not returning any attributes of get_movie_detail,Is there scope issue?"""
def get_trailer_link(self,movie_name):
print("I am in get_trailer")
query_new = urllib.parse.quote_plus(movie_name+'trailer')
data_yt = urllib.request.urlopen("https://www.googleapis.com/youtube/v3/search?part=snippet&maxResults=5&q="+query_new+"&type=video&key=")
trailer_data = data_yt.read()
youtube_json = json.loads(trailer_data)
youtube_key = str(youtube_json['items'][0]['id']['videoId'])
def __init__(self,movie_name):
"""stores the atrributes of instances"""
detail_new = self.get_movie_detail(movie_name)
self.title = detail_new.detail_list[0] #***while trying to get data from method it shows error:'NoneType' object has no attribute 'title'***
print(self.title)
self.imdb_rating = detail_new.vote.detail_list[1]
print(self.imdb_rating)
self.movie_overview = detail_new.detail_list[2]
print(self.movie_overview)
self.release_date = detail_new.detail_list[3]
print(self.release_date)
#uses youtube api to get the url from the search result
#gets movie image from the database tmdbsimple api
self.trailer_youtube_url = "https://www.youtube.com/watch?v="+str(self.get_trailer_link.youtube_key)
self.poster_image_url = "http://image.tmdb.org/t/p/w185"+str(poster_id)
def trailer(self):
"""opens youtube url"""
webbrowser.open(self.trailer_youtube_url)
You don't need to try to access detail_list, your detail_new is that list.
You can get the title by accessing detail_new[0].
So that portion of your __init__() function becomes:
self.title = detail_new[0]
self.imdb_rating = detail_new[1]
self.movie_overview = detail_new[2]
self.release_date = detail_new[3]
In this SO question I had learnt that I cannot delete a Cosmos DB document using SQL.
Using Python, I believe I need the DeleteDocument() method. This is how I'm getting the document ID's that are required (I believe) to then call the DeleteDocument() method.
# set up the client
client = document_client.DocumentClient()
# use a SQL based query to get a bunch of documents
query = { 'query': 'SELECT * FROM server s' }
result_iterable = client.QueryDocuments('dbs/DB/colls/coll', query, options)
results = list(result_iterable);
for x in range(0, len (results)):
docID = results[x]['id']
Now, at this stage I want to call DeleteDocument().
The inputs into which are document_link and options.
I can define document_link as something like
document_link = 'dbs/DB/colls/coll/docs/'+docID
And successfully call ReadAttachments() for example, which has the same inputs as DeleteDocument().
When I do however, I get an error...
The partition key supplied in x-ms-partitionkey header has fewer
components than defined in the the collection
...and now I'm totally lost
UPDATE
Following on from Jay's help, I believe I'm missing the partitonKey element in the options.
In this example, I've created a testing database, it looks like this
So I think my partition key is /testPART
When I include the partitionKey in the options however, no results are returned, (and so print len(results) outputs 0).
Removing partitionKey means that results are returned, but the delete attempt fails as before.
# Query them in SQL
query = { 'query': 'SELECT * FROM c' }
options = {}
options['enableCrossPartitionQuery'] = True
options['maxItemCount'] = 2
options['partitionKey'] = '/testPART'
result_iterable = client.QueryDocuments('dbs/testDB/colls/testCOLL', query, options)
results = list(result_iterable)
# should be > 0
print len(results)
for x in range(0, len (results)):
docID = results[x]['id']
print docID
client.DeleteDocument('dbs/testDB/colls/testCOLL/docs/'+docID, options=options)
print 'deleted', docID
According to your description, I tried to use pydocument module to delete document in my azure document db and it works for me.
Here is my code:
import pydocumentdb;
import pydocumentdb.document_client as document_client
config = {
'ENDPOINT': 'Your url',
'MASTERKEY': 'Your master key',
'DOCUMENTDB_DATABASE': 'familydb',
'DOCUMENTDB_COLLECTION': 'familycoll'
};
# Initialize the Python DocumentDB client
client = document_client.DocumentClient(config['ENDPOINT'], {'masterKey': config['MASTERKEY']})
# use a SQL based query to get a bunch of documents
query = { 'query': 'SELECT * FROM server s' }
options = {}
options['enableCrossPartitionQuery'] = True
options['maxItemCount'] = 2
result_iterable = client.QueryDocuments('dbs/familydb/colls/familycoll', query, options)
results = list(result_iterable);
print(results)
client.DeleteDocument('dbs/familydb/colls/familycoll/docs/id1',options)
print 'delete success'
Console Result:
[{u'_self': u'dbs/hitPAA==/colls/hitPAL3OLgA=/docs/hitPAL3OLgABAAAAAAAAAA==/', u'myJsonArray': [{u'subId': u'sub1', u'val': u'value1'}, {u'subId': u'sub2', u'val': u'value2'}], u'_ts': 1507687788, u'_rid': u'hitPAL3OLgABAAAAAAAAAA==', u'_attachments': u'attachments/', u'_etag': u'"00002100-0000-0000-0000-59dd7d6c0000"', u'id': u'id1'}, {u'_self': u'dbs/hitPAA==/colls/hitPAL3OLgA=/docs/hitPAL3OLgACAAAAAAAAAA==/', u'myJsonArray': [{u'subId': u'sub3', u'val': u'value3'}, {u'subId': u'sub4', u'val': u'value4'}], u'_ts': 1507687809, u'_rid': u'hitPAL3OLgACAAAAAAAAAA==', u'_attachments': u'attachments/', u'_etag': u'"00002200-0000-0000-0000-59dd7d810000"', u'id': u'id2'}]
delete success
Please notice that you need to set the enableCrossPartitionQuery property to True in options if your documents are cross-partitioned.
Must be set to true for any query that requires to be executed across
more than one partition. This is an explicit flag to enable you to
make conscious performance tradeoffs during development time.
You could find above description from here.
Update Answer:
I think you misunderstand the meaning of partitionkey property in the options[].
For example , my container is created like this:
My documents as below :
{
"id": "1",
"name": "jay"
}
{
"id": "2",
"name": "jay2"
}
My partitionkey is 'name', so here I have two paritions : 'jay' and 'jay1'.
So, here you should set the partitionkey property to 'jay' or 'jay2',not 'name'.
Please modify your code as below:
options = {}
options['enableCrossPartitionQuery'] = True
options['maxItemCount'] = 2
options['partitionKey'] = 'jay' (please change here in your code)
result_iterable = client.QueryDocuments('dbs/db/colls/testcoll', query, options)
results = list(result_iterable);
print(results)
Hope it helps you.
Using the azure.cosmos library:
install and import azure cosmos package:
from azure.cosmos import exceptions, CosmosClient, PartitionKey
define delete items function - in this case using the partition key in query:
def deleteItems(deviceid):
client = CosmosClient(config.cosmos.endpoint, config.cosmos.primarykey)
# Create a database if not exists
database = client.create_database_if_not_exists(id=azure-cosmos-db-name)
# Create a container
# Using a good partition key improves the performance of database operations.
container = database.create_container_if_not_exists(id=container-name, partition_key=PartitionKey(path='/your-pattition-path'), offer_throughput=400)
#fetch items
query = f"SELECT * FROM c WHERE c.device.deviceid IN ('{deviceid}')"
items = list(container.query_items(query=query, enable_cross_partition_query=False))
for item in items:
container.delete_item(item, 'partition-key')
usage:
deviceid=10
deleteItems(items)
github full example here: https://github.com/eladtpro/python-iothub-cosmos