How to find non-retweets in a MongoDB collection of tweets? - python

I have a collection of about 1.4 million tweets in a MongoDB collection. I want to find all that are NOT retweets, and am using Python. The structure of a document is as follows:
{
'_id': ObjectId('59388c046b0c1901172555b9'),
'coordinates': None,
'created_at': datetime.datetime(2016, 8, 18, 17, 17, 12),
'geo': None,
'is_quote': False,
'lang': 'en',
'text': b'Adam Cole Praises Kevin Owens + A Preview For Next Week\xe2\x80\x99s',
'tw_id': 766323071976247296,
'user_id': 2231233110,
'user_lang': 'en',
'user_loc': 'main; #Kan1shk3',
'user_name': 'sheezy0',
'user_timezone': 'Chennai'
}
I can write a query that works to find the particular tweet from above:
twitter_mongo_collection.find_one({
'text': b'Adam Cole Praises Kevin Owens + A Preview For Next Week\xe2\x80\x99s'
})
But when I try to find retweets, my code doesn't work, for example I try to find any tweets that start like this:
'text': b'RT some tweet'
Using this query:
find_one( {'text': {'$regex': "/^RT/" } } )
It doesn't return an error, but it doesn't find anything. I suspect it has something to do with that 'b' at the beginning before the text starts. I know I also need to put '$not:' in there somewhere but am not sure where.
Thanks!

It looks like your regex search is trying to match the string
b'RT'
but you want to match strings like
b'RT some text afterwards'
try using this regex instead
find_one( {'text': {'$regex': "/^RT.*/" } } )

I had to decode the 'text' field that was encoded as binary. Then I was able to use
twitter_mongo_collection.find_one( { {'text': { '$not': re.compile("^RT.*") } } )
to find all the documents that did not start with "RT".

Related

AWS bulk indexing using gives 'illegal_argument_exception', 'explicit index in bulk is not allowed')

While I am trying to bulk index on AWS Opensearch Service (ElasticSearch V 10.1) using opensearch-py, I am getting below error
RequestError: RequestError(400, 'illegal_argument_exception', 'explicit index in bulk is not allowed')
from opensearchpy.helpers import bulk
bulk(client, format_embeddings_for_es_indexing(embd_data, titles_, _INDEX_))
format_embeddings_for_es_indexing() function yeilds
{
'_index': 'test_v1',
'_id': '208387',
'_source': {
'article_id': '208387',
'title': 'Battery and Performance',
'title_vector': [ 1.77665558e-02, 1.95874255e-02,.....],
......
}
}
I am able to index documents one by one using `open search.index()'
failed = {}
for document in format_embeddings_for_es_indexing(embd_data, titles_, _INDEX_):
res = client.index(
**document,
refresh = True
)
if res['_shards']['failed'] > 0:
failed[document["body"]["article_id"]] = res['_shards']
# document body for open search index
{
'index': 'test_v1',
'id': '208387',
'body': {
'article_id': '208387',
'title': 'Battery and Performance',
'title_vector': [ 1.77665558e-02, 1.95874255e-02,.....],
......
}
}
please help
This may have something to do with what is documented here: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-advanced
Please make sure that the value of rest.action.multi.allow_explicit_index in the Advanced cluster settings is true

How do i get the document id in Marqo?

i added a document to marqo add_documents() but i didn't pass an id and now i am trying to get the document but i don't know what the document_id is?
Here is what my code look like:
mq = marqo.Client(url='http://localhost:8882')
mq.index("my-first-index").add_documents([
{
"Title": title,
"Description": document_body
}]
)
i tried to check whether the document got added or not but ;
no_of_docs = mq.index("my-first-index").get_stats()
print(no_of_docs)
i got;
{'numberOfDocuments': 1}
meaning it was added.
if you don't add the "_id" as part of key/value then by default marqo will generate a random id for you, to access it you can search the document using the document's Title,
doc = mq.index("my-first-index").search(title_of_your_document, searchable_attributes=['Title'])
you should get a dictionary as the result something like this;
{'hits': [{'Description': your_description,
'Title': title_of_your_document,
'_highlights': relevant part of the doc,
'_id': 'ac14f87e-50b8-43e7-91de-ee72e1469bd3',
'_score': 1.0}],
'limit': 10,
'processingTimeMs': 122,
'query': 'The Premier League'}
the part that says _id is the id of your document.

Send json between Python and PHP

I have this code in Python (is interesterd poart of the code)
elements = [{'id': 1, 'name': 'Alex'}, {'id': 2, 'name': 'Jessica'}]
elements_json = json.dumps(elements)
requests.post('http://localhost:8000/api/add-element', data = {'json': elements_json})
Elements sometimes contain more elements.
In PHP (Laravel) I have this code:
public function Store(Request $request) {
$json = json_decode($request->json );
echo $json;
}
I want to store all elements but for test I want to print that but it not display me empty result. I test that on Postman. Where problem is?

String indices must be integers - Django

I have a pretty big dictionary which looks like this:
{
'startIndex': 1,
'username': 'myemail#gmail.com',
'items': [{
'id': '67022006',
'name': 'Adopt-a-Hydrant',
'kind': 'analytics#accountSummary',
'webProperties': [{
'id': 'UA-67522226-1',
'name': 'Adopt-a-Hydrant',
'websiteUrl': 'https://www.udemy.com/,
'internalWebPropertyId': '104343473',
'profiles': [{
'id': '108333146',
'name': 'Adopt a Hydrant (Udemy)',
'type': 'WEB',
'kind': 'analytics#profileSummary'
}, {
'id': '132099908',
'name': 'Unfiltered view',
'type': 'WEB',
'kind': 'analytics#profileSummary'
}],
'level': 'STANDARD',
'kind': 'analytics#webPropertySummary'
}]
}, {
'id': '44222959',
'name': 'A223n',
'kind': 'analytics#accountSummary',
And so on....
When I copy this dictionary on my Jupyter notebook and I run the exact same function I run on my django code it runs as expected, everything is literarily the same, in my django code I'm even printing the dictionary out then I copy it to the notebook and run it and I get what I'm expecting.
Just for more info this is the function:
google_profile = gp.google_profile # Get google_profile from DB
print(google_profile)
all_properties = []
for properties in google_profile['items']:
all_properties.append(properties)
site_selection=[]
for single_property in all_properties:
single_propery_name=single_property['name']
for single_view in single_property['webProperties'][0]['profiles']:
single_view_id = single_view['id']
single_view_name = (single_view['name'])
selections = single_propery_name + ' (View: '+single_view_name+' ID: '+single_view_id+')'
site_selection.append(selections)
print (site_selection)
So my guess is that my notebook has some sort of json parser installed or something like that? Is that possible? Why in django I can't access dictionaries the same way I can on my ipython notebooks?
EDITS
More info:
The error is at the line: for properties in google_profile['items']:
Django debug is: TypeError at /gconnect/ string indices must be integers
Local Vars are:
all_properties =[]
current_user = 'myemail#gmail.com'
google_profile = `the above dictionary`
So just to make it clear for who finds this question:
If you save a dictionary in a database django will save it as a string, so you won't be able to access it after.
To solve this you can re-convert it to a dictionary:
The answer from this post worked perfectly for me, in other words:
import json
s = "{'muffin' : 'lolz', 'foo' : 'kitty'}"
json_acceptable_string = s.replace("'", "\"")
d = json.loads(json_acceptable_string)
# d = {u'muffin': u'lolz', u'foo': u'kitty'}
There are many ways to convert a string to a dictionary, this is only one. If you stumbled in this problem you can quickly check if it's a string instead of a dictionary with:
print(type(var))
In my case I had:
<class 'str'>
before converting it with the above method and then I got
<class 'dict'>
and everything worked as supposed to

python Api output json/string

I have problem with an api output. As the web page says its a json but suddenly he switch to string?
The output looks like this
{'terms':
[{'start_date':'2013-09-30',
'finish_date': '2014-03-02',
'end_date': '2014-01-31',
'order_key': 420,
'name': {'pl': 'Semestr zimowy 2013/14', 'en': 'Winter Semester 2013/14'},
'id': '2013Z'},
.
.
.
{'start_date': '2017-09-25',
'finish_date': '2018-02-19',
'end_date': '2018-01-29',
'order_key': 540,
'name': {'pl': 'Semest zimowy 2017/2018', 'en': 'Winter Semester 2017/18'}, 'id': '2017Z'}],
and then something like second paragraph looks like this
'groups':
{'2015Z':
[{'relationship_type': 'participant',
'course_name': {'pl': 'Algorytmy i struktury danych', 'en': 'Algorithms and Data Structures'},
'term_id': '2015Z'},
.
.
.
{'relationship_type': 'participant',
'course_name': {'pl': 'Wychowanie fizyczne 1', 'en': 'Gymnastics 1'},
'term_id': '2015Z'}]
Whole output got +1000 words so i decided to put it that way. My problem is that i can get an any data from terms but when i try to get any data from groups pycharms says those are strings. My code looks like this
data = polaczenie.get('/services/groups/user',
fields='course_name|class_type|class_type_id|group_number', format='json')
mylist = []
mylist2 = []
for i in data['terms']:
mylist.append(i['id'])
print(mylist)
for i in data['groups']:
mylist2.append(i['course_name'])
print(mylist2)
The first loop get data fine however the second give me following error
mylist2.append(i['term_id'])
TypeError: string indices must be integers
As I understand the error, my json suddenly become string? I don't know how I can fix it and my goal is to get course_name and term_id.
for i in data['groups']['2015Z']:
mylist2.append(i['course_name'])

Categories

Resources