Redis Timeseries Pipeline with Python - python

I am looking to use pipeline to insert data into a redis Timeseries but cannot find a way to ts.add via pipeline.
I can do basic example with get / set:
import redis
import json
redis_client = redis.Redis(host='xxx.xxx.xxx.xxx', port='xxxxx', password='xxxx')
pipe = redis_client.pipeline()
pipe.set(1,'apple')
pipe.set(2,'orange')
pipe.execute()
I cant find a way to insert into a timeseries:
import redis
import json
redis_client = redis.Redis(host='xxx.xxx.xxx.xxx', port='xxxxx', password='xxxx')
pipe = redis_client.pipeline()
pipe.ts.add(TS1,1652683016,55) #<----- this is what I want to do!
pipe.ts.add(TS1,1652683017,59) #<----- this is what I want to do!
pipe.execute()

As of this writing (redis-py 4.3.1) there exists another pipeline object on the timeseries class itself. The following will work:
import redis
r = redis.Redis()
pipe = r.ts().pipeline()
pipe.add("TS1", 1, 123123123123)
pipe.add("TS1", 2, 123123123451)
...
pipe.add("TS1", 15, 123123126957)
pipe.execute()

Related

how to speed up a splunk export?

I am using the python 3 splunk API to export some massive logs.
My code essentially follows the splunk API guidelines:
import splunklib.client as client
import splunklib.results as results
import pandas as pd
kwargs_export = {"earliest_time": "2019-08-19T12:00:00.000-00:00",
"latest_time": "2019-08-19T14:00:00.000-00:00",
"search_mode": "normal"}
exportsearch_results = service.jobs.export(mysearchquery, **kwargs_export)
reader = results.ResultsReader(exportsearch_results)
df = pd.DataFrame(list(reader))
But this is extremely slow...
Ultimately I want to store the output of the search as a csv to disk. Is there any way to speed the export?
Thanks!
Check this as it works
kwargs_export = {"earliest_time": "-1d",
"latest_time": "now",
"search_mode": "normal"}
service = client.connect(**args)
job = service.jobs.create(query, **kwargs_export)
with open(filename, 'wb') as out_f:
try:
job_results = job.results(output_mode="csv", count=0)
for result in job_results:
out_f.write(result)
except :
print("Session timed out. Reauthenticating")

read from text file and send in to aws sqs Fifo queue

I have a little issue here. I want to read from a text file using python and create queue and then send these lines from the text file into Amazon web services SQS(Simple Queue service). First of all, I`ve actually managed to do this using boto, but the problem is that the lines dont come in order, just randomly, like line 4,line 1, line 5 etc etc..
Here is my code:
import boto.sqs
conn = boto.sqs.connect_to_region("us-east-2",
aws_access_key_id='AKIAIJIQZG5TR3NMW3LQ',
aws_secret_access_key='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')
q = conn.create_queue('test')
with open('read.txt', 'r') as read_file:
from boto.sqs.message import RawMessage
for line in read_file:
m = RawMessage()
m.set_body(line)
q.write(m)
So, what to do? Well, we need to create an FIFO queue(which I also managed to do using boto3 in python), but now the problem is that I have problems reading the text file.. Here is the code i used to create a FIFO queue in SQS:
import boto3
AWS_ACCESS_KEY = 'AKIAIJIQZG5TR3NMW3LQ'
AWS_SECRET_ACCESS_KEY = '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'
sqs_client = boto3.resource(
'sqs',
aws_access_key_id=AWS_ACCESS_KEY,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
region_name='us-east-2'
)
queue_name = ('demo_queue.fifo')
response = sqs_client.create_queue(
QueueName=queue_name,
Attributes={
'FifoQueue': 'true',
'ContentBasedDeduplication': 'true'
}
)
with open('read.txt', 'r') as read_file:
from boto.sqs.message import RawMessage
for line in read_file:
m = RawMessage()
m.set_body(line)
queue_name.write(m)
Somebody know how to solve this? thanks.
Try replacing
queue_name.write(m)
with
response.write(m)
in your first piece of code. You should use the actual queue returned by get_queue_by_name
Also, when only specifying MessageBody and MessageGroupID in boto3, make sure that Content-Based deduplication is enabled for the queue, or specify a MessageDeduplicationId string, otherwise it will fail

pySpark Kafka Direct Streaming update Zookeeper / Kafka Offset

currently I'm working with Kafka / Zookeeper and pySpark (1.6.0).
I have successfully created a kafka consumer, which is using the KafkaUtils.createDirectStream().
There is no problem with all the streaming, but I recognized, that my Kafka Topics are not updated to the current offset, after I have consumed some messages.
Since we need the topics updated to have a monitoring here in place this is somehow weird.
In the documentation of Spark I found this comment:
offsetRanges = []
def storeOffsetRanges(rdd):
global offsetRanges
offsetRanges = rdd.offsetRanges()
return rdd
def printOffsetRanges(rdd):
for o in offsetRanges:
print "%s %s %s %s" % (o.topic, o.partition, o.fromOffset, o.untilOffset)
directKafkaStream\
.transform(storeOffsetRanges)\
.foreachRDD(printOffsetRanges)
You can use this to update Zookeeper yourself if you want Zookeeper-based Kafka monitoring tools to show progress of the streaming application.
Here is the documentation:
http://spark.apache.org/docs/1.6.0/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
I found a solution in Scala, but I can't find an equivalent for python.
Here is the Scala example: http://geeks.aretotally.in/spark-streaming-kafka-direct-api-store-offsets-in-zk/
Question
But the question is, how I'm able to update the zookeeper from that point on?
I write some functions to save and read Kafka offsets with python kazoo library.
First function to get singleton of Kazoo Client:
ZOOKEEPER_SERVERS = "127.0.0.1:2181"
def get_zookeeper_instance():
from kazoo.client import KazooClient
if 'KazooSingletonInstance' not in globals():
globals()['KazooSingletonInstance'] = KazooClient(ZOOKEEPER_SERVERS)
globals()['KazooSingletonInstance'].start()
return globals()['KazooSingletonInstance']
Then functions to read and write offsets:
def read_offsets(zk, topics):
from pyspark.streaming.kafka import TopicAndPartition
from_offsets = {}
for topic in topics:
for partition in zk.get_children(f'/consumers/{topic}'):
topic_partion = TopicAndPartition(topic, int(partition))
offset = int(zk.get(f'/consumers/{topic}/{partition}')[0])
from_offsets[topic_partion] = offset
return from_offsets
def save_offsets(rdd):
zk = get_zookeeper_instance()
for offset in rdd.offsetRanges():
path = f"/consumers/{offset.topic}/{offset.partition}"
zk.ensure_path(path)
zk.set(path, str(offset.untilOffset).encode())
Then before starting streaming you could read offsets from zookeeper and pass them to createDirectStream
for fromOffsets argument.:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
def main(brokers="127.0.0.1:9092", topics=['test1', 'test2']):
sc = SparkContext(appName="PythonStreamingSaveOffsets")
ssc = StreamingContext(sc, 2)
zk = get_zookeeper_instance()
from_offsets = read_offsets(zk, topics)
directKafkaStream = KafkaUtils.createDirectStream(
ssc, topics, {"metadata.broker.list": brokers},
fromOffsets=from_offsets)
directKafkaStream.foreachRDD(save_offsets)
if __name__ == "__main__":
main()
I encountered similar question.
You are right, by using directStream, means using kafka low-level API directly, which didn't update reader offset.
there are couple of examples for scala/java around, but not for python.
but it's easy to do it by yourself, what you need to do are:
read from the offset at the beginning
save the offset at the end
for example, I save the offset for each partition in redis by doing:
stream.foreachRDD(lambda rdd: save_offset(rdd))
def save_offset(rdd):
ranges = rdd.offsetRanges()
for rng in ranges:
rng.untilOffset # save offset somewhere
then at the begin, you can use:
fromoffset = {}
topic_partition = TopicAndPartition(topic, partition)
fromoffset[topic_partition]= int(value) #the value of int read from where you store previously.
for some tools that use zk to track offset, it's better to save the offset in zookeeper.
this page:
https://community.hortonworks.com/articles/81357/manually-resetting-offset-for-a-kafka-topic.html
describe how to set the offset, basically, the zk node is:
/consumers/[consumer_name]/offsets/[topic name]/[partition id]
as we are using directStream, so you have to make up a consumer name.

cURL method in Python for JSON feed [duplicate]

This question already has answers here:
How to download a file over HTTP?
(30 answers)
Closed 7 years ago.
While building a flask website, I'm using an external JSON feed to feed the local mongoDB with content. This feed is parsed and fed while repurposing keys from the JSON to keys in Mongo.
One of the available keys from the feed is called "img_url" and contains, guess what, an url to an image.
Is there a way, in Python, to mimic a php style cURL? I'd like to grab that key, download the image, and store it somewhere locally while keeping other associated keys, and have that as an entry to my db.
Here is my script up to now:
import json
import sys
import urllib2
from datetime import datetime
import pymongo
import pytz
from utils import slugify
# from utils import logger
client = pymongo.MongoClient()
db = client.artlogic
def fetch_artworks():
# logger.debug("downloading artwork data from Artlogic")
AL_artworks = []
AL_artists = []
url = "http://feeds.artlogic.net/artworks/artlogiconline/json/"
while True:
f = urllib2.urlopen(url)
data = json.load(f)
AL_artworks += data['rows']
# logger.debug("retrieved page %s of %s of artwork data" % (data['feed_data']['page'], data['feed_data']['no_of_pages']))
# Stop we are at the last page
if data['feed_data']['page'] == data['feed_data']['no_of_pages']:
break
url = data['feed_data']['next_page_link']
# Now we have a list called ‘artworks’ in which all the descriptions are stored
# We are going to put them into the mongoDB database,
# Making sure that if the artwork is already encoded (an object with the same id
# already is in the database) we update the existing description instead of
# inserting a new one (‘upsert’).
# logger.debug("updating local mongodb database with %s entries" % len(artworks))
for artwork in AL_artworks:
# Mongo does not like keys that have a dot in their name,
# this property does not seem to be used anyway so let us
# delete it:
if 'artworks.description2' in artwork:
del artwork['artworks.description2']
# upsert int the database:
db.AL_artworks.update({"id": artwork['id']}, artwork, upsert=True)
# artwork['artist_id'] is not functioning properly
db.AL_artists.update({"artist": artwork['artist']},
{"artist_sort": artwork['artist_sort'],
"artist": artwork['artist'],
"slug": slugify(artwork['artist'])},
upsert=True)
# db.meta.update({"subject": "artworks"}, {"updated": datetime.now(pytz.utc), "subject": "artworks"}, upsert=True)
return AL_artworks
if __name__ == "__main__":
fetch_artworks()
First, you might like the requests library.
Otherwise, if you want to stick to the stdlib, it will be something in the lines of:
def fetchfile(url, dst):
fi = urllib2.urlopen(url)
fo = open(dst, 'wb')
while True:
chunk = fi.read(4096)
if not chunk: break
fo.write(chunk)
fetchfile(
data['feed_data']['next_page_link'],
os.path.join('/var/www/static', uuid.uuid1().get_hex()
)
With the correct exceptions catching (i can develop if you want, but i'm sure the documentation will be clear enough).
You could put the fetchfile() into a pool of async jobs to fetch many files at once.
https://docs.python.org/2/library/json.html
https://docs.python.org/2/library/urllib2.html
https://docs.python.org/2/library/tempfile.html
https://docs.python.org/2/library/multiprocessing.html

PyMongo/Mongoengine equivalent of mongodump

Is there an equivalent function in PyMongo or mongoengine to MongoDB's mongodump? I can't seem to find anything in the docs.
Use case: I need to periodically backup a remote mongo database. The local machine is a production server that does not have mongo installed, and I do not have admin rights, so I can't use subprocess to call mongodump. I could install the mongo client locally on a virtualenv, but I'd prefer an API call.
Thanks a lot :-).
For my relatively small small database, I eventually used the following solution. It's not really suitable for big or complex databases, but it suffices for my case. It dumps all documents as a json to the backup directory. It's clunky, but it does not rely on other stuff than pymongo.
from os.path import join
import pymongo
from bson.json_utils import dumps
def backup_db(backup_db_dir):
client = pymongo.MongoClient(host=<host>, port=<port>)
database = client[<db_name>]
authenticated = database.authenticate(<uname>,<pwd>)
assert authenticated, "Could not authenticate to database!"
collections = database.collection_names()
for i, collection_name in enumerate(collections):
col = getattr(database,collections[i])
collection = col.find()
jsonpath = collection_name + ".json"
jsonpath = join(backup_db_dir, jsonpath)
with open(jsonpath, 'wb') as jsonfile:
jsonfile.write(dumps(collection))
The accepted answer is not working anymore. Here is a revised code:
from os.path import join
import pymongo
from bson.json_util import dumps
def backup_db(backup_db_dir):
client = pymongo.MongoClient(host=..., port=..., username=..., password=...)
database = client[<db_name>]
collections = database.collection_names()
for i, collection_name in enumerate(collections):
col = getattr(database,collections[i])
collection = col.find()
jsonpath = collection_name + ".json"
jsonpath = join(backup_db_dir, jsonpath)
with open(jsonpath, 'wb') as jsonfile:
jsonfile.write(dumps(collection).encode())
backup_db('.')

Categories

Resources