I have 5gb of data serialized with apache thrift and a .thrift file with the formatting of the data. I have tried using thriftpy and the official thrift package but I can't wrap my head around how to open the files.
The data is the expanded dataset from http://www.iesl.cs.umass.edu/data/wiki-links
A description of the data format can be found here https://code.google.com/p/wiki-link/wiki/ExpandedDataset
The Scala setup is to be found in the ThriftSerializerFactory.scala file. Since the naming of most things is consistent throughout the Thrift libraries, you more or less model your python code after the Scala example:
package edu.umass.cs.iesl.wikilink.expanded.process
import org.apache.thrift.protocol.TBinaryProtocol
import org.apache.thrift.transport.TIOStreamTransport
import java.io.File
import java.io.BufferedOutputStream
import java.io.FileOutputStream
import java.io.BufferedInputStream
import java.io.FileInputStream
import java.util.zip.{GZIPOutputStream, GZIPInputStream}
object ThriftSerializerFactory {
def getWriter(f: File) = {
val stream = new BufferedOutputStream(new GZIPOutputStream(new FileOutputStream(f)), 2048)
val protocol= new TBinaryProtocol(new TIOStreamTransport(stream))
(stream, protocol)
}
def getReader(f: File) = {
val stream = new BufferedInputStream(new GZIPInputStream(new FileInputStream(f)), 2048)
val protocol = new TBinaryProtocol(new TIOStreamTransport(stream))
(stream, protocol)
}
}
You basically set up a stream transport and the binary protocol. If you leave the data compressed, you will have to add the gzip piece to the puzzle, but once the data are decompressed this should not be needed anymore.
The code in WikiLinkItemIterator.scala shows how to read the data files using the factory class above.
class PerFileWebpageIterator(f: File) extends Iterator[WikiLinkItem] {
var done = false
val (stream, proto) = ThriftSerializerFactory.getReader(f)
private var _next: Option[WikiLinkItem] = getNext()
private def getNext(): Option[WikiLinkItem] = try {
Some(WikiLinkItem.decode(proto))
} catch {case _: TTransportException => {done = true; stream.close(); None}}
def hasNext(): Boolean = !done && (_next != None || {_next = getNext(); _next != None})
def next(): WikiLinkItem = if (hasNext()) _next match {
case Some(wli) => {_next = None; wli}
case None => {throw new Exception("Next on empty iterator.")}
} else throw new Exception("Next on empty iterator.")
}
Steps to implement:
implement Thrift protocol stack factory like above (recommendable pattern, BTW)
instantiate the root element of each record, in our case a WikiLinkItem
call instance.read(proto) to read one record of data
Related
I have here a part of a code in Python which is for AWS SendEmailRequest(SES)
response = boto3.client('ses').send_raw_email(
FromArn='response = boto3.client('ses').send_raw_email(
FromArn='arn:aws:ses:us-east-1:123456789012:identity/example.com',
SourceArn='arn:aws:ses:us-east-1:123456789012:identity/example.com',
RawMessage={
'Data': msg
},
)
This is working as expected. My problem is that I also need to have this in my Java script but I'm confused how to incorporate it. I've been trying but it seems to be not working. This is the existing Java script part below:
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
message.writeTo(outputStream);
RawMessage rawMessage = new RawMessage(ByteBuffer.wrap(outputStream.toByteArray()));
SendRawEmailRequest rawEmailRequest = new SendRawEmailRequest(rawMessage)
client.sendRawEmail(rawEmailRequest);
I think the FromArn and SourceArn should be incorporated in the rawMessage or rawEmailRequest but I couldn't make it work. On the top of the code, there are values declared like this:
public class SESEMail {
static final String FROM = "example#web.com";
static final String key = Config.key;
static final String privatekey = Config.privateKey;
static Logger logger = Logger.getLogger(SESEMail.class);
public static Variables variables;
I've been reading this one but still confused with how Java language works. http://javadox.com/com.amazonaws/aws-java-sdk-ses/1.10.29/com/amazonaws/services/simpleemail/model/SendRawEmailRequest.html#getSourceArn()
I'm trying to utilize a database from another program in a php based website tool, and apparently the original was built in python and puts some of it's data into a python tuple and serializes it to store it as a blob in the sql table.
I'm not a python programmer so I'm not sure how to even see what is in this blob, but I do know that some of the 'type' indicators for the data field are stored in there and I want to extract them and anything else useful.
Is there any way to 'unserialize' a python tuple in php?
The blob data turned out to be a pickled tuple (part of the reason I despise python - both data types that only python can read! Python programmers: 'standardized conventions? Who needs standardized conventions?!?!')
I came up with a cludgy way to 'unpickle' the data and json serialize it using a command line. To get the binary blob data into the command line, I base64 encode it. It's janky but it works for what I need:
/**
* use a python exec call to 'unpickle' the blob_data
* to get the binary blob into a command line argument, base64 encode it
* to get the data back out of python, json serialize it
* #param string $blob binary blob data
* #return mixed
*/
public static function unpickle($blob) {
$cmd = sprintf("import pickle; import base64; import json; print(json.dumps(pickle.loads(base64.b64decode('%s'))))", base64_encode($blob));
$pcmd = sprintf("python -c \"%s\"", $cmd);
$result = exec($pcmd);
$resdec = json_decode($result);
return $resdec;
}
With a little more playing on this concept, I gave myself a few more alternatives. First is, I took the command line version above and made it into a little more functional python script:
unpickle.py:
#!/usr/bin/env python3
import pickle
import json
import sys
import base64
import select
def isBase64(s):
try:
return s == base64.b64encode(base64.b64decode(s)).decode('ascii')
except Exception:
return False
bblob = None
if (len(sys.argv) > 1) and isBase64(sys.argv[1]):
bblob = base64.b64decode(sys.argv[1])
elif select.select([sys.stdin, ], [], [], 0.0)[0]:
try:
with open(0, 'rb') as f:
bblob = f.read()
except Exception as e:
err_unknown(e)
if bblob != None:
unpik = pickle.loads(bblob)
jsout = json.dumps(unpik)
print(jsout)
This script allows you to either specify the blob data from the pickled tuple 'byte' as a base64 encoded string on the command line, or you can pipe raw blob data into the script. Both variations will output json if the data is valid and formatted properly. (null if not)
You can convert this to a self-contained binary to plop on systems without python using pyinstaller -F if need be. To play with it in the event I am running it on systems with the pyinstaller binary vs one with the python script vs one with just python, I created the following static methods in my laravel model. (I'll eventually move it into a service module)
/**
* call either a pyinstaller binary or python script with raw blob data to be unpickled
*
* #param string $b binary data of blob
* #return false|mixed
*/
public static function unpickle($b)
{
$cmd = base_path(env('UNPICKLE_BINARY', 'bin/unpickle'));
if(!(is_file($cmd) && is_executable($cmd))) { // make sure unpickle cmd exists
// check for UNPICKLE_BINARY with .py after and python binary
$pyExe = env('PYTHON_EXE', '/usr/bin/python');
if (is_file($cmd.".py") && (is_file($pyExe) && is_executable($pyExe))) {
$cmd = sprintf("%s %s.py", $pyExe, $cmd);
} else
return static::unpyckle($b); // try direct python call
}
$descriptorspec = [
["pipe", "r"],
["pipe", "w"],
["pipe", "w"]
];
$cwd = dirname($cmd);
$env = [];
$process = proc_open($cmd, $descriptorspec, $pipes, $cwd, $env);
if (is_resource($process)) {
fwrite($pipes[0], $b);
fclose($pipes[0]);
$output = stream_get_contents($pipes[1]);
fclose($pipes[1]);
$return_value = proc_close($process);
if(static::isJson($output))
return json_decode($output);
else
return false;
}
return false;
}
/**
* use a python exec call to 'unpickle' the blob_data
* to get the binary blob into a command line argument, base64 encode it
* to get the data back out of python, json serialize it
* #param string $blob binary blob data
* #return mixed
*/
public static function unpyckle($blob) {
$pyExe = env('PYTHON_EXE', '/usr/bin/python');
if (!(is_file($pyExe) && is_executable($pyExe)))
throw new Exception('python executable not found!');
$bblob = base64_encode($blob);
$cmd = sprintf("import pickle; import base64; import json; print(json.dumps(pickle.loads(base64.b64decode('%s'))))", $bblob);
$pcmd = sprintf("%s -c \"%s\"", $pyExe, $cmd);
$result = exec($pcmd);
$resdec = json_decode($result);
return $resdec;
}
/**
* try to detect if a string is a json string
*
* #param $str
* #return bool
*/
public static function isJson($str) {
if(is_string($str) && !empty($str)) {
json_decode($str);
return (json_last_error() == JSON_ERROR_NONE);
}
return false;
}
example .env values:
UNPICKLE_BINARY=bin/unpickle
PYTHON_EXE=/usr/bin/python3
basically showing three different ways to call python to do essentially the same thing...
I have a snippet of a code written in python which I have to rewrite to C# with the help of Newtonsoft JSON
def getconfig():
with open("config.json") as f:
return json.loads(f.read())
and I have to be able to access the elements of JSON in the code just as it's shown in this snippet
def getcaptcha(proxy):
while(True):
ret = s.get("{}{}&pageurl={}"
.format(getconfig()['host'], getconfig()['key'], getconfig()['googlekey'], getconfig()['pageurl'])).text
So far I've come up with this idea:
public bool GetConfig()
{
JObject config = JObject.Parse(File.ReadAllText("path/config.json"));
using (StreamReader file = File.OpenText("path/config.json"))
using (JsonTextReader reader = new JsonTextReader(file))
{
JObject getConfig = (JObject) JToken.ReadFrom(reader);
}
return true;
}
but it doesn't seem to be working since I can't access it in further code as
GetConfig()['host']
for example. I can't even Console.WriteLine my JSON
First, define your GetConfig function as:
public JObject GetConfig()
{
return JObject.Parse(File.ReadAllText("path/config.json"));
}
then, call that function and access elements defined in the json file like:
JObject config = GetConfig();
string host = config.SelectToken("host").Value<string>();
string key = config.SelectToken("key").Value<string>();
Here is a config file that represents a protobuf object:
model {
faster_rcnn {
num_classes: 37
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
...
I wish to use python to read this into an object, make some changes to some of the values and then write it back to a config file.
A complicating factor is that this is a rather large object which is constructed from many .proto file.
I was able to accomplish the task by converting the protobuf to a dictionary, making edits and then converting back to a protobuf, like so:
import tensorflow as tf
from google.protobuf.json_format import MessageToDict
from google.protobuf.json_format import ParseDict
from google.protobuf import text_format
from object_detection.protos import pipeline_pb2
def get_configs_from_pipeline_file(pipeline_config_path, config_override=None):
'''
read .config and convert it to proto_buffer_object
'''
pipeline_config = pipeline_pb2.TrainEvalPipelineConfig()
with tf.gfile.GFile(pipeline_config_path, "r") as f:
proto_str = f.read()
text_format.Merge(proto_str, pipeline_config)
if config_override:
text_format.Merge(config_override, pipeline_config)
return pipeline_config
configs = get_configs_from_pipeline_file('faster_rcnn_resnet101_pets.config')
d = MessageToDict(configs)
d['model']['fasterRcnn']['numClasses']=999
config2 = pipeline_pb2.TrainEvalPipelineConfig()
c = ParseDict(d, config2)
s = text_format.MessageToString(c)
with open('/path/test.config', 'w+') as fh:
fh.write(str(s))
I would like to be able to make edits to the protobuf object directly without having to convert to a dictionary. The problem, however, is that is not clear how to "walk the dom" to discover the correct references to the variables whose values I would like to change. This is especially true when there are multiple .proto files involved.
I was able to make the edit like so, but I am hoping there is a better way:
configs.model.ListFields()[0][1].num_classes = 99
I followed this post:
Multiple Output Files for Hadoop Streaming with Python Mapper
for generating Multipleoutput file in hadoop streaming and i am getting that too.
So i wanted my structure like that:
date--:
---code=1
---code=2
---code=3
date--:
:
:
But inside code=1 and other directories everything is written to one file only and since my data is very large my job is taking very large time for completion.
Any workaround for that???
package com.custom;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;
import java.lang.*;
public class CustomMultiOutputFormat extends MultipleTextOutputFormat<Text, Text> {
#Override
protected String generateFileNameForKeyValue(Text key, Text value, String leaf) {
String key_temp,date,code,key_final;
key_temp=key.toString();
String[] arr=key_temp.split("/");
date="date=" +arr[0];
code ="code="+arr[1];
key_final=date+"/"+code;
Text t1 = new Text(key_final);
return new Path(t1.toString(), leaf).toString();
}
#Override
protected Text generateActualKey(Text key, Text value) {
return null;
}
}