How can I parse this report tab-delimited response? Amazon MWS

How can I parse this report tab-delimited response? Amazon MWS - python

I am trying to make a web app with Amazon MWS. User's can add, list, delete their products in this app and also list their orders.
For listing their products I am trying to use Reports API with "_GET_MERCHANT_LISTINGS_DATA_". But this method is returns with very bad tab-delimited response. And also when I make request with RequestReport method, It sends the listings report to the shop owner.
This is the dummy response example:
b'item-name\titem-description\tlisting-id\tseller-sku\tprice\tquantity\topen-date\timage-url\titem-is-marketplace\tproduct-id-type\tzshop-shipping-fee\titem-note\titem-condition\tzshop-category1\tzshop-browse-path\tzshop-storefront-feature\tasin1\tasin2\tasin3\twill-ship-internationally\texpedited-shipping\tzshop-boldface\tproduct-id\tbid-for-featured-placement\tadd-delete\tpending-quantity\tfulfillment-channel\nPropars deneme urunu 2 CD-ROM CD-ROM\tThis is a development test product\t0119QL9BRT8\tPARS12344321\t0.5\t9\t2016-01-19 05:26:44 PST\t\ty\t4\t\t\t11\t\t\t\tB01ATBY2NA\t\t\t1\t\t\t8680925084020\t\t\t0\tDEFAULT\n'
Is there anybody know another method for listing products in a shop, or what do you suggest for taking better results from "_GET_MERCHANT_LISTINGS_DATA_" report?
Or how can I parse this tab-delimited string?
Thanks.

You really have two options for obtaining bulk data from amazon, tab-delimited and xml. tab-delimited reads in Excel quite nicely actually, and the routines for splitting the values into usable format are pretty straight forward. Unfortunately Amazon doesn't give you the option of XML or Flat File on every report so you have to use a mix of both under most circumstances.
First off, your title indicates you need to list all listings active and inactive. That is going to be a combination of reports. If you want an 'all inclusive' including problem listings, active listings, hidden listings and cancelled listings you will need three reports:
_GET_MERCHANT_LISTINGS_DATA_
_GET_MERCHANT_CANCELLED_LISTINGS_DATA_
_GET_MERCHANT_LISTINGS_DEFECT_DATA_
All of these are flat file format so you will have a consistent method of reading the data in. In c# you simply read a line, split that line and read each array value. There will be a similar method of performing this in python, which is most likely well documented here on SO. The c# method would look something like this:
while ((line = file.ReadLine()) != null)
{
if (counter == 0)
{
string[] tempParts = line.Split(delimiters);
for (int i = 0; i < tempParts.Length; i++)
{
tempParts[i] = tempParts[i].Trim(); //Clean up remaining whitespace.
}
//Try and verify headers have not changed.
if (!isReportHeaderValid(tempParts))
{
reportStatus.IsError = true;
reportStatus.Exception = new Exception("Report Column headers were not validated!!!!");
return;
}
counter++;
continue;
}
counter++;
string[] parts = line.Split(delimiters);
for (int i = 0; i < parts.Length; i++)
{
parts[i] = parts[i].Trim(); //Clean up remaining whitespace.
}
//Do stuff with parts[1], parts[2] etc
}
This is an example from one of my pieces of code working with the Amazon Inventory report. Basically, I verify that headers are what I would expect them to be (indicating the report format has not changed) and then I split, clean up white space, and work with each element from split.
Python split method:
Python Split
Alternatively, you could probably take the entire stream and stick it straight into an excel spreadsheet since Excel understands how to delimit on tabs.
Edit
Just a note, in my code example i pass 'delimiters' in to the split routine, but I never define it. It is defined as char[] delimiters = new char[] { '\t' };

Related

Deserializing a Streamed Protocol Buffer Message With Header and Repeated fields

I am working on deserializing a log file that has been serialized in C using protocol buffers (and NanoPB).
The log file has a short header composed of: entity, version, and identifier. After the header, the stream of data should be continuous and it should log the fields from the sensors but not the header values (this should only occur once and at the beginning).The same .proto file was used to serialize the file. I do not have separate .proto files for the header and for the streamed data.
After my implementation, I assume it should look like this:
firmware "1.0.0"
GUID "1231214211321" (example)
Timestamp 123123
Sens1 2343
Sens2 13123
Sens3 13443
Sens4 1231
Sens5 190
Timestamp 123124
Sens1 2345
Sens2 2312
...
I posted this question to figure out how to structure the .proto file initially, when I was implementing the serialization in C. And in the end I used a similar approach but did no include the: [(nanopb).max_count = 1];
Finally I opted with the following .proto in Python (There can be more sensors than 5):
syntax = "proto3";
import "timestamp.proto";
message SessionLogs {
int32 Entity = 1;
string Version = 2;
string GUID = 3;
repeated SessionLogsDetail LogDetail = 4;
}
message SessionLogsDetail
{
int32 DataTimestamp = 1; // internal counter to identify the order of session logs
// Sensor data, there can be X amount of sensors.
int32 sens1 = 2;
int32 sens2= 3;
int32 sens3= 4;
int32 sens4= 5;
}
At this point, I can serialize a message as I log with my device and according to the file size, the log seems to work, but I have not been able to deserialize it on Python offline to check if my implementation has been correct. And I can't do it in C since its an embedded application and I want to do the post-processing offline with Python.
Also, I have checked this online protobuf deserializer where I can pass the serialized file and get it deserialized without the need of the .proto file. In it I can see the header values (field 3 is empty so its not seen) and the logged information. So this makes me think that the serialization is correct but I am deserializing it wrongly on Python.
This is my current code used to deserialize the message in Python:
import PSessionLogs_pb2
with open('$PROTOBUF_LOG_FILENAME$', 'rb') as f:
read_metric = PSessionLogs_pb2.PSessionLogs()
read_metric.ParseFromString(f.read())
Besides this, I've used protoc to generate the .py equivalent of the .proto file to deserialize offline.

It looks like you've serialized a header, then serialized some other data immediately afterwards, meaning: instead of serializing a SessionLogs that has some SessionLogsDetail records, you've serialized a SessionLogs, and then you've serialized (separately) a SessionLogsDetail - does that sound about right? if so: yes, that will not work correctly; there are ways to do what you're after, but it isn't quite as simple as just serializing one after the other, because the root protobuf object is never terminated; so what actually happens is that it overwrites the root object with later fields by number.
There's two ways of addressing this, depending on the data volume. If the size (including all of the detail rows) is small, you can just change the code so that it is a true parent / child relationship, i.e. so that the rows are all inside the parent. When writing the data, this does not mean that you need to have all the rows before you start writing - there are ways of making appending child rows so that you are sending data as it becomes available; however, when deserializing, it will want to load everything in one go, so this approach is only useful if you're OK with that, i.e. you don't have obscene open-ended numbers of rows.
If you have large numbers of rows, you'll need to add your own framing, essentially. This is often done by adding a length-prefix between each payload, so that you can essentially read a single message at a time. Some of the libraries include helper methods for this; for example, in the java API this is parseDelimitedFrom and writeDelimitedTo. However, my understand is that the python API does not currently support this utility, so you'd need to do the framing yourself :(
To summarize, you currently have:
{header - SessionLogs}
{row 0 - SessionLogsDetail}
{row 1 - SessionLogsDetail}
option 1 is:
{header - SessionLogs
{row 0 - SessionLogsDetail}
{row 1 - SessionLogsDetail}
}
option 2 is:
{length prefix of header}
{header - SessionLogs}
{length prefix of row0}
{row 0 - SessionLogsDetail}
{length prefix of row1}
{row 1 - SessionLogsDetail}
(where the length prefix is something simple like a raw varint, or just a 4-byte integer in some agreed endianness)

How to make Chatfuel read JSON file stored in Zapier?

In my Chatfuel block I collect a {{user input}} and POST a JSON in a Zapier webhook. So far so good. After that, my local Pyhon reads this JSON from Zapier storage successfully
url = 'https://store.zapier.com/api/records?secret=password'
response = urllib.request.urlopen(url).read().decode('utf-8')
data = json.loads(response)
and analyze it generating another JSON as output:
json0={
"messages": [
{"text": analysis_output}]
}
Then Python3 posts this JSON in a GET webhook in Zapier:
import requests
r = requests.post('https://hooks.zapier.com/hooks/catch/2843360/8sx1xl/', json=json0)
r.status_code
Zapier Webhook successfully gets the JSON and sends it to Storage.
Key-Value pairs are set and then Chatfuel tries to read from storage:
GET https://store.zapier.com/api/records?secret=password2
But the JSON structure obtained is wrong, what was verified with this code:
url = 'https://store.zapier.com/api/records?secret=password2'
response = urllib.request.urlopen(url).read().decode('utf-8')
data = json.loads(response)
data
that returns:
{'messages': "text: Didn't know I could order several items"}
when the right one for Chatfuel to work should be:
{'messages': [{"text: Didn't know I could order several items"}]}
That is, there are two mais problems:
1) There is a missing " { [ " in the JSON
2) The JSON is appending new information to the existing one, instead of generating a brand new JSON, what cause the JSON to have 5 different parts.
I am looking for possible solutions for this issue.

David here, from the Zapier Platform team.
First off, you don't need quotes around your keys, we take care of that for you. Currently, your json will look like:
{ "'messages'": { "'text'": "<DATA FROM STEP 1>" } }
So the first change is to take out those.
Next, if you want to store an array, use the Push Value Onto List action instead. It takes a top-level key and stores your values in a key in that object called list. Given the following setup:
The resulting structure in JSON is
{ "demo": {"list": [ "5" ]} }
It seems like you want to store an extra level down; an array of json objects:
[ { "text": "this is text" } ]
That's not supported out of the box, as all list items are stored as strings. You can store json strings though, and parse them back into an object when you need to access them like an object!
Does that answer your question?

Exporting DOORS Objects to csv files with dxl wont write all objects?

I wrote a script for DOORS 9.5 that looks in a DOORS module for specific Objects and writes them in a csv-file. However after a specific number of lines it stops writing in the csv-file and i got only half of my requested Objects.
I'm using a String replacement function that i found in the internet. So thats maybe the problem or is there some kind of maximum for dxl to write in csv files?
Would be very nice if anyone could help me with this, because i cant find any solution for this in the internet or understand why this wont work.
// String replacement function
string replace (string sSource, string sSearch, string sReplace)
{
int iLen = length sSource
if (iLen == 0) return ""
int iLenSearch = length(sSearch)
if (iLenSearch == 0)
{
print "search string must not be empty"
return ""
}
// read the first char for latter comparison -> speed optimization
char firstChar = sSearch[0]
Buffer s = create()
int pos = 0, d1,d2;
int i
while (pos < iLen) {
char ch = sSource[pos];
bool found = true
if (ch != firstChar) {pos ++; s+= ch; continue}
for (i = 1; i < iLenSearch; i++)
if (sSource[pos+i] != sSearch[i]) { found = false; break }
if (!found) {pos++; s+= ch; continue}
s += sReplace
pos += iLenSearch
}
string result = stringOf s
delete s
return result
}
Module m = read(modulePath, false)
Object o
string s
string eval
Stream outfile = write("D:\\Python\\Toolbeta\\data\\modules\\test.csv")
for o in m do
{
eval = o."Evaluation Spec Filter"
if(eval == "Evaluation Step Object")
{
s = o."Object Text"
s = replace(s,"\n","\\n")
outfile2 << o."HierarchyNumber" ";" s "\n"
}
}
close outfile

I finally found the solution to my problem (I know the replace function was kinda crappy^^).
dxl scripts seem to have an internal timer. When the timer is up, the script will end itself automaticly even when processes are still executed. So my script always stoppt after x seconds and thats the reason i never got all the data in my csv files.
If you have the same problem try pragma runLim,0. It sets the timer to unlimited. you can also choose a timer by replacing 0 with any number. (For my purpose 2000000 suits the best).
Thx for all answers and the help

There is no line limit that I know of for output to CSV.
But I do see something odd in your replace function. Your sSearch variable is always \n which as far as DOORS is concerned is 1 character (a carriage return). But in the following line i=1 :
if (sSource[pos+i] != sSearch[i]) { found = false; break }
sSearch doesn't have any character at position 1 because the string array starts at 0.
I think you need to change your for loop to:
for (i = 0; i < iLenSearch; i++)
My guess is that your script is failing the first time it finds an object with a carriage return in it.
Let me know if that helps, good luck!

There are a myriad of problems you have to take care of when outputting Object Text to a CSV. Are you taking care of replacing commas, or at least quoting the text? What if the Object Text has an OLE in it? Etc.
If you are doing this manually, I would suggest setting up a view with the attributes you want to see as columns and a filter to include only the objects you want to see, then using the native DOORS export to Excel (Module window: File > Export > Microsoft Office > Excel). If you really need a CSV, you can then save from Excel as CSV.
If you are doing this automatically with scripts, I would suggest using a DXL library for Excel, such as: http://www.baselinesinc.com/dxl-repository/
(But be aware there are problems using Excel in Windows Scheduled Tasks.)
If you don't have access to Excel, then you might look around the 'net for some C code used to write to a CSV and base your DXL on that.
Hope that helps.
EDIT: Also, here's a link for a good function for escaping: https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014627043

Parsing an unknown data structure in python

I have a file containing lots of data put in a form similar to this:
Group1 {
Entry1 {
Title1 [{Data1:Member1, Data2:Member2}]
Title2 [{Data3:Member3, Data4:Member4}]
}
Entry2 {
...
}
}
Group2 {
DifferentEntry1 {
DiffTitle1 {
...
}
}
}
Thing is, I don't know how many layers of parentheses there are, and how the data is structured. I need modify the data, and delete entire 'Entry's depending on conditions involving data members before writing everything to a new file. What's the best way of reading in a file like this? Thanks!

The data structure basically seems to be a dict where they keys are strings and the value is either a string or another dict of the same type, so I'd recommend maybe pulling it into that sort of python structure,
eg:
{'group1': {'Entry2': {}, 'Entry1': {'Title1':{'Data4': 'Member4',
'Data1': 'Member1','Data3': 'Member3', 'Data2': 'Member2'},
'Title2': {}}}
At the top level of the file you would create a blank dict, and then for each line you read, you use the identifier as a key, and then when you see a { you create the value for that key as a dict. When you see Key:Value, then instead of creating that key as a dict, you just insert the value normally. When you see a } you have to 'go back up' to the previous dict you were working on and go back to filling that in.
I'd think this whole parser to put the file into a python structure like this could be done in one fairly short recursive function that just called itself to fill in each sub-dict when it saw a { and then returned to its caller upon seeing }

Here is a grammar.
dict_content : NAME ':' NAME [ ',' dict_content ]?
| NAME '{' [ dict_content ]? '}' [ dict_content ]?
| NAME '[' [ list_content ]? ']' [ dict_content ]?
;
list_content : NAME [ ',' list_content ]?
| '{' [ dict_content ]? '}' [ ',' list_content ]?
| '[' [ list_content ]? ']' [ ',' list_content ]?
;
Top level is dict_content.
I'm a little unsure about the comma after dicts and lists embedded in a list, as you didn't provide any example of that.

If you have the grammar for the structure of your data file, or you can create it yourself, you could use a parser generator for Python, like YAPPS: link text.

I have something similar but written in java. It parses a file with the same basic structure with a little different syntax (no '{' and '}' only indentation like in python). It is a very simple script language.
Basically it works like this: It uses a stack to keep track of the inner most block of instructions (or in your case data) and appends every new instruction to the block on the top. If it parses an instruction which expects a new block it is pushed to the stack. If a block ends it pops one element from the stack.
I do not want to post the entire source because it is big and it is available on google code (lizzard-entertainment, revision 405). There is a few things you need to know.
Instruction is an abstract class and it has a block_expected method to indicate wether the concrete instruction needs a block (like loops, etc) In your case this is unnecessary you only need to check for '{'.
Block extends Instruction. It contains a list of instructions and has an add method to add more.
indent_level return how many spaces are preceding the instruction text. This is also unneccessary with '{}' singns.
placeholder
BufferedReader input = null;
try {
input = new BufferedReader(new FileReader(inputFileName));
// Stack of instruction blocks
Stack<Block> stack = new Stack<Block>();
// Push the root block
stack.push(this.topLevelBlock);
String line = null;
Instruction prev = new Noop();
while ((line = input.readLine()) != null) {
// Difference between the indentation of the previous and this line
// You do not need this you will be using {} to specify block boundaries
int level = indent_level(line) - stack.size();
// Parse the line (returns an instruction object)
Instruction inst = Instruction.parse(line.trim().split(" +"));
// If the previous instruction expects a block (for example repeat)
if (prev.block_expected()) {
if (level != 1) {
// TODO handle error
continue;
}
// Push the previous instruction and add the current instruction
stack.push((Block)(prev));
stack.peek().add(inst);
} else {
if (level > 0) {
// TODO handle error
continue;
} else if (level < 0) {
// Pop the stack at the end of blocks
for (int i = 0; i < -level; ++i)
stack.pop();
}
stack.peek().add(inst);
}
prev = inst;
}
} finally {
if (input != null)
input.close();
}

That depends on how the data is structured, and what kind of changes you need to do.
One option might be to parse that into a Python data structure, it seems similar, except that you don't have quotes around the strings. That makes complex manipulation easy.
On the other hand, if all you need to do is make changes that modify some entries to other entries, you can do it with search and replace.
So you need to understand the issue better before you can know what the best way is.

This is a pretty similar problem to XML processing, and there's a lot of Python code to do that. So if you could somehow convert the file to XML, you could just run it through a parser from the standard library. An XML version of your example would be something like this:
<group id="Group1">
<entry id="Entry1">
<title id="Title1"><data id="Data1">Member1</data> <data id="Data2">Member2</data></title>
<title id="Title2"><data id="Data3">Member3</data> <data id="Data4">Member4</data></title>
</entry>
<entry id="Entry2">
...
</entry>
</group>
Of course, converting to XML probably isn't the most straightforward thing to do. But your job is pretty similar to what's already been done with the XML parsers, you just have a different syntax to deal with. So you could take a look at some XML parsing code and write a little Python parser for your data file based on that. (Depending on how the XML parser is implemented, you might even be able to copy the code, just change a few regular expressions, and run it for your file)

How to build "Tagging" support using CouchDB?

I'm using the following view function to iterate over all items in the database (in order to find a tag), but I think the performance is very poor if the dataset is large.
Any other approach?
def by_tag(tag):
return '''
function(doc) {
if (doc.tags.length > 0) {
for (var tag in doc.tags) {
if (doc.tags[tag] == "%s") {
emit(doc.published, doc)
}
}
}
};
''' % tag

Disclaimer: I didn't test this and don't know if it can perform better.
Create a single perm view:
function(doc) {
for (var tag in doc.tags) {
emit([tag, doc.published], doc)
}
};
And query with
_view/your_view/all?startkey=['your_tag_here']&endkey=['your_tag_here', {}]
Resulting JSON structure will be slightly different but you will still get the publish date sorting.

You can define a single permanent view, as Bahadir suggests. when doing this sort of indexing, though, don't output the doc for each key. Instead, emit([tag, doc.published], null). In current release versions you'd then have to do a separate lookup for each doc, but SVN trunk now has support for specifying "include_docs=True" in the query string and CouchDB will automatically merge the docs into your view for you, without the space overhead.

You are very much on the right track with the view. A list of thoughts though:
View generation is incremental. If you're read traffic is greater than you're write traffic, then your views won't cause an issue at all. People that are concerned about this generally shouldn't be. Frame of reference, you should be worried if you're dumping hundreds of records into the view without an update.
Emitting an entire document will slow things down. You should only emit what is necessary for use of the view.
Not sure what the val == "%s" performance would be, but you shouldn't over think things. If there's a tag array you should emit the tags. Granted if you expect a tags array that will contain non-strings, then ignore this.

# Works on CouchDB 0.8.0
from couchdb import Server # http://code.google.com/p/couchdb-python/
byTag = """
function(doc) {
if (doc.type == 'post' && doc.tags) {
doc.tags.forEach(function(tag) {
emit(tag, doc);
});
}
}
"""
def findPostsByTag(self, tag):
server = Server("http://localhost:1234")
db = server['my_table']
return [row for row in db.query(byTag, key = tag)]
The byTag map function returns the data with each unique tag in the "key", then each post with that tag in value, so when you grab key = "mytag", it will retrieve all posts with the tag "mytag".
I've tested it against about 10 entries and it seems to take about 0.0025 seconds per query, not sure how efficient it is with large data sets..

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I parse this report tab-delimited response? Amazon MWS - python

Related

Deserializing a Streamed Protocol Buffer Message With Header and Repeated fields

How to make Chatfuel read JSON file stored in Zapier?

Exporting DOORS Objects to csv files with dxl wont write all objects?

Parsing an unknown data structure in python

How to build "Tagging" support using CouchDB?

Categories

Resources