To extract data between two strings - python

I have a file with the following json file, I want to extract the data between transcript and explain.
` "1010320": {
"transcript": [
"1012220",
"to build. so three is not correct."
],
"explain": "Describing&Interpreting"
},
"1019660": {
"transcript": [
"1031920",
"The moment disturbance comes, if this control strategy is to be implemented properly, the moment disturbance comes, it is picked up immediately, and corrective action done immediately."
],
"explain": "Describing&Interpreting"
},
"1041600": {
"transcript": [`
"1044860",
"this is also not correct because it will take some time."
],
"explain": "Describing&Interpreting"
},
` "1053100": {
"transcript": [
"1073800",
], `
` "explain": "Describing&Interpreting"
},
"2082920": {
"transcript": [
"2089000",
"45 minutes i.e., whereas this taken around 15seconds or something. Is that ok?"
],
"explain": "Describing&Interpreting"
}, `
I want to sort the string and numbers.
The output should be:
"to build. so three is not correct."
"The moment disturbance comes, if this control strategy is to be implemented properly, the moment disturbance comes, it is picked up immediately, and corrective action done immediately."
"this is also not correct because it will take some time."
"45 minutes i.e., whereas this taken around 15seconds or something. Is that ok?"
Is it possible??

This might work for you (GNU sed):
sed -n '/^\s*"transcript": \[/,/^\s*\],/{/^\s*"[^"]*"\s*$/p}' file
This uses seds grep-like mode and prints lines which start and end in double-quotes in the transcript clauses.

sed -n -e '/",[[:blank:]]*$/,/^[[:blank:]]*],/ {
/^[[:blank:]]*".*"[[:blank:]]*$/ {
G;p
}
}' YourFile
Based on your sample structure, take string between string ending with ", and string starting with ],, only print line that are only between quote.
I just add the possibility of several space char ([:blank:] in fact for extend space char like tab)

Related

How to have a default word in a vscode snippet (Python)

I'm new in vscode and I'm adding a snippet for my python.
my snippet is this
"for i in a range": {
"prefix": "for",
"body": [
"for i in range($1,$2):",
"\t$3"
],
In the $3 I want to have a "pass" as default and when I tap "tab" for the third time I want to select the pass for me and I change it if I wanted to
What you are looking for is called a placeholder, see snippet placeholders.
So this code would work:
"for i in a range": {
"prefix": "for",
"body": [
"for i in range($1,$2):",
"\t${3:my default text}"
]
}
All of my default text will be seleced when you get to that tabstop, you can either overwrite it or tab again to complete the snippet.

How do I search for a JSON key or value within all objects of an array?

I'm making a JSON object that stores every word in a language that me and my friends are trying to make. I am using Python to manipulate the data. The json looks something like this:
{
"nouns": [
{
"english": "man",
"liniitsav": "gom"
},
{
"english": "word",
"liniitsav": "duukam"
},
{
"english": "day",
"liniitsav": "diam"
}
],
"proper-nouns": [
],
"pronouns": [
],
"adjectives": [
],
"verbs": [
],
"adverbs": [
],
"prepositions": [
],
"conjunctions": [
],
"miscellaneous": [
],
"idioms": [
]
}
My friends and I have an Excel document where we store all of the words. I want to be able to put words into the JSON dictionary for further reference. My method right now is to use computer automation to (slowly) go through one cell at a time and copy it. Then, I access my clipboard and add an object with the English-MadeUpLanguage Pair and place it into JSON. My problem comes when I run the script again, after adding some new words. It re-adds every word to the JSON. I want to add an if statement that looks something like this line I found somewhere on stackexchange:
if any(tag['english'] == copiedToClipboard for tag in data['nouns']):
However, the english tab is within an object, so I can't immediately access it. I'd rather not use a for loop going through every object, for efficiency, and in preparation for when we hopefully have thousands of words. Is there any way built-in to Python to access tabs within all objects; a depth 2 search?

How to add json objects from file to another json array of objects in another file

In *nix environment. I'm seeking a solution on how to add some (not quite so valid) json in a file, to another (valid) json file. Let me elaborate and also cover some failed attempts I've tried so far.
This will run in a shell script in a loop which will grow quite large. It's making an api call which can only return 1000 at a time. However, there are 70,000,000+ total records. So, I will have to make this api call 70,000 times in order to get all of the desired records. The original json file I want to keep, it includes information outside of the actual data I want, such as result info and success messages, etc. Each time I iterate and call the next set, I'm trying to strip out that information and just append the main data records to the main data records of the first set.
I'm already 99% there. I'm attempting this using jq sed and python. The body of the data records is not technically valid json. So jq is complaining because it can only append if valid data. My attempt looks like this jq --argjson results "$(<new.json)" '.result[] += [$results]' original.json. But if it would, then it would be valid json.
I've already used grep -n to abstract the line number of where I want to start appending the new sets of records to the first set of records. So I've been trying to use sed but can not figure out the right syntax. Though I feel I'm close. I've been trying something like sed -i -e $linenumber '<a few sed things here> new.json' original.json. But no success yet.
I've now tried to write a python script to do this. But I had never tried anything like this before. Just some string matching on readlines and string replacements. I didn't realize that there isn't a built in method for jumping to a specific line. I guess I could do some find statements to jump to that line in python but I've already done this in the bash script. Also, I realize I could read each line to memory in python but I fear that with this many records, it might get to be too much and become very slow.
I had some passing thoughts on trying some kind of head and tail and write in between since I know the exact line number. Any thoughts or solutions with any tools/languages are welcome. This is a devops project that is just to diagnose some logs, so I'm trying to not make this a full project, as once I produce the logs, I'll shift all my focus and efforts to running commands against this final produced json file and not really use this script ever again.
Example of original.json
{
"result": [
{
"id": "5b5915f4cdb39c7b",
"kind": "foo",
"source": "bar",
"action": "baz",
"matches": [
{
"id": "b298ee91704b489b8119c1d604a8308d",
"source": "blah",
"action": "buzz"
}
],
"occurred_at": "date"
},
{
"id": "5b5915f4cdb39c7b",
"kind": "foo",
"source": "bar",
"action": "baz",
"matches": [
{
"id": "b298ee91704b489b8119c1d604a8308d",
"source": "blah",
"action": "buzz"
}
],
"occurred_at": "date"
}
],
"result_info": {
"cursors": {
"after": "dlipU4c",
"before": "iLjx06u"
},
"scanned_range": {
"since": "date",
"until": "date"
}
},
"success": true,
"errors": [],
"messages": []
}
Example of new.json
{
"id": "5b5915f4cdb39c7b",
"kind": "foo",
"source": "bar",
"action": "baz",
"matches": [
{
"id": "b298ee91704b489b8119c1d604a8308d",
"source": "blah",
"action": "buzz"
}
],
"occurred_at": "date"
},
{
"id": "5b5915f4cdb39c7b",
"kind": "foo",
"source": "bar",
"action": "baz",
"matches": [
{
"id": "b298ee91704b489b8119c1d604a8308d",
"source": "blah",
"action": "buzz"
}
],
"occurred_at": "date"
}
Don't worry about the indentation or missing trailing commas, I already have that figured out and confirmed working.
You can turn the invalid JSON from the API response into a valid array by wrapping it in [...]. The resulting array can be imported and added directly to the result array.
jq --argjson results "[$(<new.json)]" '.result += $results' original.json
So first to add results into new file create a json file with "[]" an empty array as its content, this is to make sure the file we load is valid json.
Next run the following command for each file as input
jq --argjson results "$(<new.json)" '.result | . += $results ' orig.json > new.json
Issue with your query was .result[] this return all the elements individually not as a json object in format
{}
{}
instead of
[
{},
{}
]
Based on the given new.json and its description, you seem to have comma-separated JSON objects with the JSON-separating commas on separate lines matching the regex '^}, *$'
If that's the case, the good news is you can achieve the result you want by simply removing the superfluous commas with:
sed 's/^}, *$/}/' new.txt
This produces a stream of objects, which can then be processed in any one of several well-known ways (e.g. by "slurping" it using the -s option).
"XY problem"?
In a side-comment, you wrote:
I did fix this with sed to add the commas, which are included in the question.
So it is beginning to sound as if the Q as posted is really a so-called "XY" problem. Anyway, if you were starting with a stream of JSON objects, then of course there would be need to add the commas and deal with the consequences.

Trying to convert a CSV into JSON in python for posting to REST API

I've got the following data in a CSV file (a few hundred lines) that I'm trying to massage into sensible JSON to post into a rest api
I've gone with the bare minimum fields required, but here's what I've got:
dateAsked,author,title,body,answers.author,answers.body,topics.name,answers.accepted
13-Jan-16,Ben,Cant set a channel ,"Has anyone had any issues setting channels. it stays at �0�. It actually tells me there are �0� files.",Silvio,"I�m not sure. I think you can leave the cable out, because the control works. But you could try and switch two port and see if problem follows the serial port. maybe �extended� clip names over 32 characters.
Please let me know if you find out!
Best regards.",club_k,TRUE
Here's a sample of JSON that is roughly like where I need to get to:
json_test = """{
"title": "Can I answer a question?",
"body": "Some text for the question",
"author": "Silvio",
"topics": [
{
"name": "club_k"
}
],
"answers": [
{
"author": "john",
"body": "I\'m not sure. I think you can leave the cable out. Please let me know if you find out! Best regards.",
"accepted": "true"
}
]
}"""
Pandas seems to import it into a dataframe okay (ish) but keeps telling me I can't serialize it to json - also need to clean it and sanitise, but that should be fairly easy to achieve within the script.
There must also be a way to do this in Pandas, but I'm beating my head against a wall here - as the columns for both answers and topics can't easily be merged together into a dict or a list in python.
You can use a csv.DictReader to process the CSV file as a dictionary for each row. Using the field names as keys, a new dictionary can be constructed that groups common keys into a nested dictionary keyed by the part of the field name after the .. The nested dictionary is held within a list, although it is unclear whether that is really necessary - the nested dictionary could probably be placed immediately under the top-level without requiring a list. Here's the code to do it:
import csv
import json
json_data = []
for row in csv.DictReader(open('/tmp/data.csv')):
data = {}
for field in row:
key, _, sub_key = field.partition('.')
if not sub_key:
data[key] = row[field]
else:
if key not in data:
data[key] = [{}]
data[key][0][sub_key] = row[field]
# print(json.dumps(data, indent=True))
# print('---------------------------')
json_data.append(json.dumps(data))
For your data, with the print() statements enabled, the output would be:
{
"body": "Has anyone had any issues setting channels. it stays at '0'. It actually tells me there are '0' files.",
"author": "Ben",
"topics": [
{
"name": "club_k"
}
],
"title": "Cant set a channel ",
"answers": [
{
"body": "I'm not sure. I think you can leave the cable out, because the control works. But you could try and switch two port and see if problem follows the serial port. maybe 'extended' clip names over 32 characters. \nPlease let me know if you find out!\n Best regards.",
"accepted ": "TRUE",
"author": "Silvio"
}
],
"dateAsked": "13-Jan-16"
}
---------------------------

From JSON to JSON-LD without changing the source

There are 'duplicates' to my question but they don't answer my question.
Considering the following JSON-LD example as described in paragraph 6.13 - Named Graphs from http://www.w3.org/TR/json-ld/:
{
"#context": {
"generatedAt": {
"#id": "http://www.w3.org/ns/prov#generatedAtTime",
"#type": "http://www.w3.org/2001/XMLSchema#date"
},
"Person": "http://xmlns.com/foaf/0.1/Person",
"name": "http://xmlns.com/foaf/0.1/name",
"knows": "http://xmlns.com/foaf/0.1/knows"
},
"#id": "http://example.org/graphs/73",
"generatedAt": "2012-04-09",
"#graph":
[
{
"#id": "http://manu.sporny.org/about#manu",
"#type": "Person",
"name": "Manu Sporny",
"knows": "http://greggkellogg.net/foaf#me"
},
{
"#id": "http://greggkellogg.net/foaf#me",
"#type": "Person",
"name": "Gregg Kellogg",
"knows": "http://manu.sporny.org/about#manu"
}
]
}
Question:
What if you start with only the JSON part without the semantic layer:
[{
"name": "Manu Sporny",
"knows": "http://greggkellogg.net/foaf#me"
},
{
"name": "Gregg Kellogg",
"knows": "http://manu.sporny.org/about#manu"
}]
and you link the #context from a separate file or location using a http link header or rdflib parsing, then you are still left without the #id and #type in the rest of the document. Injecting those missing keys-values into the json string is not a clean option. The idea is to go from JSON to JSON-LD without changing the original JSON part.
The way I see it to define a triple subject, one has to use an #id to map tot an IRI. It's very unlikely that JSON data has the #id key-values. So does this mean all JSON files cannot be parsed as JSON-LD without add the keys first? I wonder how they do it.
Does someone have an idea to point me in the right direction?
Thank you.
No, unfortunately that's not possible. There exist, however, libraries and tools that have been created exactly for that reason. JSON-LD Macros is such a library. It allows declarative transformations of JSON objects to make them usable as JSON-LD. So, effectively, all you need is a very thin layer on top of an off-the-shelve JSON-LD processor.

Categories

Resources