Parse the grouped data using StringGrouper to json - python

I'm using StringGrouper to group the similar data together and I want to see the group data in json file how can I parse the data into json file
here is my code:
import pandas as pd
from string_grouper import match_strings, match_most_similar, \
group_similar_strings, compute_pairwise_similarities, \
StringGrouper
string_grouper = StringGrouper(data['name'],ignore_index=True,min_similarity=0.83)
string_grouper = string_grouper.fit()
data['deduplicated_name'] = string_grouper.get_groups()
the example of output right now:
[1]: https://i.stack.imgur.com/aWnvb.png
the expected output in json format:
[
[sql server
{
“id”: 0
“name”: “sql server ”
},
{
“id”: 1
“name”: “sql server management”
},
//another name in the same group
],
[
// another group
]
]

try using below code and modify it accordingly :
data[['id','name']].to_json(orient='records')

Related

How to create bulk node relationships using py2neo

I need to populate a database in neo4j using a json of the following that contains data of some processes. Among them the name of the process, its parents and its children (if any). Here is a part of the json as an example:
[
{
"process": "IPTV_Subscriptions",
"parents": ["IPTV_Navigation","DeviceCertifications-insertion"],
"childs": ["villa_iptv", "villa_ott", "villa_calicux"]
},
{
"process": "IPTV_Navigation",
"parents": [],
"childs": ["IPTV_Subscriptions"],
},
{
"process": "DeviceCertifications-getter",
"parents": [],
"childs": ["DeviceCertifications-insertion"]
},
{
"process": "DeviceCertifications-insertion",
"parents": ["DeviceCertifications-getter"],
"childs": ["IPTV_Subscriptions"]
}
]
With the following Python code I generated, I found that I can create each node with the processes contained in the json in bulk:
import json
from py2neo import Graph
from py2neo.bulk import create_nodes, create_relationships
graph = Graph("bolt://localhost:7687", auth = ("yyyy", "xxxx"))
#Opening json
f = open('/app/conf/data.json',)
processs = json.load(f)
data=[]
for i in processs:
proc=[]
proc.append(i["process"])
data.append(proc)
keys = ["process"]
create_nodes(graph.auto(), data, labels={"process"}, keys=keys)
And checking in neo4j, I see that the nodes are already created.
But now I need to make the relationships. For each process, from the json I know which are the parents and children of that node.
I wanted to take the documentation as an example:
from py2neo import Graph
from py2neo.bulk import create_relationships
g = Graph()
data = [
(("Alice", "Smith"), {"since": 1999}, "ACME"),
(("Bob", "Jones"), {"since": 2002}, "Bob Corp"),
(("Carol", "Singer"), {"since": 1981}, "The Daily Planet"),
]
create_relationships(g.auto(), data, "WORKS_FOR", start_node_key=("Person", "name", "family name"), end_node_key=("Company", "name"))
But it didn't work for me
Having from the json the information of the parents and children, does anyone have an idea of how I can generate the massive relationships? In view of the json example I have, the relationship tags would be ParentOf and ChildOf but I have no idea how they would be generated from python.
Below is the script to create the bulk relationship using py2neo. Let me know if it works for you or not. Another thing, please label your nodes as Process rather than process (notice the upper case P). Then I use the relationship :CHILD_OF. If you want :PARENT_OF then change the tuple in data and swap the first and third item.
import json
from py2neo import Graph
from py2neo.bulk import create_relationships
graph = Graph("neo4j://localhost:7687", auth = ("neo4j", "neo4jay"))
#Opening json
f = open('data2.json',)
processs = json.load(f)
data=[]
for i in processs:
for p in i["parents"]:
data.append((i["process"],{},p))
create_relationships(graph.auto(), data, "CHILD_OF", start_node_key=("Process", "process"), end_node_key=("Process", "process"))
Result:

How to read JSON data in TXT file into Pandas

I have a
".txt"
file which has JSON data in it. I want to read this file in python and convert it into a dataframe.
The Data in this text file looks like this:
{
"_id" : "116b244599862fd2200",
"met_id" : [
612019,
621295,
725,
622169,
640014,
250,
350,
640015,
613689,
650423
],
"id" : "104",
"name" : "Energy",
"label" : "Peer Group",
"display_type" : "Risky Peer Group",
"processed_time" : ISODate("2019-04-18T11:17:05Z")
}
I tried reading it using the
pd.read_json
function but it always shows me an error. I am quite new to JSON, how can I use this Text file and load it in Python?
Please check this link
Also, "processed_time" : ISODate("2019-04-18T11:17:05Z") is not JSON format.
We can check that in https://jsonlint.com/
I added python code.
import pandas as pd
import json
with open('rr.txt') as f:
string = f.read()
# Remove 'ISODate(', ')' For correct, we can use regex
string = string.replace('ISODate(', '')
string = string.replace(')', '')
jsonData = json.loads(string)
print (pd.DataFrame(jsonData))

Import list of dicts or JSON file to elastic search with python

I have a .json.gz file that I wish to load into elastic search.
My first attempt involved using the json module to convert the JSON to a list of dicts.
import gzip
import json
from pprint import pprint
from elasticsearch import Elasticsearch
nodes_f = gzip.open("nodes.json.gz")
nodes = json.load(nodes_f)
Dict example:
pprint(nodes[0])
{u'index': 1,
u'point': [508163.122, 195316.627],
u'tax': u'fehwj39099'}
Using Elasticsearch:
es = Elasticsearch()
data = es.bulk(index="index",body=nodes)
However, this returns:
elasticsearch.exceptions.RequestError: TransportError(400, u'illegal_argument_exception', u'Malformed action/metadata line [1], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]')
Beyond this, I wish to be able to find the tax for given point query, in case this has an impact on how I should be indexing the data with elasticsearch.
Alfe pointed me in the right direction, but I couldn't get his code to work.
I found two solutions:
Line by line with a for loop:
es = elasticsearch.Elasticsearch()
for node in nodes:
_id = node['index']
es.index(index='nodes',doc_type='external',id=_id,body=node)
In bulk, using helper:
actions = [
{
"_index" : "nodes_bulk",
"_type" : "external",
"_id" : str(node['index']),
"_source" : node
}
for node in nodes
]
helpers.bulk(es,actions)
Bulk was around 22 times faster for a list of 343724 dicts.
Here is my working code using bulk api:
Define a list of dicts:
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch([{'host':'localhost', 'port': 9200}])
doc = [{'_id': 1,'price': 10, 'productID' : 'XHDK-A-1293-#fJ3'},
{'_id':2, "price" : 20, "productID" : "KDKE-B-9947-#kL5"},
{'_id':3, "price" : 30, "productID" : "JODL-X-1937-#pV7"},
{'_id':4, "price" : 30, "productID" : "QQPX-R-3956-#aD8"}]
helpers.bulk(es, doc, index='products',doc_type='_doc', request_timeout=200)
The ES bulk library showed several problems, including performance trouble, not being able to set specific _ids etc. But since the bulk API of ES is not very complicated, we did it ourselves:
import requests
headers = { 'Content-type': 'application/json',
'Accept': 'text/plain'}
jsons = []
for d in docs:
_id = d.pop('_id') # take _id out of dict
jsons.append('{"index":{"_id":"%s"}}\n%s\n' % (_id, json.dumps(d)))
data = ''.join(jsons)
response = requests.post(url, data=data, headers=headers)
We needed to set a specific _id but I guess you can skip this part in case you want a random _id set by ES automatically.
Hope that helps.

What is the data format returned by the AdWords API TargetingIdeaPage service?

When I query the AdWords API to get search volume data and trends through their TargetingIdeaSelector using the Python client library the returned data looks like this:
(TargetingIdeaPage){
totalNumEntries = 1
entries[] =
(TargetingIdea){
data[] =
(Type_AttributeMapEntry){
key = "KEYWORD_TEXT"
value =
(StringAttribute){
Attribute.Type = "StringAttribute"
value = "keyword phrase"
}
},
(Type_AttributeMapEntry){
key = "TARGETED_MONTHLY_SEARCHES"
value =
(MonthlySearchVolumeAttribute){
Attribute.Type = "MonthlySearchVolumeAttribute"
value[] =
(MonthlySearchVolume){
year = 2016
month = 2
count = 2900
},
...
(MonthlySearchVolume){
year = 2015
month = 3
count = 2900
},
}
},
},
}
This isn't JSON and appears to just be a messy Python list. What's the easiest way to flatten the monthly data into a Pandas dataframe with a structure like this?
Keyword | Year | Month | Count
keyword phrase 2016 2 10
The output is a sudsobject. I found that this code does the trick:
import suds.sudsobject as sudsobject
import pandas as pd
a = [sudsobject.asdict(x) for x in output]
df = pd.DataFrame(a)
Addendum: This was once correct but new versions of the API (I tested
201802) now return a zeep.objects. However, zeep.helpers.serialize_object should do the same trick.
link
Here's the complete code that I used to query the TargetingIdeaSelector, with requestType STATS, and the method I used to parse the data to a useable dataframe; note the section starting "Parse results to pandas dataframe" as this takes the output given in the question above and converts it to a dataframe. Probably not the fastest or best, but it works! Tested with Python 2.7.
"""This code pulls trends for a set of keywords, and parses into a dataframe.
The LoadFromStorage method is pulling credentials and properties from a
"googleads.yaml" file. By default, it looks for this file in your home
directory. For more information, see the "Caching authentication information"
section of our README.
"""
from googleads import adwords
import pandas as pd
adwords_client = adwords.AdWordsClient.LoadFromStorage()
PAGE_SIZE = 10
# Initialize appropriate service.
targeting_idea_service = adwords_client.GetService(
'TargetingIdeaService', version='v201601')
# Construct selector object and retrieve related keywords.
offset = 0
stats_selector = {
'searchParameters': [
{
'xsi_type': 'RelatedToQuerySearchParameter',
'queries': ['donald trump', 'bernie sanders']
},
{
# Language setting (optional).
# The ID can be found in the documentation:
# https://developers.google.com/adwords/api/docs/appendix/languagecodes
'xsi_type': 'LanguageSearchParameter',
'languages': [{'id': '1000'}],
},
{
# Location setting
'xsi_type': 'LocationSearchParameter',
'locations': [{'id': '1027363'}] # Burlington,Vermont
}
],
'ideaType': 'KEYWORD',
'requestType': 'STATS',
'requestedAttributeTypes': ['KEYWORD_TEXT', 'TARGETED_MONTHLY_SEARCHES'],
'paging': {
'startIndex': str(offset),
'numberResults': str(PAGE_SIZE)
}
}
stats_page = targeting_idea_service.get(stats_selector)
##########################################################################
# Parse results to pandas dataframe
stats_pd = pd.DataFrame()
if 'entries' in stats_page:
for stats_result in stats_page['entries']:
stats_attributes = {}
for stats_attribute in stats_result['data']:
#print (stats_attribute)
if stats_attribute['key'] == 'KEYWORD_TEXT':
kt = stats_attribute['value']['value']
else:
for i, val in enumerate(stats_attribute['value'][1]):
data = {'keyword': kt,
'year': val['year'],
'month': val['month'],
'count': val['count']}
data = pd.DataFrame(data, index = [i])
stats_pd = stats_pd.append(data, ignore_index=True)
print(stats_pd)

JSON Parsing help in Python

I have below data in JSON format, I have started with code below which throws a KEY ERROR.
Not sure how to get all data listed in headers section.
I know I am not doing it right in json_obj['offers'][0]['pkg']['Info']: but not sure how to do it correctly.
how can I get to different nodes like info,PricingInfo,Flt_Info etc?
{
"offerInfo":{
"siteID":"1",
"language":"en_US",
"currency":"USD"
},
"offers":{
"pkg":[
{
"offerDateRange":{
"StartDate":[
2015,
11,
8
],
"EndDate":[
2015,
11,
14
]
},
"Info":{
"Id":"111"
},
"PricingInfo":{
"BaseRate":1932.6
},
"flt_Info":{
"Carrier":"AA"
}
}
]
}
}
import os
import json
import csv
f = open('api.csv','w')
writer = csv.writer(f,delimiter = '~')
headers = ['Id' , 'StartDate', 'EndDate', 'Id', 'BaseRate', 'Carrier']
default = ''
writer.writerow(headers)
string = open('data.json').read().decode('utf-8')
json_obj = json.loads(string)
for pkg in json_obj['offers'][0]['pkg']['Info']:
row = []
row.append(json_obj['id']) # just to test,but I need column values listed in header section
writer.writerow(row)
It looks like you're accessing the json incorrectly. After you have accessed json_obj['offers'], you accessed [0], but there is no array there. json_obj['offers'] gives you another dictionary.
For example, to get PricingInfo like you asked, access like this:
json_obj['offers']['pkg'][0]['PricingInfo']
or 11 from the StartDate like this:
json_obj['offers']['pkg'][0]['offerDateRange']['StartDate'][1]
And I believe you get the KEY ERROR because you access [0] in the dictionary, which since that isn't a key, you get the error.
try to substitute this piece of code:
for pkg in json_obj['offers'][0]['pkg']['Info']:
row = []
row.append(json_obj['id']) # just to test,but I need column values listed in header section
writer.writerow(row)
With this:
for pkg in json_obj['offers']['pkg']:
row.append(pkg['Info']['Id'])
year = pkg['offerDateRange']['StartDate'][0]
month = pkg['offerDateRange']['StartDate'][1]
day = pkg['offerDateRange']['StartDate'][2]
StartDate = "%d-%d-%d" % (year,month,day)
print StartDate
writer.writerow(row)
Try this
import os
import json
import csv
string = open('data.json').read().decode('utf-8')
json_obj = json.loads(string)
print json_obj["offers"]["pkg"][0]["Info"]["Id"]
print str(json_obj["offers"]["pkg"][0]["offerDateRange"]["StartDate"][0]) +'-'+ str(json_obj["offers"]["pkg"][0]["offerDateRange"]["StartDate"][1])+'-'+str(json_obj["offers"]["pkg"][0]
["offerDateRange"]["StartDate"][2])
print str(json_obj["offers"]["pkg"][0]["offerDateRange"]["EndDate"][0]) +'-'+ str(json_obj["offers"]["pkg"][0]["offerDateRange"]["EndDate"][1])+'-'+str(json_obj["offers"]["pkg"][0]
["offerDateRange"]["EndDate"][2])
print json_obj["offers"]["pkg"][0]["Info"]["Id"]
print json_obj["offers"]["pkg"][0]["PricingInfo"]["BaseRate"]
print json_obj["offers"]["pkg"][0]["flt_Info"]["Carrier"]

Categories

Resources