UNWIND feature in py2neo not reading json fully - python

graph = Graph()
query2 = """
WITH {m} AS document
UNWIND document.lists AS s
UNWIND s.imageurl AS img
UNWIND s.youtubevideourl AS vid
RETURN s
"""
print (graph.cypher.execute(query2,m = m))
I am trying to use UNWIND to read through the full json file but I am only
getting through the first part and so I am unable to plot a graph of full json.
It was working fine earlier but now I have added youtube video links, title of
same page, weblinkurl, webtitle I have started facing the same problem.
Here is an example of JSON file I compiled with different links as I am able to read first part only but I want to read full JSON.
This has only 2 parts of JSON and I want to read full and make nodes
Please if anyone could tell how to do using UNWIND or else.
[{'Topic': 'Virat_Kohli',
'imagetitle': 'Virat_Kohli_June_2016_(cropped).jpg?width=300',
'imageurl': 'http://commons.wikimedia.org/wiki/Special:FilePath/Virat_Kohli_June_2016_(cropped).jpg?width=300',
'webtitle': 'Virat Kohli Official Website',
'weburl': 'http://www.viratkohli.club/',
'youtubevideotitle': 'Virat Kohli Finally Accepts Love For GIRLFRIEND Anushka Sharma On Aamir Khan's Secret Superstar Show - YouTube',
'youtubevideourl': 'https://www.youtube.com/watch?v=zmPh2OQzZqc'},
{'Topic': 'Virat_Kohli',
'webtitle': 'Virat Kohli profile 2017, News and images only on official website of RCB',
'weburl': 'https://www.royalchallengers.com/virat-kohli',
'youtubevideotitle': 'Virat Kohli after losing ICC champions trophy Final - India vs Pakistian - Press Conference 2017 - YouTube',
'youtubevideourl': 'https://www.youtube.com/watch?v=Yf38l1Kx2-I'},
what i am trying is to do is this
graph = Graph()
query2 = """
WITH {j} AS document
UNWIND document.lists AS s
UNWIND s.Topic AS top
UNWIND s.weburl AS url
UNWIND s.imageurl AS img
UNWIND s.youtubevideourl as y
MERGE (c:topicnames {name:s.Topic})
MERGE (sc:images{img:img, type : s.imagetitle})
MERGE (v:weblink{url:url, type : s.webtitle})
MERGE (g:videos{vid:y, type : s.youtubevideotitle})
MERGE (c)-[:IMAGE_LINKS]->(sc)
MERGE (c)-[:WEB_LINKS]->(v)
MERGE (c)-[:VIDEO_LINKS]->(g)
RETURN (c)
"""
print (graph.cypher.execute(query2,j = j))
So I must have a single node of topic and 5 video link nodes, 5 weblink nodes and 1 imagelink node in neo4j but its only drawing nodes for 1 part of json
so UNWIND is not reading or converting other values having same key as Topic, weburl,youtubevideourl and that is why I want to know why its not working and how to fix it.

The JSON file itself is the list of documents so you don't need to specifically pass a list. And you don't need to use UNWIND that many times. Try using below program(And make sure all the variables are present when parsing):
graph = Graph()
query2 = """
UNWIND {j} AS s
MERGE (c:topicnames {name:s.Topic})
MERGE (sc:images{img:s.imageurl, type : s.imagetitle})
MERGE (v:weblink{url:s.weburl, type : s.webtitle})
MERGE (g:videos{vid:s.youtubevideourl, type : s.youtubevideotitle})
MERGE (c)-[:IMAGE_LINKS]->(sc)
MERGE (c)-[:WEB_LINKS]->(v)
MERGE (c)-[:VIDEO_LINKS]->(g)
RETURN (c)
"""
print (graph.cypher.execute(query2,j = j))
Hope this helps!

Related

How can I use the Google Cloud Natural Language Processing API on a big query table or any other topic modelling resource?

As mentioned in the title, I have a bigquery table with 18 million rows, nearly half of them are useless and I am supposed to assign a topic/niche to each row based on an important column (that has detail about a product a website), I have tested NLP API on a sample data with size of 10,000 and it did wonders but my standard approach where I am iterating over the newarr (which is the important details column I am obtaining through querying my bigquery table), here I am sending only one cell at a time, awaiting response from the api and appending it to the results array.
Ideally I want to do this operation on 18 Million rows in the minimum time, my per minute quota is increased to 3000 api requests so thats the max I can make, But I cant figure out how can i send a batch of 3000 rows one after another each minute.
for x in newarr:
i += 1
results.append(sample_classify_text(x))
Sample Classify text is a function straight from Documentation
#this function will return category for the text
from google.cloud import language_v1
def sample_classify_text(text_content):
"""
Classifying Content in a String
Args:
text_content The text content to analyze. Must include at least 20 words.
"""
client = language_v1.LanguageServiceClient()
# text_content = 'That actor on TV makes movies in Hollywood and also stars in a variety of popular new TV shows.'
# Available types: PLAIN_TEXT, HTML
type_ = language_v1.Document.Type.PLAIN_TEXT
# Optional. If not specified, the language is automatically detected.
# For list of supported languages:
# https://cloud.google.com/natural-language/docs/languages
language = "en"
document = {"content": text_content, "type_": type_, "language": language}
response = client.classify_text(request = {'document': document})
#return response.categories
# Loop through classified categories returned from the API
for category in response.categories:
# Get the name of the category representing the document.
# See the predefined taxonomy of categories:
# https://cloud.google.com/natural-language/docs/categories
x = format(category.name)
return x
# Get the confidence. Number representing how certain the classifier
# is that this category represents the provided text.

How to solve Unicoder problem when reading csv

I am totally new to python. I am using a package that takes medical text and annotates it with classifiers called pyConTextNLP
It basically takes some natural language text, adds some 'modifiers' to it and classifies it whilst removing negative findings.
The problem I am having is how to add the list of modifiers as a csv or a yaml file. I have been following the basic setup instructions here:
The problem is the line here:
modifiers = itemData.get_items("https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/lexical_kb_05042016.yml")
itemData.get_items doesn't look like it exists anymore and there is a function instead called itemData.get_fileobj(). This takes a csv file as far as I understand and the csv is passed to the function markup.markItems(modifiers, mode="modifier") which looks at the text and 'marks up' any concepts in the raw text that match the modifiers.
The error that I get when trying to run the example code is:
if not `item.getLiteral() in compiledRegExprs:`
and this gives me the error:
AttributeError: 'UnicodeReader' object has no attribute 'getLiteral'
The whole code is here: but I have also written it below
import networkx as nx
import pyConTextNLP.itemData as itemData
import pyConTextNLP.pyConTextGraph as pyConText
reports = [
"""IMPRESSION: Evaluation limited by lack of IV contrast; however, no evidence of
bowel obstruction or mass identified within the abdomen or pelvis. Non-specific interstitial opacities and bronchiectasis seen at the right
base, suggestive of post-inflammatory changes.""",
"""DIAGNOSIS: NO SIGNIFICANT PATHOLOGY
MICRO These biopsies of large bowel mucosa show oedema of the lamina propriabut no architectural abnormality
There is no dysplasia or malignancy
There is no evidence of active inflammation
There is no increase in the inflammatory cell content of the lamina propria""" ,
"""IMPRESSION:
1. 2.0 cm cyst of the right renal lower pole. Otherwise, normal appearance
of the right kidney with patent vasculature and no sonographic evidence of
renal artery stenosis.
2. Surgically absent left kidney.""",
"""IMPRESSION: No definite pneumothorax""",
"""IMPRESSION: New opacity at the left lower lobe consistent with pneumonia."""
]
modifiers = itemData.get_fileobj("/Applications/anaconda3/lib/python3.7/site-packages/pyConTextNLP-0.6.2.0-py3.7.egg/pyConTextNLP/CSV_Modifiers.csv")
targets = itemData.get_fileobj("/Applications/anaconda3/lib/python3.7/site-packages/pyConTextNLP-0.6.2.0-py3.7.egg/pyConTextNLP/CSV_targets.csv")
def markup_sentence(s, modifiers, targets, prune_inactive=True):
"""
"""
markup = pyConText.ConTextMarkup()
markup.setRawText(s)
markup.cleanText()
markup.markItems(modifiers, mode="modifier")
markup.markItems(targets, mode="target")
markup.pruneMarks()
markup.dropMarks('Exclusion')
# apply modifiers to any targets within the modifiers scope
markup.applyModifiers()
markup.pruneSelfModifyingRelationships()
if prune_inactive:
markup.dropInactiveModifiers()
return markup
reports[3]
markup = pyConText.ConTextMarkup()
isinstance(markup,nx.DiGraph)
markup.setRawText(reports[4].lower())
print(markup)
print(len(markup.getRawText()))
markup.cleanText()
print(markup)
print(len(markup.getText()))
markup.markItems(modifiers, mode="modifier")
print(markup.nodes(data=True))
print(type(list(markup.nodes())[0]))
markup.markItems(targets, mode="target")
for node in markup.nodes(data=True):
print(node)
markup.pruneMarks()
for node in markup.nodes(data=True):
print(node)
print(markup.edges())
markup.applyModifiers()
for edge in markup.edges():
print(edge)
markItems function is here:
def markItems(self, items, mode="target"):
"""tags the sentence for a list of items
items: a list of contextItems"""
if not items:
return
for item in items:
self.add_nodes_from(self.markItem(item, ConTextMode=mode),
category=mode)
The question is, how can I get the code to read the list in the csv file without throwing this error?

How to find textual differences between revisions on Wikipedia pages with mwclient?

I'm trying to find the textual differences between two revisions of a given Wikipedia page using mwclient. I have the following code:
import mwclient
import difflib
site = mwclient.Site('en.wikipedia.org')
page = site.pages['Bowdoin College']
texts = [rev for rev in page.revisions(prop='content')]
if not (texts[-1][u'*'] == texts[0][u'*']):
##show me the differences between the pages
Thank you!
It's not clear weather you want a difflib-generated diff or a mediawiki-generated diff using mwclient.
In the first case, you have two strings (the text of two revisions) and you want to get the diff using difflib:
...
t1 = texts[-1][u'*']
t2 = texts[0][u'*']
print('\n'.join(difflib.unified_diff(t1.splitlines(), t2.splitlines())))
(difflib can also generate an HTML diff, refer to the documentation for more info.)
But if you want the MediaWiki-generated HTML diff using mwclient you'll need revision ids:
# TODO: Loading all revisions is slow,
# try to load only as many as required.
revisions = list(page.revisions(prop='ids'))
last_revision_id = revisions[-1]['revid']
first_revision_id = revisions[0]['revid']
Then use the compare action to compare the revision ids:
compare_result = site.get('compare', fromrev=last_revision_id, torev=first_revision_id)
html_diff = compare_result['compare']['*']

Add Header based on Condition

I'm using reportlab to generate a PDF document that has two types of reports.
Please assume reports are r1 and r2. There may be more than 2-3 pages in each report. So i want to add a header like text from second page of each report.
For example in r1 reports page add "r1 report continued..." and in the pages of
r2 report add "r2 report continued..." How can i do that.
Currently i'm creating a list of the elements and passing it to template build function. So i cannot identify which report is being processed.
For example...
elements = []
elements.append(r1)
...
.....
elements.append(r2)
doc.build(elements)
Finally i managed to resolve it. But i'm not sure if its a proper method.
A big thanks to grc who provided this answer from where i created my solution.
As in grc's answer i have created a afterFlowable callback function.
def afterFlowable(self,flowable):
if hasattr(flowable, 'cReport'):
cReport = getattr(flowable, 'cReport')
self.cReport = cReport
Then while adding data for the r1 report a custom attribute will be created
elements.append(PageBreak())
elements[-1].cReport = 'r1'
Same code while adding data for r2 report
elements.append(PageBreak())
elements[-1].cReport = 'r2'
Then in the onPage function of the template
template = PageTemplate(id='test', frames=frame, onPage=headerAndFooter)
def headerAndFooter(canvas, doc):
canvas.saveState()
if cReport == 'r1':
Ph = Paragraph("""<para>r1 Report (continued)</para>""",styleH5)
w, h = Ph.wrap(doc.width, doc.topMargin)
Ph.drawOn(canvas, doc.leftMargin, doc.height+doc.topMargin)
Note that i'm just copy and pasting parts of my code...

Return values from a Python Entrez dictionary of dictionaries

I want to scrape the Interactions table from the Entrez Gene page.
The Interactions table is populated from a web server and when I tried to use the XML package in R, I could get the Entrez gene page, but the Interactions table body was empty (it had not been populated by the web server).
Dealing with the web server issue in R may be solvable (and I'd love to see how), but it seemed Biopython was an easier path.
I put together the following, which gives me what I want for an example gene:
# Pull the Entrez gene page for MAP1B using Biopython
from Bio import Entrez
Entrez.email = "jamayfie#vasci.umass.edu"
handle = Entrez.efetch(db="gene", id="4131", retmode="xml")
record = Entrez.read(handle)
handle.close()
PPI_Entrez = []
PPI_Sym = []
# Find the Dictionary that contains the Interaction table
for x in range(1, len(record[0]["Entrezgene_comments"])):
if ('Gene-commentary_heading', 'Interactions') in record[0]["Entrezgene_comments"][x].items():
for y in range(0, len(record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'])):
EntrezID = record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'][y]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_src']['Dbtag']['Dbtag_tag']['Object-id']['Object-id_id']
PPI_Entrez.append(EntrezID)
Sym = record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'][y]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_anchor']
PPI_Sym.append(Sym)
# Return the desired values: I want the Entrez ID and Gene symbol for each interacting protein
PPI_Entrez # Returns the EntrezID
PPI_Sym # Returns the gene symbol
This code works, giving me what I want. But I think its ugly, and am concerned that if the Entrez gene page changes slightly in format it will break the code. In particular, there must be a better way to extract the desired information than specifying the full path, as I do with:
record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'][y]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_anchor']
But I cannot figure out how to search through a dictionary of dictionaries without specifying each level I want to descend. When I try functions like find(), they operate on the next level down, but not all the way to the bottom.
Is there a wildcard symbol, a Python equivalent of "//", or a function I can use to get to ['Object-id_id'] without naming the full path? Other suggestions for cleaner code are also appreciated.
I'm not sure about xpath in Python, but if the code works, then I would not worry removing full paths or if Entrez Gene XML will change. Since you first tried R, you could get the XML using a system call to Entrez Direct below or a package like rentrez.
doc <- xmlParse( system("efetch -db=gene -id=4131 -format xml", intern=TRUE) )
Next, get the nodes corresponding to rows in the table at http://www.ncbi.nlm.nih.gov/gene/4131#interactions
x <- getNodeSet(doc, "//Gene-commentary_heading[.='Interactions']/../Gene-commentary_comment/Gene-commentary" )
length(x)
[1] 64
x[1]
x[50]
Try the easy stuff first
xmlToDataFrame(x[1:4])
Gene-commentary_type Gene-commentary_text Gene-commentary_refs Gene-commentary_source Gene-commentary_comment
1 18 Affinity Capture-MS 24457600 BioGRID110304BioGRID 255BioGRID110304255GeneID8726EEDBioGRID114265
2 18 Reconstituted Complex 20195357 BioGRID110304BioGRID 255BioGRID110304255GeneID2353FOSBioGRID108636
3 18 Reconstituted Complex 20195357 BioGRID110304BioGRID 255BioGRID110304255GeneID1936EEF1DBioGRID108256
4 18 Affinity Capture-MS 2345592220562859 BioGRID110304BioGRID 255BioGRID110304255GeneID6789STK4BioGRID112665
Gene-commentary_create-date Gene-commentary_update-date
1 2014461120 201410513330
2 201312810490 201410513330
3 201312810490 201410513330
4 20137710360 201410513330
Some tags like text, refs, source, and dates should be easy to parse
sapply(x, function(x) paste( xpathSApply(x, ".//PubMedId", xmlValue), collapse=", "))
I'm not sure about the comments or how Products, Interactants and Other Genes listed in the table are stored in the XML, but I get one or three symbols and three ids for each node here.
sapply(x, function(x) paste( xpathSApply(x, ".//Gene-commentary_comment//Other-source_anchor", xmlValue), collapse=" + "))
sapply(x, function(x) paste( xpathSApply(x, ".//Gene-commentary_comment//Object-id_id", xmlValue), collapse=" + "))
Finally, since I think Entrez Gene just copies IntAct and BioGrid, you could try those sites too. Biogrid has a really simple Rest service, but you have to register for a key.
url <- "http://webservice.thebiogrid.org/interactions?geneList=MAP1B&taxId=9606&includeHeader=TRUE&accesskey=[ your ACCESSKEY ]"
biogrid <- read.delim(url)
dim(biogrid)
[1] 58 24
head(biogrid[, c(8:9,12)])
Official.Symbol.Interactor.A Official.Symbol.Interactor.B Experimental.System
1 ANP32A MAP1B Two-hybrid
2 MAP1B ANP32A Two-hybrid
3 RASSF1 MAP1B Affinity Capture-Western
4 RASSF1 MAP1B Two-hybrid
5 ANP32A MAP1B Affinity Capture-Western
6 GAN MAP1B Affinity Capture-Western

Categories

Resources