I have a SpaCy dependency tree made by this code:
from spacy import displacy
text = "We could say to them that if in fact that's all there is, then we could, Oh, we can do something."
print(displacy.render(nlp(text), style='dep', jupyter = True, options = {'distance': 120}))
That prints out this:
SpaCy determines that this entire string is connected in a dependency tree. What I am trying to figure out is how to discern how direct or indirect the connection is between a word and the next word. For example, looking at the first 3 words:
'We' is connected to the next word 'could', because it is directly connected to 'say', which is directly connected to 'could'. Therefor, it is 2 connection points away from the next word.
'could' is directly connected to 'say'. There for it is 1 connection point away from the start.
and so on.
Essentially, I want to make a df that would look like this:
word connection_points_to_next_word
We 2
could 1
say 1
...
I'm not sure how to achieve this. As SpaCy makes this graph, I'm sure there is some efficient way to calculate the number of vertices required to connect adjacent nodes, but all of SpaCy's tools I've found, such as:
token.lefts
token.rights
token.subtree
token.children
more here https://spacy.io/api/token
Include connection information, but not how direct this connection is. Any ideas how to get closer to this problem?
Using the networkx library, we can build an undirected graph from the edgelist of token-children relationships. I am using the index of the token in the document as a unique identifier so that repeat words are treated as separate nodes.
import spacy
import networkx as nx
nlp= spacy.load('en_core_web_lg')
text = "We could say to them that if in fact that's all there is, then we could, Oh, we can do something."
doc = nlp(text)
edges = []
for tok in doc:
edges.extend([(tok.i, child.i) for child in tok.children])
The shortest path between neighboring tokens can be calculated as below:
for idx, _ in enumerate(doc):
if idx < len(doc)-1:
print(doc[idx], doc[idx+1], nx.shortest_path_length(graph,source=idx, target=idx+1))
Output:
We could 2
could say 1
say to 1
to them 1
them that 4
that if 3
if in 2
in fact 1
fact that 3
that 's 1
's all 1
all there 2
there is 1
is , 4
, then 2
then we 2
we could 2
could , 2
, Oh 2
Oh , 2
, we 2
we can 2
can do 1
do something 1
something . 3
Related
I have a use case in my project where I need to compare a key-string with a lot many strings for similarity. If this value is greater than a certain threshold, I consider those strings "similar" to my key and based on that list, I do some further calculations / processing.
I have been exploring fuzzy matching string similarity stuff, which use edit distance based algorithms like "levenshtein, jaro and jaro-winkler" similarities.
Although they work fine, I want to have a higher similarity score if one string is "abbreviation" of another. Is there any algorithm/ implementation I can use for this.
Note:
language: python3
packages explored: fuzzywuzzy, jaro-winkler
Example:
using jaro_winkler similarity:
>>> jaro.jaro_winkler_metric("wtw", "willis tower watson")
0.7473684210526316
>>> jaro.jaro_winkler_metric("wtw", "willistowerwatson")
0.7529411764705883
using levenshtein similarity:
>>> fuzz.ratio("wtw", "willis tower watson")
27
>>> fuzz.ratio("wtw", "willistowerwatson")
30
>>> fuzz.partial_ratio("wtw", "willistowerwatson")
67
>>> fuzz.QRatio("wtw", "willistowerwatson")
30
In these kind of cases, I want score to be higher (>90%) if possible. I'm ok with few false positives as well, as they won't cause too much issue with my further calculations. But if we match s1 and s2 such that s1 is fully contained in s2 (or vice versa), their similarity score should be much higher.
Edit: Further Examples for my Use-Case
For me, spaces are redundant. That means, wtw is considered abbreviation for "willistowerwatson" and "willis tower watson" alike.
Also, stove is a valid abbreviation for "STack OVErflow" or "STandardOVErview"
A simple algo would be to start with 1st char of smaller string and see if it is present in the larger one. Then check for 2nd char and so on until the condition satisfies that 1st string is fully contained in 2nd string. This is a 100% match for me.
Further examples like wtwx to "willistowerwatson" could give a score of, say 80% (this can be based on some edit distance logic). Even if I can find a package which gives either True or False for abbreviation similarity would also be helpful.
To detect abbrevioations in string, you can still using fuzzywuzzy module with the process() function:
from fuzzywuzzy import fuzz, process
s1 = ["willis tower watson", "stack overflow", "willistowerwatson", "international business machines"]
s2 = ['wtw', "so", "wtw", "ibz"]
queries = [''.join([i[0] for i in j.split()]) for j in s1]
for query, company in zip(queries, s1):
print(company, '-', process.extractOne(query, s2, scorer=fuzz.partial_token_sort_ratio))
Output:
willis tower watson - ('wtw', 100)
stack overflow - ('so', 100)
willistowerwatson - ('wtw', 100)
international business machines - ('ibz', 67)
You can use a recursive algorithm, similar to sequence alignment. Just don't give penalty for shifts (as they are expected in abbreviations) but give one for mismatch in first characters.
This one should work, for example:
def abbreviation(abr,word,penalty=1):
if len(abr)==0:
return 0
elif len(word)==0:
return penalty*len(abr)*-1
elif abr[0] == word[0]:
if len(abr)>1:
return 1 + max(abbreviation(abr[1:],word[1:]),
abbreviation(abr[2:],word[1:])-penalty)
else:
return 1 + abbreviation(abr[1:],word[1:])
else:
return abbreviation(abr,word[1:])
def compute_match(abbr,word,penalty=1):
score = abbreviation(abbr.lower(),
word.lower(),
penalty)
if abbr[0].lower() != word[0].lower(): score-=penalty
score = score/len(abbr)
return score
print(compute_match("wtw", "willis tower watson"))
print(compute_match("wtwo", "willis tower watson"))
print(compute_match("stove", "Stackoverflow"))
print(compute_match("tov", "Stackoverflow"))
print(compute_match("wtwx", "willis tower watson"))
The output is:
1.0
1.0
1.0
0.6666666666666666
0.5
Indicating that wtw and wtwo are perfectly valid abbreviations for willistowerwatson, that stove is a valid abbreviation of Stackoverflow but not tov, which has the wrong first character.
And wtwx is only partially valid abbreviation for willistowerwatson beacuse it ends with a character that does not occur in the full name.
This an extended question from this topic. I would like to search in strings total and partial strings like the following keywords Series "w":
rigour*
*demeanour*
centre*
*arbour
fulfil
This obviously means that I wanted to search for words like rigour and rigours, endemeanour and demeanours, centre and centres, harbour and arbour, and fulfil. So the keywords list I have is a mix of complete and partial strings to find. I would like to apply the search on this DataFrame "df":
ID;name
01;rigour
02;rigours
03;endemeanour
04;endemeanours
05;centre
06;centres
07;encentre
08;fulfil
09;fulfill
10;harbour
11;arbour
12;harbours
What I tried so far is the following:
r = re.compile(r'.*({}).*'.format('|'.join(w.values)), re.IGNORECASE)
then I've build a mask to filter the DataFrame:
mask = [m.group(1) if m else None for m in map(r.search, df['Tweet'])]
in order to get a new column with the Keyword found:
df['keyword'] = mask
What I'm expecting is the following resulting DataFrame:
ID;name;keyword
01;rigour;rigour
02;rigours;rigour
03;endemeanour;demeanour
04;endemeanours;demeanour
05;centre;centre
06;centres;centre
07;encentre;None
08;fulfil;fulfil
09;fulfill;None
10;harbour;arbour
11;arbour;arbour
12;harbours;None
This works using a w list without *. Now I had several issues in formatting the keyword w List of words with the * conditions, in order to run the re.compile function correctly.
Any help would be really appreciated.
It looks like your input series w needs to be adjusted to be used as regex pattern like this:
rigour.*
.*demeanour.*
centre.*
\\b.*arbour\\b
\\bfulfil\\b
Note that * in regex goes after something it does not work on its own. It means that whatever it follows can be repeated 0 or more times.
Note also that fulfil is a part of fulfill and if you want to have strict match you need to tell regex this. For example by using 'word separator' - \b - it will catch only string as whole.
Here is how your regex might look like to give you results that you need:
s = '({})'.format('|'.join(w.values))
r = re.compile(s, re.IGNORECASE)
r
re.compile(r'(rigour.*|.*demeanour.*|centre*|\b.*arbour\b|\bfulfil\b)', re.IGNORECASE)
And your code to have the replacement could be done with pandas .where method like this:
df['keyword'] = df.name.where(df.name.str.match(r), None)
df
ID name keyword
0 1 rigour rigour
1 2 rigours rigours
2 3 endemeanour endemeanour
3 4 endemeanours endemeanours
4 5 centre centre
5 6 centres centres
6 7 encentre None
7 8 fulfil fulfil
8 9 fulfill None
9 10 harbour harbour
10 11 arbour arbour
11 12 harbours None
I want to scrape the Interactions table from the Entrez Gene page.
The Interactions table is populated from a web server and when I tried to use the XML package in R, I could get the Entrez gene page, but the Interactions table body was empty (it had not been populated by the web server).
Dealing with the web server issue in R may be solvable (and I'd love to see how), but it seemed Biopython was an easier path.
I put together the following, which gives me what I want for an example gene:
# Pull the Entrez gene page for MAP1B using Biopython
from Bio import Entrez
Entrez.email = "jamayfie#vasci.umass.edu"
handle = Entrez.efetch(db="gene", id="4131", retmode="xml")
record = Entrez.read(handle)
handle.close()
PPI_Entrez = []
PPI_Sym = []
# Find the Dictionary that contains the Interaction table
for x in range(1, len(record[0]["Entrezgene_comments"])):
if ('Gene-commentary_heading', 'Interactions') in record[0]["Entrezgene_comments"][x].items():
for y in range(0, len(record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'])):
EntrezID = record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'][y]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_src']['Dbtag']['Dbtag_tag']['Object-id']['Object-id_id']
PPI_Entrez.append(EntrezID)
Sym = record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'][y]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_anchor']
PPI_Sym.append(Sym)
# Return the desired values: I want the Entrez ID and Gene symbol for each interacting protein
PPI_Entrez # Returns the EntrezID
PPI_Sym # Returns the gene symbol
This code works, giving me what I want. But I think its ugly, and am concerned that if the Entrez gene page changes slightly in format it will break the code. In particular, there must be a better way to extract the desired information than specifying the full path, as I do with:
record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'][y]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_anchor']
But I cannot figure out how to search through a dictionary of dictionaries without specifying each level I want to descend. When I try functions like find(), they operate on the next level down, but not all the way to the bottom.
Is there a wildcard symbol, a Python equivalent of "//", or a function I can use to get to ['Object-id_id'] without naming the full path? Other suggestions for cleaner code are also appreciated.
I'm not sure about xpath in Python, but if the code works, then I would not worry removing full paths or if Entrez Gene XML will change. Since you first tried R, you could get the XML using a system call to Entrez Direct below or a package like rentrez.
doc <- xmlParse( system("efetch -db=gene -id=4131 -format xml", intern=TRUE) )
Next, get the nodes corresponding to rows in the table at http://www.ncbi.nlm.nih.gov/gene/4131#interactions
x <- getNodeSet(doc, "//Gene-commentary_heading[.='Interactions']/../Gene-commentary_comment/Gene-commentary" )
length(x)
[1] 64
x[1]
x[50]
Try the easy stuff first
xmlToDataFrame(x[1:4])
Gene-commentary_type Gene-commentary_text Gene-commentary_refs Gene-commentary_source Gene-commentary_comment
1 18 Affinity Capture-MS 24457600 BioGRID110304BioGRID 255BioGRID110304255GeneID8726EEDBioGRID114265
2 18 Reconstituted Complex 20195357 BioGRID110304BioGRID 255BioGRID110304255GeneID2353FOSBioGRID108636
3 18 Reconstituted Complex 20195357 BioGRID110304BioGRID 255BioGRID110304255GeneID1936EEF1DBioGRID108256
4 18 Affinity Capture-MS 2345592220562859 BioGRID110304BioGRID 255BioGRID110304255GeneID6789STK4BioGRID112665
Gene-commentary_create-date Gene-commentary_update-date
1 2014461120 201410513330
2 201312810490 201410513330
3 201312810490 201410513330
4 20137710360 201410513330
Some tags like text, refs, source, and dates should be easy to parse
sapply(x, function(x) paste( xpathSApply(x, ".//PubMedId", xmlValue), collapse=", "))
I'm not sure about the comments or how Products, Interactants and Other Genes listed in the table are stored in the XML, but I get one or three symbols and three ids for each node here.
sapply(x, function(x) paste( xpathSApply(x, ".//Gene-commentary_comment//Other-source_anchor", xmlValue), collapse=" + "))
sapply(x, function(x) paste( xpathSApply(x, ".//Gene-commentary_comment//Object-id_id", xmlValue), collapse=" + "))
Finally, since I think Entrez Gene just copies IntAct and BioGrid, you could try those sites too. Biogrid has a really simple Rest service, but you have to register for a key.
url <- "http://webservice.thebiogrid.org/interactions?geneList=MAP1B&taxId=9606&includeHeader=TRUE&accesskey=[ your ACCESSKEY ]"
biogrid <- read.delim(url)
dim(biogrid)
[1] 58 24
head(biogrid[, c(8:9,12)])
Official.Symbol.Interactor.A Official.Symbol.Interactor.B Experimental.System
1 ANP32A MAP1B Two-hybrid
2 MAP1B ANP32A Two-hybrid
3 RASSF1 MAP1B Affinity Capture-Western
4 RASSF1 MAP1B Two-hybrid
5 ANP32A MAP1B Affinity Capture-Western
6 GAN MAP1B Affinity Capture-Western
I have the following dataset:
firm_id_1 firm_id_2
1 2
1 4
1 5
2 1
2 3
3 2
3 6
4 1
4 5
4 6
5 4
5 7
6 3 ....
I would like to graph the network of firm_id = 1. In other words, I want to see a graph that shows that firm_id = 1 is directly connected to 2, 4, 5, and indirectly connected to 3 via firm 2, connected to 6 via firm 4 and indirectly connected to 7 via firm 5. In other words I graph the shortest distance to each node (firm_id) starting from firm_id=1. There is 3000 nodes in my data and I know that firm 1 reaches all nodes in less than 9 vertices. How can I graph this in Python?
I would start with a library called NetworkX. I'm not sure I understand everything that you are looking for, but I think this should be close enough for you to modify it.
This program will load the data in from a text file graphdata.txt, split by whitespace, and add the pair as an edge.
It will then calculate the shortest paths to all nodes from 1, and then print if the distance is larger than 9... see the documentation for more details.
Lastly, it will render the graph using a spring layout to a file called mynetwork.png and to the screen.
Some optimization may / may not be needed for 3000 nodes.
Hope this helps!
import networkx as nx
import matplotlib.pyplot as plt
graph = nx.Graph()
with open('graphdata.txt') as f:
for line in f:
firm_id_1, firm_id_2 = line.split()
graph.add_edge(firm_id_1, firm_id_2)
paths_from_1 = nx.shortest_path(graph, "1")
for path in paths_from_1:
if len(paths_from_1[node]) > 9:
print "Shortest path from 1 to", node, "is longer than 9"
pos = nx.spring_layout(graph, iterations=200)
nx.draw(graph, pos)
plt.savefig("mynetwork.png")
plt.show()
You can try python-graph package. I am not sure about its scalability, but you can do something like...
from pygraph.classes.digraph import digraph
from pygraph.algorithms.minmax import shortest_path
gr= digraph()
gr.add_nodes(range(1,num_nodes))
for i in range(num_edges):
gr.add_edge((edge_start, edge_end))
# shortest path from the node 1 to all others
shortest_path(gr,1)
I would like to find node connectivity between node 1 and rest of the nodes in a graph. The input text file format is as follows:
1 2 1
1 35 1
8 37 1
and so on for 167 lines. First column represents source node, second column represents destination node while the last column represents weight of the edge.
I'm trying to read the source, destination nodes from input file and forming an edge between them. I need to then find out if it is a connected network (only one component of graph and no sub-components). Here is the code
from numpy import*
import networkx as nx
G=nx.empty_graph()
for row in file('out40.txt'):
row = row.split()
src = row[0]
dest = row[1]
#print src
G.add_edge(src, dest)
print src, dest
for i in range(2, 41):
if nx.bidirectional_dijkstra(G, 1, i): print "path exists from 1 to ", i
manually adding the edges using
G.add_edge(1, 2)
works but is tedious and not suitable for large input files such as mine. The if loop condition works when I add edges manually but throws the following error for the above code:
in neighbors_iter
raise NetworkXError("The node %s is not in the graph."%(n,))
networkx.exception.NetworkXError: The node 2 is not in the graph.
Any help will be much appreciated!
In your code, you're adding nodes "1" and "2" et cetera (since reading from a file is going to give you strings unless you explicitly convert them).
However, you're then trying to refer to nodes 1 and 2. I'm guessing that networkx does not think that 2 == "2".
Try changing this...
G.add_edge(src, dest)
to this:
G.add_edge(int(src), int(dest))
Not sure if that is an option for you, but are you aware of the build-in support of networkx for multiple graph text formats?
The edge list format seems to apply pretty well to your case. Specifically, the following method will read your input files without the need for custom code:
G = nx.read_weighted_edgelist(filename)
If you want to remove the weights (because you don't need them), you could subsequently do the following:
for e in G.edges_iter(data=True):
e[2].clear() #[2] is the 3rd element of the tuple, which
#contains the dictionary with edge attributes
From Networkx documentation:
for row in file('out40.txt'):
row = row.split()
src = row[0]
dest = row[1]
G.add_nodes_from([src, dest])
#print src
G.add_edge(src, dest)
print src, dest
The error message says the the graph G doesn't have the nodes you are looking to create an edge in between.
You can also use "is_connected()" to make this a little simpler. e.g.
$ cat disconnected.edgelist
1 2 1
2 3 1
4 5 1
$ cat connected.edgelist
1 2 1
2 3 1
3 4 1
$ ipython
In [1]: import networkx as nx
In [2]: print(nx.is_connected(nx.read_weighted_edgelist('disconnected.edgelist')))
False
In [3]: print(nx.is_connected(nx.read_weighted_edgelist('connected.edgelist')))
True
Another option is to load the file as a pandas dataframe and then use iterrows to iterate:
import pandas as pd
import networkx as nx
cols = ["src", "des", "wei"]
df = pd.read_csv('out40.txt', sep=" ", header=None, names=cols)
G = nx.empty_graph()
for index, row in df.iterrows():
G.add_edge(row["src"], row["des"])