Python dictionaries not copied properly causing repetitions, how to get this right? - python

I'm writing a function which is supposed to compare lists (significant genes for a test) and list out common elements (genes) for all possible combinations of the selection of lists.
These results are to be used for a venn diagram thingy...
The number of tests and genes being flexible.
The input JSON file looks something like this:
| test | genes |
|----------------- |--------------------------------------------------- |
| p-7trt_1/0con_1 | [ENSMUSG00000000031, ENSMUSG00000000049, ENSMU... |
| p-7trt_2/0con_1 | [ENSMUSG00000000031, ENSMUSG00000000037, ENSMU... |
| p-7trt_1/0con_2 | [ENSMUSG00000000037, ENSMUSG00000000049, ENSMU... |
| p-7trt_2/0con_2 | [ENSMUSG00000000028, ENSMUSG00000000031, ENSMU... |
| p-7trt_1/0con_3 | [ENSMUSG00000000088, ENSMUSG00000000094, ENSMU... |
| p-7trt_2/0con_3 | [ENSMUSG00000000028, ENSMUSG00000000031, ENSMU... |
So The function is follows:
import pandas as pd
def get_venn_compiled_data(dir_loc):
"""return json of compiled data for the venn thing
"""
data_frame = pd.read_json(dir_loc + "/venn.json", orient="records")
number_of_tests = data_frame.shape[0]
venn_data = []
venn_data_point = {"tests": [], "genes": []} # list of genes which are common across listed tests
binary = lambda x: bin(x)[2:] # to directly get the binary number
for dec_number in range(1, 2 ** number_of_tests):
# resetting
venn_data_point["tests"] = []
venn_data_point["genes"] = []
# using a binary number to get all the cases
for index, state in enumerate(binary(dec_number)):
if state == "0":
continue
# putting in all the genes from the first test
if venn_data_point["tests"] == []:
venn_data_point["genes"] = data_frame["data"][index].copy()
# removing the ones which are not common in current genes state and this.tests
else:
for gene_index, gene in enumerate(venn_data_point["genes"]):
if gene not in data_frame["data"][index]:
venn_data_point["genes"].pop(gene_index)
# putting the test in the tests list
venn_data_point["tests"].append(data_frame["name"][index])
venn_data.append(venn_data_point.copy())
return venn_data
I'm basically abusing the fact that binary number generate all possible combinations of 1's and 0's so corresponding every place of the binary number with a test, and for every binary number, if 0 is present then the list corresponding to that test is not taken for list comparison.
I tried my best to explain, please ask in the comments if I was not clear.
After running the function I am getting an output in which there are random places where test sets are repeated.
This is the test input file.
and
This is what cameout as the output
Any help is highly appreciated Thank you.

I realized what error I was making
I assumed that the binary function will magically always generate the string with the number of places I needed, Which it doesn't.
After updating the binary function to add those zeros things are doing fine.
import pandas as pd
def get_venn_compiled_data(dir_loc):
"""return json of compiled data for the venn thing
"""
# internal variables
data_frame = pd.read_json(dir_loc + "/venn.json", orient="records")
number_of_tests = data_frame.shape[0]
venn_data = []
# defining internal function
def binary(dec_no, length=number_of_tests):
"""Just to convert decimal number to binary of specified length
"""
bin_number = bin(dec_no)[2:]
if len(bin_number) < length:
bin_number = "0" * (length - len(bin_number)) + bin_number
return bin_number
# list of genes which are common across listed tests
venn_data_point = {
"tests": [],
"genes": [],
}
for dec_number in range(1, 2 ** number_of_tests):
# resetting
venn_data_point["tests"] = []
venn_data_point["genes"] = []
# using a binary number to get all the cases
for index, state in enumerate(binary(dec_number)):
if state == "0":
continue
# putting in all the genes from the first test
if venn_data_point["tests"] == []:
venn_data_point["genes"] = data_frame["data"][index].copy()
# removing the ones which are not common in current genes state and this.tests
else:
for gene_index, gene in enumerate(venn_data_point["genes"]):
if gene not in data_frame["data"][index]:
venn_data_point["genes"].pop(gene_index)
# putting the test in the tests list
venn_data_point["tests"].append(data_frame["name"][index])
venn_data.append(venn_data_point.copy())
return venn_data
If anyone else has a more optimized algorithm for this, help is appreciated.

Related

Speeding up fuzzy match on large list

I am working on a project that uses fuzzy logic on a list of names that could go about 100,000 unique records. On the recent screening that we have conducted, the functions that we use can complete a single name within 2.20 seconds on average. This means that on a list of 10,000 names, this process could take 6 hours, which is really too long.
Is there a way that we can speed up our process? Here's the snippet of the script that we use.
# Importing packages
import pandas as pd
import Levenshtein as lev
# Reading cleaned datasets
df_name_reference = pd.read_csv('path_to_file')
df_name_to_screen = pd.read_csv('path_to_file')
# Function used in name screening
def get_similarity_score(s1, s2):
''' Return match percentage between 2 strings disregarding name swapping
Parameters
-----------
s1 : str : name from df_name_reference (to be used within pandas apply)
s2 : str : name from df_name_to_screen (ref_name variable)
Return
-----------
float
'''
# Get sorted names
s1_sort = ' '.join(sorted(s1.split(' '))).strip() if type(s1)==str else ''
s2_sort = ' '.join(sorted(s2.split(' '))).strip() if type(s2)==str else ''
# Get ratios and return the max value
# THIS COULD BE THE BOTTLENECK OF OUR SCRIPT: MORE DETAILS BELOW
return max([
lev.ratio(s1, s2),
lev.ratio(s1_sort, s2),
lev.ratio(s1, s2_sort),
lev.ratio(s1_sort, s2_sort)
])
# Returning file
screening_results = []
for row in range(df_name_to_screen.shape[0]):
# Get name to screen
ref_name = df_name_to_screen.loc[row, 'fullname']
# Get scores
scores = df_name_reference.fullname.apply(lev.ratio, args=(ref_name,))
# Append results
screening_results.append(pd.DataFrame({'screened_name':ref_name, 'scores':scores}))
I took four scores from lev.ratio. This is to address variations in the arrangement of names, ie. firstname-lastname and lastname-firstname formats. I know that fuzzywuzzy package has token_sort_ratio, but I've noticed that it's just splitting the name parts, and sorting it alphabetically, which leads to lower scores. Plus, fuzzywuzzy is slower than Levenshtein. So, I had to manually capture the similarity score of sorted and unsorted names.
Can anyone give an approach that I could try? Thanks!
EDIT: Here's a sample dataset that you may try. This is in Google Drive.
In case you don't need scores for all entries in the reference data but just the top N then you can use difflib.get_close_matches to remove the others before calculating any scores:
screening_results = []
for row in range(df_name_to_screen.shape[0]):
ref_name = df_name_to_screen.loc[row, 'fullname']
skimmed = pd.DataFrame({
'fullname': difflib.get_close_matches(
ref_name,
df_name_reference.fullname,
N_RESULTS,
0
)
})
scores = skimmed.fullname.apply(lev.ratio, args=(ref_name,))
screening_results.append(pd.DataFrame({'screened_name': ref_name, 'scores': scores}))
This takes about 50ms per row using the file you provided.

Trying to place strings into columns

There are 3 columns, levels 1-3. A file is read, and each line of the file contains various data, including the level to which it belongs, located at the back of the string.
Sample lines from file being read:
thing_1 - level 1
thing_17 - level 3
thing_22 - level 2
I want to assign each "thing" to it's corresponding column. I have looked into pandas, but it would seem that DataFrame columns won't work, as passed data would need to have attributes that match the number of columns, where in my case, I need 3 columns, but each piece of data only has 1 data point.
How could I approach this problem?
Desired output:
level 1 level 2 level 3
thing_1 thing_22 thing_17
Edit:
In looking at suggestion, I can refine my question further. I have up to 3 columns, and the line from file needs to be assigned to one of the 3 columns. Most solutions seem to need something like:
data = [['Mary', 20], ['John', 57]]
columns = ['Name', 'Age']
This does not work for me, since there are 3 columns, and each piece of data goes into only one.
There's an additional wrinkle here that I didn't notice at first. If each of your levels has the same number of things, then you can build a dictionary and then use it to supply the table's columns to PrettyTable:
from prettytable import PrettyTable
# Create an empty dictionary.
levels = {}
with open('data.txt') as f:
for line in f:
# Remove trailing \n and split into the parts we want.
thing, level = line.rstrip('\n').split(' - ')
# If this is is a new level, set it to a list containing its thing.
if level not in levels:
levels[level] = [thing]
# Otherwise, add the new thing to the level's list.
else:
levels[level].append(thing)
# Create the table, and add each level as a column
table = PrettyTable()
for level, things in levels.items():
table.add_column(level, things)
print(table)
For the example data you showed, this prints:
+---------+----------+----------+
| level 1 | level 3 | level 2 |
+---------+----------+----------+
| thing_1 | thing_17 | thing_22 |
+---------+----------+----------+
The Complication
I probably wouldn't have posted an answer (believing it was covered sufficiently in this answer), except that I realized there's an unintuitive hurdle here. If your levels contain different numbers of things each, you get an error like this:
Exception: Column length 2 does not match number of rows 1!
Because none of the solutions readily available have an obvious, "automatic" solution to this, here is a simple way to do it. Build the dictionary as before, then:
# Find the length of the longest list of things.
longest = max(len(things) for things in levels.values())
table = PrettyTable()
for level, things in levels.items():
# Pad out the list if it's shorter than the longest.
things += ['-'] * (longest - len(things))
table.add_column(level, things)
print(table)
This will print something like this:
+---------+----------+----------+
| level 1 | level 3 | level 2 |
+---------+----------+----------+
| thing_1 | thing_17 | thing_22 |
| - | - | thing_5 |
+---------+----------+----------+
Extra
If all of that made sense and you'd like to know about a way part of it can be streamlined a little, take a look at Python's defaultdict. It can take care of the "check if this key already exists" process, providing a default (in this case a new list) if nothing's already there.
from collections import defaultdict
levels = defaultdict(list)
with open('data.txt') as f:
for line in f:
# Remove trailing \n and split into the parts we want.
thing, level = line.rstrip('\n').split(' - ')
# Automatically handles adding a new key if needed:
levels[level].append(thing)

Sample RDD element(s) according to weighted probability [Spark]

In PySpark I have an RDD composed by (key;value) pairs where keys are sequential integers and values are floating point numbers.
I'd like to sample exactly one element from this RDD with probability proportional to value.
In a naiive manner, this task can be accomplished as follows:
pairs = myRDD.collect() #now pairs is a list of (key;value) tuples
K, V = zip(*pairs) #separate keys and values
V = numpy.array(V)/sum(V) #normalise probabilities
extractedK = numpy.random.choice(K,size=1,replace=True, p=V)
What concerns me is the collect() operation which, as you might know, loads the entire list of tuples in memory, which can be quite expensive. I'm aware of takeSample(), which is great when elements should be extracted uniformly, but what happens if elements should be extracted according to weighted probabilities?
Thanks!
Here is an algorithm I worked out to do this:
EXAMPLE PROBLEM
Assume we want to sample 10 items from an RDD on 3 partitions like this:
P1: ("A", 0.10), ("B", 0.10), ("C", 0.20)
P2: ("D": 0.25), ("E", 0.25)
P3: ("F", 0.10)
Here is the high-level algorithm:
INPUT: number of samples and a RDD of items (with weights)
OUTPUT: dataset sample on driver
For each partition, calculate the total probability of sampling from the partition, and aggregate those values to the driver.
This would give the probability distribution: Prob(P1) = 0.40, Prob(P2) = 0.50, Prob(P3) = 0.10
Generate a sample of the partitions (to determine the number of elements to select from each partition.)
A sample may look like this: [P1, P1, P1, P1, P2, P2, P2, P2, P2, P3]
This would give us 4 items from P1, 5 items from P2, and 1 item from P3.
On each separate partition, we locally generate a sample of the needed size using only the elements on that partition:
On P1, we would sample 4 items with the (re-normalized) probability distribution: Prob(A) = 0.25, Prob(B) = 0.25, Prob(C) = 0.50. This could yield a sample such as [A, B, C, C].
On P2, we would sample 5 items with probability distribution: Prob(D) = 0.5, Prob(E) = 0.5. This could yield a sample such as [D,D,E,E,E]
On P3: sample 1 item with probability distribution: P(F) = 1.0, this would generate the sample [E]
Collect the samples to the driver to yield your dataset sample [A,B,C,C,D,D,E,E,E,F].
Here is an implementation in scala:
case class Sample[T](weight: Double, obj: T)
/*
* Obtain a sample of size `numSamples` from an RDD `ar` using a two-phase distributed sampling approach.
*/
def sampleWeightedRDD[T:ClassTag](ar: RDD[Sample[T]], numSamples: Int)(implicit sc: SparkContext): Array[T] = {
// 1. Get total weight on each partition
var partitionWeights = ar.mapPartitionsWithIndex{case(partitionIndex, iter) => Array((partitionIndex, iter.map(_.weight).sum)).toIterator }.collect().toArray
//Normalize to 1.0
val Z = partitionWeights.map(_._2).sum
partitionWeights = partitionWeights.map{case(partitionIndex, weight) => (partitionIndex, weight/Z)}
// 2. Sample from partitions indexes to determine number of samples from each partition
val samplesPerIndex = sc.broadcast(sample[Int](partitionWeights, numSamples).groupBy(x => x).mapValues(_.size).toMap).value
// 3. On each partition, sample the number of elements needed for that partition
ar.mapPartitionsWithIndex{case(partitionIndex, iter) =>
val numSamplesForPartition = samplesPerIndex.getOrElse(partitionIndex, 0)
var ar = iter.map(x => (x.obj, x.weight)).toArray
//Normalize to 1.0
val Z = ar.map(x => x._2).sum
ar = ar.map{case(obj, weight) => (obj, weight/Z)}
sample(ar, numSamplesForPartition).toIterator
}.collect()
}
This code using a simple weighted sampling function sample:
// a very simple weighted sampling function
def sample[T:ClassTag](dist: Array[(T, Double)], numSamples: Int): Array[T] = {
val probs = dist.zipWithIndex.map{case((elem,prob),idx) => (elem,prob,idx+1)}.sortBy(-_._2)
val cumulativeDist = probs.map(_._2).scanLeft(0.0)(_+_).drop(1)
(1 to numSamples).toArray.map(x => scala.util.Random.nextDouble).map{case(p) =>
def findElem(p: Double, cumulativeDist: Array[Double]): Int = {
for(i <- (0 until cumulativeDist.size-1))
if (p <= cumulativeDist(i)) return i
return cumulativeDist.size-1
}
probs(findElem(p, cumulativeDist))._1
}
}
This is basically doable, but you should really consider whether it makes sense to use Spark for this. If you need to draw random values, then you presumeably need to do so many times over in a loop. Each iteration will require scanning through all the data (maybe more than once).
So fitting the data you need into memory and then drawing values from it randomly is almost certainly the right way to go. If your data is really too big to fit into memory, consider (a) only collecting the columns you need for this purpose and (b) whether your data can be binned in a way that makes sense.
Having said that, it is doable within Spark. Below is pysaprk code to demonstrate the idea.
import random
import pyspark.sql.functions as F
from pyspark.sql.window import Window
# read some sample data (shown below)
df = spark.read.csv("prb.csv",sep='\t',inferSchema=True,header=True)
# find the sum of the value column
ss = df.groupBy().agg( F.sum("vl").alias("sum") ).collect()
# add a column to store the normalized values
q = df.withColumn("nrm_vl", (df["vl"] / ss[0].sum) )
w = Window.partitionBy().orderBy("nrm_vl")\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
q = q.select("*", F.sum("nrm_vl").over(w).alias("cum_vl"))
q.show()
+---+---+-------------------+-------------------+
| ky| vl| nrm_vl| cum_vl|
+---+---+-------------------+-------------------+
| 2|0.8|0.07079646017699115|0.07079646017699115|
| 3|1.1|0.09734513274336283|0.16814159292035397|
| 4|1.7|0.15044247787610618| 0.3185840707964601|
| 0|3.2| 0.2831858407079646| 0.6017699115044247|
| 1|4.5| 0.3982300884955752| 0.9999999999999999|
+---+---+-------------------+-------------------+
def getRandVl(q):
# choose a random number and find the row that is
# less than and nearest to the random number
# (analog to `std::lower_bound` in C++)
chvl = q.where( q["cum_vl"] > random.random() ).groupBy().agg(
F.min(q["cum_vl"]).alias("cum_vl") )
return q.join(chvl, on="cum_vl", how="inner")
# get 30 random samples.. this is already slow
# on a single machine.
for i in range(0,30):
x = getRandVl(q)
# add this row. there's no reason to do this (it's slow)
# except that it's convenient to count how often each
# key was chosen, to check if this method works
cdf = cdf.select(cdf.columns).union(x.select(cdf.columns))
# count how often we picked each key
cdf.groupBy("ky","vl").agg( F.count("*").alias("count") ).show()
+---+---+-----+
| ky| vl|count|
+---+---+-----+
| 4|1.7| 4|
| 2|0.8| 1|
| 3|1.1| 3|
| 0|3.2| 11|
| 1|4.5| 12|
+---+---+-----+
I think these counts are reasonable given the values. I'd rather test it with many more samples, but it's too slow.

python fastest way to match strings with huge data size

I have a huge table data (or record array) with elements:
tbdata[i]['a'], tbdata[i]['b'], tbdata[i]['c']
which are all integers, and i is a random number between 0 and 1 million (the size of the table).
I also have a list called Name whose elements are all names (900 names in total) of files, such as '/Users/Desktop/Data/spe-3588-55184-0228.jpg' (modified), all containing three numbers.
Now I want to select those data from my tbdata whose elements mentioned above all match the three numbers in the names of list Name. Here's the code I originally wrote:
Data = []
for k in range(0, len(tbdata)):
for i in range(0, len(NameA5)):
if Name[i][43:47] == str(tbdata[k]['a']) and\
Name[i][48:53] == str(tbdata[k]['b']) and\
Name[i][55:58] == str(tbdata[k]['c']):
Data.append(tbdata[k])
Python ran for the whole night and still haven't finished, since either the size of data is huge or my algorithm is too slow...I'm wondering what's the fastest way to complete such a task? Thanks!
You can construct a lookup tree like this:
a2b2c = {}
for name in NameA5:
a = int(name[43:47])
b = int(name[48:53])
c = int(name[55:58])
if a not in a2b2c2name:
a2b2c2name[a] = {}
if b not in a2b2c2name[a]:
a2b2c2name[a][b] = {}
a2b2c2name[a][b][c] = True
for k in range(len(tbdata)):
a = tbdata[k]['a']
b = tbdata[k]['b']
c = tbdata[k]['c']
if a in a2b2c2name and b in a2b2c2name[a] and c in a2b2c2name[a][b]:
Data.append(tbdata[k])

Merge two lists of objects containing lists

I have a directory tree containing html files called slides. Something like:
slides_root
|
|_slide-1
| |_slide-1.html
| |_slide-2.html
|
|_slide-2
| |
| |_slide-1
| | |_slide-1.html
| | |_slide-2.html
| | |_slide-3.html
| |
| |_slide-2
| |_slide-1.html
...and so on. They could go even deeper. Now imagine I have to replace some slides in this structure by merging it with another tree which is a subset of this.
WITH AN EXAMPLE: say that I want to replace slide-1.html and slide-3.html inside "slides_root/slide-2/slide-1" merging "slides_root" with:
slide_to_change
|
|_slide-2
|
|_slide-1
|_slide-1.html
|_slide-3.html
I would merge "slide_to_change" into "slides_root". The structure is the same so everything goes fine. But I have to do it in a python object representation of this scheme.
So the two trees are represented by two instances - slides1, slides2 - of the same "Slide" class which is structured as follows:
Slide(object):
def __init__(self, path):
self.path = path
self.slides = [Slide(path)]
Both slide1 and slide2 contains a path and a list that contain other Slide objects with other path and list of Slide objects and so on.
The rule is that if the the relative path is the same then I would replace the slide object in slide1 with the one in slide2.
How can achieve this result? It is really difficult and I can see no way out. Ideally something like:
for slide_root in slide1.slides:
for slide_dest in slide2.slides:
if slide_root.path == slide_dest.path:
slide_root = slide_dest
// now restart the loop at a deeper level
// repeat
Thank everyone for any answer.
Sounds not so complicated.
Just use a recursive function for walking the to-be-inserted tree and keep a hold on the corresponding place in the old tree.
If the parts match:
If the parts are both leafs (html thingies):
Insert (overwrite) the value.
If the parts are both nodes (slides):
Call yourself with the subslides (here's the recursion).
I know this is just kind of a hint, just kind of a sketch on how to do it. But maybe you want to start on this. In Python it could look sth like this (also not completely fleshed out):
def merge_slide(slide, old_slide):
for sub_slide in slide.slides:
sub_slide_position_in_old_slide = find_sub_slide_position_by_path(sub_slide.path)
if sub_slide_position_in_old_slide >= 0: # we found a match!
sub_slide_in_old_slide = old_slide.slides[sub_slide_position_in_old_slide]
if sub_slide.slides: # this is a node!
merge_slide(sub_slide, sub_slide_in_old_slide) # here we recurse
else: # this is a leaf! so we replace it:
old_slide[sub_slide_position_in_old_slide] = sub_slide
else: # nothing like this in old_slide
pass # ignore (you might want to consider this case!)
Maybe that gives you an idea on how I would approach this.

Categories

Resources