I have been thinking about this problem for a while now and I was hoping that someone here would have a suggestion to considerably increase the speed of this analysis using Python.
I basically have two files. File (1) contains coordinates composed of a letter, a start and an end: e.g. "a 1000 1100" and file (2) a dataset in which each datapoint in composed of a letter and a coordinate: e.g. "p 1350". What I am trying to do with the script is to count how many datapoints fall within the borders of the coordinates, but only if the letter of the datapoint from file (2) and the coordinate from file (1) are equal. In real life datasets file (1) contains > 50K coordinates and file (2) > 50 million datapoints. Increasing the amount of datapoints exponentially increases the time my script requires to run. So I wonder if someone could come up with a more time-efficient way.
Thanks!
My script starts at # script strategy, but I first simulate a minimal example dataset:
import numpy as np
import random
import string
# simulate data
c_size = 10
d_size = 1000000
# letters
letters = list(string.ascii_lowercase)
# coordinates
c1 = np.random.randint(low=100000, high=2000000, size=c_size)
c2 = np.random.randint(low=100, high=1000, size=c_size)
# data
data = np.random.randint(low=100000, high=2000000, size=d_size)
# script strategy
# create coordinates and count dict
c_dict = {}
count_dict = {}
for start,end in zip(c1,c2):
end = start + end
c_l = random.choice(letters)
ID = c_l + '_' + str(start) + '_' + str(end)
count_dict[ID] = 0
if c_l not in c_dict:
c_dict[c_l] = [[start,end]]
else:
c_dict[c_l].append([start,end])
# count how many datapoints (x) are within the borders of the coordinates
for i in range(d_size):
d_l = random.choice(letters)
x = data[i]
if d_l in c_dict:
# increasing speed by only comparing data and coordinates with identical letter identifier
for coordinates in c_dict[d_l]:
start = coordinates[0]
end = coordinates[1]
ID = d_l + '_' + str(start) + '_' + str(end)
if x >= start and x <= end:
count_dict[ID] += 1
# print output
for ID in count_dict:
count = count_dict[ID]
print(ID + '\t' + str(count))
Related
I have this password generator, which comute combination with length of 2 to 6 characters from a list containing small letters, capital letters and numbers (without 0) - together 61 characters.
All I need is to show percentage (with a step of 5) of the combinations already created. I tried to compute all the combinations of selected length, from that number a boundary value (the 5 % step values) and count each combination written in text file and when when the count of combinations meets the boundary value, print the xxx % completed, but this code doesn't seem to work.
Do you know how to easily show the percentage please?
Sorry for my english, I'm not a native speaker.
Thank you all!
def pw_gen(characters, length):
"""generate all characters combinations with selected length and export them to a text file"""
# counting number of combinations according to a formula in documentation
k = length
n = len(characters) + k - 1
comb_numb = math.factorial(n)/(math.factorial(n-length)*math.factorial(length))
x = 0
# first value
percent = 5
# step of percent done to display
step = 5
# 'step' % of combinations
boundary_value = comb_numb/(100/step)
try:
# output text file
with open("password_combinations.txt", "a+") as f:
for p in itertools.product(characters, repeat=length):
combination = ''.join(p)
# write each combination and create a new line
f.write(combination + '\n')
x += 1
if boundary_value <= x <= comb_numb:
print("{} % complete".format(percent))
percent += step
boundary_value += comb_numb/(100/step)
elif x > comb_numb:
break
First of all - I think you are using incorrect formula for combinations because itertools.product creates variations with repetition, so the correct formula is n^k (n to power of k).
Also, you overcomplicated percentage calculation a little bit. I just modified your code to work as expected.
import math
import itertools
def pw_gen(characters, length):
"""generate all characters combinations with selected length and export them to a text file"""
k = length
n = len(characters)
comb_numb = n ** k
x = 0
next_percent = 5
percent_step = 5
with open("password_combinations.txt", "a+") as f:
for p in itertools.product(characters, repeat=length):
combination = ''.join(p)
# write each combination and create a new line
f.write(combination + '\n')
x += 1
percent = 100.0 * x / comb_numb
if percent >= next_percent:
print(f"{next_percent} % complete")
while next_percent < percent:
next_percent += percent_step
The tricky part is a while loop that makes sure that everything will work fine for very small sets (where one combination is more than step percentage of results).
Removed try:, since you are not handling any errors with expect.
Also removed elif:, this condition is never met anyway.
Besides, your formula for comb_numb is not the right one, since you're generating combinations with repetition. With those changes, your code is good.
import math, iterations, string
def pw_gen(characters, length):
"""generate all characters combinations with selected length and export them to a text file"""
# counting number of combinations according to a formula in documentation
comb_numb = len(characters) ** k
x = 0
# first value
percent = 5
# step of percent done to display
step = 5
# 'step' % of combinations
boundary_value = comb_numb/(100/step)
# output text file
with open("password_combinations.txt", "a+") as f:
for p in itertools.product(characters, repeat=length):
combination = ''.join(p)
# write each combination and create a new line
f.write(combination + '\n')
x += 1
if boundary_value <= x:
print("{} % complete".format(percent))
percent += step
boundary_value += comb_numb/(100/step)
pw_gen(string.ascii_letters, 4)
For this question http://www.spoj.com/problems/ACPC10D/ on SPOJ, I wrote a python solution as below:
count = 1
while True:
no_rows = int(raw_input())
if no_rows == 0:
break
grid = [[None for x in range(3)] for y in range(2)]
input_arr = map(int, raw_input().split())
grid[0][0] = 10000000
grid[0][1] = input_arr[1]
grid[0][2] = input_arr[1] + input_arr[2]
r = 1
for i in range(0, no_rows-1):
input_arr = map(int, raw_input().split())
_r = r ^ 1
grid[r][0] = input_arr[0] + min(grid[_r][0], grid[_r][1])
grid[r][1] = input_arr[1] + min(min(grid[_r][0], grid[r][0]), min(grid[_r][1], grid[_r][2]))
grid[r][2] = input_arr[2] + min(min(grid[_r][1], grid[r][1]), grid[_r][2])
r = _r
print str(count) + ". " + str(grid[(no_rows -1) & 1][1])
count += 1
The above code exceeds time limit. However, when I change the line
grid[r][2] = input_arr[2] + min(min(grid[_r][1], grid[r][1]), grid[_r][2])
to
grid[r][2] = input_arr[2] + min(min(grid[_r][1], grid[_r][2]), grid[r][1])
the solution is accepted. If you notice the difference, the first line compares, grid[_r][1], grid[r][1] for minimum (i.e. the row number are different) and second line compares grid[_r][1], grid[_r][2] for minimum(i.e. the row number are same)
This is a consistent behaviour. I want to understand, how python is processing those two lines - so that one results in exceeding time limit, while other is fine.
I'm trying to create bins with the count of prices to be used for a histogram.
I want the bins to be 0-1000, 1000-2000, 2000-3000 and so forth. If I just do group by I get way to many different bins.
The code I've written seems to end in a infinite loop (or at least the script is still running after an hour). I'm not sure how to do it correctly. Here is the code I wrote:
from itertools import zip_longest
def price_histogram(area_id, agency_id):
# Get prices and total count for competitors
query = HousePrice.objects.filter(area_id=area_id, cur_price__range=(1000,30000)).exclude(agency_id=agency_id)
count = query.values('cur_price').annotate(count=Count('cur_price')).order_by('cur_price')
total = query.count()
# Get prices and total count for selected agency
query_agency = HousePrice.objects.filter(area_id=area_id, agency_id=agency_id, cur_price__range=(1000,30000))
count_agency = query_agency.values('cur_price').annotate(count=Count('cur_price')).order_by('cur_price')
total_agency = query_agency.count()
# Make list for x and y values
x_comp = []
y_comp = []
x_agency = []
y_agency = []
bin_start = 0
bin_end = 1000
_count_comp = 0
_count_agency = 0
for row_comp, row_agency in zip_longest(count, count_agency, fillvalue={}):
while bin_start < int(row_comp['cur_price']) < bin_end:
_count_comp += row_comp['count']
_count_agency += row_agency.get('count', 0)
bin_start += 1000
bin_end += 1000
x_comp.append(str(bin_start) + "-" + str(bin_end) + " USD")
x_agency.append(str(bin_start) + "-" + str(bin_end) + " USD")
y_comp.append(_count_comp/total)
y_agency.append(_count_agency/total_agency)
return {'x_comp': x_comp, 'y_comp': y_comp, 'x_agency': x_agency, 'y_agency': y_agency}
I'm using Python 3.5 and Django 1.10.
I'm a little late, but maybe the django-pivot library does what you want.
from django_pivot.histogram import histogram
query = HousePrice.objects.filter(area_id=area_id, cur_price__range=(1000,30000)).exclude(agency_id=agency_id
hist = histogram(query, cur_price, bins=[1000:30000:1000])
I am trying to discover how to use for loops and where strings and ints can differentiate.
I created a function calling dimension and TV size: for example
def TVDisplay(dimension, TVsize):
final = "<==="
for i in range(TVsize-2):
final = final + "=="
final = final + "==>\n"
for corner in dimension:
final = final + "< "
for edge in corner:
final = final + edge + " "
final = final + ">\n"
final = final + "<==="
for i in range(TVsize-2):
final = final + "=="
final = final + "==>\n"
return final
This function returns
<=====>
< 0 0 >
< 0 0 >
<=====>
Based on a dimension that is [['0','0'],['0','0']] and a TVsize of 2.
Now I am trying to use while loops to make it look similar, but I am running into problems at the strings and int()s
My Function looks like this:
def TVDisplay(dimension, TVsize):
final="<==="
i=0
while i < TVsize-2:
final = final + "=="
ctr+=1
final = final + "==>\n"
corner=0
while corner < dimension:
edge = 0
final = final + "< "
while edge < corner:
final = final + edge + " "
edge+=1
final = final + ">\n"
corner+=1
final = final + "<==="
while i < TVsize-2:
final = final + "=="
i+=1
final = final + "==>\n"
return final
This function returns this:
<=====>
<>
< 0 >
<=====>
I think it has to do with my middle part of code that is conflicting with strs or ints.
Does anyone have any advice how to fix this problem?
Thank you!!
EDITED::
corner=1
while corner < dimension:
final = final + "< "
edge = 0
while edge < corner:
final = final + edge + " "
edge+=1
final = final + ">\n"
corner+=1
At the:
final = final + edge + " "
line, cannot concatenate 'str' and 'int' objects appears.
my purpose to get the middle part of the loop is to spit out the middle part of the display
< 0 0 >
< 0 0 >
the last loop closes it off.
so thats my issue
Dimension is a list of lists right?
So when you call:
for corner in dimension:
for edge in corner:
It is referring to the actual object. So corner is a list and edge is an object in that list.
It would be the same as saying:
while corner < len(dimension):
edge = 0
while edge < len(dimension[corner]):
finale += dimension[corner][edge] + " "
The difference is that when you say:
for x in y
You are actually referring to the object x which is in y. However when you say:
while x < y:
x+=1
X is only an integer (in your case it is the index of the object). To access the actual object you must use y[x]. The 'in' function refers to actual objects, whereas when you use a while loop you create a counter that keeps track of an index but not the actual object.
while corner_index < len(dimension):
edge_index = 0
corner = dimension[corner_index] #retrieve the list from dimension
final = final + "< "
while edge_index < len(corner):
edge = corner[edge_index] #get the edge
final += edge + " "
edge_index+=1
final = final + ">\n"
corner_index+=1
To be even more succint:
while corner_index < len(dimension):
edge_index = 0
final = final + "< "
while edge_index < len(dimension[corner_index]):
final += dimension[corner_index][edge_index] + " "
edge_index+=1
final = final + ">\n"
corner_index+=1
As to your edit:
The way you are accessing edge (as an index integer) means you must first typecast to a string. So:
final += str(edge) + " "
You didn't have this issue initially because 'edge' referred to the actual string object '0' in your dimensions list. However, when you use while loops, 'edge' is an integer that you are using as a counter.
I am writing a simple command line program in Python 3.3 which reads a text file of xyz-coordinates and outputs a the equivalent triangle faces in between. The export format are Wavefront obj-files (https://en.wikipedia.org/wiki/Wavefront_.obj_file). The algorthm is solely intended to work with regular spaced points from high resolution satellite scans of the earth. Actually, I am using a set of about 340000 points and creating 2 triangles in between a vertex quadrupel. The outer iteration goes in x-direction while the inner iteration is over the y-direction. So, pairs of triangle faces are creates for every vertex in y-direction until it moves on in x-direction and repeats the process. I will show you the principle pattern (the lines are the face edges):
v1--v5--v9
| \ | / |
v2--v6--v10
| / | \ |
v3--v7--v11
| \ | / |
v4--v8--v12
The code seems to work in way as importing the file in Blender or MeshLab gives reasonable results, except for one thing: All stripes of face pairs seem to be not connected with their neighbors along the x-axis. A rendered picture which demonstrates the problem:
unconnected stripes.
Normally, there shouldn't be an vertical offset between different face-stripes because they share the same vertices along their interior border(-line). Tests with less vertices and more common low coordinate values succeeded. The method was working perfectly fine. Maybe the problem lies not within my mesh generator but within the coordinate limitations of Blender, MeshLab, etcetera.
Here is the function which generates the faces and stitches everythin together in an return-string:
def simpleTriangMesh(verts):
printAll("--creating simple triangulated mesh", "\n")
maxCoords = [max(verts[0]), max(verts[1]), max(verts[2])]
minCoords = [min(verts[0]), min(verts[1]), min(verts[2])]
printAll("max. coordinates (xyz): \n", maxCoords, "\n")
printAll("min. coordinates (xyz): \n", minCoords, "\n")
xVerts = 0 # amount of vertices in x-direction
yVerts = 0 # amount of vertices in y-direction
faceAmount = 0 # amount of required faces to skin grid
i = 0
temp = verts[0][0]
while(i < len(verts[0])):
if(temp < verts[0][i]):
yVerts = int(i)
break
temp = verts[0][i]
i += 1
xVerts = int(len(verts[0]) / float(yVerts))
faceAmount = ((xVerts - 1) * (yVerts - 1)) * 2
printAll("vertices in x direction: ", xVerts, "\n")
printAll("vertices in y direction: ", yVerts, "\n")
printAll("estimated amount of triangle faces: ",
faceAmount, "\n")
printAll("----generating vertex triangles representing the faces", "\n")
# list of vertex-index quadrupels representing the faces
faceList = [[0 for line in range(0, 3)] for face in range(0, int(faceAmount))]
f = 0
v = 0
# rather to draw hypotenuse of the triangles from topleft to bottomright
# or perpendicular to that (topright to bottomleft)
tl = True # the one that changes in y-direction
tl_rem = False # to remember the hypotenuse direction of the last topmost faces
while(f < len(faceList)):
# prevent creation of faces at the bottom line
# + guarantees that v = 1 when creating the first face
if(( v % yVerts ) == 0):
v += 1
tl = not tl_rem
tl_rem = tl
if(tl):
faceList[f][0] = v
faceList[f][1] = v + yVerts
faceList[f][2] = v + yVerts + 1
f += 1
faceList[f][0] = v
faceList[f][1] = v + yVerts + 1
faceList[f][2] = v + 1
else:
faceList[f][0] = v
faceList[f][1] = v + yVerts
faceList[f][2] = v + 1
f += 1
faceList[f][0] = v + 1
faceList[f][1] = v + yVerts
faceList[f][2] = v + yVerts + 1
f += 1
v += 1
tl = not tl
printAll("----preparing obj-file-content for export", "\n")
rectMesh_Obj = "" # string containing the mesh in obj-format (ascii)
tempVerts = ""
tempFaces = ""
row = 0
while(row < len(verts[0])):
# temp = ("v" + " " + str(verts[0][row]) + " " + str(verts[1][row])
# + " " + str(verts[2][row]) + "\n")
temp = ("v" + " " + str(verts[0][row]) + " " + str(verts[2][row])
+ " " + str(verts[1][row]) + "\n")
tempVerts += temp
row += 1
row = 0
while(row < len(faceList)):
temp = ("f"
+ " " + str(int(faceList[row][0]))
+ " " + str(int(faceList[row][1]))
+ " " + str(int(faceList[row][2]))
# + " " + str(int(faceList[row][3]))
+ "\n")
tempFaces += temp
row += 1
rectMesh_Obj += tempVerts + tempFaces
return(rectMesh_Obj)
The verts-variable which is inputted into the function has the form of a 2-dimensional list, similar to:
# x y z
vertsExample = [[3334, 3333, 3332], [2555, 2554, 2553], [10.2, 5.2, 6.7]]
I hope some of you can help me out of the misery. If something requires more explanation, please let me know and I will add it to the first post.
I finally solved the issue. The problem wasn't in my mesh generator program. Blender and MeshLab (and most likely other 3D-Programs as well) do some weird things when the coordinates of vertices are too big. If am reducing the real world geographically projected coordinates to smaller relative coordinates everything works just fine (https://dl.dropboxusercontent.com/u/13547611/meshGenWorking001.png).
My guess:
The Wavefront obj-format has too limited byte-sizes for its numbers. or to be more correct: Common 3D-Programs do not expect the numbers to be so big like the real world ones. This way they interpret what they get in a confusing manner.
I hope this solution helps somebody in the future !