I am currently working on debugging some of the DirectWrite code I have written, as I have run into issues when testing with non-English characters. Mostly with getting multiple Unicode characters returning proper indices.
EDIT: After further research, I believe the issue is diacritics, the extra character should be combined somehow. The DWRITE_SHAPING_GLYPH_PROPERTIES field isDiacritic does return 1 for the last unicode codepoint. However, it doesn't seem like the shaping process takes these into account at all. GetGlyphPlacements returns 0's for advance and offset for the diacritic glyph. The LSB is around -5 but that's not enough to offset to the correct position. Does anyone know where in the shaping process DirectWrite is supposed to take diacritics into account and how?
Consider this character: œ̃
It is displayed as one character (through most text editors), but two codepoints: U+0153 U+0303
How do I account for this in GetGlyphs(), since they are separate codepoints? In my code, it is returning two different indices (177, 1123), and one cluster (0, 0).
This is what ends up getting rendered:
Which is consistent with both codepoints rendered individually, but not the actual character. The actual indice count returned by GetGlyphs() is 2.
My questions are as follows:
Should this be returning one indice from GetGlyphs()?
Should I even be getting one indice, or is there some magic involved with two different indices, where at some stage in the process they are combined in the glyph run?
If I should be getting one indice, what process/functions are these indices combined at? Perhaps a bug in my ScriptAnalysis? Trying to narrow down where the issue may be.
Should I be using the length of the characters and not include codepoints?
I apologize as I am not super knowledgeable about fonts/Unicode and the inner workings of the whole shaping process.
Here is some of my code for the process I use to get the indices and advances:
text_length = len(text.encode('utf-16-le')) // 2
text_buffer = create_unicode_buffer(text, text_length)
self._text_analysis.GenerateResults(self._analyzer, text_buffer, len(text_buffer))
# Formula for text buffer size from Microsoft.
max_glyph_size = int(3 * text_length / 2 + 16)
length = text_length
clusters = (UINT16 * length)()
text_props = (DWRITE_SHAPING_TEXT_PROPERTIES * length)()
indices = (UINT16 * max_glyph_size)()
glyph_props = (DWRITE_SHAPING_GLYPH_PROPERTIES * max_glyph_size)()
actual_count = UINT32()
self._analyzer.GetGlyphs(text_buffer,
len(text_buffer),
self.font.font_face,
False, # sideways
False, # rtl
self._text_analysis.script, # scriptAnalysis
None, # localName
None, # numberSub
None, # typo features
None, # feature range length
0, # feature range
max_glyph_size, # max glyph size
clusters, # cluster map
text_props, # text props
indices, # glyph indices
glyph_props, # glyph pops
byref(actual_count) # glyph count
)
advances = (FLOAT * length)()
offsets = (DWRITE_GLYPH_OFFSET * length)()
self._analyzer.GetGlyphPlacements(text_buffer,
clusters,
text_props,
text_length,
indices,
glyph_props,
actual_count,
self.font.font_face,
self.font.font_metrics.designUnitsPerEm,
False, False,
self._text_analysis.script,
self.font.locale,
None,
None,
0,
advances,
offsets)
EDIT: Here is rendering code:
def render_single_glyph(self, font_face, indice, advance, offset, metrics):
"""Renders a single glyph using D2D DrawGlyphRun"""
glyph_width, glyph_height, lsb, font_advance = metrics
# Slicing an array turns it into a python object. Maybe a better way to keep it a ctypes value?
new_indice = (UINT16 * 1)(indice)
new_advance = (FLOAT * 1)(advance)
run = self._get_single_glyph_run(font_face,
self.font._real_size,
new_indice, # indice,
new_advance, # advance,
pointer(offset), # offset,
False,
False)
offset_x = 0
if lsb < 0:
# Negative LSB: we shift the layout rect to the right
# Otherwise we will cut the left part of the glyph
offset_x = math.ceil(abs(lsb))
font_height = (self.font.font_metrics.ascent + self.font.font_metrics.descent) * self.font.font_scale_ratio
# Create new bitmap.
self._create_bitmap(int(math.ceil(glyph_width)),
int(math.ceil(font_height)))
# This offsets the characters if needed.
point = D2D_POINT_2F(offset_x, int(math.ceil(font_height)))
self._render_target.BeginDraw()
self._render_target.Clear(transparent)
self._render_target.DrawGlyphRun(point,
run,
self.brush,
DWRITE_MEASURING_MODE_NATURAL)
self._render_target.EndDraw(None, None)
image = wic_decoder.get_image(self._bitmap)
glyph = self.font.create_glyph(image)
glyph.set_bearings(self.font.descent, offset_x, round(advance * self.font.font_scale_ratio)) # baseline, lsb, advance
return glyph
Shaping process is controlled by your input which is (text,font,locale,script,user features). All that affects results you get. To answer your questions specifically:
Should this be returning one indice from GetGlyphs()?
That's mostly defined by your font.
Should I even be getting one indice, or is there some magic involved
with two different indices, where at some stage in the process they
are combined in the glyph run?
GetGlyphs() operates on single run. Glyphs are free to form a cluster according to shaping rules defined per-script, and according to transformations defined in the font.
If I should be getting one indice, what process/functions are these
indices combined at? Perhaps a bug in my ScriptAnalysis? Trying to
narrow down where the issue may be.
Basically, if your input arguments are correct, you get what you get as output, you can't really control the core of it. What you can do is to test output for the same text and font on Uniscribe, on CoreText (macos), and on Chromium/Firefox (harfbuzz) to see if they differ.
Should I be using the length of the characters and not include
codepoints?
I didn't get this one.
So we can generate a unique id with str(uuid.uuid4()), which is 36 characters long.
Is there another method to generate a unique ID which is shorter in terms of characters?
EDIT:
If ID is usable as primary key then even better
Granularity should be better than 1ms
This code could be distributed, so we can't assume time independence.
If this is for use as a primary key field in db, consider just using auto-incrementing integer instead.
str(uuid.uuid4()) is 36 chars but it has four useless dashes (-) in it, and it's limited to 0-9 a-f.
Better uuid4 in 32 chars:
>>> uuid.uuid4().hex
'b327fc1b6a2343e48af311343fc3f5a8'
Or just b64 encode and slice some urandom bytes (up to you to guarantee uniqueness):
>>> base64.b64encode(os.urandom(32))[:8]
b'iR4hZqs9'
TLDR
Most of the times it's better to work with numbers internally and encode them to short IDs externally. So here's a function for Python3, PowerShell & VBA that will convert an int32 to an alphanumeric ID. Use it like this:
int32_to_id(225204568)
'F2AXP8'
For distributed code use ULIDs: https://github.com/mdipierro/ulid
They are much longer but unique across different machines.
How short are the IDs?
It will encode about half a billion IDs in 6 characters so it's as compact as possible while still using only non-ambiguous digits and letters.
How can I get even shorter IDs?
If you want even more compact IDs/codes/Serial Numbers, you can easily expand the character set by just changing the chars="..." definition. For example if you allow all lower and upper case letters you can have 56 billion IDs within the same 6 characters. Adding a few symbols (like ~!##$%^&*()_+-=) gives you 208 billion IDs.
So why didn't you go for the shortest possible IDs?
The character set I'm using in my code has an advantage: It generates IDs that are easy to copy-paste (no symbols so double clicking selects the whole ID), easy to read without mistakes (no look-alike characters like 2 and Z) and rather easy to communicate verbally (only upper case letters). Sticking to numeric digits only is your best option for verbal communication but they are not compact.
I'm convinced: show me the code
Python 3
def int32_to_id(n):
if n==0: return "0"
chars="0123456789ACEFHJKLMNPRTUVWXY"
length=len(chars)
result=""
remain=n
while remain>0:
pos = remain % length
remain = remain // length
result = chars[pos] + result
return result
PowerShell
function int32_to_id($n){
$chars="0123456789ACEFHJKLMNPRTUVWXY"
$length=$chars.length
$result=""; $remain=[int]$n
do {
$pos = $remain % $length
$remain = [int][Math]::Floor($remain / $length)
$result = $chars[$pos] + $result
} while ($remain -gt 0)
$result
}
VBA
Function int32_to_id(n)
Dim chars$, length, result$, remain, pos
If n = 0 Then int32_to_id = "0": Exit Function
chars$ = "0123456789ACEFHJKLMNPRTUVWXY"
length = Len(chars$)
result$ = ""
remain = n
Do While (remain > 0)
pos = remain Mod length
remain = Int(remain / length)
result$ = Mid(chars$, pos + 1, 1) + result$
Loop
int32_to_id = result
End Function
Function id_to_int32(id$)
Dim chars$, length, result, remain, pos, value, power
chars$ = "0123456789ACEFHJKLMNPRTUVWXY"
length = Len(chars$)
result = 0
power = 1
For pos = Len(id$) To 1 Step -1
result = result + (InStr(chars$, Mid(id$, pos, 1)) - 1) * power
power = power * length
Next
id_to_int32 = result
End Function
Public Sub test_id_to_int32()
Dim i
For i = 0 To 28 ^ 3
If id_to_int32(int32_to_id(i)) <> i Then Debug.Print "Error, i=", i, "int32_to_id(i)", int32_to_id(i), "id_to_int32('" & int32_to_id(i) & "')", id_to_int32(int32_to_id(i))
Next
Debug.Print "Done testing"
End Sub
Yes. Just use the current UTC millis. This number never repeats.
const uniqueID = new Date().getTime();
EDIT
If you have the rather seldom requirement to produce more than one ID within the same millisecond, this method is of no use as this number‘s granularity is 1ms.
This question may be a little specialist, but hopefully someone might be able to help. I normally use IDL, but for developing a pipeline I'm looking to use python to improve running times.
My fits file handling setup is as follows:
import numpy as numpy
from astropy.io import fits
#Directory: /Users/UCL_Astronomy/Documents/UCL/PHASG199/M33_UVOT_sum/UVOTIMSUM/M33_sum_epoch1_um2_norm.img
with fits.open('...') as ima_norm_um2:
#Open UVOTIMSUM file once and close it after extracting the relevant values:
ima_norm_um2_hdr = ima_norm_um2[0].header
ima_norm_um2_data = ima_norm_um2[0].data
#Individual dimensions for number of x pixels and number of y pixels:
nxpix_um2_ext1 = ima_norm_um2_hdr['NAXIS1']
nypix_um2_ext1 = ima_norm_um2_hdr['NAXIS2']
#Compute the size of the images (you can also do this manually rather than calling these keywords from the header):
#Call the header and data from the UVOTIMSUM file with the relevant keyword extensions:
corrfact_um2_ext1 = numpy.zeros((ima_norm_um2_hdr['NAXIS2'], ima_norm_um2_hdr['NAXIS1']))
coincorr_um2_ext1 = numpy.zeros((ima_norm_um2_hdr['NAXIS2'], ima_norm_um2_hdr['NAXIS1']))
#Check that the dimensions are all the same:
print(corrfact_um2_ext1.shape)
print(coincorr_um2_ext1.shape)
print(ima_norm_um2_data.shape)
# Make a new image file to save the correction factors:
hdu_corrfact = fits.PrimaryHDU(corrfact_um2_ext1, header=ima_norm_um2_hdr)
fits.HDUList([hdu_corrfact]).writeto('.../M33_sum_epoch1_um2_corrfact.img')
# Make a new image file to save the corrected image to:
hdu_coincorr = fits.PrimaryHDU(coincorr_um2_ext1, header=ima_norm_um2_hdr)
fits.HDUList([hdu_coincorr]).writeto('.../M33_sum_epoch1_um2_coincorr.img')
I'm looking to then apply the following corrections:
# Define the variables from Poole et al. (2008) "Photometric calibration of the Swift ultraviolet/optical telescope":
alpha = 0.9842000
ft = 0.0110329
a1 = 0.0658568
a2 = -0.0907142
a3 = 0.0285951
a4 = 0.0308063
for i in range(nxpix_um2_ext1 - 1): #do begin
for j in range(nypix_um2_ext1 - 1): #do begin
if (numpy.less_equal(i, 4) | numpy.greater_equal(i, nxpix_um2_ext1-4) | numpy.less_equal(j, 4) | numpy.greater_equal(j, nxpix_um2_ext1-4)): #then begin
#UVM2
corrfact_um2_ext1[i,j] == 0
coincorr_um2_ext1[i,j] == 0
else:
xpixmin = i-4
xpixmax = i+4
ypixmin = j-4
ypixmax = j+4
#UVM2
ima_UVM2sum = total(ima_norm_um2[xpixmin:xpixmax,ypixmin:ypixmax])
xvec_UVM2 = ft*ima_UVM2sum
fxvec_UVM2 = 1 + (a1*xvec_UVM2) + (a2*xvec_UVM2*xvec_UVM2) + (a3*xvec_UVM2*xvec_UVM2*xvec_UVM2) + (a4*xvec_UVM2*xvec_UVM2*xvec_UVM2*xvec_UVM2)
Ctheory_UVM2 = - alog(1-(alpha*ima_UVM2sum*ft))/(alpha*ft)
corrfact_um2_ext1[i,j] = Ctheory_UVM2*(fxvec_UVM2/ima_UVM2sum)
coincorr_um2_ext1[i,j] = corrfact_um2_ext1[i,j]*ima_sk_um2[i,j]
The above snippet is where it is messing up, as I have a mixture of IDL syntax and python syntax. I'm just not sure how to convert certain aspects of IDL to python. For example, the ima_UVM2sum = total(ima_norm_um2[xpixmin:xpixmax,ypixmin:ypixmax]) I'm not quite sure how to handle.
I'm also missing the part where it will update the correction factor and coincidence correction image files, I would say. If anyone could have the patience to go over it with a fine tooth comb and suggest the neccessary changes I need that would be excellent.
The original normalised image can be downloaded here: Replace ... in above code with this file
One very important thing about numpy is that it does every mathematical or comparison function on an element-basis. So you probably don't need to loop through the arrays.
So maybe start where you convolve your image with a sum-filter. This can be done for 2D images by astropy.convolution.convolve or scipy.ndimage.filters.uniform_filter
I'm not sure what you want but I think you want a 9x9 sum-filter that would be realized by
from scipy.ndimage.filters import uniform_filter
ima_UVM2sum = uniform_filter(ima_norm_um2_data, size=9)
since you want to discard any pixel that are at the borders (4 pixel) you can simply slice them away:
ima_UVM2sum_valid = ima_UVM2sum[4:-4,4:-4]
This ignores the first and last 4 rows and the first and last 4 columns (last is realized by making the stop value negative)
now you want to calculate the corrections:
xvec_UVM2 = ft*ima_UVM2sum_valid
fxvec_UVM2 = 1 + (a1*xvec_UVM2) + (a2*xvec_UVM2**2) + (a3*xvec_UVM2**3) + (a4*xvec_UVM2**4)
Ctheory_UVM2 = - np.alog(1-(alpha*ima_UVM2sum_valid*ft))/(alpha*ft)
these are all arrays so you still do not need to loop.
But then you want to fill your two images. Be careful because the correction is smaller (we inored the first and last rows/columns) so you have to take the same region in the correction images:
corrfact_um2_ext1[4:-4,4:-4] = Ctheory_UVM2*(fxvec_UVM2/ima_UVM2sum_valid)
coincorr_um2_ext1[4:-4,4:-4] = corrfact_um2_ext1[4:-4,4:-4] *ima_sk_um2
still no loop just using numpys mathematical functions. This means it is much faster (MUCH FASTER!) and does the same.
Maybe I have forgotten some slicing and that would yield a Not broadcastable error if so please report back.
Just a note about your loop: Python's first axis is the second axis in FITS and the second axis is the first FITS axis. So if you need to loop over the axis bear that in mind so you don't end up with IndexErrors or unexpected results.
I am working with data which I get in text files, and which has to be subsequently analysed. I'm currently using Excel for this task. The original file looks like this:
Contact Angle (deg) 86.20
Wetting Tension (dy/cm) 4.836
Wetting Tension Left (dy/cm) 39.44
Wetting Tension Right (dy/cm) 39.44
Base Tilt Angle (deg) 0.00
Base (mm) 1.6858
Base Area (mm2) 2.2322
Height (mm) 0.7888
Tip Width (mm) 0.9707
Wetted Tip Width (mm) 0.9581
Sessile Volume (ul) 1.1374
Sessile Surface Area (mm2) 4.1869
Contrast (cts) 245
Sharpness (cts) 161
Black Peak (cts) 10
White Peak (cts) 255
Edge Threshold (cts) 111
Base Left X (mm) 4.138
Base Right X (mm) 5.821
Base Y (mm) 2.980
RMS Fit Error (mm) 3.545E-3
#1600
I don't need the majority of this information, and for now, all I need is the Contact Angle at the top, and the time (prefixed by the '#' at the bottom). At the moment, I have a script which extracts the information I need and creates another text file for easy reading. The code used is below:
infile = "in.txt"
outfile = "newout.out"
measure_time = ""
with open(infile) as f, open(outfile, 'w') as f2:
for line in f:
if line.split():
if line.split()[0] == "Contact":
contact_angle = line.split()[-1].strip()
f2.write("Contact Angle (deg): " + contact_angle + '\n')
if line.split()[0][0] == '#':
for i in range(1,5):
measure_time += (line.split()[0][i])
f2.write("Measured at: " + measure_time[:2] + ":" + measure_time[2:] + '\n')
measure_time = ""
else:
continue
What I am looking for is a way to get my data nicely formatted in a spreadsheet for easy analysis. I would like the angles in the same row, in adjoining cells, and the measurement time in the cells below that, but I'm unsure what the best way to go about this is.
Can anyone with some more Python experience help me here?
EDIT: The image here shows what I tried to explain (poorly) above.
EDIT2: The solution posted below by #RonRosenfeld works, but I would still prefer to have a Python solution for this problem, as stated earlier. As I have no previous experience with Excel VBA, I would rather use something familiar to me.
I would just read the original file or files into Excel, selecting only those lines that begin with the Contact Angle, or # token. I'm not sure how much error checking you need to do. The following assumes that you will select multiple files, and that each file is formatted as you demonstrated in your original data. It will output the angles in row 1, and the corresponding times in row 2. It does NOT check for proper formatting; or that every Angle has a corresponding Time.
It also does NOT test and will give an error, if you only select one file. That capability can be added, if necessary.
EDIT: modified to account for either TAB or SPACE as the separator; also added code to clear worksheet and autofit the columns
It should also be easy to modify if you want to select additional parameters.
Option Explicit
'Set Reference to Microsoft Scripting Runtime
Sub GetDataFromTextFiles()
Dim FSO As FileSystemObject
Dim TS As TextStream
Dim F As File
Dim sLines As Variant
Dim I As Long, J As Long
Dim sFilePath
Dim S As String
Dim vLines() As Variant
Dim rExtract As Range
'Hard Coded here but could also use a
'User form to select multiple lines
vLines = Array("#", "Contact Angle")
Set rExtract = [b3]
Cells.Clear
[a3] = "Contact Angle (deg)"
[a4] = "Measured At"
sFilePath = Application.GetOpenFilename("Text Files (*.txt), *.txt", MultiSelect:=True)
Set FSO = New FileSystemObject
For J = LBound(sFilePath) To UBound(sFilePath)
Set TS = FSO.OpenTextFile(sFilePath(J), ForReading)
Do Until TS.AtEndOfStream = True
S = Trim(Replace(TS.ReadLine, Chr(9), Chr(32)))
For I = 0 To UBound(vLines)
If InStr(1, S, vLines(I)) = 1 Then
Select Case I
Case 0 '#
With rExtract(2, 1)
.Value = TimeSerial(Int(Mid(S, 2) / 100), Mid(S, 2) Mod 100, 0)
.NumberFormat = "hh:mm"
End With
Case 1 '#
rExtract(1, 1) = Mid(S, InStrRev(S, " ") + 1)
'advance to next column after outputting angle
Set rExtract = rExtract(1, 2)
End Select
End If
Next I
Loop
Next J
Cells.EntireColumn.AutoFit
End Sub
Here is another macro that does NOT require setting a reference to Microsoft Scripting Runtime. It does not use the FileSystemObject, but rather uses built-in VBA routines to read the file. I have been told that it will run more quickly, but I've not tested it myself. In addition, there could be issues with certain types of data, but they do not seem to exist in your files, and it runs fine on your sample.
Option Explicit
Sub GetDataFromTextFiles()
Dim sLines As Variant
Dim I As Long, J As Long
Dim sFilePath
Dim S As String
Dim vLines() As Variant
Dim rExtract As Range
'Hard Coded here but could also use a
'User form to select multiple lines
vLines = Array("#", "Contact Angle")
Set rExtract = [b3]
Cells.Clear
[a3] = "Contact Angle (deg)"
[a4] = "Measured At"
sFilePath = Application.GetOpenFilename("Text Files (*.txt), *.txt", MultiSelect:=True)
For J = LBound(sFilePath) To UBound(sFilePath)
Open sFilePath(J) For Input As #1
Do While Not EOF(1)
Input #1, S
S = Trim(Replace(S, Chr(9), Chr(32)))
For I = 0 To UBound(vLines)
If InStr(1, S, vLines(I)) = 1 Then
Select Case I
Case 0 '#
With rExtract(2, 1)
.Value = TimeSerial(Int(Mid(S, 2) / 100), Mid(S, 2) Mod 100, 0)
.NumberFormat = "hh:mm"
End With
Case 1
rExtract(1, 1) = Mid(S, InStrRev(S, " ") + 1)
'advance to next column after outputting angle
Set rExtract = rExtract(1, 2)
End Select
End If
Next I
Loop
Close #1
Next J
Cells.EntireColumn.AutoFit
End Sub
Thanks for the answers, I have not used StackOverflow before so I was suprised by the number of answers and the speed of them - its fantastic.
I have not been through the answers properly yet, but thought I should add some information to the problem specification. See the image below.
I can't post an image in this because i don't have enough points but you can see an image
at http://journal.acquitane.com/2010-01-20/image003.jpg
This image may describe more closely what I'm trying to achieve. So you can see on the horizontal lines across the page are price points on the chart. Now where you get a clustering of lines within 0.5% of each, this is considered to be a good thing and why I want to identify those clusters automatically. You can see on the chart that there is a cluster at S2 & MR1, R2 & WPP1.
So everyday I produce these price points and then I can identify manually those that are within 0.5%. - but the purpose of this question is how to do it with a python routine.
I have reproduced the list again (see below) with labels. Just be aware that the list price points don't match the price points in the image because they are from two different days.
[YR3,175.24,8]
[SR3,147.85,6]
[YR2,144.13,8]
[SR2,130.44,6]
[YR1,127.79,8]
[QR3,127.42,5]
[SR1,120.94,6]
[QR2,120.22,5]
[MR3,118.10,3]
[WR3,116.73,2]
[DR3,116.23,1]
[WR2,115.93,2]
[QR1,115.83,5]
[MR2,115.56,3]
[DR2,115.53,1]
[WR1,114.79,2]
[DR1,114.59,1]
[WPP,113.99,2]
[DPP,113.89,1]
[MR1,113.50,3]
[DS1,112.95,1]
[WS1,112.85,2]
[DS2,112.25,1]
[WS2,112.05,2]
[DS3,111.31,1]
[MPP,110.97,3]
[WS3,110.91,2]
[50MA,110.87,4]
[MS1,108.91,3]
[QPP,108.64,5]
[MS2,106.37,3]
[MS3,104.31,3]
[QS1,104.25,5]
[SPP,103.53,6]
[200MA,99.42,7]
[QS2,97.05,5]
[YPP,96.68,8]
[SS1,94.03,6]
[QS3,92.66,5]
[YS1,80.34,8]
[SS2,76.62,6]
[SS3,67.12,6]
[YS2,49.23,8]
[YS3,32.89,8]
I did make a mistake with the original list in that Group C is wrong and should not be included. Thanks for pointing that out.
Also the 0.5% is not fixed this value will change from day to day, but I have just used 0.5% as an example for spec'ing the problem.
Thanks Again.
Mark
PS. I will get cracking on checking the answers now now.
Hi:
I need to do some manipulation of stock prices. I have just started using Python, (but I think I would have trouble implementing this in any language). I'm looking for some ideas on how to implement this nicely in python.
Thanks
Mark
Problem:
I have a list of lists (FloorLevels (see below)) where the sublist has two items (stockprice, weight). I want to put the stockprices into groups when they are within 0.5% of each other. A groups strength will be determined by its total weight. For example:
Group-A
115.93,2
115.83,5
115.56,3
115.53,1
-------------
TotalWeight:12
-------------
Group-B
113.50,3
112.95,1
112.85,2
-------------
TotalWeight:6
-------------
FloorLevels[
[175.24,8]
[147.85,6]
[144.13,8]
[130.44,6]
[127.79,8]
[127.42,5]
[120.94,6]
[120.22,5]
[118.10,3]
[116.73,2]
[116.23,1]
[115.93,2]
[115.83,5]
[115.56,3]
[115.53,1]
[114.79,2]
[114.59,1]
[113.99,2]
[113.89,1]
[113.50,3]
[112.95,1]
[112.85,2]
[112.25,1]
[112.05,2]
[111.31,1]
[110.97,3]
[110.91,2]
[110.87,4]
[108.91,3]
[108.64,5]
[106.37,3]
[104.31,3]
[104.25,5]
[103.53,6]
[99.42,7]
[97.05,5]
[96.68,8]
[94.03,6]
[92.66,5]
[80.34,8]
[76.62,6]
[67.12,6]
[49.23,8]
[32.89,8]
]
I suggest a repeated use of k-means clustering -- let's call it KMC for short. KMC is a simple and powerful clustering algorithm... but it needs to "be told" how many clusters, k, you're aiming for. You don't know that in advance (if I understand you correctly) -- you just want the smallest k such that no two items "clustered together" are more than X% apart from each other. So, start with k equal 1 -- everything bunched together, no clustering pass needed;-) -- and check the diameter of the cluster (a cluster's "diameter", from the use of the term in geometry, is the largest distance between any two members of a cluster).
If the diameter is > X%, set k += 1, perform KMC with k as the number of clusters, and repeat the check, iteratively.
In pseudo-code:
def markCluster(items, threshold):
k = 1
clusters = [items]
maxdist = diameter(items)
while maxdist > threshold:
k += 1
clusters = Kmc(items, k)
maxdist = max(diameter(c) for c in clusters)
return clusters
assuming of course we have suitable diameter and Kmc Python functions.
Does this sound like the kind of thing you want? If so, then we can move on to show you how to write diameter and Kmc (in pure Python if you have a relatively limited number of items to deal with, otherwise maybe by exploiting powerful third-party add-on frameworks such as numpy) -- but it's not worthwhile to go to such trouble if you actually want something pretty different, whence this check!-)
A stock s belong in a group G if for each stock t in G, s * 1.05 >= t and s / 1.05 <= t, right?
How do we add the stocks to each group? If we have the stocks 95, 100, 101, and 105, and we start a group with 100, then add 101, we will end up with {100, 101, 105}. If we did 95 after 100, we'd end up with {100, 95}.
Do we just need to consider all possible permutations? If so, your algorithm is going to be inefficient.
You need to specify your problem in more detail. Just what does "put the stockprices into groups when they are within 0.5% of each other" mean?
Possibilities:
(1) each member of the group is within 0.5% of every other member of the group
(2) sort the list and split it where the gap is more than 0.5%
Note that 116.23 is within 0.5% of 115.93 -- abs((116.23 / 115.93 - 1) * 100) < 0.5 -- but you have put one number in Group A and one in Group C.
Simple example: a, b, c = (0.996, 1, 1.004) ... Note that a and b fit, b and c fit, but a and c don't fit. How do you want them grouped, and why? Is the order in the input list relevant?
Possibility (1) produces ab,c or a,bc ... tie-breaking rule, please
Possibility (2) produces abc (no big gaps, so only one group)
You won't be able to classify them into hard "groups". If you have prices (1.0,1.05, 1.1) then the first and second should be in the same group, and the second and third should be in the same group, but not the first and third.
A quick, dirty way to do something that you might find useful:
def make_group_function(tolerance = 0.05):
from math import log10, floor
# I forget why this works.
tolerance_factor = -1.0/(-log10(1.0 + tolerance))
# well ... since you might ask
# we want: log(x)*tf - log(x*(1+t))*tf = -1,
# so every 5% change has a different group. The minus is just so groups
# are ascending .. it looks a bit nicer.
#
# tf = -1/(log(x)-log(x*(1+t)))
# tf = -1/(log(x/(x*(1+t))))
# tf = -1/(log(1/(1*(1+t)))) # solved .. but let's just be more clever
# tf = -1/(0-log(1*(1+t)))
# tf = -1/(-log((1+t))
def group_function(value):
# don't just use int - it rounds up below zero, and down above zero
return int(floor(log10(value)*tolerance_factor))
return group_function
Usage:
group_function = make_group_function()
import random
groups = {}
for i in range(50):
v = random.random()*500+1000
group = group_function(v)
if group in groups:
groups[group].append(v)
else:
groups[group] = [v]
for group in sorted(groups):
print 'Group',group
for v in sorted(groups[group]):
print v
print
For a given set of stock prices, there is probably more than one way to group stocks that are within 0.5% of each other. Without some additional rules for grouping the prices, there's no way to be sure an answer will do what you really want.
apart from the proper way to pick which values fit together, this is a problem where a little Object Orientation dropped in can make it a lot easier to deal with.
I made two classes here, with a minimum of desirable behaviors, but which can make the classification a lot easier -- you get a single point to play with it on the Group class.
I can see the code bellow is incorrect, in the sense the limtis for group inclusion varies as new members are added -- even it the separation crieteria remaisn teh same, you heva e torewrite the get_groups method to use a multi-pass approach. It should nto be hard -- but the code would be too long to be helpfull here, and i think this snipped is enoguh to get you going:
from copy import copy
class Group(object):
def __init__(self,data=None, name=""):
if data:
self.data = data
else:
self.data = []
self.name = name
def get_mean_stock(self):
return sum(item[0] for item in self.data) / len(self.data)
def fits(self, item):
if 0.995 < abs(item[0]) / self.get_mean_stock() < 1.005:
return True
return False
def get_weight(self):
return sum(item[1] for item in self.data)
def __repr__(self):
return "Group-%s\n%s\n---\nTotalWeight: %d\n\n" % (
self.name,
"\n".join("%.02f, %d" % tuple(item) for item in self.data ),
self.get_weight())
class StockGrouper(object):
def __init__(self, data=None):
if data:
self.floor_levels = data
else:
self.floor_levels = []
def get_groups(self):
groups = []
floor_levels = copy(self.floor_levels)
name_ord = ord("A") - 1
while floor_levels:
seed = floor_levels.pop(0)
name_ord += 1
group = Group([seed], chr(name_ord))
groups.append(group)
to_remove = []
for i, item in enumerate(floor_levels):
if group.fits(item):
group.data.append(item)
to_remove.append(i)
for i in reversed(to_remove):
floor_levels.pop(i)
return groups
testing:
floor_levels = [ [stock. weight] ,... <paste the data above> ]
s = StockGrouper(floor_levels)
s.get_groups()
For the grouping element, could you use itertools.groupby()? As the data is sorted, a lot of the work of grouping it is already done, and then you could test if the current value in the iteration was different to the last by <0.5%, and have itertools.groupby() break into a new group every time your function returned false.