Related
Playing around with numpy:
import numpy as np
l = [39, 54, 72, 46, 89, 53, 96, 64, 2, 75]
nl = np.array(l.append(3))
>> array(None, dtype=object)
Now, if I call on l, I'll get the list: [39, 54, 72, 46, 89, 53, 96, 64, 2, 75, 3]
My question is, why doesn't numpy create that list as an array?
If I do something like this:
nl = np.array(l.extend([45])) I get the same thing.
But, if I try to concatenate without a method: nl = np.array(l+[45]) it works.
What is causing this behaviour?
The append function will always return None. You must do this in two different lines of code:
import numpy as np
l = [39, 54, 72, 46, 89, 53, 96, 64, 2, 75]
l.append(3)
nl = np.array(l)
append and extend are in-place methods and return None.
print(l.append(3)) # None
print(l.extend([3])) # None
I have a timeseries with various downcasts. My question is how do I slice a pandas dataframe (or in this case the array, just to keep it simple) to get the data and its indexes of the descending bits of the timeseries?
import matplotlib.pyplot as plt
import numpy as np
b = np.asarray([ 1.3068586 , 1.59882279, 2.11291473, 2.64699527,
3.23948166, 3.81979878, 4.37630243, 4.97740025,
5.59247254, 6.18671493, 6.77414586, 7.43078595,
8.02243495, 8.59612224, 9.22302662, 9.83263379,
10.43125902, 11.0956864 , 11.61107838, 12.09616684,
12.63973254, 12.49437955, 11.6433792 , 10.61083269,
9.50534291, 8.47418827, 7.40571742, 6.56611512,
5.66963658, 4.89748187, 4.10543794, 3.44828054,
2.76866318, 2.24306623, 1.68034463, 1.26568186,
1.44548443, 2.01225076, 2.60715524, 3.21968562,
3.8622007 , 4.57035958, 5.14021305, 5.77879484,
6.42776897, 7.09397923, 7.71722028, 8.30860725,
8.96652218, 9.66157193, 10.23469208, 10.79889453,
10.5788411 , 9.38270646, 7.82070643, 6.74893389,
5.68200335, 4.73429009, 3.78358222, 3.05924946,
2.30428171, 1.78052369, 1.27897065, 1.16840532,
1.59452726, 2.13085096, 2.70989933, 3.3396291 ,
3.97318058, 4.62429262, 5.23997774, 5.91232803,
6.5906609 , 7.21099657, 7.82936331, 8.49636247,
9.15634983, 9.76450244, 10.39680729, 11.04659976,
11.69287237, 12.35692643, 12.99957563, 13.66228386,
14.31806385, 14.91871927, 15.57212978, 16.22288287,
16.84697357, 17.50502002, 18.15907842, 18.83068151,
19.50945548, 20.18020639, 20.84441358, 21.52792846,
22.17933087, 22.84614545, 23.51212887, 24.18308399,
24.8552263 , 25.51709528, 26.18724379, 26.84531493,
27.50690265, 28.16610365, 28.83394822, 29.49621179,
30.15118676, 30.8019521 , 31.46714114, 32.1213546 ,
32.79366952, 33.45233007, 34.12158193, 34.77502197,
35.4532211 , 36.11018053, 36.76540453, 37.41746323])
plt.plot(-b)
plt.show()
You can just change the negative diffs to NaN and then plot:
bb = pd.Series(-b)
bb[bb.diff().ge(0)] = np.nan
bb.plot()
To get the indexes of descending values, use:
bb.index[bb.diff().lt(0)]
Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 37, 38, 39, 40, 41, 42,
43, 44, 45, 46, 47, 48, 49, 50, 51, 65, 66, 67, 68,
69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81,
82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94,
95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,
108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119],
dtype='int64')
create a second dataframe where you move everyting from one index then you do it by substracting them term to term. you should get what you want (getting only the ones with negative diff)
here:
df = DataFrame(b)
df = concat([df.shift(1),df],axis = 1)
df.columns = ['t-1','t']
df.reset_index()
df = df.drop(df.index[0])
df['diff'] = df['t']-df['t-1']
res = df[df['diff']<0]
There is also an easy numpy-only solution (the question is tagged pandas but the code uses only numpy) using np.where. You want the points where the graph is descending which means the data is ascending.
# the indices where the data is ascending.
ix, = np.where(np.diff(b) > 0)
# the values
c = b[ix]
Note that this will give you the first value in each ascending pair of consecutive values, while the pandas-based solution gives the second one. To get the same indices just add 1 to ix.
s = pd.Series(b)
assert np.all(s[s.diff() > 0].index == ix + 1)
assert np.all(s[s.diff() > 0] == b[ix + 1])
I was trying to create my own hex editor that list the statistics of a binary file generated from Veracrypt. (I am still learning.)
File: Statistics.py
import Statistics
data = open('VERASHORT', 'rb').read()
print(list(data))
Anyways, the code above will print the hex of the binary file in a list format twice. It is only a three line code, but I am wondering why won't it work. I have modified the code from the author, so it should work. (Learning Python)
Here is the output after Python3 is ran. (List appears twice.)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 102, 102, 62, 90, 121, 113, 111, 92, 85, 102, 102, 102, 102, 102, 102, 102, 102, 52, 32, 38, 92, 85, 102, 102, 102, 102, 102, 102, 102, 102]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 102, 102, 62, 90, 121, 113, 111, 92, 85, 102, 102, 102, 102, 102, 102, 102, 102, 52, 32, 38, 92, 85, 102, 102, 102, 102, 102, 102, 102, 102]
The "import Statistics" is the cause.
You just load Statistics.py twice, then you execute that code two times.
BTW, Python packages needs lowercase https://www.python.org/dev/peps/pep-0008/#package-and-module-names
Add: I have solved the issue.
I have edited Statistics.py into Stat.py, this means that the module won't import itself!!
An error occured, the Statistics import that is in my first line of code should be LOWERCASE!! Thus, I changed it.
list(data) does not require any imports!!
That is where I screw up, thanks for the help guys. (The hints did help me obtain a quick conclusion!!)
I'm trying to draw a bar plot with vertical axis labels and an axis title.
The script below makes the graph but it cuts off the x-axis label/title. Even if I try to make the picture bigger on my screen it still is cut off a bit. Also when I run this, I have to run it twice. The first time I get error about the fontdict property, but the next time it works.
Anyone know how to not make it cut that off? Also I am just saving the one that pops up on the screen as the saving is not working for some reason.
Thanks!
import numpy
import matplotlib
import matplotlib.pylab as pylab
import matplotlib.pyplot
import pdb
from collections import Counter
phenos = [128, 20, 0, 144, 4, 16, 160, 136, 192, 128, 20, 0, 4, 16, 144, 130, 136, 132, 22,
128, 160, 4, 0, 36, 132, 136, 130, 128, 22, 4, 0, 144, 160, 130, 132,
128, 4, 0, 136, 132, 68, 130, 192, 8, 128, 4, 0, 20, 22, 132, 144, 192, 130, 2,
128, 4, 0, 132, 20, 136, 144, 192, 64, 130, 128, 4, 0, 144, 132, 192, 20, 16, 136,
128, 4, 0, 130, 160, 132, 192, 2, 128, 4, 0, 132, 68, 160, 192, 36, 64,
128, 4, 0, 136, 192, 8, 160, 36, 128, 4, 0, 22, 20, 144, 132, 160,
128, 4, 0, 132, 20, 192, 144, 160, 68, 64, 128, 4, 0, 132, 160, 144, 136, 192, 68, 20]
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
from operator import itemgetter
c = Counter(phenos).items()
c.sort(key=itemgetter(1))
font = {'family' : 'sanserif',
'color' : 'black',
'weight' : 'normal',
'size' : 22,
}
font2 = {'family' : 'sansserif',
'color' : 'black',
'weight' : 'normal',
'size' : 18,
}
labels, values = zip(*c)
labels = ("GU", "IT", "AA", "SG", "A, IGI", "A, SG", "GU, A, AA", "D, GU", "D, IT", "A, AA", "D, IGI", "D, AA", "192", "D, A", "D, H", "H", "A")
pylab.show()
pylab.draw()
indexes = np.arange(0, 2*len(labels), 2)
width = 2
plt.bar(indexes, values, width=2, color="blueviolet")
plt.xlabel("Phenotype identifier", fontdict=font)
plt.ylabel("Number of occurances in top 10 \n phenotypes for cancerous tumours", fontdict=font)
#plt.title("Number of occurances for different phenotypes \n in top 10 subclones of a tumour", fontdict=font2)
plt.xticks(indexes + width * 0.5, labels, rotation='vertical', fontdict=font2)
plt.figure(figsize=(8.0, 7.0))
pictureFileName2 = "..\\Stats\\" + "Phenos2.png"
pylab.savefig(pictureFileName2, dpi=800)
#fig.set_size_inches(18.5,10.5)
#plt.savefig('test2png.png',dpi=100)
Three problems:
1, It is not true that the first time you run the code it doesn't work and the second time it does. The reason is that you call .show() before making the plot. The 1st time you run the code, the code stopped at where the except error message indicates. The 2nd time, .show() gets executed first and the partially made plot from the previous run now show up.
2, fontdict=font2 etc is not necessary and in fact wrong. You just need **font2 etc.
3, The truncated tick labels. There are just about many different ways to do it, but the basic idea is to increase the area of white space around the plot, alternatives are:
plt.gcf().subplots_adjust(bottom=0.35, top=0.7) #adjusting the plotting area
plt.tight_layout() #may raise an exception, depends on which backend is in use
plt.savefig('test.png', bbox_inches='tight', pad_inches = 0.0) #use bbox and pad, if you only want to change the saved figure.
I have a 1D numpy array, and some offset/length values. I would like to extract from this array all entries which fall within offset, offset+length, which are then used to build up a new 'reduced' array from the original one, that only consists of those values picked by the offset/length pairs.
For a single offset/length pair this is trivial with standard array slicing [offset:offset+length]. But how can I do this efficiently (i.e. without any loops) for many offset/length values?
Thanks,
Mark
>>> import numpy as np
>>> a = np.arange(100)
>>> ind = np.concatenate((np.arange(5),np.arange(10,15),np.arange(20,30,2),np.array([8])))
>>> a[[ind]]
array([ 0, 1, 2, 3, 4, 10, 11, 12, 13, 14, 20, 22, 24, 26, 28, 8])
There is the naive method; just doing the slices:
>>> import numpy as np
>>> a = np.arange(100)
>>>
>>> offset_length = [(3,10),(50,3),(60,20),(95,1)]
>>>
>>> np.concatenate([a[offset:offset+length] for offset,length in offset_length])
array([ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 50, 51, 52, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 95])
The following might be faster, but you would have to test/benchmark.
It works by constructing a list of the desired indices, which is valid method of indexing a numpy array.
>>> indices = [offset + i for offset,length in offset_length for i in xrange(length)]
>>> a[indices]
array([ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 50, 51, 52, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 95])
It's not clear if this would actually be faster than the naive method but it might be if you have a lot of very short intervals. But I don't know.
(This last method is basically the same as #fraxel's solution, just using a different method of making the index list.)
Performance testing
I've tested a few different cases: a few short intervals, a few long intervals, lots of short intervals. I used the following script:
import timeit
setup = 'import numpy as np; a = np.arange(1000); offset_length = %s'
for title, ol in [('few short', '[(3,10),(50,3),(60,10),(95,1)]'),
('few long', '[(3,100),(200,200),(600,300)]'),
('many short', '[(2*x,1) for x in range(400)]')]:
print '**',title,'**'
print 'dbaupp 1st:', timeit.timeit('np.concatenate([a[offset:offset+length] for offset,length in offset_length])', setup % ol, number=10000)
print 'dbaupp 2nd:', timeit.timeit('a[[offset + i for offset,length in offset_length for i in xrange(length)]]', setup % ol, number=10000)
print ' fraxel:', timeit.timeit('a[np.concatenate([np.arange(offset,offset+length) for offset,length in offset_length])]', setup % ol, number=10000)
This outputs:
** few short **
dbaupp 1st: 0.0474979877472
dbaupp 2nd: 0.190793991089
fraxel: 0.128381967545
** few long **
dbaupp 1st: 0.0416231155396
dbaupp 2nd: 1.58000087738
fraxel: 0.228138923645
** many short **
dbaupp 1st: 3.97210478783
dbaupp 2nd: 2.73584890366
fraxel: 7.34302687645
This suggests that my first method is the fastest when you have a few intervals (and it is significantly faster), and my second is the fastest when you have lots of intervals.