How to calculate third central moment? - python

Description: I have a sample: sample = [100, 86, 51, 100, 95, 100, 12, 61, 0, 0, 12, 86, 0, 52, 62, 76, 91, 91, 62, 91, 65, 91, 9, 83, 67, 58, 56]. I need to calculate third central moment of this sample.
My approach:
I'm making a table with top row being unique values from the sample and bottom row - frequency of each value from the top row:
table = dict(Counter(sample))
Then I'm calculating empirical k-th central moment with this formula:
def empirical_central_moment(table: dict, k):
mean = sum([value * frequency for value, frequency in table.items()]) / sum(list(table.values()))
N = sum(list(table.values()))
return sum([(value - mean)**k * frequency / N for value, frequency in table.items()])
Program:
from collections import Counter
def empirical_central_moment(table: dict, k):
mean = sum([value * frequency for value, frequency in table.items()]) / sum(list(table.values()))
N = sum(list(table.values()))
return sum([(value - mean)**k * frequency / N for value, frequency in table.items()])
sample = [100, 86, 51, 100, 95, 100, 12, 61, 0, 0, 12, 86, 0, 52, 62, 76, 91, 91, 62, 91, 65, 91, 9, 83, 67, 58, 56]
table = dict(Counter(sample))
print(empirical_central_moment(table, 3))
Problem: Instead of desired -545.33983 ... I'm getting -26721.65147589292 and I just can't wrap my head around as to why I'm gettting wrong. Will appreciate any help, thanks in advance.

Your answer is correct. Not sure what other answer you might be looking for. In general, and unless the purpose of this code is to exercise programming the logic of it, you don't need to reinvent the wheel and you'll be much faster and safer by doing something as simple as:
from scipy.stats import moment
sample = [100, 86, 51, 100, 95, 100, 12, 61, 0, 0, 12, 86, 0, 52, 62, 76, 91, 91, 62, 91, 65, 91, 9, 83, 67, 58, 56]
print(scipy.stats.moment(sample, moment=3, axis=0, nan_policy='propagate'))

Related

Why doesn't numpy create an array when executing a list method

Playing around with numpy:
import numpy as np
l = [39, 54, 72, 46, 89, 53, 96, 64, 2, 75]
nl = np.array(l.append(3))
>> array(None, dtype=object)
Now, if I call on l, I'll get the list: [39, 54, 72, 46, 89, 53, 96, 64, 2, 75, 3]
My question is, why doesn't numpy create that list as an array?
If I do something like this:
nl = np.array(l.extend([45])) I get the same thing.
But, if I try to concatenate without a method: nl = np.array(l+[45]) it works.
What is causing this behaviour?
The append function will always return None. You must do this in two different lines of code:
import numpy as np
l = [39, 54, 72, 46, 89, 53, 96, 64, 2, 75]
l.append(3)
nl = np.array(l)
append and extend are in-place methods and return None.
print(l.append(3)) # None
print(l.extend([3])) # None

Create a list of multiples of a number

Problem:
List of Multiples
Create a Python 3 function that takes two numbers (value, length) as arguments and returns a list of multiples of value until the size of the list reaches length.
Examples
list_of_multiples(value=7, length=5) ➞ [7, 14, 21, 28, 35]
list_of_multiples(value=12, length=10) ➞ [12, 24, 36, 48, 60, 72, 84, 96, 108, 120]
list_of_multiples(value=17, length=6) ➞ [17, 34, 51, 68, 85, 102]
def multiples (value,length):
"""
Value is number to be multiply
length is maximum number of iteration up to
which multiple required.
"""
for i in range(length):
out=i
return i
Most Pythonic Way
def multiples(value, length):
return [*range(value, length*value+1, value)]
print(multiples(7, 5))
# [7, 14, 21, 28, 35]
print(multiples(12, 10))
# [12, 24, 36, 48, 60, 72, 84, 96, 108, 120]
print(multiples(17, 6))
# [17, 34, 51, 68, 85, 102]
Pythonic way:
def multiples(value, length):
return [value * i for i in range(1, length + 1)]
print(multiples(7, 5))
# [7, 14, 21, 28, 35]
print(multiples(12, 10))
# [12, 24, 36, 48, 60, 72, 84, 96, 108, 120]
print(multiples(17, 6))
# [17, 34, 51, 68, 85, 102]
def multiples(value, length):
list_multiples = []
i = 0
while i < length:
list_multiples.append(value*(i+1))
i+=1
return list_multiples
The easy / not in-line way would be :
def multiples(value, length):
l = []
for i in range(1, length+1):
l.append(value*i)
return l
The best answer for small values of length (< 100) is given by Nite Block.
However, in case length becomes bigger, using numpy is significantly faster than python loops:
numpy.arange(1, length+1) * value
With a length of 1000, python loops take almost 4 times longer than numpy. See code below:
import timeit
testcode_numpy = '''
import numpy
def multiples_numpy(value, length):
return numpy.arange(1, length+1) * value
multiples_numpy(5, 1000)
'''
testcode = '''
def multiples(value, length):
return [*range(value, length*value+1, value)]
multiples(5, 1000)
'''
print(timeit.timeit(testcode_numpy))
print(timeit.timeit(testcode))
# Result:
# without numpy: 9.7 s
# with numpy: 2.4 s

Python - cut only the descending part of the dataset

I have a timeseries with various downcasts. My question is how do I slice a pandas dataframe (or in this case the array, just to keep it simple) to get the data and its indexes of the descending bits of the timeseries?
import matplotlib.pyplot as plt
import numpy as np
b = np.asarray([ 1.3068586 , 1.59882279, 2.11291473, 2.64699527,
3.23948166, 3.81979878, 4.37630243, 4.97740025,
5.59247254, 6.18671493, 6.77414586, 7.43078595,
8.02243495, 8.59612224, 9.22302662, 9.83263379,
10.43125902, 11.0956864 , 11.61107838, 12.09616684,
12.63973254, 12.49437955, 11.6433792 , 10.61083269,
9.50534291, 8.47418827, 7.40571742, 6.56611512,
5.66963658, 4.89748187, 4.10543794, 3.44828054,
2.76866318, 2.24306623, 1.68034463, 1.26568186,
1.44548443, 2.01225076, 2.60715524, 3.21968562,
3.8622007 , 4.57035958, 5.14021305, 5.77879484,
6.42776897, 7.09397923, 7.71722028, 8.30860725,
8.96652218, 9.66157193, 10.23469208, 10.79889453,
10.5788411 , 9.38270646, 7.82070643, 6.74893389,
5.68200335, 4.73429009, 3.78358222, 3.05924946,
2.30428171, 1.78052369, 1.27897065, 1.16840532,
1.59452726, 2.13085096, 2.70989933, 3.3396291 ,
3.97318058, 4.62429262, 5.23997774, 5.91232803,
6.5906609 , 7.21099657, 7.82936331, 8.49636247,
9.15634983, 9.76450244, 10.39680729, 11.04659976,
11.69287237, 12.35692643, 12.99957563, 13.66228386,
14.31806385, 14.91871927, 15.57212978, 16.22288287,
16.84697357, 17.50502002, 18.15907842, 18.83068151,
19.50945548, 20.18020639, 20.84441358, 21.52792846,
22.17933087, 22.84614545, 23.51212887, 24.18308399,
24.8552263 , 25.51709528, 26.18724379, 26.84531493,
27.50690265, 28.16610365, 28.83394822, 29.49621179,
30.15118676, 30.8019521 , 31.46714114, 32.1213546 ,
32.79366952, 33.45233007, 34.12158193, 34.77502197,
35.4532211 , 36.11018053, 36.76540453, 37.41746323])
plt.plot(-b)
plt.show()
You can just change the negative diffs to NaN and then plot:
bb = pd.Series(-b)
bb[bb.diff().ge(0)] = np.nan
bb.plot()
To get the indexes of descending values, use:
bb.index[bb.diff().lt(0)]
Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 37, 38, 39, 40, 41, 42,
43, 44, 45, 46, 47, 48, 49, 50, 51, 65, 66, 67, 68,
69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81,
82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94,
95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,
108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119],
dtype='int64')
create a second dataframe where you move everyting from one index then you do it by substracting them term to term. you should get what you want (getting only the ones with negative diff)
here:
df = DataFrame(b)
df = concat([df.shift(1),df],axis = 1)
df.columns = ['t-1','t']
df.reset_index()
df = df.drop(df.index[0])
df['diff'] = df['t']-df['t-1']
res = df[df['diff']<0]
There is also an easy numpy-only solution (the question is tagged pandas but the code uses only numpy) using np.where. You want the points where the graph is descending which means the data is ascending.
# the indices where the data is ascending.
ix, = np.where(np.diff(b) > 0)
# the values
c = b[ix]
Note that this will give you the first value in each ascending pair of consecutive values, while the pandas-based solution gives the second one. To get the same indices just add 1 to ix.
s = pd.Series(b)
assert np.all(s[s.diff() > 0].index == ix + 1)
assert np.all(s[s.diff() > 0] == b[ix + 1])

Python: list(data) outputs list twice in terminal when reading a binary file. Is it a bug?

I was trying to create my own hex editor that list the statistics of a binary file generated from Veracrypt. (I am still learning.)
File: Statistics.py
import Statistics
data = open('VERASHORT', 'rb').read()
print(list(data))
Anyways, the code above will print the hex of the binary file in a list format twice. It is only a three line code, but I am wondering why won't it work. I have modified the code from the author, so it should work. (Learning Python)
Here is the output after Python3 is ran. (List appears twice.)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 102, 102, 62, 90, 121, 113, 111, 92, 85, 102, 102, 102, 102, 102, 102, 102, 102, 52, 32, 38, 92, 85, 102, 102, 102, 102, 102, 102, 102, 102]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 102, 102, 62, 90, 121, 113, 111, 92, 85, 102, 102, 102, 102, 102, 102, 102, 102, 52, 32, 38, 92, 85, 102, 102, 102, 102, 102, 102, 102, 102]
The "import Statistics" is the cause.
You just load Statistics.py twice, then you execute that code two times.
BTW, Python packages needs lowercase https://www.python.org/dev/peps/pep-0008/#package-and-module-names
Add: I have solved the issue.
I have edited Statistics.py into Stat.py, this means that the module won't import itself!!
An error occured, the Statistics import that is in my first line of code should be LOWERCASE!! Thus, I changed it.
list(data) does not require any imports!!
That is where I screw up, thanks for the help guys. (The hints did help me obtain a quick conclusion!!)

python, weighted linspace

can anyone show me what the best way is to generate a (numpy) array containig values from 0 to 100, that is weighted by a (for example) normal distribution function with mean 50 and variance 5. So that there are more 50s and less (nearly no) zeros and hundreds. I think the problem should not be too hard to solve, but I'm stucked somehow...
I thought about something with np.linspace but it seems, that there is no weight option.
So just to be clear: I don't wan't a simple normal distribution from 0 to 100, but something like an array from 0 to 100 with higher density of values in the middle.
Thanks
You can use scipy's stats distributions:
import numpy as np
from scipy import stats
# your distribution:
distribution = stats.norm(loc=50, scale=5)
# percentile point, the range for the inverse cumulative distribution function:
bounds_for_range = distribution.cdf([0, 100])
# Linspace for the inverse cdf:
pp = np.linspace(*bounds_for_range, num=1000)
x = distribution.ppf(pp)
# And just to check that it makes sense you can try:
from matplotlib import pyplot as plt
plt.hist(x)
plt.show()
Of course, I admit the start and end point is not quite exact like this due to numerical inaccuracies when going back and forth.
It is important to understand, that your problem is not exactly solvable, since generally a finite discrete sample cannot exactly reproduce your distribution.
You can easily see this, when asking trivial versions of your question like a set of 3 values in [0,1] with an equal distribution. Here the results [0,0,1] and [0,1,1] would both be reasonable.
However, you can solve the problem roughly. If you ask for an array with count elements out of [0,1,...,N] where the given probabilities are p=[p0,p1,...,pN] and normalized (p0+...+pN==1) then the count c_k of the element k in your resulting array is theoretically
c[k] = p[k]*count
but these counts now are floats. You have to decide for a way to "round" them while keeping their total sum. This is the freedom of choice arising from the under-definedness of your question.
>>> sorted([int(random.gauss(50,5)) for i in range(100)])
[33, 40, 40, 40, 40, 40, 42, 42, 42, 42, 43, 43, 43, 43, 44, 44, 44, 44, 44, 45, 45, 45, 46, 46, 46, 46, 46, 46, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 48, 48, 48, 48, 48, 48, 48, 49, 49, 50, 50, 50, 50, 50, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 53, 53, 53, 54, 54, 54, 54, 54, 54, 54, 54, 54, 55, 55, 56, 56, 57, 57, 57, 57, 57, 57, 57, 58, 61]

Categories

Resources