Python - cut only the descending part of the dataset - python

I have a timeseries with various downcasts. My question is how do I slice a pandas dataframe (or in this case the array, just to keep it simple) to get the data and its indexes of the descending bits of the timeseries?
import matplotlib.pyplot as plt
import numpy as np
b = np.asarray([ 1.3068586 , 1.59882279, 2.11291473, 2.64699527,
3.23948166, 3.81979878, 4.37630243, 4.97740025,
5.59247254, 6.18671493, 6.77414586, 7.43078595,
8.02243495, 8.59612224, 9.22302662, 9.83263379,
10.43125902, 11.0956864 , 11.61107838, 12.09616684,
12.63973254, 12.49437955, 11.6433792 , 10.61083269,
9.50534291, 8.47418827, 7.40571742, 6.56611512,
5.66963658, 4.89748187, 4.10543794, 3.44828054,
2.76866318, 2.24306623, 1.68034463, 1.26568186,
1.44548443, 2.01225076, 2.60715524, 3.21968562,
3.8622007 , 4.57035958, 5.14021305, 5.77879484,
6.42776897, 7.09397923, 7.71722028, 8.30860725,
8.96652218, 9.66157193, 10.23469208, 10.79889453,
10.5788411 , 9.38270646, 7.82070643, 6.74893389,
5.68200335, 4.73429009, 3.78358222, 3.05924946,
2.30428171, 1.78052369, 1.27897065, 1.16840532,
1.59452726, 2.13085096, 2.70989933, 3.3396291 ,
3.97318058, 4.62429262, 5.23997774, 5.91232803,
6.5906609 , 7.21099657, 7.82936331, 8.49636247,
9.15634983, 9.76450244, 10.39680729, 11.04659976,
11.69287237, 12.35692643, 12.99957563, 13.66228386,
14.31806385, 14.91871927, 15.57212978, 16.22288287,
16.84697357, 17.50502002, 18.15907842, 18.83068151,
19.50945548, 20.18020639, 20.84441358, 21.52792846,
22.17933087, 22.84614545, 23.51212887, 24.18308399,
24.8552263 , 25.51709528, 26.18724379, 26.84531493,
27.50690265, 28.16610365, 28.83394822, 29.49621179,
30.15118676, 30.8019521 , 31.46714114, 32.1213546 ,
32.79366952, 33.45233007, 34.12158193, 34.77502197,
35.4532211 , 36.11018053, 36.76540453, 37.41746323])
plt.plot(-b)
plt.show()

You can just change the negative diffs to NaN and then plot:
bb = pd.Series(-b)
bb[bb.diff().ge(0)] = np.nan
bb.plot()
To get the indexes of descending values, use:
bb.index[bb.diff().lt(0)]
Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 37, 38, 39, 40, 41, 42,
43, 44, 45, 46, 47, 48, 49, 50, 51, 65, 66, 67, 68,
69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81,
82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94,
95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,
108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119],
dtype='int64')

create a second dataframe where you move everyting from one index then you do it by substracting them term to term. you should get what you want (getting only the ones with negative diff)
here:
df = DataFrame(b)
df = concat([df.shift(1),df],axis = 1)
df.columns = ['t-1','t']
df.reset_index()
df = df.drop(df.index[0])
df['diff'] = df['t']-df['t-1']
res = df[df['diff']<0]

There is also an easy numpy-only solution (the question is tagged pandas but the code uses only numpy) using np.where. You want the points where the graph is descending which means the data is ascending.
# the indices where the data is ascending.
ix, = np.where(np.diff(b) > 0)
# the values
c = b[ix]
Note that this will give you the first value in each ascending pair of consecutive values, while the pandas-based solution gives the second one. To get the same indices just add 1 to ix.
s = pd.Series(b)
assert np.all(s[s.diff() > 0].index == ix + 1)
assert np.all(s[s.diff() > 0] == b[ix + 1])

Related

How to calculate third central moment?

Description: I have a sample: sample = [100, 86, 51, 100, 95, 100, 12, 61, 0, 0, 12, 86, 0, 52, 62, 76, 91, 91, 62, 91, 65, 91, 9, 83, 67, 58, 56]. I need to calculate third central moment of this sample.
My approach:
I'm making a table with top row being unique values from the sample and bottom row - frequency of each value from the top row:
table = dict(Counter(sample))
Then I'm calculating empirical k-th central moment with this formula:
def empirical_central_moment(table: dict, k):
mean = sum([value * frequency for value, frequency in table.items()]) / sum(list(table.values()))
N = sum(list(table.values()))
return sum([(value - mean)**k * frequency / N for value, frequency in table.items()])
Program:
from collections import Counter
def empirical_central_moment(table: dict, k):
mean = sum([value * frequency for value, frequency in table.items()]) / sum(list(table.values()))
N = sum(list(table.values()))
return sum([(value - mean)**k * frequency / N for value, frequency in table.items()])
sample = [100, 86, 51, 100, 95, 100, 12, 61, 0, 0, 12, 86, 0, 52, 62, 76, 91, 91, 62, 91, 65, 91, 9, 83, 67, 58, 56]
table = dict(Counter(sample))
print(empirical_central_moment(table, 3))
Problem: Instead of desired -545.33983 ... I'm getting -26721.65147589292 and I just can't wrap my head around as to why I'm gettting wrong. Will appreciate any help, thanks in advance.
Your answer is correct. Not sure what other answer you might be looking for. In general, and unless the purpose of this code is to exercise programming the logic of it, you don't need to reinvent the wheel and you'll be much faster and safer by doing something as simple as:
from scipy.stats import moment
sample = [100, 86, 51, 100, 95, 100, 12, 61, 0, 0, 12, 86, 0, 52, 62, 76, 91, 91, 62, 91, 65, 91, 9, 83, 67, 58, 56]
print(scipy.stats.moment(sample, moment=3, axis=0, nan_policy='propagate'))

Why doesn't numpy create an array when executing a list method

Playing around with numpy:
import numpy as np
l = [39, 54, 72, 46, 89, 53, 96, 64, 2, 75]
nl = np.array(l.append(3))
>> array(None, dtype=object)
Now, if I call on l, I'll get the list: [39, 54, 72, 46, 89, 53, 96, 64, 2, 75, 3]
My question is, why doesn't numpy create that list as an array?
If I do something like this:
nl = np.array(l.extend([45])) I get the same thing.
But, if I try to concatenate without a method: nl = np.array(l+[45]) it works.
What is causing this behaviour?
The append function will always return None. You must do this in two different lines of code:
import numpy as np
l = [39, 54, 72, 46, 89, 53, 96, 64, 2, 75]
l.append(3)
nl = np.array(l)
append and extend are in-place methods and return None.
print(l.append(3)) # None
print(l.extend([3])) # None

calculate the df.describe() for each value in a column and recreate a dataframe

Imagine the following data frame:
d={‘cluster’: [1,1,3,4,2,2],
‘Weight‘: [65, 70, 68, 75, 78, 62],
‘Height’: [170, 173, 174, 180, 184, 167]}
df=pd.DataFrame(d)
Now, how to use a for loop to return a dataframe that calculate the average weight and height for each value in cluster.
If I write stupid codes will be like this:
#creating subsets and concat
a=pd.DaFrame(df[df[‘cluster’]==1].describe().loc[‘mean’])
b= pd.DaFrame(df[df[‘cluster’]==2].describe().loc[‘mean)
....
DF= pd.concat([a,b], axis=1)
It will be ridiculous when there are more clusters in a column.
Thank you.
import pandas as pd
d={'cluster': [1,1,3,4,2,2],
'Weight': [65, 70, 68, 75, 78, 62],
'Height': [170, 173, 174, 180, 184, 167]}
df=pd.DataFrame(d)
df.groupby('cluster').agg(['mean'])
This implementation also has the benefit that you can add further aggregation-based functions (e.g. median) in the future if necessary.
Try:
import pandas as pd
d={'cluster': [1,1,3,4,2,2],
'Weight': [65, 70, 68, 75, 78, 62],
'Height': [170, 173, 174, 180, 184, 167]}
df=pd.DataFrame(d)
newdf = df.groupby('cluster').describe().iloc[:,1]
print(newdf)
EDIT: WeNYoBen does it better if you want only the means/don't need to pick anything else from describe()

Python: list(data) outputs list twice in terminal when reading a binary file. Is it a bug?

I was trying to create my own hex editor that list the statistics of a binary file generated from Veracrypt. (I am still learning.)
File: Statistics.py
import Statistics
data = open('VERASHORT', 'rb').read()
print(list(data))
Anyways, the code above will print the hex of the binary file in a list format twice. It is only a three line code, but I am wondering why won't it work. I have modified the code from the author, so it should work. (Learning Python)
Here is the output after Python3 is ran. (List appears twice.)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 102, 102, 62, 90, 121, 113, 111, 92, 85, 102, 102, 102, 102, 102, 102, 102, 102, 52, 32, 38, 92, 85, 102, 102, 102, 102, 102, 102, 102, 102]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 102, 102, 62, 90, 121, 113, 111, 92, 85, 102, 102, 102, 102, 102, 102, 102, 102, 52, 32, 38, 92, 85, 102, 102, 102, 102, 102, 102, 102, 102]
The "import Statistics" is the cause.
You just load Statistics.py twice, then you execute that code two times.
BTW, Python packages needs lowercase https://www.python.org/dev/peps/pep-0008/#package-and-module-names
Add: I have solved the issue.
I have edited Statistics.py into Stat.py, this means that the module won't import itself!!
An error occured, the Statistics import that is in my first line of code should be LOWERCASE!! Thus, I changed it.
list(data) does not require any imports!!
That is where I screw up, thanks for the help guys. (The hints did help me obtain a quick conclusion!!)

python, weighted linspace

can anyone show me what the best way is to generate a (numpy) array containig values from 0 to 100, that is weighted by a (for example) normal distribution function with mean 50 and variance 5. So that there are more 50s and less (nearly no) zeros and hundreds. I think the problem should not be too hard to solve, but I'm stucked somehow...
I thought about something with np.linspace but it seems, that there is no weight option.
So just to be clear: I don't wan't a simple normal distribution from 0 to 100, but something like an array from 0 to 100 with higher density of values in the middle.
Thanks
You can use scipy's stats distributions:
import numpy as np
from scipy import stats
# your distribution:
distribution = stats.norm(loc=50, scale=5)
# percentile point, the range for the inverse cumulative distribution function:
bounds_for_range = distribution.cdf([0, 100])
# Linspace for the inverse cdf:
pp = np.linspace(*bounds_for_range, num=1000)
x = distribution.ppf(pp)
# And just to check that it makes sense you can try:
from matplotlib import pyplot as plt
plt.hist(x)
plt.show()
Of course, I admit the start and end point is not quite exact like this due to numerical inaccuracies when going back and forth.
It is important to understand, that your problem is not exactly solvable, since generally a finite discrete sample cannot exactly reproduce your distribution.
You can easily see this, when asking trivial versions of your question like a set of 3 values in [0,1] with an equal distribution. Here the results [0,0,1] and [0,1,1] would both be reasonable.
However, you can solve the problem roughly. If you ask for an array with count elements out of [0,1,...,N] where the given probabilities are p=[p0,p1,...,pN] and normalized (p0+...+pN==1) then the count c_k of the element k in your resulting array is theoretically
c[k] = p[k]*count
but these counts now are floats. You have to decide for a way to "round" them while keeping their total sum. This is the freedom of choice arising from the under-definedness of your question.
>>> sorted([int(random.gauss(50,5)) for i in range(100)])
[33, 40, 40, 40, 40, 40, 42, 42, 42, 42, 43, 43, 43, 43, 44, 44, 44, 44, 44, 45, 45, 45, 46, 46, 46, 46, 46, 46, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 48, 48, 48, 48, 48, 48, 48, 49, 49, 50, 50, 50, 50, 50, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 53, 53, 53, 54, 54, 54, 54, 54, 54, 54, 54, 54, 55, 55, 56, 56, 57, 57, 57, 57, 57, 57, 57, 58, 61]

Categories

Resources