how to get pandas pct_change result in percents? - python

I'm using pandas's example to do what I want to do:
>>> s = pd.Series([90, 91, 85])
>>> s
0 90
1 91
2 85
dtype: int64
then the pct_change() is applied to this series:
>>> s.pct_change()
0 NaN
1 0.011111
2 -0.065934
dtype: float64
okay, fair enough, but Percentage Increase = [ (Final Value - Starting Value) / |Starting Value| ] × 100
so the results should actually be [NaN, 1.11111%, -6.59341%].
how would I get this *100 part that the pct_change() didn't run for me?

You can simply multiply the result by 100 to get what you want:
In [712]: s.pct_change().mul(100)
Out[712]:
0 NaN
1 1.111111
2 -6.593407
dtype: float64
If you want the result to be a list of these values, do this:
In [714]: l = s.pct_change().mul(100).tolist()
In [715]: l
Out[715]: [nan, 1.1111111111111072, -6.593406593406592]

Try concatenating functions:
.pct_change().multiply(100)
following the desired df operation. You could concatenate more functions before or after.

Related

How to sum each table of values in each index produced by dataframe.rolling.sum()

I work with large data sheets, in which I am trying to correlate all of the columns. I achieve this using:
df = df.rolling(5).corr(pairwise = True)
This produces data like this:
477
s1 -0.240339 0.932141 1.000000 0.577741 0.718307 -0.518748 0.772099
s2 0.534848 0.626280 0.577741 1.000000 0.645064 -0.455503 0.447589
s3 0.384720 0.907782 0.718307 0.645064 1.000000 -0.831378 0.406054
s4 -0.347547 -0.651557 -0.518748 -0.455503 -0.831378 1.000000 -0.569301
s5 -0.315022 0.576705 0.772099 0.447589 0.406054 -0.569301 1.000000
for each row contained in the data set. 477 in this case being the row number or index, and s1 - s5 being the column titles.
The goal is to find when the sensors are highly correlated with each other. I want to achieve this by (a) calculating the correlation using a rolling window of 5 rows using the code above, and (b) for each row produced, i.e i = 0 to i = 500 for a 500 row excel sheet, sum the tables dataframe.rolling(5).corr() produces for each value of i, i.e. produce one value per unit time such as in the graph included at the bottom. I am new to stackoverflow so please let me know if there's more information I can provide.
Example code + data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}
df = pd.DataFrame(data=d)
dfn = df.rolling(5).corr(pairwise = True)
MATLAB code which accomplishes what I want:
% move through the data and get a correlation for 5 data points
for i=1:ns-4 C(:,:,i)=corrcoef(X(i:i+4,:));
cact(i)=sum(C(:,:,i),'all')-nv; % subtracting nv removes the diagaonals that are = 1 and dont change
end
For the original data, the following is the graph I am trying to produce in Python, where the x axis is time:
Correlation Graph
sum the entire table in both directions, and subtract the diagonal of 1's which is the sensors correlated with themselves.
Using your dfn row four is
>>> dfn.loc[4]
col1 col2 col3
col1 1.000000 -0.146977 -0.227059
col2 -0.146977 1.000000 0.435216
col3 -0.227059 0.435216 1.000000
You can sum the complete table using Numpy's ndarray.sum() on the underlying data
>>> dfn.loc[4].to_numpy().sum()
3.1223603416753103
Then assuming the correlation table is square you just need to subtract the number of columns/sensors. If there isn't already a variable you can use the shape of the underlying numpy array.
>>> v = dfn.loc[4].to_numpy()
>>> v.shape
(3, 3)
>>> v.sum() - v.shape[0]
0.12236034167531029
>>>
without using the numpy array, you could sum the correlation table twice before subtracting.
>>> four = dfn.loc[4]
>>> four.sum().sum()
3.1223603416753103
>>> four.sum().sum() - four.shape[0]
0.12236034167531029
Get the numpy array of the whole rolling sum correlation and reshape it to get separate correlations for each original row
n_sensors = 3
v = dfn.to_numpy() # v.shape = (30,3)
new_dims = df.shape[0], n_sensors, n_sensors
v = v.reshape(new_dims) # shape = (10,3,3)
print(v[4])
[[ 1. -0.14697697 -0.22705934]
[-0.14697697 1. 0.43521648]
[-0.22705934 0.43521648 1. ]]
Sum across the last two dimensions and subtract the number of sensors
result = v.sum((1,2)) - n_sensors
print(result)
[nan, nan, nan, nan, 0.12236034, 0.25316027, -2.40763192, -1.9370202, -2.28023618, -2.57886457]
There is probably a way to do that in Pandas but I'd have to work on that to figure it out. Maybe someone will answer with an all Pandas solution.
The rolling average correlation DataFrame has a multiindex
>>> dfn.index
MultiIndex([(0, 'col1'),
(0, 'col2'),
(0, 'col3'),
(1, 'col1'),
(1, 'col2'),
(1, 'col3'),
(2, 'col1'),
(2, 'col2'),
(2, 'col3'),
...
With a quick review of the MultiIndex docs and a search using pandas multi index sum on level 0 site:stackoverflow.com I came up with - group by level 0 and sum then sum again along the columns.
>>> four_five = dfn.loc[[4,5]]
>>> four_five
col1 col2 col3
4 col1 1.000000 -0.146977 -0.227059
col2 -0.146977 1.000000 0.435216
col3 -0.227059 0.435216 1.000000
5 col1 1.000000 0.191238 -0.644203
col2 0.191238 1.000000 0.579545
col3 -0.644203 0.579545 1.000000
>>> four_five.groupby(level=0).sum()
col1 col2 col3
4 0.625964 1.288240 1.208157
5 0.547035 1.770783 0.935343
>>> four_five.groupby(level=0).sum().sum(1)
4 3.12236
5 3.25316
dtype: float64
>>>
Then for the complete DataFrame.
>>> dfn.groupby(level=0).sum().sum(1) - n_sensors
0 -3.000000
1 -3.000000
2 -3.000000
3 -3.000000
4 0.122360
5 0.253160
6 -2.407632
7 -1.937020
8 -2.280236
9 -2.578865
dtype: float64
Reading a few more of the answers from that search (I should have looked at the DataFrame.sum docs closer)
>>> dfn.sum(level=0).sum(1) - n_sensors
0 -3.000000
1 -3.000000
2 -3.000000
3 -3.000000
4 0.122360
5 0.253160
6 -2.407632
7 -1.937020
8 -2.280236
9 -2.578865
dtype: float64

Is there a limit for zip() elements in loops in python?

Something weird happened to me today. I needed to create a list based on a sequence of if statements. My dataframe looks something like this:
prom_lect4b_rbd prom_lect2m_rbd prom_lect8b_rbd prom_lect6b_rbd
100 np.nan 80 200
np.nan np.nan 40 1000
np.nan np.nan np.nan 90
230 100 80 100
Columns are orderer according to their priority. The list (or column) I'm trying to create takes the first value from those rows that is not nan. So, in this case I want a column that looks like this:
simce_final_lect
100
40
90
230
I tried the following:
cols=[simces.prom_lect4b_rbd, simces.prom_lect2m_rbd, simces.prom_lect8b_rbd, simces.prom_lect6b_rbd]
simce_final_lect=[j if np.isnan(j)==False else k if np.isnan(k)==False
else l if np.isnan(l)==False else m if np.isnan(m)==False
else np.nan for j,k,l,m in zip(cols[0],cols[1],cols[2],cols[3])]
And that just copies two values (out of 8752) to the list. But if I limit my zip to just j,k,l, it works perfectly:
simce_final_lect=[j if np.isnan(j)==False else k if np.isnan(k)==False
else l if np.isnan(l)==False
else np.nan for j,k,l in zip(cols[0],cols[1],cols[2])]
Do you know what is happening? Else, is there a more efficient solution to my problem?
You can use bfill(axis=1) and select the first col.
df.bfill(axis=1).iloc[:,0]
0 100.0
1 40.0
2 90.0
3 230.0
Name: prom_lect4b_rbd, dtype: float64
## For list
df.bfill(axis=1).iloc[:,0].tolist()
['100', '40', 90, '230']
Use first_valid_index():
df.apply(lambda x: x[x.first_valid_index()], axis=1)
Yields:
0 100.0
1 40.0
2 90.0
3 230.0
dtype: float64

How do I print entire number in Python from describe() function?

I am doing some statistical work using Python's pandas and I am having the following code to print out the data description (mean, count, median, etc).
data=pandas.read_csv(input_file)
print(data.describe())
But my data is pretty big (around 4 million rows) and each rows has very small data. So inevitably, the count would be big and the mean would be pretty small and thus Python print it like this.
I just want to print these numbers entirely just for ease of use and understanding, for example it better be 4393476 instead of 4.393476e+06. I have googled it around and the most I can find is Display a float with two decimal places in Python and some other similar posts. But that will only work only if I have the numbers in a variable already. Not in my case though. In my case I haven't got those numbers. The numbers are created by the describe() function, so I don't know what numbers I will get.
Sorry if this seems like a very basic question, I am still new to Python. Any response is appreaciated. Thanks.
Suppose you have the following DataFrame:
Edit
I checked the docs and you should probably use the pandas.set_option API to do this:
In [13]: df
Out[13]:
a b c
0 4.405544e+08 1.425305e+08 6.387200e+08
1 8.792502e+08 7.135909e+08 4.652605e+07
2 5.074937e+08 3.008761e+08 1.781351e+08
3 1.188494e+07 7.926714e+08 9.485948e+08
4 6.071372e+08 3.236949e+08 4.464244e+08
5 1.744240e+08 4.062852e+08 4.456160e+08
6 7.622656e+07 9.790510e+08 7.587101e+08
7 8.762620e+08 1.298574e+08 4.487193e+08
8 6.262644e+08 4.648143e+08 5.947500e+08
9 5.951188e+08 9.744804e+08 8.572475e+08
In [14]: pd.set_option('float_format', '{:f}'.format)
In [15]: df
Out[15]:
a b c
0 440554429.333866 142530512.999182 638719977.824965
1 879250168.522411 713590875.479215 46526045.819487
2 507493741.709532 300876106.387427 178135140.583541
3 11884941.851962 792671390.499431 948594814.816647
4 607137206.305609 323694879.619369 446424361.522071
5 174424035.448168 406285189.907148 445616045.754137
6 76226556.685384 979050957.963583 758710090.127867
7 876261954.607558 129857447.076183 448719292.453509
8 626264394.999419 464814260.796770 594750038.747595
9 595118819.308896 974480400.272515 857247528.610996
In [16]: df.describe()
Out[16]:
a b c
count 10.000000 10.000000 10.000000
mean 479461624.877280 522785202.100082 536344333.626082
std 306428177.277935 320806568.078629 284507176.411675
min 11884941.851962 129857447.076183 46526045.819487
25% 240956633.919592 306580799.695412 445818124.696121
50% 551306280.509214 435549725.351959 521734665.600552
75% 621482597.825966 772901261.744377 728712562.052142
max 879250168.522411 979050957.963583 948594814.816647
End of edit
In [7]: df
Out[7]:
a b c
0 4.405544e+08 1.425305e+08 6.387200e+08
1 8.792502e+08 7.135909e+08 4.652605e+07
2 5.074937e+08 3.008761e+08 1.781351e+08
3 1.188494e+07 7.926714e+08 9.485948e+08
4 6.071372e+08 3.236949e+08 4.464244e+08
5 1.744240e+08 4.062852e+08 4.456160e+08
6 7.622656e+07 9.790510e+08 7.587101e+08
7 8.762620e+08 1.298574e+08 4.487193e+08
8 6.262644e+08 4.648143e+08 5.947500e+08
9 5.951188e+08 9.744804e+08 8.572475e+08
In [8]: df.describe()
Out[8]:
a b c
count 1.000000e+01 1.000000e+01 1.000000e+01
mean 4.794616e+08 5.227852e+08 5.363443e+08
std 3.064282e+08 3.208066e+08 2.845072e+08
min 1.188494e+07 1.298574e+08 4.652605e+07
25% 2.409566e+08 3.065808e+08 4.458181e+08
50% 5.513063e+08 4.355497e+08 5.217347e+08
75% 6.214826e+08 7.729013e+08 7.287126e+08
max 8.792502e+08 9.790510e+08 9.485948e+08
You need to fiddle with the pandas.options.display.float_format attribute. Note, in my code I've used import pandas as pd. A quick fix is something like:
In [29]: pd.options.display.float_format = "{:.2f}".format
In [10]: df
Out[10]:
a b c
0 440554429.33 142530513.00 638719977.82
1 879250168.52 713590875.48 46526045.82
2 507493741.71 300876106.39 178135140.58
3 11884941.85 792671390.50 948594814.82
4 607137206.31 323694879.62 446424361.52
5 174424035.45 406285189.91 445616045.75
6 76226556.69 979050957.96 758710090.13
7 876261954.61 129857447.08 448719292.45
8 626264395.00 464814260.80 594750038.75
9 595118819.31 974480400.27 857247528.61
In [11]: df.describe()
Out[11]:
a b c
count 10.00 10.00 10.00
mean 479461624.88 522785202.10 536344333.63
std 306428177.28 320806568.08 284507176.41
min 11884941.85 129857447.08 46526045.82
25% 240956633.92 306580799.70 445818124.70
50% 551306280.51 435549725.35 521734665.60
75% 621482597.83 772901261.74 728712562.05
max 879250168.52 979050957.96 948594814.82
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 4393476
df = pd.DataFrame(np.random.uniform(1e-4, 0.1, size=(N,3)), columns=list('ABC'))
desc = df.describe()
desc.loc['count'] = desc.loc['count'].astype(int).astype(str)
desc.iloc[1:] = desc.iloc[1:].applymap('{:.6f}'.format)
print(desc)
yields
A B C
count 4393476 4393476 4393476
mean 0.050039 0.050056 0.050057
std 0.028834 0.028836 0.028849
min 0.000100 0.000100 0.000100
25% 0.025076 0.025081 0.025065
50% 0.050047 0.050050 0.050037
75% 0.074987 0.075027 0.075055
max 0.100000 0.100000 0.100000
Under the hood, DataFrames are organized in columns. The values in a column can only have one data type (the column's dtype).
The DataFrame returned by df.describe() has columns of floating-point dtype:
In [116]: df.describe().info()
<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, count to max
Data columns (total 3 columns):
A 8 non-null float64
B 8 non-null float64
C 8 non-null float64
dtypes: float64(3)
memory usage: 256.0+ bytes
DataFrames do not allow you to treat one row as integers and the other rows as floats.
However, if you change the contents of the DataFrame to strings, then you have full control over the way the values are displayed
since all the values are just strings.
Thus, to create a DataFrame in the desired format, you could use
desc.loc['count'] = desc.loc['count'].astype(int).astype(str)
to convert the count row to integers (by calling astype(int)), and then convert the integers to strings (by calling astype(str)). Then
desc.iloc[1:] = desc.iloc[1:].applymap('{:.6f}'.format)
converts the rest of the floats to strings using the str.format method to format the floats to 6 digits after the decimal point.
Alternatively, you could use
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 4393476
df = pd.DataFrame(np.random.uniform(1e-4, 0.1, size=(N,3)), columns=list('ABC'))
desc = df.describe().T
desc['count'] = desc['count'].astype(int)
print(desc)
which yields
count mean std min 25% 50% 75% max
A 4393476 0.050039 0.028834 0.0001 0.025076 0.050047 0.074987 0.1
B 4393476 0.050056 0.028836 0.0001 0.025081 0.050050 0.075027 0.1
C 4393476 0.050057 0.028849 0.0001 0.025065 0.050037 0.075055 0.1
By transposing the desc DataFrame, the counts are now in their own column.
So now the problem can be solved by converting that column's dtype to int.
One advantage of doing it this way is that the values in desc remain numerical.
So further calculations based on the numeric values can still be done.
I think this solution is preferrable, provided that the transposed format is acceptable.

apply function on groups of k elements of a pandas Series

I have a pandas Series:
0 1
1 5
2 20
3 -1
Lets say I want to apply mean() on every two elements, so I get something like this:
0 3.0
1 9.5
Is there an elegant way to do this?
You can use groupby by index divide by k=2:
k = 2
print (s.index // k)
Int64Index([0, 0, 1, 1], dtype='int64')
print (s.groupby([s.index // k]).mean())
name
0 3.0
1 9.5
You can do this:
(s.iloc[::2].values + s.iloc[1::2])/2
if you want you can also reset the index afterwards, so you have 0, 1 as the index, using:
((s.iloc[::2].values + s.iloc[1::2])/2).reset_index(drop=True)
If you are using this over large series and many times, you'll want to consider a fast approach. This solution uses all numpy functions and will be fast.
Use reshape and construct new pd.Series
consider the pd.Series s
s = pd.Series([1, 5, 20, -1])
generalized function
def mean_k(s, k):
pad = (k - s.shape[0] % k) % k
nan = np.repeat(np.nan, pad)
val = np.concatenate([s.values, nan])
return pd.Series(np.nanmean(val.reshape(-1, k), axis=1))
demonstration
mean_k(s, 2)
0 3.0
1 9.5
dtype: float64
mean_k(s, 3)
0 8.666667
1 -1.000000
dtype: float64

Funny results with pandas argsort

I think I have hit on a bug in pandas. I was hoping to get some help either verifying the bug or helping me figure out where my logic error is located in my code.
My code is as follows:
import pandas, numpy, StringIO
def sq_fixer(sr):
sr = sr.where(sr != '20200229')
ranks = sr.argsort().astype(float)
ranks[ranks == -1] = numpy.nan
return ','.join(ranks.astype(numpy.str))
def correct_date(sr):
date_fixer = lambda x: pandas.datetime(x.year -100, x.month, x.day) if x > pandas.datetime.now() else x
sr = pandas.to_datetime(sr).apply(date_fixer).astype(pandas.datetime)
return sr
txt = '''ID,RUN_START_DATE,PUSHUP_START_DATE,SITUP_START_DATE,PULLUP_START_DATE
1,2013-01-24,2013-01-02,,2013-02-03
2,2013-01-30,2013-01-21,2013-01-13,2013-01-06
3,2013-01-29,2013-01-28,2013-01-01,2013-01-29
4,2013-02-16,2013-02-12,2013-01-04,2013-02-11
5,2013-01-06,2013-02-07,2013-02-25,2013-02-12
6,2013-01-26,2013-01-28,2013-02-12,2013-01-10
7,2013-01-26,,2013-01-12,2013-01-30
8,2013-01-03,2013-01-24,2013-01-19,2013-01-02
9,2013-01-22,2013-01-13,2013-02-03,
10,2013-02-06,2013-01-16,2013-02-07,2013-01-11
3347,,2008-02-27,2008-04-10,2008-02-13
3588,2004-09-12,,2004-11-06,2004-09-06
3784,2003-02-22,,2003-06-21,2003-02-19
593,2009-04-03,,2009-06-01,2009-04-01
4148,2003-03-21,2002-09-20,2003-04-01,2003-01-01
4299,2004-05-24,2004-07-23,,2004-04-22
4590,2005-05-05,2005-12-05,2005-04-05,
4830,2001-06-12,2000-10-12,2001-07-28,2001-01-28
4941,2006-11-08,2006-12-19,2006-07-19,2007-02-24
1416,2004-04-03,2004-05-19,2004-02-06,
1580,2008-12-20,,2009-03-19,2008-12-19
1661,2005-10-03,2005-10-26,2005-09-12,2006-02-19
1759,2001-10-18,,2002-01-17,2001-10-17
1858,2003-04-14,2003-05-17,,2002-12-17
1972,2003-06-01,2003-07-14,2002-12-14,
5905,2000-11-18,2001-01-13,,2000-11-04
2052,2002-06-11,,2002-08-23,2001-12-12
2165,2006-10-01,,2007-02-27,2006-09-30
2218,2007-09-19,,2008-02-06,2007-09-09
2350,2000-08-08,,2000-09-22,2000-01-08
2432,2001-08-22,,2001-09-25,2000-12-16
2611,2005-05-07,,2005-06-05,2005-03-26
2612,2005-05-06,,2005-05-26,2005-04-11
7378,2009-08-07,2009-01-30,2010-01-20,2009-06-08
7550,2006-04-08,,2006-06-01,2006-04-01 '''
df = pandas.read_csv(StringIO.StringIO(txt))
sequence_array = ['RUN_START_DATE', 'PUSHUP_START_DATE', 'SITUP_START_DATE', 'PULLUP_START_DATE']
xsequence_array = ['X_RUN_START_DATE', 'X_PUSHUP_START_DATE', 'X_SITUP_START_DATE', 'X_PULLUP_START_DATE']
df[sequence_array] = df[sequence_array].apply(correct_date, axis=1)
fix_day = lambda x: x if x > 0 else 29
fix_month = lambda x: x if x > 0 else 02
fix_year = lambda x: x if x > 0 else 2020
for col in sequence_array:
xcol = 'X_{0}'.format(col)
df[xcol] = ['{0:04d}{1:02d}{2:02d}'.format(fix_year(c.year), fix_month(c.month), fix_day(c.day)) for c in df[col]]
df['X_AS_SEQUENCE'] = df[xsequence_array].apply(sq_fixer, axis=1)
When I run the code most of the results are correct. Take for example index 6:
In [31]: df.ix[6]
Out[31]:
ID 7
RUN_START_DATE 2013-01-26 00:00:00
PUSHUP_START_DATE NaN
SITUP_START_DATE 2013-01-12 00:00:00
PULLUP_START_DATE 2013-01-30 00:00:00
X_RUN_START_DATE 20130126
X_PUSHUP_START_DATE 20200229
X_SITUP_START_DATE 20130112
X_PULLUP_START_DATE 20130130
X_AS_SEQUENCE 1.0,nan,0.0,2.0
However, certain indices seem to throw pandas.argsort() for a loop. Take for example index 10:
In [32]: df.ix[10]
Out[32]:
ID 3347
RUN_START_DATE NaN
PUSHUP_START_DATE 2008-02-27 00:00:00
SITUP_START_DATE 2008-04-10 00:00:00
PULLUP_START_DATE 2008-02-13 00:00:00
X_RUN_START_DATE 20200229
X_PUSHUP_START_DATE 20080227
X_SITUP_START_DATE 20080410
X_PULLUP_START_DATE 20080213
X_AS_SEQUENCE nan,2.0,0.0,1.0
The argsort should return nan,1.0,2.0,0.0 instead of nan,2.0,0.0,1.0.
I have been on this for three days. At this point I am not sure if it is me or a bug. I am not sure how to backtrace it to get an answer. Any help would be most appreciated!
You might be interpreting the result of argsort incorrectly. argsort does not give the ranking of the values. Use the rank method if you want to rank the values.
The values in the Series returned by argsort give the corresponding positions of the original values after dropping the NaNs. In your case, since you convert 20200229 to NaN, you are argsorting NaN, 20080227, 20080410, 20080213. The non-NaN values are
nonnan = [20080227, 20080410, 20080213]
The result, NaN, 2, 0, 1 says:
argsort sorted values
NaN NaN
2 nonnan[2] = 20080213
0 nonnan[0] = 20080227
1 nonnan[1] = 20080410
So it looks OK to me.
if you want to sort a Series, just use sort_values() or rank() function:
In [2]: a=pd.Series([3,2,1])
In [3]: a
Out[3]:
0 3
1 2
2 1
dtype: int64
In [4]: a.sort_values()
Out[4]:
2 1
1 2
0 3
dtype: int64
if you use argsort(), this will give you the position of each element in the sorted series,
in this case, 1 should be in the 0 position and 2 should be in the 1 position and 3 should be in the 2 position
In [5]: a.argsort()
Out[5]:
0 2
1 1
2 0
dtype: int64

Categories

Resources