Row wise calculations(Python)

Row wise calculations(Python) - python

Trying to run the following code to create a new column 'Median Rank':
N=data2.Rank.count()
for i in data2.Rank:
data2['Median_Rank']=i-0.3/(N+0.4)
But I'm getting a constant value of 0.99802. Even though my rank column is as follows:
data2.Rank.head()
Out[464]:
4131 1.0
4173 3.0
4172 3.0
4132 3.0
5335 10.0
4171 10.0
4159 10.0
5079 10.0
4115 10.0
4179 10.0
4180 10.0
4147 10.0
4181 10.0
4175 10.0
4170 10.0
4116 24.0
4129 24.0
4156 24.0
4153 24.0
4160 24.0
5358 24.0
4152 24.0
Somebody please point out the errors in my code.

Your code isn't vectorised. Use this:
N = data2.Rank.count()
data2['Median_Rank'] = data2['Rank'] - 0.3 / (N+0.4)
The reason your code does not work is because you are assigning the entire column in each loop. So only the last i iteration sticks, values in data2['Median_Rank'] are guaranteed to be identical.

This occurs because every time you make data2['Median_Rank']=i-0.3/(N+0.4) you are updating the entire column with the value calculated by the expression, the easiest way to do that actually don't need a loop:
N=data2.Rank.count()
data2['Median_Rank'] = data2.Rank-0.3/(N+0.4)
It is possible because pandas supports element-wise operations with series.
if you still want to use for loop, you will need to use .at and iterate by rows as follow:
for i, el in zip(df_filt.index,df_filt.rendimento_liquido.values):
df_filt.at[i,'Median_Rank']=el-0.3/(N+0.4)

Related

ASC files not preserving empty columns when added to df Python

I have a load of ASC files to extract data from. The issue I am having is that some of the columns have empty rows where there is no data, when I load these files into a df - it populates the first columns with all the data and just adds nans to the end... like this:
a| b| c
1 | 2 | nan
when I want it to be:
a | b | c
1 |nan|2
(I can't figure out how to make a table here to save my life) but where there is no data I want it to preserve the space. Part of my code says the separator is any space with more than two white spaces so I can preserve the headers that have one space within them, I think this is causing the issue but I am not sure how to fix it. I've tried using astropy.io to open the files and determine the delimiter but I get the error that the number of columns doesn't match the data columns.
here's an image of the general look of the files I have so you can see the lack of char delimiters and empty columns.
starting_words = ['Core no.', 'Core No.','Core','Core no.']
data = []
file_paths = []
for file in filepaths:
with open(file) as f:
for i, l in enumerate(f):
if l.startswith(tuple(starting_words)):
df = (pd.read_csv(file,sep = '\\s{2,}', engine = 'python', skiprows = i))
file_paths.append((file.stem + file.suffix))
df.insert(0,'Filepath', file)
data += [df]
break
this is the script that I've used to open the files and keep the header words together, I never got the astropy stuff to run - I either get the columns dont match error or it could not determine the file format.Also, this code has the skiprows part because the files all have random notes at the top that I don't want in my dataframe.

Your data looks well behaved, you could try to make use of the Pandas fwf to read the files with fixed-width formatted lines. If the inference from the fwf is not good enough for you, you can manually describe the extents of the fixed-width fields of each line using the parameter colspecs.
Sample
Core no. Depth Depth Perm Porosity Saturations Oil
ft m mD % %
1 5516.0 1681.277 40.0 1.0
2 5527.0 1684.630 39.0 16.0
3 5566.0 1696.517 508 37.0 4.0
5571.0 1698.041 105 33.0 8.0
6 5693.0 1735.226 44.0 16.0
5702.0 1737.970 4320 35.0 31.0
9 5686.0 1733.093 2420 33.0 26.0
df = pd.read_fwf('sample.txt', skiprows=2, header=None)
df.columns=['Core no.', 'Depth ft', 'Depth m' , 'Perm mD', 'Porosity%', 'Saturations Oil%']
print(df)
Output from df
Core no. Depth ft Depth m Perm mD Porosity% Saturations Oil%
0 1.0 5516.0 1681.277 NaN 40.0 1.0
1 2.0 5527.0 1684.630 NaN 39.0 16.0
2 3.0 5566.0 1696.517 508.0 37.0 4.0
3 NaN 5571.0 1698.041 105.0 33.0 8.0
4 6.0 5693.0 1735.226 NaN 44.0 16.0
5 NaN 5702.0 1737.970 4320.0 35.0 31.0
6 9.0 5686.0 1733.093 2420.0 33.0 26.0

Pandas aggregating and comparing across conditions

Say I have a dataframe df of many conditions
w_env numChances initial_cost ratio ev
0 0.5 1.0 4.0 1.2 6.800000
1 0.6 1.0 4.0 1.2. 2.960000
... ... ... ... ... ...
1195 0.6 3.0 12.0 2.6 8.009467
1196. 0.7 3.0 12.0 2.6 7.409467
my objective is to group the dataframe by initial_cost and ratio (averaging over w_env) and then calculating the difference in value of the ev column for when numChances=3 and numChance=1.
Then, to find the initial_cost and ratio which corresponds to that maximum difference
(i.e. for what initial cost and ratio is ev (when numChance==3) - ev (for numChance==1) the largest).
i tried
df.groupby(["numChances","initial_cost","ratio"]).agg({"ev":"mean"})
and then to pivot so that I can line up the rows for entries when numChance=1 and numChance=3. But this seems overly complicated.
Is there a simpler way to solve this problem?

How do I sum on the outter most level of a multi index (row)?

I am trying to figure out how to sum on the outer most level of my multi-index. So I want to sum the COUNTS column based on the individual operators, and all the shops listed for it.
df=pd.DataFrame(data.groupby('OPERATOR').SHOP.value_counts())
df=df.rename(columns={'SHOP':'COUNTS'})
df['COUNTS'] = df['COUNTS'].astype(float)
df['percentage']=df.groupby(['OPERATOR'])['COUNTS'].sum()
df['percentage']=df.sum(axis=0, level=['OPERATOR', 'SHOP'])
df.head()
COUNTS percentage
OPERATOR SHOP
AVIANCA CC9 3.0 3.0
FF9 1.0 1.0
IHI 1.0 1.0
Aegean HA9 33.0 33.0
IN9 24.0 24.0
When I use the df.sum call, it lets me call it on both levels, but then when I change it to df.sum(axis=0, level=['OPERATOR'], it results in the percentage column being NaN. I originally had the count column as int so I thought maybe that was the issue, and converted to float, but this didn't resolve the issue. This is the desired output:
COUNTS percentage
OPERATOR SHOP
AVIANCA CC9 3.0 5.0
FF9 1.0 5.0
IHI 1.0 5.0
Aegean HA9 33.0 57.0
IN9 24.0 57.0
(This is just a stepping stone on the way to calculating the percentage for each shop respective to the operator, i.e. the FINAL final output would be):
COUNTS percentage
OPERATOR SHOP
AVIANCA CC9 3.0 .6
FF9 1.0 .2
IHI 1.0 .2
Aegean HA9 33.0 .58
IN9 24.0 .42
So bonus points if you include the last step of that as well!! Please help me!!!

Group by OPERATOR and normalize your data:
df['percentage'] = df.groupby('OPERATOR')['COUNTS'] \
.transform(lambda x: x / x.sum()) \
.round(2)
>>> df
COUNTS percentage
OPERATOR SHOP
AVIANCA CC9 3.0 0.60
FF9 1.0 0.20
IHI 1.0 0.20
Aegean HA9 33.0 0.58
IN9 24.0 0.42

python pandas dataframe equivalent function logic for nonzero value calculation

This is a pinescript code which I am trying to code in python.what shall be an optimized equivalent python code be for the same
kama[1] here is previous kama value, for first-time calc in an array what should be done for this kama[1] value as it would not exist the first time.
kama=nz(kama[1], close[1])+smooth*(close[1]-nz(kama[1], close[1]))
pinescript info :
nz
Replaces NaN values with zeros (or given value) in a series.
nz(x, y) → integer
nz(sma(close, 100))
RETURNS
Two args version: returns x if it's a valid (not NaN) number, otherwise y
One arg version: returns x if it's a valid (not NaN) number, otherwise 0
ARGUMENTS
x (series) Series of values to process.
y (float) Value that will be inserted instead of all NaN values in x series.
edit 1
something i tried as below that is not working
stockdata['kama'] = stockdata['kama'](-1) if stockdata['kama'](-1) !=0 \
else stockdata['close'] + stockdata['smooth']*(stockdata['close'] - \
stockdata['kama'](-1) if stockdata['kama'](-1) !=0 else stockdata['close'])
edit 2
the alternative i tried just to make sure at least one part is working but that is also failing (nz(kama[1], close))
stockdata['kama'] = np.where(stockdata['kama'][-1] != 0, stockdata['kama'][-1], stockdata['close'])
completely struck now if this line of
kama=nz(kama[1], close)+smooth*(close-nz(kama[1], close))
pine-script code not converted to python my whole logic will go for a toss. any of your working solutions are greatly appreciated.
edit 3:
the dataframe input of the series
open high low close adjusted_close \
date
2002-07-01 5.2397 5.5409 5.2397 5.4127 0.0634
2002-07-02 5.5234 5.5370 5.4214 5.4438 0.0638
2002-07-03 5.5060 5.5458 5.3281 5.4661 0.0640
2002-07-04 5.5011 5.5720 5.4175 5.5283 0.0647
2002-07-05 5.5633 5.6566 5.4749 5.5905 0.0655
2002-07-08 5.5011 5.7187 5.5011 5.6255 0.0659
2002-07-09 5.5905 5.7586 5.5681 5.6167 0.0658
2002-07-10 5.4885 5.4885 5.1465 5.2222 0.0612
2002-07-11 4.9784 5.2135 4.9784 5.1863 0.0607
2002-07-12 5.5011 5.5011 5.2446 5.3194 0.0623
2002-07-15 5.3243 5.4797 5.1912 5.3330 0.0625
2002-07-16 5.1999 5.4389 5.1999 5.3155 0.0623
2002-07-17 4.7024 5.1377 4.6189 5.0445 0.0591
2002-07-18 4.8803 5.1465 4.8356 5.0804 0.0595
2002-07-19 5.0270 5.2038 5.0221 5.1513 0.0603
2002-07-22 5.0804 5.1465 4.9687 4.9735 0.0582
2002-07-23 4.8181 5.0843 4.8181 5.0619 0.0593
2002-07-24 5.0580 5.1290 4.9376 5.0619 0.0593
2002-07-25 5.0580 5.0580 4.7918 4.8492 0.0568
volume dividend_amount split_coefficient Om \
date
2002-07-01 21923 0.0 1.0 NaN
2002-07-02 61045 0.0 1.0 NaN
2002-07-03 34161 0.0 1.0 NaN
2002-07-04 27893 0.0 1.0 NaN
2002-07-05 58976 0.0 1.0 NaN
2002-07-08 48910 0.0 1.0 5.472433
2002-07-09 321846 0.0 1.0 5.530900
2002-07-10 138434 0.0 1.0 5.525083
2002-07-11 15027 0.0 1.0 5.437150
2002-07-12 24187 0.0 1.0 5.437150
2002-07-15 50330 0.0 1.0 5.397317
2002-07-16 24928 0.0 1.0 5.347117
2002-07-17 21357 0.0 1.0 5.199100
2002-07-18 27532 0.0 1.0 5.097733
2002-07-19 13380 0.0 1.0 5.105833
2002-07-22 21666 0.0 1.0 5.035717
2002-07-23 40161 0.0 1.0 4.951350
2002-07-24 34480 0.0 1.0 4.927700
2002-07-25 38185 0.0 1.0 4.986967
Hm Lm Cm vClose diff \
date
2002-07-01 NaN NaN NaN NaN 1669.8373
2002-07-02 NaN NaN NaN NaN 1669.8062
2002-07-03 NaN NaN NaN NaN 1669.7839
2002-07-04 NaN NaN NaN NaN 1669.7217
2002-07-05 NaN NaN NaN NaN 1669.6595
2002-07-08 5.595167 5.397117 5.511150 5.493967 1669.6245
2002-07-09 5.631450 5.451850 5.545150 5.539837 1669.6333
2002-07-10 5.623367 5.406033 5.508217 5.515675 1670.0278
2002-07-11 5.567983 5.347750 5.461583 5.453617 1670.0637
2002-07-12 5.556167 5.318933 5.426767 5.434754 1669.9306
2002-07-15 5.526683 5.271650 5.383850 5.394875 1669.9170
2002-07-16 5.480050 5.221450 5.332183 5.345200 1669.9345
2002-07-17 5.376567 5.063250 5.236817 5.218933 1670.2055
2002-07-18 5.319567 5.011433 5.213183 5.160479 1670.1696
2002-07-19 5.317950 5.018717 5.207350 5.162463 1670.0987
2002-07-22 5.258850 4.972733 5.149700 5.104250 1670.2765
2002-07-23 5.192950 4.910550 5.104517 5.039842 1670.1881
2002-07-24 5.141300 4.866833 5.062250 4.999521 1670.1881
2002-07-25 5.128017 4.895650 5.029700 5.010083 1670.4008
signal noise efratio smooth
date
2002-07-01 5.4127 1670.3373 0.003240 0.416113
2002-07-02 5.4438 1670.3062 0.003259 0.416113
2002-07-03 5.4661 1670.2839 0.003273 0.416114
2002-07-04 5.5283 1670.2217 0.003310 0.416115
2002-07-05 5.5905 1670.1595 0.003347 0.416116
2002-07-08 5.6255 1670.1245 0.003368 0.416116
2002-07-09 5.6167 1670.1333 0.003363 0.416116
2002-07-10 5.2222 1670.5278 0.003126 0.416110
2002-07-11 5.1863 1670.5637 0.003105 0.416109
2002-07-12 5.3194 1670.4306 0.003184 0.416111
2002-07-15 5.3330 1670.4170 0.003193 0.416111
2002-07-16 5.3155 1670.4345 0.003182 0.416111
2002-07-17 5.0445 1670.7055 0.003019 0.416107
2002-07-18 5.0804 1670.6696 0.003041 0.416107
2002-07-19 5.1513 1670.5987 0.003084 0.416109
2002-07-22 4.9735 1670.7765 0.002977 0.416106
2002-07-23 5.0619 1670.6881 0.003030 0.416107
2002-07-24 5.0619 1670.6881 0.003030 0.416107
2002-07-25 4.8492 1670.9008 0.002902 0.416104
what is expected for kama=nz(kama[1], close)+smooth*(close-nz(kama[1], close))?
stockdata['kama'] =nz(stockdata[kama][-1],stockdata['close'] +stockdata['smooth']*(stockdata['close']-nz(stockdata['kama'][-1],stockdata['close'])
in this case for first iteration there will not be any previous kama value which should be taken care. all the inputs are given in dataframe format above.

You need to create the column kama first with the values of close:
import numpy as np
stockdata['kama'] = stockdata['close']
previous_kama = stockdata['kama'].shift()
previous_close = stockdata['close'].shift()
value = np.where(previous_kama.notnull(), previous_kama, previous_close)
stockdata['kama'] = value + stockdata['smooth'] * (previous_close - value)

Find the average for user-defined window in pandas

I have a pandas dataframe that has raw heart rate data with the index of time (in seconds).
I am trying to bin the data so that I can have the average of a user define window (e.g. 10s) - not a rolling average, just an average of 10s, then the 10s following, etc.
import pandas as pd
hr_raw = pd.read_csv('hr_data.csv', index_col='time')
print(hr_raw)
heart_rate
time
0.6 164.0
1.0 182.0
1.3 164.0
1.6 150.0
2.0 152.0
2.4 141.0
2.9 163.0
3.2 141.0
3.7 124.0
4.2 116.0
4.7 126.0
5.1 116.0
5.7 107.0
Using the example data above, I would like to be able to set a user defined window size (let's use 2 seconds) and produce a new dataframe that has index of 2sec increments and averages the 'heart_rate' values if the time falls into that window (and should continue to the end of the dataframe).
For example:
heart_rate
time
2.0 162.40
4.0 142.25
6.0 116.25
I can only seem to find methods to bin the data based on a predetermined number of bins (e.g. making a histogram) and this only returns the count/frequency.
thanks.

A groupby should do it.
df.groupby((df.index // 2 + 1) * 2).mean()
heart_rate
time
2.0 165.00
4.0 144.20
6.0 116.25
Note that the reason for the slight difference between our answers is that the upper bound is excluded. That means, a reading taken at 2.0s will be considered for the 4.0s time interval. This is how it is usually done, a similar solution with the TimeGrouper will yield the same result.

Like coldspeed pointed out, 2s will be considered in 4s, however, if you need it in 2x bucket, you can
In [1038]: df.groupby(np.ceil(df.index/2)*2).mean()
Out[1038]:
heart_rate
time
2.0 162.40
4.0 142.25
6.0 116.25

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Row wise calculations(Python) - python

Related

ASC files not preserving empty columns when added to df Python

Pandas aggregating and comparing across conditions

How do I sum on the outter most level of a multi index (row)?

python pandas dataframe equivalent function logic for nonzero value calculation

Find the average for user-defined window in pandas

Categories

Resources