Searching values through Pandas columns - python

This is a sample of a pandas dataframe I have. I need to find the particular row for a given bid. For instance, give bid = 5, I need to return row corresponding to that in the following table. If I enter a missing bid, for instance, bid = 6, then the row corresponding to the largest bid smaller than input bid should be return. Thus row corresponding to bid = 5 should be return in that case. How do I do this in pandas?
Bid Imp Click Spend
3 13 0.97 2
4 13 1.89 7
5 79 34.98 130
7 83 37.52 140
8 88 38.52 144

I think this could do the trick:
>>> df[(df['Bid']<=5)].iloc[-1,:]
Bid 5.00
Imp 79.00
Click 34.98
Spend 130.00
Name: 2, dtype: float64
If you want a pandas just do df[(df['Bid']<=5)].iloc[-1,:].to_frame().T.
>>> df[(df['Bid']<=5)].iloc[-1,:].to_frame().T
Bid Imp Click Spend
2 5.0 79.0 34.98 130.0
For the case of the missing bid=6, df[(df['Bid']<=6)].iloc[-1,:].to_frame().T would return the nearest bid below 6, which is, again, 5.
>>> df[(df['Bid']<=6)].iloc[-1,:].to_frame().T
Bid Imp Click Spend
2 5.0 79.0 34.98 130.0
EDITED
To make sure that the dataframe contains Bidin ascending order just do previously:
>>> df = df.sort_values(by='Bid',ascending=True)

Here is a generator-based method. The generator gets exhausted and we catch the last item by enumeration.
df = df.sort_values('Bids')
df.loc[df['Bid'] == [max(enumerate(i for i in df['Bid'] if i <= 6))[1]]]
Bid Imp Click Spend
2 5 79 34.98 130
The above method is slow for large, marginally faster for small dataframes. As an alternative, you can use this pandas-based solution:
df.iloc[df[df['Bid'] <= 6].index[-1]]

Try
def get_bid(val):
# find the index of the maximum bid below or equal val
index = df.loc[df.Bid <= val, 'Bid'].idxmax()
return df.loc[[index]]
here is the result of calling the function with values 6 and 5 and 4 respectively
In []: get_bid(6)
Out[]:
Bid Imp Click Spend
2 5 79 34.98 130
In []: get_bid(5)
Out[]:
Bid Imp Click Spend
2 5 79 34.98 130
In []: get_bid(4)
Out[]:
Bid Imp Click Spend
1 4 13 1.89 7
PS if you prefer one liners, you can change the code to In[1], this will produce the same output as above. i.e. a dataframe. removing the double brackets(In [2]) will change the output to a series. I,e,
In [1]: val = 6
df.loc[[df.loc[df.Bid <= val, 'Bid'].idxmax()]]
Out[1]:
Bid Imp Click Spend
2 5 79 34.98 130
In [2]: df.loc[df.loc[df.Bid <= val, 'Bid'].idxmax()]
Out[2]:
Bid 5.00
Imp 79.00
Click 34.98
Spend 130.00
Name: 2, dtype: float64

Related

How can I subtract two panda data frame columns without getting an index error? [duplicate]

In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)
I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71
To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6
I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass

Is there a way to do rolling rank in Pandas?

I am trying to rank some values in one column over a rolling period of N days instead of having the ranking done over the entire set. I have seen several methods here using rolling_apply but I have read that this is no longer in python. For example, in the following table;
A
01-01-2013
100
02-01-2013
85
03-01-2013
110
04-01-2013
60
05-01-2013
20
06-01-2013
40
For the column A above, how can I have the rank as below for N = 3;
A
Ranked_A
01-01-2013
100
NaN
02-01-2013
85
Nan
03-01-2013
110
1
04-01-2013
60
3
05-01-2013
20
3
06-01-2013
40
2
Yes we have some work around, still with rolling but need apply
df.A.rolling(3).apply(lambda x: pd.Series(x).rank(ascending=False)[-1])
01-01-2013 NaN
02-01-2013 NaN
03-01-2013 1.0
04-01-2013 3.0
05-01-2013 3.0
06-01-2013 2.0
Name: A, dtype: float64

Problem iteration columns and rows Dataframe

Here is my problem :
Let’s say you have to buy and sell two objects with those following conditions:
You buy object A or B if its price goes below 150 (<150) and assuming that you can buy fraction of the object (so decimals are allowed)
If the following day the object is still below 150, then you just keep the object and do nothing
If the object is higher or equal to 150, then you sell the object and take profits
You start the game with 10000$
Here is the DataFrame with all the prices
df=pd.DataFrame({'Date':['2017-05-19','2017-05-22','2017-05-23','2017-05-24','2017-05-25','2017-05-26','2017-05-29'],
'A':[153,147,149,155,145,147,155],
'B':[139,152,141,141,141,152,152],})
df['Date']=pd.to_datetime(df['Date'])
df = df.set_index('Date')
The goal is to return a DataFrame with the number of object for A and B you hold and the number of cash you have left.
If the conditions are met, the allocation for each object is the half of the cash you have if you don’t hold any object (weight =1/2) and is the rest if you already have one object (weight=1)
Let’s look at df first, I will also develop the new data frame that I’m trying to create (let’s call it df_end) :
On 2017-05-19, object A is 153$ and B is 139$ : You buy 35.97 object B (=5000/139) as the price is <150 —> You have 5000$ left in cash.
On 2017-05-22, object A is 147$ and B is 152$ : You buy 34.01 object A (=5000/147) as the price is <150 + You sell 35.97 object B at 152$ as it is >=150 --> You have now 5467,44$ left in cash thanks to the selling of B.
On 2017-05-23, object A is 149$ and B is 141$ : You keep your position on Object A (34.01 object) as it’s still below 150 and you buy 38.77 Object B (=5467.44/141) as the price is <150 —> You have now 0$ left in cash.
On 2017-05-24, object A is 155$ and B is 141$ : You sell 34.01 object A at 155$ as it’s above 150$ and you keep 38.77 Object B as it’s still below 150 —> You have now 5271.55$ left in cash thanks to the selling of A
On 2017-05-25, object A is 145$ and B is 141$: You buy 36.35 object A (5271.55/145) as it’s below 150 and you keep 38.77 Object B as it’s still below 150 —> You have now 0$ in cash
On 2017-05-26, object A is 147$ and B is 152$: You sell 38.77 object B at 152 as it’s above 150 and you keep 36.35 Object A as it’s still below 150 —> You have now 5893.04$ in cash thanks to the selling of Object B
On 2017-05-29, object A is 155$ and B is 152$: You sell 36.35 object A at 155 as it’s above 150 and you do nothing else as B is not below 150 —> You have now 11.527,29$ in cash thanks to the selling of Object A.
Hence, the new dataframe df_end should look like this (this is the Result I am looking for)
A B Cash
Date
2017-05-19 0 35.97 5000
2017-05-22 34.01 0 5467.64
2017-05-23 34.01 38.77 0
2017-05-24 0 38.77 5272.11
2017-05-25 36.35 38.77 0
2017-05-26 36.35 0 5893.04
2017-05-29 0 0 11527.29
My principal problem is that we have to iterate over both rows and columns and this is the most difficult part.
It's been a week that I'm trying to find a solution but I still don't find any idea on that, that is why I tried to explain as clear as possible.
So if somebody has an idea on this issue, you are very welcome.
Thank you so much
You could try this:
import pandas as pd
df=pd.DataFrame({'Date':['2017-05-19','2017-05-22','2017-05-23','2017-05-24','2017-05-25','2017-05-26','2017-05-29'],
'A':[153,147,149,155,145,147,155],
'B':[139,152,141,141,141,152,152],})
df['Date']=pd.to_datetime(df['Date'])
df = df.set_index('Date')
print(df)
#Values before iterations
EntryCash=10000
newdata=[]
holding=False
#First iteration (Initial conditions)
firstrow=df.to_records()[0]
possibcash=EntryCash if holding else EntryCash/2
prevroa=possibcash/firstrow[1] if firstrow[1]<=150 else 0
prevrob=possibcash/firstrow[2] if firstrow[2]<=150 else 0
holding=any(i!=0 for i in [prevroa,prevrob])
newdata.append([df.to_records()[0][0],prevroa,prevrob,possibcash])
#others iterations
for row in df.to_records()[1:]:
possibcash=possibcash if holding else possibcash/2
a=row[1]
b=row[2]
if a>150:
if prevroa>0:
possibcash+=prevroa*a
a=0
else:
a=prevroa
else:
if prevroa==0:
a=possibcash/a
possibcash=0
else:
a=prevroa
if b>150:
if prevrob>0:
possibcash+=prevrob*b
b=0
else:
b=prevrob
else:
if prevrob==0:
b=possibcash/b
possibcash=0
else:
b=prevrob
prevroa=a
prevrob=b
newdata.append([row[0],a,b,possibcash])
holding=any(i!=0 for i in [a,b])
df_end=pd.DataFrame(newdata, columns=[df.index.name]+list(df.columns)+['Cash']).set_index('Date')
print(df_end)
Output:
df
A B
Date
2017-05-19 153 139
2017-05-22 147 152
2017-05-23 149 141
2017-05-24 155 141
2017-05-25 145 141
2017-05-26 147 152
2017-05-29 155 152
df_end
A B Cash
Date
2017-05-19 0.000000 35.971223 5000.000000
2017-05-22 34.013605 0.000000 5467.625899
2017-05-23 34.013605 38.777489 0.000000
2017-05-24 0.000000 38.777489 5272.108844
2017-05-25 36.359371 38.777489 0.000000
2017-05-26 36.359371 0.000000 5894.178274
2017-05-29 0.000000 0.000000 11529.880831
If you want it rounded to two decimals, you can add:
df_end=df_end.round(decimals=2)
df_end:
A B Cash
Date
2017-05-19 0.00 35.97 5000.00
2017-05-22 34.01 0.00 5467.63
2017-05-23 34.01 38.78 0.00
2017-05-24 0.00 38.78 5272.11
2017-05-25 36.36 38.78 0.00
2017-05-26 36.36 0.00 5894.18
2017-05-29 0.00 0.00 11529.88
Slight Differences Final Values
It is slight different to your desired output because sometimes you were rounding the values to two decimals and sometimes you didn't. For example:
In your second row you put:
#second row
2017-05-22 34.01 0 5467.64
That means you used the complete value of object A, first row, that is 35.971223 not 35.97:
35.97*152
Out[120]: 5467.44
35.971223*152
Out[121]: 5467.6258960000005 #---->closest to 5467.64
And at row 3, again you used the real value, not the rounded:
#row 3
2017-05-24 0 38.77 5272.11
#Values
34.013605*155
Out[122]: 5272.108775
34.01*155
Out[123]: 5271.549999999999
And finally, at the last two rows you used the rounded value, I guess, because:
#last two rows
2017-05-26 36.35 0 5893.04
2017-05-29 0 0 11527.29
#cash values
#penultimate row, cash value
38.777489*152
Out[127]: 5894.178328
38.77*152
Out[128]: 5893.040000000001
#last row, cash value
5894.04+(155*36.35)
Out[125]: 11528.29 #---->closest to 11527.29
5894.04+(155*36.359371)
Out[126]: 11529.742505

Get maximum relative difference between row-values and row-mean in new pandas dataframe column

I want to have an extra column with the maximum relative difference [-] of the row-values and the mean of these rows:
The df is filled with energy use data for several years.
The theoretical formula that should get me this is as follows:
df['max_rel_dif'] = MAX [ ABS(highest energy use – mean energy use), ABS(lowest energy use – mean energy use)] / mean energy use
Initial dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014
0 23 22631 21954.0 22314.0 22032 21843
1 43 27456 29654.0 28159.0 28654 2000
2 36 61200 NaN NaN 31895 1600
3 87 87621 86542.0 87542.0 88456 86961
4 90 58951 57486.0 2000.0 0 0
5 98 24587 25478.0 NaN 24896 25461
Desired dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014 max_rel_dif
0 23 22631 21954.0 22314.0 22032 21843 0.02149
1 43 27456 29654.0 28159.0 28654 2000 0.91373
2 36 61200 NaN NaN 31895 1600 0.94931
3 87 87621 86542.0 87542.0 88456 86961 0.01179
4 90 58951 57486.0 2000.0 0 0 1.48870
5 98 24587 25478.0 NaN 24896 25461 0.02065
tried code:
import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [23,43,36,87,90,98],
"y_2010": [22631,27456,61200,87621,58951,24587],
"y_2011": [21954,29654,np.nan,86542,57486,25478],
"y_2012": [22314,28159,np.nan,87542,2000,np.nan],
"y_2013": [22032,28654,31895,88456,0,24896,],
"y_2014": [21843,2000,1600,86961,0,25461]})

print(df)

a = df.loc[:, ['y_2010','y_2011','y_2012','y_2013', 'y_2014']]


# calculate mean
mean = a.mean(1)
# calculate max_rel_dif
df['max_rel_dif'] = (((df.max(axis=1).sub(mean)).abs(),(df.min(axis=1).sub(mean)).abs()).max()).div(mean)
# AttributeError: 'tuple' object has no attribute 'max'
-> I'm obviously doing the wrong thing with the tuple, I just don't know how to get the maximum values
from the tuples and divide them then by the mean in the proper Phytonic way
I feel like the whole function can be
s=df.filter(like='y')
s.sub(s.mean(1),axis=0).abs().max(1)/s.mean(1)
0 0.021494
1 0.913736
2 0.949311
3 0.011800
4 1.488707
5 0.020653
dtype: float64

Calculating difference between two rows in Python / Pandas

In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)
I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71
To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6
I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass

Categories

Resources