Here is my problem :
Let’s say you have to buy and sell two objects with those following conditions:
You buy object A or B if its price goes below 150 (<150) and assuming that you can buy fraction of the object (so decimals are allowed)
If the following day the object is still below 150, then you just keep the object and do nothing
If the object is higher or equal to 150, then you sell the object and take profits
You start the game with 10000$
Here is the DataFrame with all the prices
df=pd.DataFrame({'Date':['2017-05-19','2017-05-22','2017-05-23','2017-05-24','2017-05-25','2017-05-26','2017-05-29'],
'A':[153,147,149,155,145,147,155],
'B':[139,152,141,141,141,152,152],})
df['Date']=pd.to_datetime(df['Date'])
df = df.set_index('Date')
The goal is to return a DataFrame with the number of object for A and B you hold and the number of cash you have left.
If the conditions are met, the allocation for each object is the half of the cash you have if you don’t hold any object (weight =1/2) and is the rest if you already have one object (weight=1)
Let’s look at df first, I will also develop the new data frame that I’m trying to create (let’s call it df_end) :
On 2017-05-19, object A is 153$ and B is 139$ : You buy 35.97 object B (=5000/139) as the price is <150 —> You have 5000$ left in cash.
On 2017-05-22, object A is 147$ and B is 152$ : You buy 34.01 object A (=5000/147) as the price is <150 + You sell 35.97 object B at 152$ as it is >=150 --> You have now 5467,44$ left in cash thanks to the selling of B.
On 2017-05-23, object A is 149$ and B is 141$ : You keep your position on Object A (34.01 object) as it’s still below 150 and you buy 38.77 Object B (=5467.44/141) as the price is <150 —> You have now 0$ left in cash.
On 2017-05-24, object A is 155$ and B is 141$ : You sell 34.01 object A at 155$ as it’s above 150$ and you keep 38.77 Object B as it’s still below 150 —> You have now 5271.55$ left in cash thanks to the selling of A
On 2017-05-25, object A is 145$ and B is 141$: You buy 36.35 object A (5271.55/145) as it’s below 150 and you keep 38.77 Object B as it’s still below 150 —> You have now 0$ in cash
On 2017-05-26, object A is 147$ and B is 152$: You sell 38.77 object B at 152 as it’s above 150 and you keep 36.35 Object A as it’s still below 150 —> You have now 5893.04$ in cash thanks to the selling of Object B
On 2017-05-29, object A is 155$ and B is 152$: You sell 36.35 object A at 155 as it’s above 150 and you do nothing else as B is not below 150 —> You have now 11.527,29$ in cash thanks to the selling of Object A.
Hence, the new dataframe df_end should look like this (this is the Result I am looking for)
A B Cash
Date
2017-05-19 0 35.97 5000
2017-05-22 34.01 0 5467.64
2017-05-23 34.01 38.77 0
2017-05-24 0 38.77 5272.11
2017-05-25 36.35 38.77 0
2017-05-26 36.35 0 5893.04
2017-05-29 0 0 11527.29
My principal problem is that we have to iterate over both rows and columns and this is the most difficult part.
It's been a week that I'm trying to find a solution but I still don't find any idea on that, that is why I tried to explain as clear as possible.
So if somebody has an idea on this issue, you are very welcome.
Thank you so much
You could try this:
import pandas as pd
df=pd.DataFrame({'Date':['2017-05-19','2017-05-22','2017-05-23','2017-05-24','2017-05-25','2017-05-26','2017-05-29'],
'A':[153,147,149,155,145,147,155],
'B':[139,152,141,141,141,152,152],})
df['Date']=pd.to_datetime(df['Date'])
df = df.set_index('Date')
print(df)
#Values before iterations
EntryCash=10000
newdata=[]
holding=False
#First iteration (Initial conditions)
firstrow=df.to_records()[0]
possibcash=EntryCash if holding else EntryCash/2
prevroa=possibcash/firstrow[1] if firstrow[1]<=150 else 0
prevrob=possibcash/firstrow[2] if firstrow[2]<=150 else 0
holding=any(i!=0 for i in [prevroa,prevrob])
newdata.append([df.to_records()[0][0],prevroa,prevrob,possibcash])
#others iterations
for row in df.to_records()[1:]:
possibcash=possibcash if holding else possibcash/2
a=row[1]
b=row[2]
if a>150:
if prevroa>0:
possibcash+=prevroa*a
a=0
else:
a=prevroa
else:
if prevroa==0:
a=possibcash/a
possibcash=0
else:
a=prevroa
if b>150:
if prevrob>0:
possibcash+=prevrob*b
b=0
else:
b=prevrob
else:
if prevrob==0:
b=possibcash/b
possibcash=0
else:
b=prevrob
prevroa=a
prevrob=b
newdata.append([row[0],a,b,possibcash])
holding=any(i!=0 for i in [a,b])
df_end=pd.DataFrame(newdata, columns=[df.index.name]+list(df.columns)+['Cash']).set_index('Date')
print(df_end)
Output:
df
A B
Date
2017-05-19 153 139
2017-05-22 147 152
2017-05-23 149 141
2017-05-24 155 141
2017-05-25 145 141
2017-05-26 147 152
2017-05-29 155 152
df_end
A B Cash
Date
2017-05-19 0.000000 35.971223 5000.000000
2017-05-22 34.013605 0.000000 5467.625899
2017-05-23 34.013605 38.777489 0.000000
2017-05-24 0.000000 38.777489 5272.108844
2017-05-25 36.359371 38.777489 0.000000
2017-05-26 36.359371 0.000000 5894.178274
2017-05-29 0.000000 0.000000 11529.880831
If you want it rounded to two decimals, you can add:
df_end=df_end.round(decimals=2)
df_end:
A B Cash
Date
2017-05-19 0.00 35.97 5000.00
2017-05-22 34.01 0.00 5467.63
2017-05-23 34.01 38.78 0.00
2017-05-24 0.00 38.78 5272.11
2017-05-25 36.36 38.78 0.00
2017-05-26 36.36 0.00 5894.18
2017-05-29 0.00 0.00 11529.88
Slight Differences Final Values
It is slight different to your desired output because sometimes you were rounding the values to two decimals and sometimes you didn't. For example:
In your second row you put:
#second row
2017-05-22 34.01 0 5467.64
That means you used the complete value of object A, first row, that is 35.971223 not 35.97:
35.97*152
Out[120]: 5467.44
35.971223*152
Out[121]: 5467.6258960000005 #---->closest to 5467.64
And at row 3, again you used the real value, not the rounded:
#row 3
2017-05-24 0 38.77 5272.11
#Values
34.013605*155
Out[122]: 5272.108775
34.01*155
Out[123]: 5271.549999999999
And finally, at the last two rows you used the rounded value, I guess, because:
#last two rows
2017-05-26 36.35 0 5893.04
2017-05-29 0 0 11527.29
#cash values
#penultimate row, cash value
38.777489*152
Out[127]: 5894.178328
38.77*152
Out[128]: 5893.040000000001
#last row, cash value
5894.04+(155*36.35)
Out[125]: 11528.29 #---->closest to 11527.29
5894.04+(155*36.359371)
Out[126]: 11529.742505
Related
my dataset
name date record
A 2018-09-18 95
A 2018-10-11 104
A 2018-10-30 230
A 2018-11-23 124
B 2020-01-24 95
B 2020-02-11 167
B 2020-03-07 78
As you can see, there are several records by name and date.
Compared to the previous record, I would like to see the record that rose the most.
output what I want
name record_before_date record_before record_increase_date record_increase increase_rate
A 2018-10-11 104 2018-10-30 230 121.25
B 2020-01-24 95 2020-02-11 167 75.79
I`m not comparing the lowest to the highest, but I want to check the record with the highest ascent rate when the next record comes, and the rate of ascent.
increase rate formula = (record_increase - record_before) / record_before * 100
Any help would be appreciated.
thanks for reading.
Use:
#get percento change per groups
s = df.groupby("name")["record"].pct_change()
#get row with maximal percent change
df1 = df.loc[s.groupby(df['name']).idxmax()].add_suffix('_increase')
#get row with previous maximal percent change
df2 = (df.loc[s.groupby(df['name'])
.apply(lambda x: x.shift(-1).idxmax())].add_suffix('_before'))
#join together
df = pd.concat([df2.set_index('name_before'),
df1.set_index('name_increase')], axis=1).rename_axis('name').reset_index()
#apply formula
df['increase_rate'] = (df['record_increase'].sub(df['record_before'])
.div(df['record_before'])
.mul(100))
print (df)
name date_before record_before date_increase record_increase \
0 A 2018-10-11 104 2018-10-30 230
1 B 2020-01-24 95 2020-02-11 167
increase_rate
0 121.153846
1 75.789474
I want to have an extra column with the maximum relative difference [-] of the row-values and the mean of these rows:
The df is filled with energy use data for several years.
The theoretical formula that should get me this is as follows:
df['max_rel_dif'] = MAX [ ABS(highest energy use – mean energy use), ABS(lowest energy use – mean energy use)] / mean energy use
Initial dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014
0 23 22631 21954.0 22314.0 22032 21843
1 43 27456 29654.0 28159.0 28654 2000
2 36 61200 NaN NaN 31895 1600
3 87 87621 86542.0 87542.0 88456 86961
4 90 58951 57486.0 2000.0 0 0
5 98 24587 25478.0 NaN 24896 25461
Desired dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014 max_rel_dif
0 23 22631 21954.0 22314.0 22032 21843 0.02149
1 43 27456 29654.0 28159.0 28654 2000 0.91373
2 36 61200 NaN NaN 31895 1600 0.94931
3 87 87621 86542.0 87542.0 88456 86961 0.01179
4 90 58951 57486.0 2000.0 0 0 1.48870
5 98 24587 25478.0 NaN 24896 25461 0.02065
tried code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [23,43,36,87,90,98],
"y_2010": [22631,27456,61200,87621,58951,24587],
"y_2011": [21954,29654,np.nan,86542,57486,25478],
"y_2012": [22314,28159,np.nan,87542,2000,np.nan],
"y_2013": [22032,28654,31895,88456,0,24896,],
"y_2014": [21843,2000,1600,86961,0,25461]})
print(df)
a = df.loc[:, ['y_2010','y_2011','y_2012','y_2013', 'y_2014']]
# calculate mean
mean = a.mean(1)
# calculate max_rel_dif
df['max_rel_dif'] = (((df.max(axis=1).sub(mean)).abs(),(df.min(axis=1).sub(mean)).abs()).max()).div(mean)
# AttributeError: 'tuple' object has no attribute 'max'
-> I'm obviously doing the wrong thing with the tuple, I just don't know how to get the maximum values
from the tuples and divide them then by the mean in the proper Phytonic way
I feel like the whole function can be
s=df.filter(like='y')
s.sub(s.mean(1),axis=0).abs().max(1)/s.mean(1)
0 0.021494
1 0.913736
2 0.949311
3 0.011800
4 1.488707
5 0.020653
dtype: float64
Stata .dta files include labels/descriptions for each column, which can be viewed in Stata using the describe command. For example, the adults and kids variables in this online dataset, have descriptions number of adults in household and number of children in household, respectively:
clear
use http://www.principlesofeconometrics.com/stata/alcohol.dta
describe
Contains data from http://www.principlesofeconometrics.com/stata/alcohol.dta
obs: 1,000
vars: 4 10 Nov 2007 11:33
size: 5,000 (_dta has notes)
-------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------------------------------------------------------------
adults byte %8.0g number of adults in household
kids byte %8.0g number of children in household
income int %8.0g weekly income
consume byte %8.0g =1 if consume alcohol, =0 otherwise
-------------------------------------------------------------------------------------------------------------------------------------
Sorted by:
Those descriptions do not show up in Pandas, for example with describe():
df = pd.read_stata('http://www.principlesofeconometrics.com/stata/alcohol.dta')
df
adults kids income consume
0 2 2 758 1
1 2 3 1785 1
2 3 0 1200 1
.. ... ... ... ...
997 2 0 1383 1
998 2 2 816 0
999 2 2 387 0
df.describe()
adults kids income consume
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 2.012000 0.722000 649.528000 0.766000
std 0.815181 1.078833 460.657826 0.423584
min 1.000000 0.000000 12.000000 0.000000
25% 2.000000 0.000000 295.000000 1.000000
50% 2.000000 0.000000 562.500000 1.000000
75% 2.000000 1.000000 887.500000 1.000000
max 6.000000 5.000000 3846.000000 1.000000
Is there a way to view this information after loading it to a Pandas DataFrame using read_stata()?
Using Stata's toy dataset auto as an example:
sysuse auto, clear
describe
Contains data from auto.dta
obs: 74 1978 Automobile Data
vars: 12 13 Apr 2014 17:45
size: 3,182 (_dta has notes)
-------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------------------------------------------------------------
make str18 %-18s Make and Model
price int %8.0gc Price
mpg int %8.0g Mileage (mpg)
rep78 int %8.0g Repair Record 1978
headroom float %6.1f Headroom (in.)
trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn Circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear Ratio
foreign byte %8.0g origin Car type
-------------------------------------------------------------------------------------------------------------------------------------
Sorted by: foreign
The following works for me:
import pandas as pd
data = pd.read_stata('auto.dta', iterator = True)
labels = data.variable_labels()
labels
Out[5]:
{'make': 'Make and Model',
'price': 'Price',
'mpg': 'Mileage (mpg)',
'rep78': 'Repair Record 1978',
'headroom': 'Headroom (in.)',
'trunk': 'Trunk space (cu. ft.)',
'weight': 'Weight (lbs.)',
'length': 'Length (in.)',
'turn': 'Turn Circle (ft.) ',
'displacement': 'Displacement (cu. in.)',
'gear_ratio': 'Gear Ratio',
'foreign': 'Car type'}
Edited:
OK If I understand you correctly, you are looking for frequency counts?
If so, .value_counts() should do the trick.
df = pd.read_stata("http://www.principlesofeconometrics.com/stata/alcohol.dta")
adults_values = df.adults.value_counts().sort_index().to_frame()
print(adults_values)
adults
1 247
2 562
3 133
4 49
5 8
6 1
kids_values = df.kids.value_counts().sort_index()
print(kids_values)
kids
0 626
1 133
2 158
3 61
4 20
5 2
Variable Descriptions
.info() gives you information on the datatypes of variables in each column.(int8, int64, etc)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 4 columns):
adults 1000 non-null int8
kids 1000 non-null int8
income 1000 non-null int16
consume 1000 non-null int8
dtypes: int16(1), int8(3)
memory usage: 12.7 KB
Hope this helps.
I am doing the formatting of a dataframe. I need to do the thousand separator and the decimals. The problem is when I combine them together, only the last one is in effect. I guess many people may have the same confusion, as I have googled a lot, nothing is found.
I tried to use .map(lambda x:('%.2f')%x and format(x,',')) to combine the two required formats together, but only the last one is in effect
DF_T_1_EQUITY_CHANGE_Summary_ADE['Sum of EQUITY_CHANGE'].map(lambda x:format(x,',') and ('%.2f')%x)
DF_T_1_EQUITY_CHANGE_Summary_ADE['Sum of EQUITY_CHANGE'].map(lambda x:('%.2f')%x and format(x,','))
the first result is:
0 -2905.22
1 -6574.62
2 -360.86
3 -3431.95
Name: Sum of EQUITY_CHANGE, dtype: object
the second result is:
0 -2,905.2200000000003
1 -6,574.62
2 -360.86
3 -3,431.9500000000003
Name: Sum of EQUITY_CHANGE, dtype: object
I tried a new way, by using
DF_T_1_EQUITY_CHANGE_Summary_ADE.to_string(formatters={'style1': '${:,.2f}'.format})
the result is:
Row Labels Sum of EQUITY_CHANGE Sum of TRUE_PROFIT Sum of total_cost Sum of FOREX VOL Sum of BULLION VOL Oil Sum of CFD VOL Sum of BITCOIN VOL Sum of DEPOSIT Sum of WITHDRAW Sum of IN/OUT
0 ADE A BOOK USD -2,905.2200000000003 638.09 134.83 15.590000000000002 2.76 0.0 0.0 0 0.0 0.0 0.0
1 ADE B BOOK USD -6,574.62 -1,179.3299999999997 983.2099999999999 21.819999999999997 30.979999999999993 72.02 0.0 0 8,166.9 0.0 8,166.9
2 ADE A BOOK AUD -360.86 235.39 64.44 5.369999999999999 0.0 0.0 0.0 0 700.0 0.0 700.0
3 ADE B BOOK AUD -3,431.9500000000003 190.66 88.42999999999999 11.88 3.14 0.03 2.0 0 20,700.0 -30,000.0 -9,300.0
the result confuses me, as I set the .2f format which is not in effect.
Using the string formatter mini language you can add commas and set the decimals to 2 places using f'{:,.2f}'.
import pandas as pd
df = pd.DataFrame({'EQUITY_CHANGE': [-2905.219262257907,
-6574.619531995241,
-360.85959369471186,
-3431.9499712161164]}
)
df.EQUITY_CHANGE.apply(lambda x: f'{x:,.2f}')
# returns:
0 -2,905.22
1 -6,574.62
2 -360.86
3 -3,431.95
Name: EQUITY_CHANGE, dtype: object
map method is not in-place; it doesn't modify the Series but instead it returns a new one.
So just substitute the result of the map to the old one
Here doc:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html
This is a sample of a pandas dataframe I have. I need to find the particular row for a given bid. For instance, give bid = 5, I need to return row corresponding to that in the following table. If I enter a missing bid, for instance, bid = 6, then the row corresponding to the largest bid smaller than input bid should be return. Thus row corresponding to bid = 5 should be return in that case. How do I do this in pandas?
Bid Imp Click Spend
3 13 0.97 2
4 13 1.89 7
5 79 34.98 130
7 83 37.52 140
8 88 38.52 144
I think this could do the trick:
>>> df[(df['Bid']<=5)].iloc[-1,:]
Bid 5.00
Imp 79.00
Click 34.98
Spend 130.00
Name: 2, dtype: float64
If you want a pandas just do df[(df['Bid']<=5)].iloc[-1,:].to_frame().T.
>>> df[(df['Bid']<=5)].iloc[-1,:].to_frame().T
Bid Imp Click Spend
2 5.0 79.0 34.98 130.0
For the case of the missing bid=6, df[(df['Bid']<=6)].iloc[-1,:].to_frame().T would return the nearest bid below 6, which is, again, 5.
>>> df[(df['Bid']<=6)].iloc[-1,:].to_frame().T
Bid Imp Click Spend
2 5.0 79.0 34.98 130.0
EDITED
To make sure that the dataframe contains Bidin ascending order just do previously:
>>> df = df.sort_values(by='Bid',ascending=True)
Here is a generator-based method. The generator gets exhausted and we catch the last item by enumeration.
df = df.sort_values('Bids')
df.loc[df['Bid'] == [max(enumerate(i for i in df['Bid'] if i <= 6))[1]]]
Bid Imp Click Spend
2 5 79 34.98 130
The above method is slow for large, marginally faster for small dataframes. As an alternative, you can use this pandas-based solution:
df.iloc[df[df['Bid'] <= 6].index[-1]]
Try
def get_bid(val):
# find the index of the maximum bid below or equal val
index = df.loc[df.Bid <= val, 'Bid'].idxmax()
return df.loc[[index]]
here is the result of calling the function with values 6 and 5 and 4 respectively
In []: get_bid(6)
Out[]:
Bid Imp Click Spend
2 5 79 34.98 130
In []: get_bid(5)
Out[]:
Bid Imp Click Spend
2 5 79 34.98 130
In []: get_bid(4)
Out[]:
Bid Imp Click Spend
1 4 13 1.89 7
PS if you prefer one liners, you can change the code to In[1], this will produce the same output as above. i.e. a dataframe. removing the double brackets(In [2]) will change the output to a series. I,e,
In [1]: val = 6
df.loc[[df.loc[df.Bid <= val, 'Bid'].idxmax()]]
Out[1]:
Bid Imp Click Spend
2 5 79 34.98 130
In [2]: df.loc[df.loc[df.Bid <= val, 'Bid'].idxmax()]
Out[2]:
Bid 5.00
Imp 79.00
Click 34.98
Spend 130.00
Name: 2, dtype: float64