combine two formats together - python

I am doing the formatting of a dataframe. I need to do the thousand separator and the decimals. The problem is when I combine them together, only the last one is in effect. I guess many people may have the same confusion, as I have googled a lot, nothing is found.
I tried to use .map(lambda x:('%.2f')%x and format(x,',')) to combine the two required formats together, but only the last one is in effect
DF_T_1_EQUITY_CHANGE_Summary_ADE['Sum of EQUITY_CHANGE'].map(lambda x:format(x,',') and ('%.2f')%x)
DF_T_1_EQUITY_CHANGE_Summary_ADE['Sum of EQUITY_CHANGE'].map(lambda x:('%.2f')%x and format(x,','))
the first result is:
0 -2905.22
1 -6574.62
2 -360.86
3 -3431.95
Name: Sum of EQUITY_CHANGE, dtype: object
the second result is:
0 -2,905.2200000000003
1 -6,574.62
2 -360.86
3 -3,431.9500000000003
Name: Sum of EQUITY_CHANGE, dtype: object
I tried a new way, by using
DF_T_1_EQUITY_CHANGE_Summary_ADE.to_string(formatters={'style1': '${:,.2f}'.format})
the result is:
Row Labels Sum of EQUITY_CHANGE Sum of TRUE_PROFIT Sum of total_cost Sum of FOREX VOL Sum of BULLION VOL Oil Sum of CFD VOL Sum of BITCOIN VOL Sum of DEPOSIT Sum of WITHDRAW Sum of IN/OUT
0 ADE A BOOK USD -2,905.2200000000003 638.09 134.83 15.590000000000002 2.76 0.0 0.0 0 0.0 0.0 0.0
1 ADE B BOOK USD -6,574.62 -1,179.3299999999997 983.2099999999999 21.819999999999997 30.979999999999993 72.02 0.0 0 8,166.9 0.0 8,166.9
2 ADE A BOOK AUD -360.86 235.39 64.44 5.369999999999999 0.0 0.0 0.0 0 700.0 0.0 700.0
3 ADE B BOOK AUD -3,431.9500000000003 190.66 88.42999999999999 11.88 3.14 0.03 2.0 0 20,700.0 -30,000.0 -9,300.0
the result confuses me, as I set the .2f format which is not in effect.

Using the string formatter mini language you can add commas and set the decimals to 2 places using f'{:,.2f}'.
import pandas as pd
df = pd.DataFrame({'EQUITY_CHANGE': [-2905.219262257907,
-6574.619531995241,
-360.85959369471186,
-3431.9499712161164]}
)
df.EQUITY_CHANGE.apply(lambda x: f'{x:,.2f}')
# returns:
0 -2,905.22
1 -6,574.62
2 -360.86
3 -3,431.95
Name: EQUITY_CHANGE, dtype: object

map method is not in-place; it doesn't modify the Series but instead it returns a new one.
So just substitute the result of the map to the old one
Here doc:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html

Related

How to calculate percentage of column using pandas pivot_table()

I am attempting to get the frequency of objects per version expressed as a percentage.
Input dfsof = pd.read_clipboard()
file version object
path1 1.0 name
path1 1.0 session
path1 1.0 sequence
path2 2.01 name
path2 2.01 session
path2 2.01 sequence
path3 2.01 name
path3 2.01 session
path3 2.01 earthworm
Using the following, I am able to get frequency of each file.
dfsof.pivot_table(index=['object'], values=['file'], columns=['version'], aggfunc=len, fill_value=0, margins=True)
file
version 1.0 2.01 All
object
earthworm 0 1 1
name 1 2 3
sequence 1 1 2
session 1 2 3
All 3 6 9
I want to divide each count per object/version by the total number of distinct files for that version. Using the expected return table as an example, earthworms shows up in the input only once for version 2.01, so I expect 0% for version 1.0 and 50% for version 2.01 since only one of the files has that value.
Using dfsof.groupby('version')['file'].nunique() returns the frequency of files per version, which is the denominator for each of object/version in the table above. What I am struggling with is how to apply the denominator values to the pivot_table. I have seen examples of this using grand totals and subtotals but I can't seem to find nor figure out how to divide by the unique number of files per version. Any help would be greatly appreciated.
version
1.00 1
2.01 2
Expected return
path
version 1.0 2.01 All
object
earthworm 0% 50% 1
name 100% 100% 3
sequence 100% 50% 2
session 100% 100% 3
All 3 6 9
IIUC, you need to aggregate as 'nunique', then perform some recomputation:
# aggregate using nunique
out = dfsof.pivot_table(index=['object'], values=['file'], columns=['version'],
aggfunc='nunique', fill_value=0, margins=True)
# compute and save sum of nunique
All = out.iloc[:-1].sum()
# update values as percentage
out.update(out.iloc[:-1, :-1].div(out.iloc[-1, :-1]).mul(100))
# assign sum of nunique
out.loc['All'] = All
print(out)
Output:
file
version 1.0 2.01 All
object
earthworm 0.0 50.0 1
name 100.0 100.0 3
sequence 100.0 50.0 2
session 100.0 100.0 3
All 3.0 6.0 9

Problem iteration columns and rows Dataframe

Here is my problem :
Let’s say you have to buy and sell two objects with those following conditions:
You buy object A or B if its price goes below 150 (<150) and assuming that you can buy fraction of the object (so decimals are allowed)
If the following day the object is still below 150, then you just keep the object and do nothing
If the object is higher or equal to 150, then you sell the object and take profits
You start the game with 10000$
Here is the DataFrame with all the prices
df=pd.DataFrame({'Date':['2017-05-19','2017-05-22','2017-05-23','2017-05-24','2017-05-25','2017-05-26','2017-05-29'],
'A':[153,147,149,155,145,147,155],
'B':[139,152,141,141,141,152,152],})
df['Date']=pd.to_datetime(df['Date'])
df = df.set_index('Date')
The goal is to return a DataFrame with the number of object for A and B you hold and the number of cash you have left.
If the conditions are met, the allocation for each object is the half of the cash you have if you don’t hold any object (weight =1/2) and is the rest if you already have one object (weight=1)
Let’s look at df first, I will also develop the new data frame that I’m trying to create (let’s call it df_end) :
On 2017-05-19, object A is 153$ and B is 139$ : You buy 35.97 object B (=5000/139) as the price is <150 —> You have 5000$ left in cash.
On 2017-05-22, object A is 147$ and B is 152$ : You buy 34.01 object A (=5000/147) as the price is <150 + You sell 35.97 object B at 152$ as it is >=150 --> You have now 5467,44$ left in cash thanks to the selling of B.
On 2017-05-23, object A is 149$ and B is 141$ : You keep your position on Object A (34.01 object) as it’s still below 150 and you buy 38.77 Object B (=5467.44/141) as the price is <150 —> You have now 0$ left in cash.
On 2017-05-24, object A is 155$ and B is 141$ : You sell 34.01 object A at 155$ as it’s above 150$ and you keep 38.77 Object B as it’s still below 150 —> You have now 5271.55$ left in cash thanks to the selling of A
On 2017-05-25, object A is 145$ and B is 141$: You buy 36.35 object A (5271.55/145) as it’s below 150 and you keep 38.77 Object B as it’s still below 150 —> You have now 0$ in cash
On 2017-05-26, object A is 147$ and B is 152$: You sell 38.77 object B at 152 as it’s above 150 and you keep 36.35 Object A as it’s still below 150 —> You have now 5893.04$ in cash thanks to the selling of Object B
On 2017-05-29, object A is 155$ and B is 152$: You sell 36.35 object A at 155 as it’s above 150 and you do nothing else as B is not below 150 —> You have now 11.527,29$ in cash thanks to the selling of Object A.
Hence, the new dataframe df_end should look like this (this is the Result I am looking for)
A B Cash
Date
2017-05-19 0 35.97 5000
2017-05-22 34.01 0 5467.64
2017-05-23 34.01 38.77 0
2017-05-24 0 38.77 5272.11
2017-05-25 36.35 38.77 0
2017-05-26 36.35 0 5893.04
2017-05-29 0 0 11527.29
My principal problem is that we have to iterate over both rows and columns and this is the most difficult part.
It's been a week that I'm trying to find a solution but I still don't find any idea on that, that is why I tried to explain as clear as possible.
So if somebody has an idea on this issue, you are very welcome.
Thank you so much
You could try this:
import pandas as pd
df=pd.DataFrame({'Date':['2017-05-19','2017-05-22','2017-05-23','2017-05-24','2017-05-25','2017-05-26','2017-05-29'],
'A':[153,147,149,155,145,147,155],
'B':[139,152,141,141,141,152,152],})
df['Date']=pd.to_datetime(df['Date'])
df = df.set_index('Date')
print(df)
#Values before iterations
EntryCash=10000
newdata=[]
holding=False
#First iteration (Initial conditions)
firstrow=df.to_records()[0]
possibcash=EntryCash if holding else EntryCash/2
prevroa=possibcash/firstrow[1] if firstrow[1]<=150 else 0
prevrob=possibcash/firstrow[2] if firstrow[2]<=150 else 0
holding=any(i!=0 for i in [prevroa,prevrob])
newdata.append([df.to_records()[0][0],prevroa,prevrob,possibcash])
#others iterations
for row in df.to_records()[1:]:
possibcash=possibcash if holding else possibcash/2
a=row[1]
b=row[2]
if a>150:
if prevroa>0:
possibcash+=prevroa*a
a=0
else:
a=prevroa
else:
if prevroa==0:
a=possibcash/a
possibcash=0
else:
a=prevroa
if b>150:
if prevrob>0:
possibcash+=prevrob*b
b=0
else:
b=prevrob
else:
if prevrob==0:
b=possibcash/b
possibcash=0
else:
b=prevrob
prevroa=a
prevrob=b
newdata.append([row[0],a,b,possibcash])
holding=any(i!=0 for i in [a,b])
df_end=pd.DataFrame(newdata, columns=[df.index.name]+list(df.columns)+['Cash']).set_index('Date')
print(df_end)
Output:
df
A B
Date
2017-05-19 153 139
2017-05-22 147 152
2017-05-23 149 141
2017-05-24 155 141
2017-05-25 145 141
2017-05-26 147 152
2017-05-29 155 152
df_end
A B Cash
Date
2017-05-19 0.000000 35.971223 5000.000000
2017-05-22 34.013605 0.000000 5467.625899
2017-05-23 34.013605 38.777489 0.000000
2017-05-24 0.000000 38.777489 5272.108844
2017-05-25 36.359371 38.777489 0.000000
2017-05-26 36.359371 0.000000 5894.178274
2017-05-29 0.000000 0.000000 11529.880831
If you want it rounded to two decimals, you can add:
df_end=df_end.round(decimals=2)
df_end:
A B Cash
Date
2017-05-19 0.00 35.97 5000.00
2017-05-22 34.01 0.00 5467.63
2017-05-23 34.01 38.78 0.00
2017-05-24 0.00 38.78 5272.11
2017-05-25 36.36 38.78 0.00
2017-05-26 36.36 0.00 5894.18
2017-05-29 0.00 0.00 11529.88
Slight Differences Final Values
It is slight different to your desired output because sometimes you were rounding the values to two decimals and sometimes you didn't. For example:
In your second row you put:
#second row
2017-05-22 34.01 0 5467.64
That means you used the complete value of object A, first row, that is 35.971223 not 35.97:
35.97*152
Out[120]: 5467.44
35.971223*152
Out[121]: 5467.6258960000005 #---->closest to 5467.64
And at row 3, again you used the real value, not the rounded:
#row 3
2017-05-24 0 38.77 5272.11
#Values
34.013605*155
Out[122]: 5272.108775
34.01*155
Out[123]: 5271.549999999999
And finally, at the last two rows you used the rounded value, I guess, because:
#last two rows
2017-05-26 36.35 0 5893.04
2017-05-29 0 0 11527.29
#cash values
#penultimate row, cash value
38.777489*152
Out[127]: 5894.178328
38.77*152
Out[128]: 5893.040000000001
#last row, cash value
5894.04+(155*36.35)
Out[125]: 11528.29 #---->closest to 11527.29
5894.04+(155*36.359371)
Out[126]: 11529.742505

How can I build a faster decaying average? comparing a data frame's rows date field to other rows dates

I am clumsy but adequate with python. I have referenced stack often, but this is my first question. I have built a decaying average function to act on a pandas data frame with about 10000 rows, but it takes 40 minutes to run. I would appreciate any thoughts on how to speed it up. Here is a sample of actual data, simplified a bit.
sub = pd.DataFrame({
'user_id':[101,101,101,101,101,102,101],
'class_section':['Modern Biology - B','Spanish Novice 1 - D', 'Modern Biology - B','Spanish Novice 1 - D','Spanish Novice 1 - D','Modern Biology - B','Spanish Novice 1 - D'],
'sub_skill':['A','A','B','B','B','B','B'],
'rating' :[2.0,3.0,3.0,2.0,3.0,2.0,2.0],
'date' :['2019-10-16','2019-09-04','2019-09-04', '2019-09-04','2019-09-13','2019-10-16','2019-09-05']})
For this data frame:
sub
Out[716]:
user_id class_section sub_skill rating date
0 101 Modern Biology - B A 2.0 2019-10-16
1 101 Spanish Novice 1 - D A 3.0 2019-09-04
2 101 Modern Biology - B B 3.0 2019-09-04
3 101 Spanish Novice 1 - D B 2.0 2019-09-04
4 101 Spanish Novice 1 - D B 3.0 2019-09-13
5 102 Modern Biology - B B 2.0 2019-10-16
6 101 Spanish Novice 1 - D B 2.0 2019-09-05
A decaying average weights the most recent event that meets conditions at full weight and weights each previous event with a multiplier less than one. In this case, the multiplier is 0.667. previously weighted events are weighted again.
So the decaying average for user 101's rating in Spanish sub_skill B is:
(2.0*0.667^2 + 2.0*0.667^1 + 3.0*0.667^0)/((0.667^2 + 0.667^1 + 0.667^0) = 2.4735
Here is what I tried, after reading a helpful post on weighted averages
sub['date'] = pd.to_datetime(sub.date_due)
def func(date, user_id, class_section, sub_skill):
return sub.apply(lambda row: row['date'] > date
and row['user_id']==user_id
and row['class_section']== class_section
and row['sub_skill']==sub_skill,axis=1).sum()
# for some reason this next line of code took about 40 minutes to run on 9000 rows:
sub['decay_count']=sub.apply(lambda row: func(row['date'],row['user_id'], row['class_section'], row['sub_skill']), axis=1)
# calculate decay factor:
sub['decay_weight']=sub.apply(lambda row: 0.667**row['decay_count'], axis=1)
# calcuate decay average contributors (still needs to be summed):
g = sub.groupby(['user_id','class_section','sub_skill'])
sub['decay_avg'] = sub.decay_weight / g.decay_weight.transform("sum") * sub.rating
# new dataframe with indicator/course summaries as decaying average (note the sum):
indicator_summary = g.decay_avg.sum().to_frame(name = 'DAvg').reset_index()
I frequently work in pandas and I am used to iterating through large datasets. I would have expected this to take rows-squared time, but it is taking much longer. A more elegant solution or some advice to speed it up would be really appreciated!
Some background on this project: I am trying to automate the conversion from proficiency-based grading into a classic course grade for my school. I have the process of data extraction from our Learning Management System into a spreadsheet that does the decaying average and then posts the information to teachers, but I would like to automate the whole process and extract myself from it. The LMS is slow to implement a proficiency-based system and is reluctant to provide a conversion - for good reason. However, we have to communicate both student proficiencies and our conversion to a traditional grade to parents and colleges since that is a language they speak.
Why not use groupby? The idea here is that you rank the dates within the group in descending order and subtract 1 (because rank starts with 1). That seems to mirror your logic in func above, without having to try to call apply with a nested apply.
sub['decay_count'] = sub.groupby(['user_id', 'class_section', 'sub_skill'])['date'].rank(method='first', ascending=False) - 1
sub['decay_weight'] = sub['decay_count'].apply(lambda x: 0.667 ** x)
Output:
sub.sort_values(['user_id', 'class_section', 'sub_skill', 'decay_count'])
user_id class_section sub_skill rating date decay_count decay_weight
0 101 Modern Biology - B A 2.0 2019-10-16 0.0 1.000000
2 101 Modern Biology - B B 3.0 2019-09-04 0.0 1.000000
1 101 Spanish Novice 1 - D A 3.0 2019-09-04 0.0 1.000000
3 101 Spanish Novice 1 - D B 2.0 2019-09-04 0.0 1.000000
6 101 Spanish Novice 1 - D B 2.0 2019-09-05 1.0 0.667000
4 101 Spanish Novice 1 - D B 3.0 2019-09-13 2.0 0.444889
5 102 Modern Biology - B B 2.0 2019-10-16 0.0 1.000000

Select value from dataframe based on other dataframe

i try to calculate the position of an object based on a timestamp. For this I have two dataframes in pandas. One for the measurement data and one for the position. All the movement is a straightforward acceleration.
Dataframe 1 contains the measurement data:
ms force ... ... ...
1 5 20
2 10 20
3 15 25
4 20 30
5 25 20
..... (~ 6000 lines)
Dataframe 2 contains "positioning data"
ms speed (m/s)
1 0 0.66
2 4500 0.66
3 8000 1.3
4 16000 3.0
5 20000 3.0
.....(~300 lines)
Now I want to calculate the position of the first dataframe with the data from secound dataframe
In Excel I solved the problem by using an array formular but now I have to use Python/Pandas and I cant find a way how to select the correct row from dataframe 2.
My idea is to make something like this: if
In the end I want to display a graph "force <-> way" and not "force <-> time"
Thank you in andvance
==========================================================================
Update:
In the meantime I could almost solve my issue. Now my Data look like this:
Dataframe 2 (Speed Data):
pos v a t t-end t-start
0 -3.000 0.666667 0.000000 4.500000 4.500000 0.000000
1 0.000 0.666667 0.187037 0.071287 4.571287 4.500000
2 0.048 0.680000 0.650794 0.010244 4.581531 4.571287
3 0.055 0.686667 0.205432 0.064904 4.646435 4.581531
...
15 0.055 0.686667 0.5 0.064904 23.0 20.0
...
28 0.055 0.686667 0.6 0.064904 35.0 34.0
...
30 0.055 0.686667 0.9 0.064904 44.0 39.0
And Dataframe 1 (time based measurement):
Fx Fy Fz abs_t expected output ('a' from DF1)
0 -13.9 170.3 45.0 0.005 0.000000
1 -14.1 151.6 38.2 0.010 0.000000
...
200 -14.1 131.4 30.4 20.015 0.5
...
300 -14.3 111.9 21.1 34.01 0.6
...
400 -14.5 95.6 13.2 40.025
So i want to check the time(abs_t) from DF1 and search for the corract 'a' in DF2
So somthing like this (pseudo code):
if (DF1['t_abs'] between (DF2['t-start'], DF2['t-end']):
DF1['a'] = DF2['a']
I could make two for loops but it looks like the wrong way and is very very slow.
I hope you understand my problem; to provide a running sample is very hard.
In Excel I did like this:
I found a very slow solution but atleast its working :(
df1['a'] = 0
for index, row in df2.iterrows():
start = row['t-start']
end = row ['t-end']
a = row ['a']
df1.loc[(df1['tabs']>start)&(df1['tabs']<end), 'a'] = a

How do I iterate over a DataFrame when apply won't work without a for loop?

I am trying to find the best way to apply my function to each individual row of a pandas DataFrame without using iterrows() or itertuples(). Note that I am pretty sure apply() will not work in this case.
Here the first 5 rows of the DataFrame that I'm working with:
In [2470]: home_df.head()
Out[2470]:
GameId GameId_real team FTHG FTAG homeElo awayElo homeGame
0 0 -1 Charlton 1.0 2.0 1500.0 1500.0 1
1 1 -1 Derby 2.0 1.0 1500.0 1500.0 1
2 2 -1 Leeds 2.0 0.0 1500.0 1500.0 1
3 3 -1 Leicester 0.0 5.0 1500.0 1500.0 1
4 4 -1 Liverpool 2.0 1.0 1500.0 1500.0 1
Here is my function and the code that I am currently using:
def wt_goals_elo(df, game_id_row, team_row):
wt_goals = (df[(df.GameId < game_id_row) & (df.team == team_row)]
.pipe(lambda df:
(df.awayElo * df.FTHG).sum() / df.awayElo.sum()))
return wt_goals
game_id_idx = home_df.columns.get_loc('GameId')
team_idx = home_df.columns.get_loc('team')
wt_goals = [wt_goals_elo(home_df, row[game_id_idx + 1], row[team_idx + 1]) for row in home_df.itertuples()]
FTHG = Full time home goals.
I am basically trying to find the weighted average of full time home goals, weighted by away elo for previous games. I can do this using a for loop but am unable to do it using apply, as I need to refer to the original DataFrame to filter by GameId and team.
Any ideas?
Thanks so much in advance.
I believe need:
def wt_goals_elo(game_id_row, team_row):
print (game_id_row)
wt_goals = (home_df[(home_df.GameId.shift() < game_id_row) &
(home_df.team.shift() == team_row)]
.pipe(lambda df:
(df.awayElo * df.FTHG).sum() / df.awayElo.sum()))
return wt_goals
home_df['w'] = home_df.apply(lambda x: wt_goals_elo(x['GameId'], x['team']), axis=1)

Categories

Resources