pandas dataframe: perform calculations on columns - python

New to pandas and new to stackoverflow (really), any suggestions are highly appreciated!
I have this dataframe df:
col1 col2 col3
Date
2017-08-24 100 101 105
2017-08-23 102 102 107
2017-08-22 101 100 106
2017-08-21 103 99 106
2017-08-18 103 98 108
...
Now I'd like to perform some calculations with the values of each column, e.g. calculate the logarithm of each value.
I thought it's a good idea to loop over the columns and create a new temporary data frame with the resulting columns.
This new data frame should look like this e.g.:
col1 RN LOG
Date
2017-08-24 100 1 2
2017-08-23 102 2 2,008600
2017-08-22 101 3 2,004321
2017-08-21 103 4 2,012837
2017-08-18 103 5 2,012837
So I tried this for-loop:
for column in df:
tmp_df = df[column]
tmp_df['RN'] = range(1, len(tmp_df) + 1) # to create a new column with the row number
tmp_df['LOG'] = np.log(df[column]) # to create a new column with the LOG
However this doesn't print the new columns next to col1, but one below the other. The result looks like this:
Name: col1, Length: 86, dtype: object
Date
2017-08-24 00:00:00 100
2017-08-23 00:00:00 102
2017-08-22 00:00:00 101
2017-08-21 00:00:00 103
2017-08-18 00:00:00 103
RN,"range(1, 86)"
LOG,"Date
2017-08-24 2
2017-08-23 2,008600
2017-08-22 2,004321
2017-08-21 2,012837
2017-08-18 2,012837
00:00:00 was added to the date in the first part...
I also tried something with assign:
tmp_df = tmp_df.assign(LN=np.log(df[column]))
But this results in "AttributeError: "'Series' object has no attribute 'assign'""
It'd really be great if someone could point me in the right direction.
Thanks!

Your for loop is a good idea, but you need to create pandas Series in new columns this way:
for column in df:
df['RN ' + column] = pd.Series(range(1, len(df[column]) + 1))
df['Log ' + column] = pd.Series(np.log(df[column]))

Now I figured it out. :)
import pandas as pd
import numpy as np
...
for column in df:
tmp_res=pd.DataFrame(data=df[column])
newcol=range(1, len(df) + 1)
tmp_res=tmp_res.assign(RN=newcol)
newcol2=np.log(df[column])
tmp_res=tmp_res.assign(LN=newcol2)
This prints all columns next to each other:
col1 RN LOG
Date
2017-08-24 100 1 2
2017-08-23 102 2 2.008600
2017-08-22 101 3 2.004321
2017-08-21 103 4 2.012837
2017-08-18 103 5 2.012837
Now I can go on processing them or put it all in a csv / excel file.
Thanks for all your suggestions!

Related

Drop Rows in Pandas DataFrame When Items in Column Match Items in List

I have a pandas df with 5181 rows and with a column of customer names and I have a separate list of 383 customer names from within that column whose corresponding rows I want to drop from the df. I tried to write a piece of code that would iterate through all the names in the customer column and drop each of the rows with customer names matching those on the list. My result is TypeError: 'NoneType' object is not subscriptable.
The list is called Retail_Customer_Tracking and the df is called df_final and looks like:
index Customer First_Order_Date Last_Order_Date
0 0 0 2022-09-15 2022-09-15
1 1 287 2018-02-19 2020-11-30
2 2 606 2017-10-31 2017-12-07
3 3 724 2021-12-28 2022-09-15
4 4 1025 2015-08-13 2015-08-13
... ... ... ... ...
5176 5176 tulips little pop up shop 2021-10-25 2022-10-08
5177 5177 unboxed 2021-06-24 2022-10-10
5178 5178 upMADE 2021-09-10 2022-03-31
5179 5179 victorias floral design 2021-07-12 2021-07-12
5180 5180 vintique marketplace 2021-03-16 2022-10-15
5181 rows × 4 columns
The code i wrote looks like
i = 0
for x in Retail_Customer_Tracking:
while i < 5182:
if df_final["Customer"].iloc[i] == x:
df_final = df_final.drop(df_final[i], axis=0, inplace=True)
else:
i = i + 1
I was hoping that the revised df_final would not have the rows I wanted to drop...
i'm very new at coding and any help would be greatly appreciated. Thanks!

column not found while renaming in panda dataframe

I have this panda dataframe
timestamp EG2021 EGH2021
2021-01-04 33 Nan
2021-02-04 45 65
And I Am trying to replace the columnm name with new name as mapped in an excel file like this
OldId NewId
EG2021 LER_EG2021
EGH2021 LER_EGH2021
I tried below code but its not working I get the error as
KeyError: "None of [Index(['LER_EG2021',LER_EGH2021'],\n
dtype='object', length=186)] are in the [columns]
Code:
df = pd.ExcelFile('ids.xlsx').parse('Sheet1')
x=[]
x.append(df['external_ids'].to_list())
dtest_df = (my panda dataframe as mentioned above)
mapper = df.set_index(df['oldId'])[df['NewId']]
dtest_df.columns = dtest_df.columns.Series.replace(mapper)
Any idea what wrong am I doing??
You need:
mapper = df.set_index('oldId')['NewId']
dtest_df.columns = dtest_df.columns.map(mapper.to_dict())
Or:
dtest_df = dtest_df.rename(columns=df.set_index('oldId')['NewId'].to_dict())
dtest_df output:
timestamp LER_EG2021 LER_EGH2021
0 2021-01-04 33 NaN
1 2021-02-04 45 65
Another way, dict the zip of the df with the Old and New ids
dtest_df.rename(columns=dict(zip(df['OldId'], df['NewId'])), inplace=True)
timestamp LER_EG2021 LER_EGH2021
0 2021-01-04 33 Nan
1 2021-02-04 45 65

How can i perform linear regression by groups of different sizes?

I have 2 tables.
Table A has 105 rows:
bbgid dt weekly_price_per_stock weekly_pct_change
0 BBG000J9HHN8 2018-12-31 13562.328 0.000000
1 BBG000J9HHN8 2019-01-07 34717.536 1.559851
2 BBG000J9HHN8 2019-01-14 28300.218 -0.184844
3 BBG000J9HHN8 2019-01-21 35370.134 0.249818
4 BBG000J9HHN8 2019-01-28 36104.512 0.020763
... ... ... ... ...
100 BBG000J9HHN8 2020-11-30 62065.827 0.278765
101 BBG000J9HHN8 2020-12-07 62145.445 0.001283
102 BBG000J9HHN8 2020-12-14 63516.146 0.022056
103 BBG000J9HHN8 2020-12-21 51283.187 -0.192596
104 BBG000J9HHN8 2020-12-28 51306.951 0.000463
Table B has 257970 rows:
bbgid dt weekly_price_per_stock weekly_pct_change
0 BBG000B9WJ55 2018-12-31 34.612737 0.000000
1 BBG000B9WJ55 2019-01-07 70.618471 1.040245
2 BBG000B9WJ55 2019-01-14 89.123337 0.262040
3 BBG000B9WJ55 2019-01-21 90.377643 0.014074
4 BBG000B9WJ55 2019-01-28 90.527678 0.001660
... ... ... ... ...
257965 BBG00YFR2NJ6 2020-12-21 30.825000 -0.251275
257966 BBG00YFR2NJ6 2020-12-28 40.960000 0.328792
257967 BBG00YM46B38 2020-12-14 0.155900 -0.996194
257968 BBG00YM46B38 2020-12-21 0.372860 1.391661
257969 BBG00YM46B38 2020-12-28 0.535650 0.436598
In table A there's only a group of stocks (CCPM) but in table B i have a lot of different stock groups. I want to run a linear regression of table B pct_change vs table A (CCPM) pct_change so i can know how the stocks in table B move with respect to CCPM stocks during the period of time in the dt column. The problem is that i only have 105 rows in table A and when i group table B by bbgid i always get more rows so i'm having a error that says X and y must be the same size.
Both tables have been previously grouped by week and their pct_change has been calculated weekly. I should compare the variations in pct_change from table B with those on table A based on date and one group at a time from table B vs the CCPM stocks' pct_change.
I would like to extract the slope from each regression and store them in a column inside the same table and associate it to its corresponding group.
I have tried the solutions in this post and this post without success.
Is there any workaround to do this or i'm a doing something wrong? Please help me fix this.
Thank you very much in advance.

Replacing NaTs in a DataFrame using shift

I am importing a dataframe from an Excel spreadsheet where the data column is incomplete:
Date Value
0 2020-04-29 144
1 NaT 158
2 NaT 134
3 2020-04-30 114
4 NaT 153
and I'd like to fill in the NaTs by replacing them with the date from the line above. The slow method works:
for i in range(0, df.shape[0]):
if pd.isnull(df.iat[i,0]):
df.iat[i, 0] = df.iat[i-1, 0]
but the methods I think ought to work, don't. Both of these replace the first NaT they can encounter but skip NaTs after that (are they working on copies of the data?)
df["Date"] = np.where(df["Date"].isnull(), df["Date"].shift(1), df["Date"])
df['Date'].mask(df['Date'].isnull(), df['Date'].shift(1), inplace=True)
Is there any quick way of doing this?
A
You can try ffill:
df.ffill()
If "Date" values are string, you can convert "NaT" into actual NaN value using replace:
df.replace("NaT", np.NaN).ffill()
Explanation
Use replace to replace "NaT" string to actuel NaN values.
Fill all NaN cells from the previous not NaN cell using ffill.
Code + illustration
import pandas as pd
import numpy as np
print(df.replace("NaT", np.NaN))
# Date Value
# 0 2020-04-29 144
# 1 NaN 158
# 2 NaN 134
# 3 2020-04-30 114
# 4 NaN 153
print(df.replace("NaT", np.NaN).ffill())
# Date Value
# 0 2020-04-29 144
# 1 2020-04-29 158
# 2 2020-04-29 134
# 3 2020-04-30 114
# 4 2020-04-30 153

Subtract between two pandas data frame with different length

I need to do a subtract between two distinct data frames.
I had try the follow code:
df_sw['Apropriacao_total'] = df_sw_ant.merge(df_sw, how='left', right_on=['Data posicao', 'Ativo', 'Data vencimento'],
left_on=['Data posicao', 'Ativo', 'Data vencimento'])
['Apropriacao_conjunta'].sub(['Apropriacao'], axis=1)
Below, the sample data frames sample:
df_sw Cols: 6 rows: 62
Data_posicao Ativo Data_vencimento Apropriacao Apropriacao_conjunta
0 2017-07-03 RXU7 2017-09-07 -631.17 -631.17
1 2017-07-04 RXU7 2017-09-07 -828.59 -828.59
...
22 2017-07-05 GCQ7 2017-07-31 1820.06 1820.06
...
53 2017-07-18 CNHBRL 2017-09-28 1431.82 1431.82
df_sw_ant Cols: 6 rows: 32
Data_swap Data_posicao Ativo Data_vencimento Apropriacao_swap
0 2017-07-03 2017-06-30 RXU7 2017-09-07 -333.66
1 2017-07-04 2017-07-03 RXU7 2017-09-07 -631.17
...
22 2017-07-05 2017-07-04 GCQ7 2017-07-31 720.06
...
29 2017-07-20 2017-07-19 CNHBRL 2017-09-28 -157.30
Question:
How to perform a subtraction (df_sw['Apropriacao_conjunta'] - df_sw_ant['Apropriacao_swap']) where:
df_sw['Data_posicao'] = df_sw_ant['Data_swap'] and df_sw['Ativo'] = df_sw_ant['Ativo'] and df_sw['Data_vencimento'] = df_sw_ant['Data_vencimento']
The subtraction will be done in the axis = 1
You can try following and see if it works for you:
# merge and save to new dataframe
df_merged = df_sw_ant.merge(df_sw, how='left', right_on=['Data posicao', 'Ativo', 'Data vencimento'],
left_on=['Data posicao', 'Ativo', 'Data vencimento'])
# save subtracted result to a new column
df_merged['Sub_Value'] = df_merged['Apropriacao_conjunta'] - df_merged['Apropriacao']
Then, Sub_Value column in df_merged will have the result from subtraction of two columns.

Categories

Resources