How to do a multiplication of two different columns and rows - python

How can I make this account that I made in excel in python...
I wanted to take the column "Acumulado" and multiply by the bottom row of the column 'Selic por diy' and add that value in that row, and so do the same thing successively
import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"Data":['06/03/2006','07/03/2006','08/03/2006','09/03/2006','10/03/2006','13/03/2006','14/03/2006','15/03/2006','16/03/2006','17/03/2006'],
"Taxa SELIC":[17.29,17.29,17.29,16.54,16.54,16.54,16.54,16.54,16.54,16.54,]})
df['Taxa Selic %'] = df['Taxa SELIC'] / 100
df['Selic por dia'] = (1 + df['Taxa SELIC'])**(1/252)
Data frame Example
Here's an example I did in excel
Second example of how I would like it to look

Not an efficient method, but you can try this:
import numpy as np
selic_per_dia = list(df['Selic por dia'].values)
accumulado = [1000000*selic_per_dia[0]]
for i,value in enumerate(selic_per_dia):
if i==0:
continue
else:
accumulado.append(accumulado[i-1]*value)
df['Acumulado'] = accumulado
df.loc[-1] = [np.nan,np.nan,np.nan,np.nan,1000000]
df.index = df.index + 1
df = df.sort_index()

Related

querying a multiindex pandas dataframe with slices

Assuming I have the following multiindex DF
import pandas as pd
import numpy as np
import pandas as pd
input_id = np.array(['12345'])
docType = np.array(['pre','pub','app','dw'])
docId = np.array(['34455667'])
sec_type = np.array(['bib','abs','cl','de'])
sec_ids = np.array(['x-y','z-k'])
index = pd.MultiIndex.from_product([input_id,docType,docId,sec_type,sec_ids])
content= [str(randint(1,10))+ '##' + str(randint(1,10)) for i in range(len(index))]
df = pd.DataFrame(content, index=index, columns=['content'])
df.rename_axis(index=['input_id','docType','docId','secType','sec_ids'], inplace=True)
df
I know that I can query a multiindex DF as follows:
# querying a multiindex DF
idx = pd.IndexSlice
df.loc[idx[:,['pub','pre'],:,'de',:]]
basically with the help of pd.IndexSlice I can pass the values I want for every of the indexes. In the above case I want the resulting DF where the second index is 'pub' OR 'pre' and the 4th one is 'de'.
I am looking for the way to pass a range of values to the query. something like multiindex 3 beeing between 34567 and 45657. Assume those are integers.
pseudocode: df.loc[idx[:,['pub','pre'],XXXXX,'de',:]]
XXXX = ?
EDIT 1:
docId column index is of text type, probably its necessary to change it first to int
Turns out query is very powerful:
df.query('docType in ["pub","pre"] and ("34455667" <= docId <= "3445568") and (secType=="de")')
Output:
content
input_id docType docId secType sec_ids
12345 pre 34455667 de x-y 2##9
z-k 6##1
pub 34455667 de x-y 6##5
z-k 9##8

how can select data of coefficient of 3 columns from csv file

I would like to plot amount of columns for 2 different scenario based on index of rows in my dataset preferably via Pandas.DataFrame :
1st scenario: columns index[2,5,8,..., n+2]
2nd scenario: the last 480 columns or column index [961-1439]
picture
I've tried to play with index of columns which is following:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dft = pd.read_csv("D:\Test.csv" , header=None)
dft.head()
id_set = dft[dft.index % 2 == 0].astype('int').values
A = dft[dft.index % 2 == 1].values
B = dft[dft.index % 2 == 2].values
C = dft[dft.index % 2 == 3].values
data = {'A': A[:,0], 'B': B[:,0], 'C': C[:,0]}
df = pd.DataFrame(data, columns=['A','B','C'], index = id_set[:,0])
#1st scenario
j=0
index=[]
for i in range(1439):
if j==2:
j=0
continue
else:
index.append(i)
j+=1
print(index)
#2nd scenario
last_480 = df.[0:480][::-1]
I've found this post1 and post2 but they weren't my case!
I would appreciate if someone can help me.
1st scenario:
df.iloc[:, 2::3]
The slicing here means all rows, columns starting from the 2nd, and every 3 after that.
2nd scenario:
df.iloc[:, :961:-1]
The slicing here means all rows, columns to 961 from the end of the list.
EDIT:
import matplotlib.pyplot as plt
import seaborn as sns
senario1 = df.iloc[:, 2::3].copy()
sns.lineplot(data = senario1.T)
You can save the copy of the slice to another variable, then since you want to graph row-wise you need to take the transpose of the sliced matrix (This will make yours rows into columns).

How do I efficiently change a pd.Series of lists in a dataframe to a pd.Series of np.arrays

I have a PostgreSQL database that has data similar to:
date, character varying, character varying, integer[]
In the interger array column is stored a list of values: 1,2,3,4,5
I'm using pd.read_sql to read the data into a dataframe.
So I have a dataframe with a date column, several string columns, and then a column with a list of intergers.
The array values are regularly used in numpy arrays to do vector math.
In the past I couldn't find a way to convert the list column to a numpy array column without looping and recreating the dataframe row by row.
As an example:
import pandas as pd
import numpy as np
col1 = ['String data'] * 4
col2 = [[1,2,3,4,5]] * 4
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)
new_df = pd.DataFrame(columns=df.columns)
for i in range(len(df)):
new_df.loc[i, ['Description','Measures']] = [df.at[i,'Description'], np.array(df.at[i,'Measures'])]
print(new_df)
This looping could be over a few thousand rows.
More recently I figured out that if I could do a single line conversion of Series -> list -> nparray -> list -> Series and achieve the result a lot more efficiently.
import pandas as pd
import numpy as np
col1 = ['String data'] * 4
col2 = [[1,2,3,4,5]] * 4
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)
df['NParray'] = pd.Series(list(np.array(list(np.array(df['Measures'])))))
df.drop(['Measures'], axis=1, inplace=True)
print(df)
print(type(df['NParray'][0]))
I read about and tried to use Series.array and Series.to_numpy, but they don't really achieve what I'm trying to do.
So, the question is:
Is there a method to convert a pd.Series of lists to a numpy array as I'm trying to do?
Is there any easier way to mass convert these lists to numpy arrays?
I was hoping for something like simple like:
df['NParray'] =np.asarray(df['Measures'])
df['NParray'] =np.array(df['Measures'])
df['NParray'] =df['Measures'].array
df['NParray'] =df['Measures'].to_numpy()
But these have different functions and do not work for my purpose.
------------Edited after testing------------------------------------------------
I setup a small test to see what the difference in timings and efficiency would be:
import pandas as pd
import numpy as np
def get_dataframe():
col1 = ['String data'] * 10000
col2 = [list(range(0,5000))] * 10000
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)
return(df)
def old_looping(df):
new_df = pd.DataFrame(columns=df.columns)
starttime = pd.datetime.now()
for i in range(len(df)):
new_df.loc[i, ['Description','Measures']] = [df.at[i,'Description'], np.array(df.at[i,'Measures'])]
endtime = pd.datetime.now()
duration = endtime - starttime
print('Looping', duration)
def series_transforms(df):
starttime = pd.datetime.now()
df['NParray'] = pd.Series(list(np.array(list(np.array(df['Measures'])))))
df.drop(['Measures'], axis=1, inplace=True)
endtime = pd.datetime.now()
duration = endtime - starttime
print('Transforms', duration)
def use_apply(df):
starttime = pd.datetime.now()
df['Measures'] = df['Measures'].apply(np.array)
endtime = pd.datetime.now()
duration = endtime - starttime
print('Apply', duration)
def run_test(tests):
for i in range(tests):
construct_df = get_dataframe()
old_looping(construct_df)
for i in range(tests):
construct_df = get_dataframe()
series_transforms(construct_df)
for i in range(tests):
construct_df = get_dataframe()
use_apply(construct_df)
run_test(5)
With 10,000 rows the results were:
Transforms 3.945816
Transforms 3.968821
Transforms 3.891866
Transforms 3.859437
Transforms 3.860590
Apply 4.218867
Apply 4.015742
Apply 4.046986
Apply 3.906360
Apply 3.890740
Looping 27.662418
Looping 27.814523
Looping 27.298895
Looping 27.565626
Looping 27.222970
Transforming through Series-List-NP Array-List-Series is negligibly faster than using Apply. Apply is definitely shorter code and possibly easier to understand.
Increasing the number of rows or array length will increase the times by the same magnitude.
Easiest might be to go with apply to convert to the np.array: df['Measures'].apply(np.array)
Full example:
import pandas as pd
import numpy as np
col1 = ['String data'] * 4
col2 = [[1,2,3,4,5]] * 4
d = {'Description': col1, 'Measures':col2}
df = pd.DataFrame(d)
display(df.Measures)
df['NParray'] = df['Measures'].apply(np.array)
df.drop(['Measures'], axis=1, inplace=True)
print(df)
print(type(df['NParray'][0]))

Excel xlwings data input for Python Technical Indicators

I am trying to replicate a simple Technical-Analysis indicator using xlwings. However, the list/data seems not to be able to read Excel values. Below is the code
import pandas as pd
import datetime as dt
import numpy as np
#xw.func
def EMA(df, n):
EMA = pd.Series(pd.ewma(df['Close'], span = n, min_periods = n - 1), name = 'EMA_' + str(n))
df = df.join(EMA)
return df
When I enter a list of excel data : EMA = ({1,2,3,4,5}, 5}, I get the following error message
TypeError: list indices must be integers, not str EMA = pd.Series(pd.ewma(df['Close'], span = n, min_periods = n - 1), name = 'EMA_' + str(n))
(Expert) help much appreciated! Thanks.
EMA() expects a DataFrame df and a scalar n, and it returns the EMA in a separate column in the source DataFrame. You are passing a simple list of values, this is not supposed to work.
Construct a DataFrame and assign the values to the Close column:
v = range(100) # use your list of values instead
df = pd.DataFrame(v, columns=['Close'])
Call EMA() with this DataFrame:
EMA(df, 5)

How to update dataframe value

I have a project where for each row in a table I need to iterate through rows from another table and update values in both. The changes need to stick for the next iteration. What is the best way to do that?
for invoice_line in invoices.itertuples():
qty = invoice_line.SHIP_QTY
for receipt_line in receipts[receipts.SKU == invoice_line.SKU].itertuples():
if qty > receipt_line.REC_QTY:
receipts.set_value(receipt_line.index,'REC_QTY',0)
qty = qty - receipt_line.REC_QTY
else:
receipts.set_value(receipt_line.index,'REC_QTY', receipt_line.REC_QTY - qty)
qty = 0
recd = receipt_line.REC_DATE
if qty < 1:break
invoices.set_value(invoice_line.index,'REC_DATE',recd)
set_value does not seem to work.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(50, 4), columns=list('ABCD'))
for row in df.itertuples():
df.set_value(row.index,'test',row.D)
print df.head()
I think what you want is a capitalized Index
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(50, 4), columns=list('ABCD'))
for row in df.itertuples():
df.set_value(row.Index,'test',row.D)
print df.head()
Not 100% sure if this is what you want, but I think you're trying to loop thru a list and update the value of a cell in a dataframe. The syntax for that is:
for ix in df.index:
df.loc[ix, 'Test'] = 'My New Value'
where ix is the row position and 'Test' is the column name that you want to update. If you need to add more logic, you could try somthing like:
for ix in df.index:
row = df.loc[ix]
if row.myVariable < 100:
df.loc[ix, 'SomeColumn'] = 'Less than ahundred'
else:
df.loc[ix, 'SomeColumn'] = 'ahundred or more'

Categories

Resources