Need to print a new column for Panda data frame

Need to print a new column for Panda data frame - python

I am using the following code to print ratio by applying a function, but am getting the following errors.
Code
import investpy
import pandas as pd
import numpy as np
import sys
def main(stock1_name, stock2_name):
stock1 = investpy.get_stock_historical_data(stock=stock1_name,country='india', from_date='01/01/2020',to_date='08/03/2021')
stock2 = investpy.get_stock_historical_data(stock=stock2_name,country='india', from_date='01/01/2020',to_date='08/03/2021')
new_df = pd.merge(stock1, stock2, on='Date')
new_df = new_df.drop(['Open_x', 'High_x', 'Low_x', 'Volume_x', 'Currency_x', 'Low_y','Volume_y', 'Currency_y', 'Open_y', 'High_y'], axis = 1)
new_df['ratio'] = np.log10(new_df['Close_x']/new_df['Close_y'])
return new_df
x = main("IOC","HPCL")
print(x)
Error
NameError Traceback (most recent call last)
<ipython-input-2-c17535375449> in <module>
12 return new_df
13 x = main("IOC","HPCL")
---> 14 print(x)
NameError: name 'x' is not defined

You are calling x = main("IOC","HPCL") inside the function main
This makes x defined only inside the scope of the function main
When you call print(x) outside function main the interpreter throws error, as it should, that x is not defined
Does this correction solve the issue:
import investpy
import pandas as pd
import numpy as np
import sys
def main(stock1_name, stock2_name):
stock1 = investpy.get_stock_historical_data(stock=stock1_name,country='india', from_date='01/01/2020',to_date='08/03/2021')
stock2 = investpy.get_stock_historical_data(stock=stock2_name,country='india', from_date='01/01/2020',to_date='08/03/2021')
new_df = pd.merge(stock1, stock2, on='Date')
new_df = new_df.drop(['Open_x', 'High_x', 'Low_x', 'Volume_x', 'Currency_x', 'Low_y','Volume_y', 'Currency_y', 'Open_y', 'High_y'], axis = 1)
new_df['ratio'] = np.log10(new_df['Close_x']/new_df['Close_y'])
return new_df
x = main("IOC","HPCL") # Edit moving this line outside the function main()
print(x)

Related

Filling out a dataframe column using parallel processing in Python

Trying to compute a value for each row of dataframe in parallel using the following code, but getting errors either when I pass individual input ranges or the combination:
#!pip install pyblaze
import itertools
import pyblaze
import pyblaze.multiprocessing as xmp
import pandas as pd
inputs = [range(2),range(2),range(3)]
inputs_list = list(itertools.product(*inputs))
Index = pd.MultiIndex.from_tuples(inputs_list,names={"a", "b", "c"})
df = pd.DataFrame(index = Index)
df['Output'] = 0
print(df)
def Addition(A,B,C):
df.loc[A,B,C]['Output']=A+B+C
return df
def parallel(inputs_list):
tokenizer = xmp.Vectorizer(Addition, num_workers=8)
return tokenizer.process(inputs_list)
parallel(inputs_list)

NameError: name 'TabularList' is not defined

from fastai import *
from fastai.tabular import *
from fastai.tabular.all import *
import pandas as pd
# set seed for reproducibility
custom_set_seed(42)
df = pd.read_csv('credit_card_default.csv', index_col=0, na_values='')
df.head()
DEP_VAR = 'default_payment_next_month'
num_features = list(df.select_dtypes('number').columns)
num_features.remove(DEP_VAR)
cat_features = list(df.select_dtypes('object').columns)
preprocessing = [FillMissing, Categorify, Normalize]
data = (TabularList.from_df(df, cat_names=cat_features, cont_names=num_features, procs=preprocessing).split_by_rand_pct(valid_pct=0.2, seed=42).label_from_df(cols=DEP_VAR).databunch())
I have been trying to run this piece of code but it keeps running into this error:
NameError Traceback (most recent call last)
<ipython-input-42-5ca7e57a8e36> in <module>
1 # Create a TabularDataBunch from the DataFrame
2
----> 3 data = (TabularList.from_df(df, cat_names=cat_features, cont_names=num_features, procs=preprocessing).split_by_rand_pct(valid_pct=0.2, seed=42).label_from_df(cols=DEP_VAR).databunch())
NameError: name 'TabularList' is not defined
I believe I have imported all the modules that were needed. Can someone suggest a solution for this?

Check with the full path import as below
from fastai.tabular.data import TabularList

I got this working by installing an older fastai i.e.
pip install fastai==1.0.61
then
from fastai.tabular.data import TabularList
works with no problems.

Apply function to data frame in python

In the following codes, I try to define a function first and apply the function to a dataframe to reset the geozone.
import pandas as pd
testdata ={'country': ['USA','AUT','CHE','ABC'], 'geozone':[0,0,0,0]}
d =pd.DataFrame.from_dict(testdata, orient = 'columns')
def setgeozone(dataframe, dcountry, dgeozone):
dataframe.loc[dataframe['dcountry'].isin(['USA','CAN']),'dgeozone'] =1
dataframe.loc[dataframe['dcountry'].isin(['AUT','BEL']),'dgeozone'] =2
dataframe.loc[dataframe['dcountry'].isin(['CHE','DNK']),'dgeozone'] =3
setgeozone(d, country, geozone)
I got error message saying:
Traceback (most recent call last):
File "<ipython-input-56-98dad4781f73>", line 1, in <module>
setgeozone(d, country, geozone)
NameError: name 'country' is not defined
Can someone help me understand what I did wrong.
Many thanks.

You don't need to pass parameters other than the DataFrame itself to your function. Try this:
def setgeozone(df):
df.loc[df['country'].isin(['USA','CAN']),'geozone'] = 1
df.loc[df['country'].isin(['AUT','BEL']),'geozone'] = 2
df.loc[df['country'].isin(['CHE','DNK']),'geozone'] = 3
setgeozone(df)
Here's two other (also better) ways to accomplish what you need:
Use map:
df["geozone"] = df["country"].map({"USA": 1, "CAN": 1, "AUT": 2, "BEL": 2, "CHE": 3, "DNK": 3})
Use numpy.select:
import numpy as np
df["geozone"] = np.select([df["country"].isin(["USA", "CAN"]), df["country"].isin(["AUT", "BEL"]), df["country"].isin(["CHE", "DNK"])],
[1, 2, 3])

Doing task between Multiprocessing python

I want use multiprocessing library to parallelize the computation. If you comment line 5 and 9 and uncomment line 11 we can run this code in serial fashion.
My dataframe is very big and taking lot of time so I want to use multiprocessing.
This is what i am trying
def do_something (df):
return df
def main(df,df_hide,df_res):
p = Pool() # comment to run normal way
for i in range(0,df_hide.shape[0]):
df = df.append(df_hide.iloc[i,:])
df = p.map(do_something,df) # comment to run normal way
#df = do_something(df) # uncomment to run normal way
df_res.iloc[i,0] = df.iloc[-1,0]
return df_res
if __name__ == '__main__':
df = pd.DataFrame({'a':[1,2,3]})
df_hide = pd.DataFrame({'a':[4,5,6]})
df_res = pd.DataFrame({'z':[0,0,0]})
df_res1 = main(df,df_hide,df_res)
print(df_res1)
Excepted output it will come if I run it normally
z
0 4
1 5
2 6
This gives me nothing It freezes the cmd. Any way still if I run it I don't think I will get expected results. As I have to do something after ever process. Can you please suggest how to parallelize this above code using multiprocessing.

import numpy as np
import pandas as pd
def do_something (df):
return df
def main(df,df_hide,df_res):
for i in range(0,df_hide.shape[0]):
df = df.append(df_hide.iloc[i,:])
df_res.iloc[i,0] = df.iloc[-1,0]
return df_res
if __name__ == '__main__':
df = pd.DataFrame({'a':[1,2,3]})
df_hide = pd.DataFrame({'a':[4,5,6]})
df_res = pd.DataFrame({'z':[0,0,0]})
df_res1 = main(df,df_hide,df_res)
print(df_res1)

Explain Function Mistake

I managed to write my first function. however I do not understand it :-)
I approached my real problem with a simplified on. See the following code:
import pandas as pd
import matplotlib as plt
import numpy as np
from pyXSteam.XSteam import XSteam
steamTable = XSteam(XSteam.UNIT_SYSTEM_MKS)
T1_T_in = [398,397,395]
T1_p_in = [29,29,29]
T1_mPkt_in = [2.2,3,3.5]
def Power(druck,temp,menge):
H = []
Q = []
for i in range(len(druck)):
H.append(steamTable.h_pt(druck[i],temp[i]))
Q.append(H[i]*menge[i])
return Q
t1Q=Power(T1_p_in,T1_T_in,T1_mPkt_in)
t3Q = Power(T3_p_in,T3_T_in,T3_mPkt_in)
print(t1Q)
print(t3Q)
It works. The real problem now is different in that way that I read the data from an excel file. I got an error message and (according my learnings from this good homepage :-)) I added ".tolist()" in the function and it works. I do not understand why I need to change it to a list? Can anybody explain it to me? Thank you for your help.
import pandas as pd
import matplotlib as plt
import numpy as np
from pyXSteam.XSteam import XSteam
steamTable = XSteam(XSteam.UNIT_SYSTEM_MKS)
pfad="XXX.xlsx"
df = pd.read_excel(pfad)
T1T_in = df.iloc[2:746,1]
T1p_in = df.iloc[2:746,2]
T1mPkt_in = df.iloc[2:746,3]
def Power(druck,temp,menge):
H = []
Q = []
for i in range(len(druck)):
H.append(steamTable.h_pt(druck.tolist()[i],temp.tolist()[i]))
Q.append(H[i]*menge.tolist()[i])
return Q
t1Q=Power(T1p_in,T1T_in,T1mPkt_in)
t1Q[0:10]

The reason your first example works is because you are passing the T1_mPkt_in variable into the menge parameter as a list:
T1_mPkt_in = [2.2,3,3.5]
Your second example is not working because you pass the T1_mPkt_in variable into the menge parameter as a series and not a list:
T1mPkt_in = df.iloc[2:746,3]
If you print out the type of T1_mPkt_in, you will get:
<class 'pandas.core.series.Series'>
In pandas, to convert a series back into a list, you can call .tolist() to store the data in a list so that you can properly index it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Need to print a new column for Panda data frame - python

Related

Filling out a dataframe column using parallel processing in Python

NameError: name 'TabularList' is not defined

Apply function to data frame in python

Doing task between Multiprocessing python

Explain Function Mistake

Categories

Resources