Python - NaN return (pandas - resample function) - python

I'm doing a finance study based on the youtube link below and I would like to understand why I got the NaN return instead of the expected calculation. What do I need to do in this script to reach the expected value?
YouTube case: https://www.youtube.com/watch?v=UpbpvP0m5d8
import investpy as env
import numpy as np
import pandas as pd
lt = ['ABEV3','CEAB3','ENBR3','FLRY3','IRBR3','ITSA4','JHSF3','STBP3']
prices = pd.DataFrame()
for i in lt:
df = env.get_stock_historical_data(stock=i, from_date='01/01/2020', to_date='29/05/2020', country='brazil')
df['Ativo'] = i
prices = pd.concat([prices, df], sort=True)
pivoted = prices.pivot(columns='Ativo', values='Close')
e_r = pivoted.resample('Y').last().pct_change().mean()
e_r
Return:
Ativo
ABEV3 NaN
CEAB3 NaN
ENBR3 NaN
FLRY3 NaN
IRBR3 NaN
ITSA4 NaN
JHSF3 NaN
STBP3 NaN
dtype: float64

You need to change the 'from_date' to have more than one year of data.
You current script returns one row and .pct_change() on one row of data returns NaN, because there is no previous row to compare against.
When I changed from_date to '01/01/2018'
import investpy as env
import numpy as np
import pandas as pd
lt = ['ABEV3','CEAB3','ENBR3','FLRY3','IRBR3','ITSA4','JHSF3','STBP3']
prices = pd.DataFrame()
for i in lt:
df = env.get_stock_historical_data(stock=i, from_date='01/01/2018', to_date='29/05/2020', country='brazil')
df['Ativo'] = i
prices = pd.concat([prices, df], sort=True)
pivoted = prices.pivot(columns='Ativo', values='Close')
e_r = pivoted.resample('Y').last().pct_change().mean()
e_r
I get the following output:
Ativo
ABEV3 -0.043025
CEAB3 -0.464669
ENBR3 0.180655
FLRY3 0.191976
IRBR3 -0.175084
ITSA4 -0.035767
JHSF3 1.283291
STBP3 0.223627
dtype: float64

Related

Convert yahoofinancials multidimensional dictionary output to dataframe

I'm creating a stock screener based on fundamental metrics using yahoofinancials module.
Below code gives output in multidimensional dictionary format that I'm not able to convert into dataframe format for further analysis.
import pandas as pd
from yahoofinancials import YahooFinancials
ticker = 'RELIANCE.NS'
yahoo_financials = YahooFinancials(ticker)
income_statement_data_qt = yahoo_financials.get_financial_stmts('quarterly', 'income')
income_statement_data_qt
Output:
Ideally, I'd like to have data in this format.
You can use list comprehension to iterate over the dictionaries from that particular ticker and use Pandas concat to concatenate the data along the columns axis (axis=1). Then, use rename_axis and reset_index to convert the index to a column with the desired name. Create a new column with the ticker name at the first position using insert.
import pandas as pd
from yahoofinancials import YahooFinancials
ticker = 'RELIANCE.NS'
yahoo_financials = YahooFinancials(ticker)
income_statement_data_qt = yahoo_financials.get_financial_stmts('quarterly', 'income')
dict_list = income_statement_data_qt['incomeStatementHistoryQuarterly'][ticker]
df = pd.concat([pd.DataFrame(i) for i in dict_list], axis=1)
df = df.rename_axis('incomeStatementHistoryQuarterly').reset_index()
df.insert(0, 'ticker', ticker)
print(df)
Output from df
ticker incomeStatementHistoryQuarterly ... 2021-03-31 2020-12-31
0 RELIANCE.NS costOfRevenue ... 1.034690e+12 7.224900e+11
1 RELIANCE.NS discontinuedOperations ... NaN NaN
2 RELIANCE.NS ebit ... 1.571800e+11 1.490100e+11
3 RELIANCE.NS effectOfAccountingCharges ... NaN NaN
...
...
18 RELIANCE.NS sellingGeneralAdministrative ... 3.976000e+10 4.244000e+10
19 RELIANCE.NS totalOperatingExpenses ... 1.338570e+12 1.029590e+12
20 RELIANCE.NS totalOtherIncomeExpenseNet ... -1.330000e+09 2.020000e+09
21 RELIANCE.NS totalRevenue ... 1.495750e+12 1.178600e+12
[22 rows x 6 columns]

PANDAS dataframe concat and pivot data

I'm leaning python pandas and playing with some example data. I have a CSV file of a dataset with net worth by percentile of US population by quarter of year.
I've successfully subseted the data by percentile to create three scatter plots of net worth by year, one plot for each of three population sections. However, I'm trying to combine those three plots to one data frame so I can combine the lines on a single plot figure.
Data here:
https://www.federalreserve.gov/releases/z1/dataviz/download/dfa-income-levels.csv
Code thus far:
import pandas as pd
import matplotlib.pyplot as plt
# importing numpy as np
import numpy as np
df = pd.read_csv("dfa-income-levels.csv")
df99th = df.loc[df['Category']=="pct99to100"]
df99th.plot(x='Date',y='Net worth', title='Net worth by percentile')
dfmid = df.loc[df['Category']=="pct40to60"]
dfmid.plot(x='Date',y='Net worth')
dflow = df.loc[df['Category']=="pct00to20"]
dflow.plot(x='Date',y='Net worth')
data = dflow['Net worth'], dfmid['Net worth'], df99th['Net worth']
headers = ['low', 'mid', '99th']
newdf = pd.concat(data, axis=1, keys=headers)
And that yields a dataframe shown below, which is not what I want for plotting the data.
low mid 99th
0 NaN NaN 3514469.0
3 NaN 2503918.0 NaN
5 585550.0 NaN NaN
6 NaN NaN 3602196.0
9 NaN 2518238.0 NaN
... ... ... ...
747 NaN 8610343.0 NaN
749 3486198.0 NaN NaN
750 NaN NaN 32011671.0
753 NaN 8952933.0 NaN
755 3540306.0 NaN NaN
Any recommendations for other ways to approach this?
#filter you dataframe to only the categories you're interested in
filtered_df = df[df['Category'].isin(['pct99to100', 'pct00to20', 'pct40to60'])]
filtered_df = filtered_df[['Date', 'Category', 'Net worth']]
fig, ax = plt.subplots() #ax is an axis object allowing multiple plots per axis
filtered_df.groupby('Category').plot(ax=ax)
I don't see the categories mentioned in your code in the csv file you shared. In order to concat dataframes along columns, you could use pd.concat along axis=1. It concats the columns of same index number. So first set the Date column as index and then concat them, and then again bring back Date as a dataframe column.
To set Date column as index of dataframe, df1 = df1.set_index('Date') and df2 = df2.set_index('Date')
Concat the dataframes df1 and df2 using df_merge = pd.concat([df1,df2],axis=1) or df_merge = pd.merge(df1,df2,on='Date')
bringing back Date into column by df_merge = df_merge.reset_index()

Pandas reading data as NaT and nan

I am trying to read in some sample data that is in a CSV file which contains the following. However when I print out the data in the dataframe many of the columns are null. I manually set the column datatypes because I thought that might be the issue, however that didn't solve the problem. Some help would be appreciated.
Data:
People_id,datetime,First Name,Last Name,Utilization,Chargeability,Target,Employee Type,Business Unit,Business Group
2222,2020-05-03,FirstName,LastName,0.8,0.9,0.4,Employee,GGGG,G1
Code:
import pandas as pd
data = pd.read_csv (r'C:\Users\Name\Documents\testdata.csv')
df = pd.DataFrame(data, columns= ['People_id', 'WeekEnding', 'FirstName', 'LastName', 'Utilization', 'Chargeability', 'Target', 'EmployeeType', 'BusinessUnit', 'BusinessGroup'])
df.dropna(subset=['People_id', 'Utilization', 'Chargeability', 'Target'])
df.FirstName = df.FirstName.astype(str)
df.LastName = df.LastName.astype(str)
df.EmployeeType = df.EmployeeType.astype(str)
df.BusinessUnit = df.BusinessUnit.astype(str)
df.BusinessGroup = df.BusinessGroup.astype(str)
df['WeekEnding'] = pd.to_datetime(df['WeekEnding'])
for row in df.itertuples():
print (row.People_id)
print (row.WeekEnding)
print (row.FirstName)
print (row.LastName)
print (row.Utilization)
print (row.Chargeability)
print (row.Target)
print (row.EmployeeType)
print (row.BusinessUnit)
print(row.BusinessGroup)
Output:
2222
NaT
nan
nan
0.8
0.9
0.4
nan
nan
nan

TypeError: Cannot compare type 'Timestamp' with type 'int'

I have some long winded code here with an issue when I am attempting to join (or merge/concat) two datasets together, I get this TypeError: Cannot compare type 'Timestamp' with type 'int'
The two datasets both come from resampling the same initial starting dataset. The master_hrs df is a resampling process using the a change point algorithm Python package called rupters. (pip install ruptures to run code). daily_summary df is just using Pandas to resample daily mean & sum values. But I get the error when I attempt to combine the datasets together. Would anyone have any tips to try?
Making up some fake data generates the same error as my real world dataset. I think the issue I have is I am trying to compare datime to numpy some how... Any tips greatly appreciated. Thanks
import ruptures as rpt
import calendar
import numpy as np
import pandas as pd
np.random.seed(11)
rows,cols = 50000,2
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='H')
df = pd.DataFrame(data, columns=['Temperature','Value'], index=tidx)
def changPointDf(df):
arr = np.array(df.Value)
#Define Binary Segmentation search method
model = "l2"
algo = rpt.Binseg(model=model).fit(arr)
my_bkps = algo.predict(n_bkps=5)
# getting the timestamps of the change points
bkps_timestamps = df.iloc[[0] + my_bkps[:-1] +[-1]].index
# computing the durations between change points
durations = (bkps_timestamps[1:] - bkps_timestamps[:-1])
#hours calc
d = durations.seconds/60/60
d_f = pd.DataFrame(d)
df2 = d_f.T
return df2
master_hrs = pd.DataFrame()
for idx, days in df.groupby(df.index.date):
changPoint_df = changPointDf(days)
values = changPoint_df.values.tolist()
master_hrs=master_hrs.append(values)
master_hrs.columns = ['overnight_AM_hrs', 'moring_startup_hrs', 'moring_ramp_hrs', 'high_load_hrs', 'evening_shoulder_hrs']
daily_summary = pd.DataFrame()
daily_summary['Temperature'] = df['Temperature'].resample('D').mean()
daily_summary['Value'] = df['Value'].resample('D').sum()
final_df = daily_summary.join(master_hrs)
The issue was the indexes themselves - master_hrs was int64 whereas daily_summary was datetime. Include this before joining the two dataframes together:
master_hrs.index = pd.to_datetime(master_hrs.index)
Just for clarity, here's my output of final_df:
Temperature Value ... high_load_hrs evening_shoulder_hrs
2019-01-01 0.417517 12.154527 ... NaN NaN
2019-01-02 0.521131 13.811842 ... NaN NaN
2019-01-03 0.583205 12.568966 ... NaN NaN
2019-01-04 0.448225 14.036136 ... NaN NaN
2019-01-05 0.542870 10.738192 ... NaN NaN
... ... ... ... ...
2024-09-10 0.470421 13.775528 ... NaN NaN
2024-09-11 0.384672 10.473930 ... NaN NaN
2024-09-12 0.527284 14.000231 ... NaN NaN
2024-09-13 0.555646 11.460867 ... NaN NaN
2024-09-14 0.426003 3.763975 ... NaN NaN
[2084 rows x 7 columns]
Hopefully this gets you what you need.

Removing text and keep digits in panda

I have dataframe similar to the one bellow
I want to remove text and keep digits only from each coloumn in that Dataframe
The expected output something like this
So far I have tried this
import json
import requests
import pandas as pd
URL = 'https://xxxxx.com'
req = requests.get(URL,auth=('xxx', 'xxx') )
text_data= req.text
json_dict= json.loads(text_data)
df = pd.DataFrame.from_dict(json_dict["measurements"])
cols_to_keep =['source','battery','c8y_TemperatureMeasurement','time','c8y_DistanceMeasurement']
df_final = df[cols_to_keep]
df_final = df_final.rename(columns={'c8y_TemperatureMeasurement': 'Temperature Or T','c8y_DistanceMeasurement':'Distance'})
for col in df_final:
df_final[col] = [''.join(re.findall("\d*\.?\d+", item)) for item in df_final[col]]
Your code is missing import pandas as pd and the data cannot be accessed, because it requires credentials.
You can use pandas.DataFrame.replace:
Example data:
df = pd.DataFrame({'a':['abc123abc', 'def456678'], 'b':['123a', 'b456']})
Dataframe:
a b
0 abc123abc 123a
1 def456678 b456
[^0-9.] replaces all non-digit characters.
df.replace('[^0-9.]', '', regex=True)
Output:
a b
0 123 123
1 456678 456
Edit:
The problem here is actually about nested JSON and not about replacing values in a dataframe. The reason the statement above does not work is because the data is saved as dicts in in the dataframe. But since the above mentioned solution is generally correct, it won't edit it out.
Revised Answer:
from pandas.io.json import json_normalize
import requests
import pandas as pd
URL = 'https://wastemanagement.post-iot.lu/measurement/measurements?source=83512& pageSize=1000000000&dateFrom=2019-10-26&dateTo=2019-10-28'
req = requests.get(URL,auth=('xxxx', 'xxxx') )
text_data= req.text
json_dict= json.loads(text_data)
df= json_normalize(json_dict['measurements'])
df = df_final.rename(columns={'source.id': 'source', 'battery.percent.value': 'battery', 'c8y_TemperatureMeasurement.T.value': 'Temperature Or T','c8y_DistanceMeasurement.distance.value':'Distance'})
cols_to_keep =['source' ,'battery', 'Temperature Or T', 'time', 'Distance']
df_final = df[cols_to_keep]
Output:
source battery Temperature Or T time Distance
0 83512 98.0 NaN 2019-10-26T00:00:06.494Z NaN
1 83512 NaN 23.0 2019-10-26T00:00:06.538Z NaN
2 83512 NaN NaN 2019-10-26T00:00:06.577Z 21.0
3 83512 98.0 NaN 2019-10-26T00:30:06.702Z NaN
4 83512 NaN 23.0 2019-10-26T00:30:06.743Z NaN

Categories

Resources