Python: Get last row from Pandas Data Frame - python

I'm trying to use some financial functions written in Python which return a Panda Data Frame.
The following function returns a Panda Data Frame:
from yahoo_fin import stock_info as si
data = si.get_data("ENEL.MI", start_date="01/21/2022 8:00", end_date="01/21/2022 16:30",index_as_date=False, interval="1d")
Here is what I get if I print data:
date open high low close adjclose volume ticker
0 2022-01-21 6.976 6.993 6.855 6.905 6.905 33639775 ENEL.MI
1 2022-01-21 6.976 6.993 6.855 6.905 6.905 35419140 ENEL.MI
I'd like to collect just the last row from the DataFrame (the row with number "1")
So, I've tried with:
lastrow = data.tail()
print(lastrow)
However I still get the same result (the whole DataFrame is printed).
I'm a bit puzzled. Is there a way to get just the last row?
Thanks a lot

With iloc() function, we can retrieve a particular value belonging to a row and column using the index values assigned to it.
lastrow = data.iloc[-1]

You have to specifies the number of rows you want to get. So for your case it would be data.tail(1) to print only the last row. The default value is to print 5 rows, that's why you see your full dataframe with 2 rows.

Related

Problem, all rows in new column change to the result of last calculation performed

I'm new to python and have researched to find an answer, I am most likely not asking the right question. I am streaming data from an exchange into a dataframe, will later stream the data into a databse, My problem is that when I do a calculation on a column to create a new column containing the result, all of the values of all rows in the new column change to the last result.
I am streaming in the open, high, low, close of a stock. In one column I am calculating the range for a candle during the timeframe, like on a one hour chart.
src = candles.close
ohlc = candles
ohlc = ohlc.rename(columns=str.lower)
candles['SMA_21'] = TA.SSMA(ohlc, period)
candles['EMA_21'] = TA.EMA(ohlc, period)
candles['WMA'] = TA.WMA(ohlc, 10)
candles['Range'] = src - candles['open']
candles['AvgRange'] = candles['Range'].tail(21).mean()
The range column works and has correct information which is not changed by each calculation. But the column for 'AvgRange' ends up with all values changed with each new mean value calculated.
The following also writes the last data entry to the whole column stream['EMA_Dir']
if stream['EMA'].iloc[-1] > stream['EMA'].iloc[-2]:
stream['EMA_Dir'] = "Ascending"
I only want the last entry in the last, most recent, row of the dataframe.
Tried several things, but the last calculation changes all values in 'AvgRange' column.
Thanks in advance. Sorry if I didn't ask the question correctly, but that is probably why I haven't found the answer.
candles['AvgRange'] = candles[’range’].rolling(
window=3,
center=False
).mean()
this will give you a 3 row rolling average

How do I sum the bottom X rows of a column in a Pandas dataframe?

Each hour I get current weather data via an API. Current weather data is appended to the bottom of the dataframe:
df = df.append([current_weather], sort=False, ignore_index=True)
Current weather data includes precipitation totals for the past hour (precipitation_1h).
I also have a column for 24 hour precipitation totals (precipitation_1d). I want to calculate this value for the appended row only, not the entire column (so I'm not looking to use df.rolling).
Here's what I've tried...
This code skips the bottom row (i.e. the most recent precipitation_1h value):
df.at[current_weather['timestamp'], 'precipitation_1d'] = df.iloc[-1:-25 , df.columns.get_loc('precipitation_1h')].sum()
This code includes too many rows (I think all rows are included and bottom 25 counted twice):
df.at[current_weather['timestamp'], 'precipitation_1d'] = df.iloc[0:-25 , df.columns.get_loc('precipitation_1h')].sum()
This code works if I flip the order and add the new data as the first row of the dataframe (but I'd prefer to keep the new data at the bottom of the dataframe):
df.at[current_weather['timestamp'], 'precipitation_1d'] = df.iloc[:24 , df.columns.get_loc('precipitation_1h')].sum()
Any ideas?
UPDATE: G. Anderson's suggestion in the comments worked perfectly. He said it'd be helpful to see dataframe so I've included a screenshot of the tail in case it helps anyone in the future. Tail screenshot shows the dataframe after I've applied G. Anderson's solution. Before, all columns ending in 1d, 5d, 10d, and 20d had NaN values. Now they have sums (precipitation) or means (humidity, temp).

Python Pandas Splitting Strings and Storing the Remainder in New Row

I have a pandas dataframe where observations are broken out per every two days. The values in the 'Date' column each describe a range of two days (eg 2020-02-22 to 2020-02-23).
I want to spit those Date values into individual days, with a row for each day. The closest I got was by doing newdf = df_day.set_index(df_day.columns.drop('Date',1).tolist()).Date.str.split(' to ', expand=True).stack().reset_index().loc[:, df_day.columns]
The problem here is that the new date values are returned as NaNs. Is there a way to achieve this data broken out by individual day?
I might not be understanding, but based on the image it's a single date per row as is, just poorly labeled -- I would manipulate the index strings, and if I can't do that I would create a new date column, or new df w/ clean date and merge it.
You should be able to chop off the first 14 characters with a lambda -- leaving you with second listed date in index.
I can't reproduce this, so bear with me.
df.rename(index=lambda s: s[14:])
#should remove first 14 characters from each row label.
#leaving just '2020-02-23' in row 2.
#If you must skip row 1, idx = df.index[1:]
#or df.iloc[1:].rename(index=lambda s: s[1:])
Otherwise, I would just replace it with a new datetime index.
didx = pd.DatetimeIndex(start ='2000-01-10', freq ='D',end='2020-02-26')
#Make sure same length as df
df.set_index(didx)
#Or
#df['new_date'] = didx.values
#df.set_index('new_date').drop(columns=['Date'])
#Or
#df.append(didx,axis=1) #might need ignore_index=True

question how to deal with KeyError: 0 or KeyError: 1 etc

I am new in python and this data science world and I am trying to play with different datasets.
In this case I am using the housing price index from quandl but unfortunately I get stuck when when I need to take the abbreviations names from the wiki page always getting the same Error KeyError.
import quandl
import pandas as pd
#pull every single housing price index from quandl
#quandl api key
api_key = 'xxxxxxxxxxxx'
#get stuff from quandl
df = quandl.get('FMAC/HPI_AK',authtoken = api_key) #alaska \
##print(df.head())
#get 50 states using pandas read html from wikipedia
fifty_states = pd.read_html('https://en.wikipedia.org /wiki/List_of_states_and_territories_of_the_United_States')
##print(fifty_states[0][1]) #first data frame is index 0, #looking for column 1,#from element 1 on
#get quandl frannymac query names for each 50 state
for abbv in fifty_states[0][1][2:]:
#print('FMAC/HPI_'+str(abbv))
So the problem I got in the following step:
#get 50 states using pandas read html from wikipedia
fifty_states = pd.read_html('https://en.wikipedia.org /wiki/List_of_states_and_territories_of_the_United_States')
##print(fifty_states[0][1]) #first data frame is index 0, #looking for column 1,#from element 1 on
I have tried different ways to get just the abbreviation but does not work
for abbv in fifty_states[0][1][2:]:
#print('FMAC/HPI_'+str(abbv))
for abbv in fifty_states[0][1][1:]:
#print('FMAC/HPI_'+str(abbv))
always Keyerror: 0
I just need this step to work, and to have the following output:
FMAC/HPI_AL,
FMAC/HPI_AK,
FMAC/HPI_AZ,
FMAC/HPI_AR,
FMAC/HPI_CA,
FMAC/HPI_CO,
FMAC/HPI_CT,
FMAC/HPI_DE,
FMAC/HPI_FL,
FMAC/HPI_GA,
FMAC/HPI_HI,
FMAC/HPI_ID,
FMAC/HPI_IL,
FMAC/HPI_IN,
FMAC/HPI_IA,
FMAC/HPI_KS,
FMAC/HPI_KY,
FMAC/HPI_LA,
FMAC/HPI_ME
for the 50 states from US and then proceed to make a data analysis from this data.
Can anybody tell me what am I doing wrong ? cheers
Note that fifty_states is a list of DataFrames, filled with
content of tables from the source page.
The first of them (at index 0 in fifty_states) is the table of US states.
If you don't know column names in a DataFrame (e.g. df),
to get column 1 from it (numeration form 0), run:
df.iloc[:, 1]
So, since we want this column from fifty_states[0], run:
fifty_states[0].iloc[:, 1]
Your code failed because you attempted to apply [1] to this DataFrame,
but this DataFrame has no column named 1.
Note that e.g. fifty_states[0][('Cities', 'Capital')] gives proper result,
because:
this DataFrame has a MultiIndex on columns,
one of columns has Cities at the first MultiIndex level
and Capital at the second level.
And getting back to your code, run:
for abbv in fifty_states[0].iloc[:, 1]:
print('FMAC/HPI_' + str(abbv))
Note that [2:] is not needed. You probably wanted to skip 2 initial rows
of the <table> HTML tag, containing column names,
but in Pandas they are actually kept in the MultiIndex on columns,
so to get all values, you don't need to skip anything.
If you want these strings as a list, for future use, the code can be:
your_list = ('FMAC/HPI_' + fifty_states[0].iloc[:, 1]).tolist()

How do I replace one cell in a data frame with another data in a different data frame?

I am trying to replace a value in the data frame dh based on the data frame larceny.
If the date in larceny exists, I want to find the corresponding date in dh and replace the corresponding 5th column entry with 1.
I am currently (somewhat successfully) doing it with the below code but, it is taking forever. Any help on this?
When I try to compare the dates, the code does not work, so I compare the .value of the dates and this seems to work.
import pandas as pd
from datetime import datetime
for i, row in dh.iterrows():
for j in range(45314):
if dh.iat[i,0].value==larceny.iat[j,0].value:
dh.iat[i,5]=1
print("Larceny")
print(i,j)
print(dh.iat[i,0],larceny.iat[j,0])
print(dh.iat[i,0].value,larceny.iat[j,0].value,'\n\n')
Basically, dh has a cell for each hour of each day for 4 years. I want to populate the cell for each hour with a 1 in the "Is_larceny" column, if that corresponding year-month-day-hour appears in the larceny data frame.
Please help. I tried some pandas search methods but I was having a problem with comparing dates and searching and replacing properly.
Thanks.
dh.loc[dh['col1'].isin(larceny['col2']), 'col1'] = 1
This looks for any value in the dh['col1'] that also appears in larceny['col2'], then sets those values in dh['col1'] to 1. You will have to replace col1 and col2 with your respective column names.

Categories

Resources