Make a list of a dataframe index - python

I'm trying to make a list of the index of my dataframe so I can use it as the X values in a plot.
I'm also trying to make a list of the rainfall so I can use it as the Y values in a plot. The dataframe is df and the index column is date.
df=pd.read_csv(data_source, sep=',', comment='#', header=None, names=['station', 'date', 'T_gem', 'T_min', 'T_max', 'rainfall'], parse_dates=[1])
df = df.set_index(['date'])
january = df.loc['2021-01-01':'2021-01-31']
I've tried using january = df.loc['2021-01-01':'2021-01-31', 'date'] but that raises a KeyError because I think it cannot find the column date as it is an index.

This should work:
january_df = df['2021-01-01':'2021-01-31']
After this, you can use the proposed solution.
january_df['rainfall'].plot()
You don't have to reset the index and create a list.

Related

Error when .loc() rows with a list of dates in pandas

I have the following code:
import pandas as pd
from pandas_datareader import data as web
df = web.DataReader('^GSPC', 'yahoo')
df['pct'] = df['Close'].pct_change()
dates_list = df.index[df['pct'].gt(0.002)]
df2 = web.DataReader('^GDAXI', 'yahoo')
df2['pct2'] = df2['Close'].pct_change()
i was trying to run this:
df2.loc[dates_list, 'pct2']
But i keep getting this error:
KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported,
I am guessing this is because there are missing data for dates in dates_list. To resolve this:
idx1 = df.index
idx2 = df2.index
missing = idx2.difference(idx1)
df.drop(missing, inplace = True)
df2.drop(missing, inplace = True)
However i am still getting the same error. I dont understand why that is.
Note that dates_list has been created from df, so it includes
some dates present in index there (in df).
Then you read df2 and attempt to retrieve pct2 from rows on
just these dates.
But there is a chance that the index in df2 does not contain
all dates given in dates_list.
And just this is the cause of your exception.
To avoid it, retrieve only rows on dates present in the index.
To look for only such "allowed" (narrow down the rows specifidation),
you should pass:
dates_list[dates_list.isin(df2.index)]
Run this alone and you will see the "allowed" dates (some dates will
be eliminated).
So change the offending instruction to:
df2.loc[dates_list[dates_list.isin(df2.index)], 'pct']

Pandas groupby(df.index) with indexes of varying size

I have an array of dataframes dfs = [df0, df1, ...]. Each one of them have a date column of varying size (some dates might be in one dataframe but not the other).
What I'm trying to do is this:
pd.concat(dfs).groupby("date", as_index=False).sum()
But with date no longer being a column but an index (dfs = [df.set_index("date") for df in dfs]).
I've seen you can pass df.index to groupby (.groupby(df.index)) but df.index might not include all the dates.
How can I do this?
The goal here is to call .sum() on the groupby, so I'm not tied to using groupby nor concat is there's any alternative method to do so.
If I am able to understand maybe you want something like this:
df = pd.concat([dfs])
df.groupby(df.index).sum()
Here's small example:
tmp1 = pd.DataFrame({'date':['2019-09-01','2019-09-02','2019-09-03'],'value':[1,1,1]}).set_index('date')
tmp2 = pd.DataFrame({'date':['2019-09-01','2019-09-02','2019-09-04','2019-09-05'],'value':[2,2,2,2]}).set_index('date')
df = pd.concat([tmp1,tmp2])
df.groupby(df.index).sum()

Sort dataframe by columns names if the columns are dates, pandas?

my df columns names are dates in this format: dd-mm-yy. when I use sort_index(axis = 1) it sort by the first two digits (which specify the days) so it doesn't make sense chronologically. How can I sort it automatically by taking into account also the months?
my df headers:
submitted_at 06-05-18 13-05-18 29-04-18
I expected the output of:
submitted_at 29-04-18 06-05-18 13-05-18
Convert the columns to datetime and use argsort to find the correct ordering. This will put all non-dates to the left in the order they occur, followed by the sorted dates.
import pandas as pd
df = pd.DataFrame(columns=['submitted_at', '06-05-18', '13-05-18', '29-04-18'])
idx = pd.to_datetime(df.columns, errors='coerce', format='%d-%m-%y').argsort()
df.iloc[:, idx]
Empty DataFrame
Columns: [submitted_at, 29-04-18, 06-05-18, 13-05-18]
Converting strings to datetime then sorting them with something like this :
from datetime import datetime
cols_as_date = [datetime.strptime(x,'%d-%m-%Y') for x in df.columns]
df = df[sorted(cols_as_data)]
just convert to DateTime your column
df['newdate']=pd.to_datetime(df.date,format='%d-%m-%y')
and then sort it using sort_values
df.sort_values(by='newdate')

Adding columns to a dataframe where all other columns are periods

I have a timeseries dataframe with a PeriodIndex. I would like to use the values as column names in another dataframe and add other columns, which are not Periods. The problem is that when I create the dataframe by using only periods as column-index adding a column whos index is a string raises an error. However if I create the dataframe with a columns index that has periods and strings, then I'm able to add a columns with string indices.
import pandas as pd
data = np.random.normal(size=(5,2))
idx = pd.Index(pd.period_range(2011,2012,freq='A'),name=year)
df = pd.DataFrame(data,columns=idx)
df['age'] = 0
This raises an error.
import pandas as pd
data = np.random.normal(size=(5,2))
idx = pd.Index(pd.period_range(2011,2012,freq='A'),name=year)
df = pd.DataFrame(columns=idx.tolist()+['age'])
df = df.iloc[:,:-1]
df[:] = data
df['age'] = 0
This does not raise an error and gives my desired outcome, but doing it this way I can't assign the data in a convenient way when I create the dataframe. I would like a more elegant way of achieving the result. I wonder if this is a bug in Pandas?
Not really sure what you are trying to achieve, but here is one way to get what I understood you wanted:
import pandas as pd
idx = pd.Index(pd.period_range(2011,2015,freq='A'),name='year')
df = pd.DataFrame(index=idx)
df1 = pd.DataFrame({'age':['age']})
df1 = df1.set_index('age')
df = df.append(df1,ignore_index=False).T
print df
Which gives:
Empty DataFrame
Columns: [2011, 2012, 2013, 2014, 2015, age]
Index: []
And it keeps you years as Periods:
df.columns[0]
Period('2011', 'A-DEC')
The same result most likely can be achieved using .merge.

Pandas Re-indexing command

*RE Add missing dates to pandas dataframe, previously ask question
import pandas as pd
import numpy as np
idx = pd.date_range('09-01-2013', '09-30-2013')
df = pd.DataFrame(data = [2,10,5,1], index = ["09-02-2013","09-03-2013","09-06-2013","09-07-2013"], columns = ["Events"])
df.index = pd.DatetimeIndex(df.index); #question (1)
df = df.reindex(idx, fill_value=np.nan)
print(df)
In the above script what does the command noted as question one do? If you leave this
command out of the script, the df will be re-indexed but the data portion of the
original df will not be retained. As there is no reference to the df data in the
DatetimeIndex command, why is the data from the starting df lost?
Short answer: df.index = pd.DatetimeIndex(df.index); converts the string index of df to a DatetimeIndex.
You have to make the distinction between different types of indexes. In
df = pd.DataFrame(data = [2,10,5,1], index = ["09-02-2013","09-03-2013","09-06-2013","09-07-2013"], columns = ["Events"])
you have an index containing strings. When using
df.index = pd.DatetimeIndex(df.index);
you convert this standard index with strings to an index with datetimes (a DatetimeIndex). So the values of these two types of indexes are completely different.
Now, when you reindex with
idx = pd.date_range('09-01-2013', '09-30-2013')
df = df.reindex(idx)
where idx is also an index with datetimes. When you reindex the original df with a string index, there are no matching index values, so no column values of the original df are retained. When you reindex the second df (after converting the index to a datetime index), there will be matching index values, so the column values on those indixes are retained.
See also http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.reindex.html

Categories

Resources