Adding columns to a dataframe where all other columns are periods - python

I have a timeseries dataframe with a PeriodIndex. I would like to use the values as column names in another dataframe and add other columns, which are not Periods. The problem is that when I create the dataframe by using only periods as column-index adding a column whos index is a string raises an error. However if I create the dataframe with a columns index that has periods and strings, then I'm able to add a columns with string indices.
import pandas as pd
data = np.random.normal(size=(5,2))
idx = pd.Index(pd.period_range(2011,2012,freq='A'),name=year)
df = pd.DataFrame(data,columns=idx)
df['age'] = 0
This raises an error.
import pandas as pd
data = np.random.normal(size=(5,2))
idx = pd.Index(pd.period_range(2011,2012,freq='A'),name=year)
df = pd.DataFrame(columns=idx.tolist()+['age'])
df = df.iloc[:,:-1]
df[:] = data
df['age'] = 0
This does not raise an error and gives my desired outcome, but doing it this way I can't assign the data in a convenient way when I create the dataframe. I would like a more elegant way of achieving the result. I wonder if this is a bug in Pandas?

Not really sure what you are trying to achieve, but here is one way to get what I understood you wanted:
import pandas as pd
idx = pd.Index(pd.period_range(2011,2015,freq='A'),name='year')
df = pd.DataFrame(index=idx)
df1 = pd.DataFrame({'age':['age']})
df1 = df1.set_index('age')
df = df.append(df1,ignore_index=False).T
print df
Which gives:
Empty DataFrame
Columns: [2011, 2012, 2013, 2014, 2015, age]
Index: []
And it keeps you years as Periods:
df.columns[0]
Period('2011', 'A-DEC')
The same result most likely can be achieved using .merge.

Related

Make a list of a dataframe index

I'm trying to make a list of the index of my dataframe so I can use it as the X values in a plot.
I'm also trying to make a list of the rainfall so I can use it as the Y values in a plot. The dataframe is df and the index column is date.
df=pd.read_csv(data_source, sep=',', comment='#', header=None, names=['station', 'date', 'T_gem', 'T_min', 'T_max', 'rainfall'], parse_dates=[1])
df = df.set_index(['date'])
january = df.loc['2021-01-01':'2021-01-31']
I've tried using january = df.loc['2021-01-01':'2021-01-31', 'date'] but that raises a KeyError because I think it cannot find the column date as it is an index.
This should work:
january_df = df['2021-01-01':'2021-01-31']
After this, you can use the proposed solution.
january_df['rainfall'].plot()
You don't have to reset the index and create a list.

Error when .loc() rows with a list of dates in pandas

I have the following code:
import pandas as pd
from pandas_datareader import data as web
df = web.DataReader('^GSPC', 'yahoo')
df['pct'] = df['Close'].pct_change()
dates_list = df.index[df['pct'].gt(0.002)]
df2 = web.DataReader('^GDAXI', 'yahoo')
df2['pct2'] = df2['Close'].pct_change()
i was trying to run this:
df2.loc[dates_list, 'pct2']
But i keep getting this error:
KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported,
I am guessing this is because there are missing data for dates in dates_list. To resolve this:
idx1 = df.index
idx2 = df2.index
missing = idx2.difference(idx1)
df.drop(missing, inplace = True)
df2.drop(missing, inplace = True)
However i am still getting the same error. I dont understand why that is.
Note that dates_list has been created from df, so it includes
some dates present in index there (in df).
Then you read df2 and attempt to retrieve pct2 from rows on
just these dates.
But there is a chance that the index in df2 does not contain
all dates given in dates_list.
And just this is the cause of your exception.
To avoid it, retrieve only rows on dates present in the index.
To look for only such "allowed" (narrow down the rows specifidation),
you should pass:
dates_list[dates_list.isin(df2.index)]
Run this alone and you will see the "allowed" dates (some dates will
be eliminated).
So change the offending instruction to:
df2.loc[dates_list[dates_list.isin(df2.index)], 'pct']

Pandas groupby(df.index) with indexes of varying size

I have an array of dataframes dfs = [df0, df1, ...]. Each one of them have a date column of varying size (some dates might be in one dataframe but not the other).
What I'm trying to do is this:
pd.concat(dfs).groupby("date", as_index=False).sum()
But with date no longer being a column but an index (dfs = [df.set_index("date") for df in dfs]).
I've seen you can pass df.index to groupby (.groupby(df.index)) but df.index might not include all the dates.
How can I do this?
The goal here is to call .sum() on the groupby, so I'm not tied to using groupby nor concat is there's any alternative method to do so.
If I am able to understand maybe you want something like this:
df = pd.concat([dfs])
df.groupby(df.index).sum()
Here's small example:
tmp1 = pd.DataFrame({'date':['2019-09-01','2019-09-02','2019-09-03'],'value':[1,1,1]}).set_index('date')
tmp2 = pd.DataFrame({'date':['2019-09-01','2019-09-02','2019-09-04','2019-09-05'],'value':[2,2,2,2]}).set_index('date')
df = pd.concat([tmp1,tmp2])
df.groupby(df.index).sum()

How to coerce pandas dataframe column to be normal index

I create a DataFrame from a dictionary. I want the keys to be used as index and the values as a single column. This is what I managed to do so far:
import pandas as pd
my_counts = {"A": 43, "B": 42}
df = pd.DataFrame(pd.Series(my_counts, name=("count",)).rename_axis("letter"))
I get the following:
count
letter
A 43
B 42
The problem is I want to concatenate (with pd.concat) this with other dataframes, that have the same index name (letter), and seemingly the same single column (count), but I end up with an
AssertionError: invalid dtype determination in get_concat_dtype.
I discovered that the other dataframes have a different type for their columns: Index(['count'], dtype='object'). The above dataframe has MultiIndex(levels=[['count']], labels=[[0]]).
How can I ensure my dataframe has a normal index?
You can prevent the multiIndex column with this code by eliminating a ',':
df = pd.DataFrame(pd.Series(my_counts, name=("count")).rename_axis("letter"))
df.columns
Output:
Index(['count'], dtype='object')
OR you can flatten your multiindex columns like this:
df = pd.DataFrame(pd.Series(my_counts, name=("count",)).rename_axis("letter"))
df.columns = df.columns.map(''.join)
df.columns
Output:
Index(['count'], dtype='object')

Pandas Re-indexing command

*RE Add missing dates to pandas dataframe, previously ask question
import pandas as pd
import numpy as np
idx = pd.date_range('09-01-2013', '09-30-2013')
df = pd.DataFrame(data = [2,10,5,1], index = ["09-02-2013","09-03-2013","09-06-2013","09-07-2013"], columns = ["Events"])
df.index = pd.DatetimeIndex(df.index); #question (1)
df = df.reindex(idx, fill_value=np.nan)
print(df)
In the above script what does the command noted as question one do? If you leave this
command out of the script, the df will be re-indexed but the data portion of the
original df will not be retained. As there is no reference to the df data in the
DatetimeIndex command, why is the data from the starting df lost?
Short answer: df.index = pd.DatetimeIndex(df.index); converts the string index of df to a DatetimeIndex.
You have to make the distinction between different types of indexes. In
df = pd.DataFrame(data = [2,10,5,1], index = ["09-02-2013","09-03-2013","09-06-2013","09-07-2013"], columns = ["Events"])
you have an index containing strings. When using
df.index = pd.DatetimeIndex(df.index);
you convert this standard index with strings to an index with datetimes (a DatetimeIndex). So the values of these two types of indexes are completely different.
Now, when you reindex with
idx = pd.date_range('09-01-2013', '09-30-2013')
df = df.reindex(idx)
where idx is also an index with datetimes. When you reindex the original df with a string index, there are no matching index values, so no column values of the original df are retained. When you reindex the second df (after converting the index to a datetime index), there will be matching index values, so the column values on those indixes are retained.
See also http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.reindex.html

Categories

Resources