How to get values from a timestamp indexed dataframe - python

I'm constructing a DataFrame like so:
dates = [datetime.datetime.today() + datetime.timedelta(days=x) for x in range(0, 2)]
d = DataFrame([[1,2],[3,4]], index = dates, columns = ['a', 'b'])
I want to get a value like so:
d[d.index[0]]['a']
But I get the following error:
KeyError: Timestamp('2018-04-26 16:08:16.120031')
How come?

If you are trying to get the first element from column 'a', you access it like this:
d.loc[d.index[0], 'a']
The way you have it written now, d[d.index[0]] is trying to get a column with name d.index[0].

It depends what you want to do.
If you just want to first row, you could access it with iloc do the following:
d.iloc[0]['a']
If you want to filter the dataframe for example by the year, you could do:
d.loc[d.index.year == 2018, 'a']

d['a'][d.index[0]]
My confusion came from the fact that DataFrame is column first, and not row first as one would expect from general multi-dimension data structures. So in order to get the value, one must switch indices.
dataFrame[coumn][row]
Thanks #Michael for the hint.

First of all, you have to always know with the type of data you are dealing with, in your case, you create a DatetimeIndex:
DatetimeIndex(['2020-08-25 11:00:00.000307403',
'2020-08-25 11:00:00.000558638',
'2020-08-25 11:00:00.002280412',
'2020-08-25 11:00:00.002440933'])
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
and inside the DatetimeIndex, each element of it is a Timestamp:
2020-08-25 11:00:00.000307403
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
As you are working with DatetimeIndexes, you have to index the values by the actual timestamp ('2020-08-25 11:00:00.000307403') and not by the full timestamp variable (Timestamp('2020-08-25 11:00:00.000307403'))
So, instead of doing:
df[Timestamp('2020-08-25 11:00:00.000307403')]
you should do:
df['2020-08-25 11:00:00.000307403']
I lost like two hours to catch this error, since it is a bit stupid to include the type of data inside the variable, the easiest way to solve this is just to parse the DatetimeIndex to string.
For your solution:
d[str(d.index[0]),'a']
should work

Related

pandas - rename_axis doesn't work as expected afterwards - why?

i was reading through the pandas documentation (10 minutes to pandas) and came across this example:
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4),
index=dates, columns=['A', 'B', 'C', 'D'])
s = df['A']
s[dates[5]]
# Out[5]: -0.6736897080883706
It's quite logic, but if I try it on my own and set the indexname afterwards (example follows), then i can't select data with s[dates[5]]. Does someone know why?
e.g.
df = pd.read_csv("xyz.csv").head(100)
s = df['price'] # series with unnamed int index + price
s = s.rename_axis('indexName')
s[indexName[5]] # NameError: name 'indexName' is not defined
Thanks in advance!
Edit: s.index.name returns indexName, despite not working with the call of s[indexName[5]]
You are confusing the name of the index, and the index values.
In your example, the first code chunk runs because dates is a variable, so when you call dates[5] it actually returns the 5th value from the dates object, which is a valid index value in the dataframe.
In your own attempt, you are referring to indexName inside your slice (ie. when you try to run s[indexName[5]]), but indexName is not a variable in your environment, so it will throw an error.
The correct way to subset parts of your series or dataframe, is to refer to the actual values of the index, not the name of the axis. For example, if you have a series as below:
s = pd.Series(range(5), index=list('abcde'))
Then the values in the index are a through e, therefore to subset that series, you could use:
s['b']
or:
s.loc['b']
Also note, if you prefer to access elements by location rather than index value, you can use the .iloc method. So to get the second element, you would use:
s.iloc[1] # locations 0 is the first element
Hope it helps to clarify. I would recommend you continue to work through some introductory pandas tutorials to build up a basic understanding.
First of all lets understand the example:
df[index] is used to select a row having that index.
This is the s dataframe:
The indexes are the dates.
The dates[5] is equal to '2000-01-06'which is the index of the 5th row of the s df. so, the result is the row having that index.
in your code:
indexName is not defined. so, indexName[5] is not representing an index of your df.

Does iloc use the indices or the place of the row

I have extracted few rows from a dataframe to a new dataframe. In this new dataframe old indices remain. However, when i want to specify range from this new dataframe i used it like new indices, starting from zero. Why did it work? Whenever I try to use the old indices it gives an error.
germany_cases = virus_df_2[virus_df_2['location'] == 'Germany']
germany_cases = germany_cases.iloc[:190]
This is the code. The rows that I extracted from the dataframe virus_df_2 have indices between 16100 and 16590. I wanted to take the first 190 rows. in the second line of code i used iloc[:190] and it worked. However, when i tried to use iloc[16100:16290] it gave an error. What could be the reason?
In pandas there are two attributes, loc and iloc.
The iloc is, as you have noticed, an indexing based on the order of the rows in memory, so there you can reference the nth line using iloc[n].
In order to reference rows using the pandas indexing, which can be manually altered and can not only be integers but also strings or other objects that are hashable (have the __hash__ method defined), you should use loc attribute.
In your case, iloc raises an error because you are trying to access a range that is outside the region defined by your dataframe. You can try loc instead and it will be ok.
At first it will be hard to grasp the indexing notation, but it can be very helpful in some circumstances, like for example sorting or performing grouping operations.
Quick example that might help:
df = pd.DataFrame(
dict(
France=[1, 2, 3],
Germany=[4, 5, 6],
UK=['x', 'y', 'z'],
))
df = df.loc[:,"Germany"].iloc[1:2]
Out:
1 5
Name: Germany, dtype: int64
Hope I could help.

Filters in Pandas - why doesn't this work?

This is a basic question so apologies in advance.
I am using Pandas and I am grouping data with the following line:
page_serp_df.groupby([page_serp_df.meta_keywords_1_length]).count()['keyword']
This is referencing the following:
The data frame: [page_serp_df]
Grouping by the column: meta_keywords_1_length
Counting with the filter: keyword column
What I don't understand is why does the filtering condition have to be ['keyword'] i.e. a string in quotes?
For example, this doesn't work and it is very counterintuituve to me:
page_serp_df.groupby([page_serp_df.meta_keywords_1_length]).count()[page_serp_df.keyword]
Thanks in advance!
I think there is a misunderstanding on what the .count() method returns.
Try to follow this example:
Create a sample data frame
df = pd.DataFrame({
'A':[0,1,0,1, 1],
'B':[100,200,300, 400, 500],
'C': [1,2,3,4,5]
})
This is what the count() method will return after groupby
# similarly to your example I am grouping by A and counting
df.groupby([df.A]).count()
As you can see, the count() method returns a dataframe itself, having the count of each other column values for the column where the grouped column has the same value.
After that, you can query for a specific column form the return of count() like this
df.groupby([df.A]).count()['C']
But the second case in your example, which in my example would correspond to df.groupby([df.A]).count()[df.C]
Will throw an error!
In fact, you would query a dataframe (in this case df.groupby([df.A]).count()) via a pandas Series but as you know you need a string or a column from df.columns.
You can check yourself that df.C and 'C' are two very different variable types.
print(type(df.C))
print(type('C'))
# <class 'pandas.core.series.Series'>
# <class 'str'>
If for some reason your code still works with the equivalent of df.C there might be some contingency like the only value of the df.C is a string with the same name of a column.. or something unintentional like that.

Get a KeyError in Pandas

I am trying to call a function from a different module as below:
module1 - func1: returns a dataframe
module1 - func2(p_df_in_fromfunc1)
function 2:
for i in range(0,len(p_df_in_fromfunc1):
# Trying to retrieve row values of individual columns and assign to variables
v_tmp = p_df_in_fromfunc1.loc[i,"Col1"]
When trying to run the above code, I get the error:
KeyError 0
Could the issue be because I don't have a zero numbered row?
Without knowing much of you're code, well my guess is, for positional indexing try using iloc instead of loc, if you're interesed in going index-wise.
Something like:
v_tmp = p_df_in_fromfunc1.iloc[i,"Col1"]
You may have a missed to close the quote in the loc function after Col1 ?
v_tmp = p_df_in_fromfunc1.loc[i,"Col1"]
For retrieving a row for specific columns do:
columns = ['Col1', 'Col2']
df[columns].iloc[index]
If you only want one column, you can simplify it to: df['Col1'].iloc(index)
As per your comment, you do not need to reset the index, you can iterate over the values of your index array: df.index

Why is 'series' type not a special case of 'dataframe' type in pandas? [duplicate]

In Pandas, when I select a label that only has one entry in the index I get back a Series, but when I select an entry that has more then one entry I get back a data frame.
Why is that? Is there a way to ensure I always get back a data frame?
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
In [3]: type(df.loc[3])
Out[3]: pandas.core.frame.DataFrame
In [4]: type(df.loc[1])
Out[4]: pandas.core.series.Series
Granted that the behavior is inconsistent, but I think it's easy to imagine cases where this is convenient. Anyway, to get a DataFrame every time, just pass a list to loc. There are other ways, but in my opinion this is the cleanest.
In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame
In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame
The TLDR
When using loc
df.loc[:] = Dataframe
df.loc[int] = Dataframe if you have more than one column and Series if you have only 1 column in the dataframe
df.loc[:, ["col_name"]] = Dataframe if you have more than one row and Series if you have only 1 row in the selection
df.loc[:, "col_name"] = Series
Not using loc
df["col_name"] = Series
df[["col_name"]] = Dataframe
You have an index with three index items 3. For this reason df.loc[3] will return a dataframe.
The reason is that you don't specify the column. So df.loc[3] selects three items of all columns (which is column 0), while df.loc[3,0] will return a Series. E.g. df.loc[1:2] also returns a dataframe, because you slice the rows.
Selecting a single row (as df.loc[1]) returns a Series with the column names as the index.
If you want to be sure to always have a DataFrame, you can slice like df.loc[1:1]. Another option is boolean indexing (df.loc[df.index==1]) or the take method (df.take([0]), but this used location not labels!).
Use df['columnName'] to get a Series and df[['columnName']] to get a Dataframe.
You wrote in a comment to joris' answer:
"I don't understand the design
decision for single rows to get converted into a series - why not a
data frame with one row?"
A single row isn't converted in a Series.
It IS a Series: No, I don't think so, in fact; see the edit
The best way to think about the pandas data structures is as flexible
containers for lower dimensional data. For example, DataFrame is a
container for Series, and Panel is a container for DataFrame objects.
We would like to be able to insert and remove objects from these
containers in a dictionary-like fashion.
http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure
The data model of Pandas objects has been choosen like that. The reason certainly lies in the fact that it ensures some advantages I don't know (I don't fully understand the last sentence of the citation, maybe it's the reason)
.
Edit : I don't agree with me
A DataFrame can't be composed of elements that would be Series, because the following code gives the same type "Series" as well for a row as for a column:
import pandas as pd
df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])
print '-------- df -------------'
print df
print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])
print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])
result
-------- df -------------
0
2 11
3 12
3 13
------- df.loc[2] --------
0 11
Name: 2, dtype: int64
type(df.loc[1]) : <class 'pandas.core.series.Series'>
--------- df[0] ----------
2 11
3 12
3 13
Name: 0, dtype: int64
type(df[0]) : <class 'pandas.core.series.Series'>
So, there is no sense to pretend that a DataFrame is composed of Series because what would these said Series be supposed to be : columns or rows ? Stupid question and vision.
.
Then what is a DataFrame ?
In the previous version of this answer, I asked this question, trying to find the answer to the Why is that? part of the question of the OP and the similar interrogation single rows to get converted into a series - why not a data frame with one row? in one of his comment,
while the Is there a way to ensure I always get back a data frame? part has been answered by Dan Allan.
Then, as the Pandas' docs cited above says that the pandas' data structures are best seen as containers of lower dimensional data, it seemed to me that the understanding of the why would be found in the characteristcs of the nature of DataFrame structures.
However, I realized that this cited advice must not be taken as a precise description of the nature of Pandas' data structures.
This advice doesn't mean that a DataFrame is a container of Series.
It expresses that the mental representation of a DataFrame as a container of Series (either rows or columns according the option considered at one moment of a reasoning) is a good way to consider DataFrames, even if it isn't strictly the case in reality. "Good" meaning that this vision enables to use DataFrames with efficiency. That's all.
.
Then what is a DataFrame object ?
The DataFrame class produces instances that have a particular structure originated in the NDFrame base class, itself derived from the PandasContainer base class that is also a parent class of the Series class.
Note that this is correct for Pandas until version 0.12. In the upcoming version 0.13, Series will derive also from NDFrame class only.
# with pandas 0.12
from pandas import Series
print 'Series :\n',Series
print 'Series.__bases__ :\n',Series.__bases__
from pandas import DataFrame
print '\nDataFrame :\n',DataFrame
print 'DataFrame.__bases__ :\n',DataFrame.__bases__
print '\n-------------------'
from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__ :\n',NDFrame.__bases__
from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__ :\n',PandasContainer.__bases__
from pandas.core.base import PandasObject
print '\nPandasObject.__bases__ :\n',PandasObject.__bases__
from pandas.core.base import StringMixin
print '\nStringMixin.__bases__ :\n',StringMixin.__bases__
result
Series :
<class 'pandas.core.series.Series'>
Series.__bases__ :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)
DataFrame :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__ :
(<class 'pandas.core.generic.NDFrame'>,)
-------------------
NDFrame.__bases__ :
(<class 'pandas.core.generic.PandasContainer'>,)
PandasContainer.__bases__ :
(<class 'pandas.core.base.PandasObject'>,)
PandasObject.__bases__ :
(<class 'pandas.core.base.StringMixin'>,)
StringMixin.__bases__ :
(<type 'object'>,)
So my understanding is now that a DataFrame instance has certain methods that have been crafted in order to control the way data are extracted from rows and columns.
The ways these extracting methods work are described in this page:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
We find in it the method given by Dan Allan and other methods.
Why these extracting methods have been crafted as they were ?
That's certainly because they have been appraised as the ones giving the better possibilities and ease in data analysis.
It's precisely what is expressed in this sentence:
The best way to think about the pandas data structures is as flexible
containers for lower dimensional data.
The why of the extraction of data from a DataFRame instance doesn't lies in its structure, it lies in the why of this structure. I guess that the structure and functionning of the Pandas' data structure have been chiseled in order to be as much intellectually intuitive as possible, and that to understand the details, one must read the blog of Wes McKinney.
If the objective is to get a subset of the data set using the index, it is best to avoid using loc or iloc. Instead you should use syntax similar to this :
df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
result = df[df.index == 3]
isinstance(result, pd.DataFrame) # True
result = df[df.index == 1]
isinstance(result, pd.DataFrame) # True
every time we put [['column name']] it returns Pandas DataFrame object,
if we put ['column name'] we got Pandas Series object
If you also select on the index of the dataframe then the result can be either a DataFrame or a Series or it can be a Series or a scalar (single value).
This function ensures that you always get a list from your selection (if the df, index and column are valid):
def get_list_from_df_column(df, index, column):
df_or_series = df.loc[index,[column]]
# df.loc[index,column] is also possible and returns a series or a scalar
if isinstance(df_or_series, pd.Series):
resulting_list = df_or_series.tolist() #get list from series
else:
resulting_list = df_or_series[column].tolist()
# use the column key to get a series from the dataframe
return(resulting_list)

Categories

Resources