How to get value of index column in pandas?

How to get value of index column in pandas? - python

I have a pandas data frame like following.
colName
date
2020-06-02 03:00:00 39
I can get value of each entry of colName using following. How to get date value?
for index, row in max_items.iterrows():
print(str(row['colName]))
// How to get date??

Anti-pattern Warning
First I want to highlight, this is an anti-pattern, using iteration is highly counterproductive.
There are extremely rare cases when you need to iterate through the pandas dataframes. Essentially, Map, Apply and applymap can achieve results efficiently.
Coming to the issue at hand:
you need to convert your index to datetime if not already there.
Simple example:
# Creating the dataframe
df1 = pd.DataFrame({'date':pd.date_range(start='1/1/2018', end='1/03/2018'),
'test_value_a':[5, 6, 9],
'test_value_b':[2, 5, 1]})
# Coverting date column into index of type datetime.
df1.index = pd.to_datetime(df1.date)
# Dropping date column we had created
df1.drop(labels='date', axis="columns")
To print date, month, month name, day or day_name:
df1.index.date
df1.index.month
df1.index.month
df1.index.month_name
df1.index.day
df1.index.day_name
I would suggest read about loc, iloc and ix in the pandas' documentation that should help.
I hope I didn't veer off from the crux of the question.

Related

Python pandas.datetimeindex piecewise dataframe slicing

I have a dataframe with a pandas DatetimeIndex. I need to take many slices from it(for printing a piecewise graph with matplotlib). In other words I need a new DF which would be a subset of the first one.
More precisely I need to take all rows that are between 9 and 16 o'clock but only if they are within a date range. Fortunately I have only one date range and one time range to apply.
What would be a clean way to do that? thanks

The first step is to set the index of the dataframe to the column where you store time. Once the index is based on time, you can subset the dataframe easily.
df['time'] = pd.to_datetime(df['time'], format='%H:%M:%S.%f') # assuming you have a col called 'time'
df['time'] = df['time'].dt.strftime('%H:%M:%S.%f')
df.set_index('time', inplace=True)
new_df = df[startTime:endTime] # startTime and endTime are strings

Convert to datetime using column position/number in python pandas

Very simple query but did not find the answer on google.
df with timestamp in date column
Date
22/11/2019 22:30:10 etc. say which is of the form object on doing df.dtype()
Code:
df['Date']=pd.to_datetime(df['Date']).dt.date
Now I want the date to be converted to datetime using column number rather than column name. Column number in this case will be 0(I have very big column names and similar multipe files, so I want to change date column to datetime using its position '0' in this case).
Can anyone help?

Use DataFrame.iloc for column (Series) by position:
df.iloc[:, 0] = pd.to_datetime(df.iloc[:, 0]).dt.date
Or is also possible extract column name by indexing:
df[df.columns[0]] = pd.to_datetime(df[df.columns[0]]).dt.date

How to manipulate your Data Set based on the values of your index?

I have this Dataset, wind_modified. In this Dataset, columns are the locations and Index is the Date. And the Values in the columns are the wind speeds.
Let's say I want to find the average wind speed in January for each location, how do I use groupby or any other method to find the average?
Would it be possible without resetting the INDEX?
Edit - [This][2] is the actual dataset. I have combined the three columns "Yr, Mo, Dy" into one i.e. "DATE" and made it the INDEX.
I imported the dataset by using pd.read_fwf.
And "DATE" is of type datetime64[ns].
[2]:

Sure, if want all Januaries for all years first filter them by boolean indexing and add mean:
#if necessary convert index to DatetimeIndex
#df.index = pd.to_datetime(df.index)
df1 = df[df.index.month == 1].mean().to_frame().T
Or if need each January for each year separately after filter use groupby with DatetimeIndex.year and aggregate mean:
df2 = df[df.index.month == 1]
df3 = df2.groupby(df2.index.year).mean()

How to slice Pandas data frame by column header value when the column header is a date-time value?

I have an excel file where the column name consists of date-time value.
As you can see the header value is in date-time format. I have loaded this to Pandas dataframe and the header values are indeed saved as date-time value.
Now if I need to query from Pandas such as, "pick all columns which are greater than May-15" how can I do that?
I am aware that by querying df[df.columns[3:]] I can achieve this. But I really want to slice based on the value of column header and not based on the position of the column.
Please help.
Edit:
Based on the answers below, I figured out a way to query the column values. Adding it here for future reference.
from datetime import datetime
df[[col for col in df.columns if col not in ("Name", "Location")
and col >= datetime(2015,4,1)
and col <= datetime(2016,3,1)]]
or
from datetime import datetime
df.loc[:, [col for col in df.columns if col not in ("Name", "Location")
and col >= datetime(2015,4,1)
and col <= datetime(2016,3,1)]]
The 1st solution is the most elegant.
Conceptually, to column slicing in Pandas works when the intended columns are provided as a list. List comprehenion is used to slice the columns based on column label values. (not the values within the column). In the examples, I have filtered out "Name" and "Location" columns since I am comparing the remaining columns based on datatime value.

Querying works best to filter observations (rows), based on one or more variables (columns). The way your data is organized doesn't allow for a natural query (You're trying to filter columns as opposed to using them as criteria in the filter). You can read more about tidying dataframes here
Of course you can come up with a contorted way to do what you want, but I'd strongly suggest you tidy your data like this
name | location | date | value
--------------------------------
John | London | Apr-15 | 1000
John | London | May-15 | 800
...
Then you can easily query based on the Date, and make sure that column is of a date type so you can use e.g.
df.query('20150501 < date')
Then when you're done and if you have to, you can always bring back the dataframe to its original format if required (If you can, better to avoid it and focus on organizing your data, it pays in the long run)

One easy-fix method would be to replace the Month string with its equivalent number.
dct = {'Jan': 1, 'Feb':2 ...}
new = []
for item in df.columns:
a = item.split('-')
try:
b= '%02d.%02d' %(a[1],a[0])
except: # if not a datetime i.e. 'name'
b= str(a[0])
new.append(b)
df.columns=new
This should make your dates in the form 15.04,15.05 .. 16.11 etc.
Alternatively: You can also Convert your headers into date-times and query them in that way:
from datetime import datetime
new=[]
for item in df.columns:
try:
new.append( datetime.strptime( item , '%b-%y') )
except:
new.append( item )
df.columns=new
df.loc[:, df.columns <= datetime(2015, 5, 1)]

Python Pandas Index Sorting/Grouping/DateTime

I am trying to combine 2 separate data series using one minute data to create a ratio then creating Open High Low Close (OHLC) files for the ratio for the entire day. I am bringing in two time series then creating associated dataframes using pandas. The time series have missing data so I am creating a datetime variable in each file then merging the files using the pd.merge approach on the datetime variable. Up this this point everything is going fine.
Next I group the data by the date using groupby. I then feed the grouped data to a for loop that calculates the OHLC and feeds that into a new dataframe for each respective day. However, the newly populated dataframe uses the date (from the grouping) as the dataframe index and the sorting is off. The index data looks like this (even when sorted):
01/29/2013
01/29/2014
01/29/2015
12/2/2013
12/2/2014
In short, the sorting is being done on only the month not the whole date as a date so it isn't chronological. My goal is to get it sorted by date so it would be chronological. Perhaps I need to create a new column in the dataframe referencing the index (not sure how). Or maybe there is a way to tell pandas the index is a date not just a value? I tried using various sort approaches including sort_index but since the dates are the index and don't seem to be treated as dates the sort functions sort by the month regardless of the year and thus my output file is out of order. In more general terms I am not sure how to reference/manipulate the actual unique identifier index in a pandas dataframe so any associated material would be useful.
Thank you

Years later...
This fixes the problem.
df is a dataframe
import pandas as pd
df.index = pd.to_datetime(df.index) #convert the index to a datetime object
df = df.sort_index() #sort the converted
This should get the sorting back into chronological order

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.