Indexing a Pandas Dataframe using the index of a Series

Indexing a Pandas Dataframe using the index of a Series - python

I have a TimeSeries and I want to extract the three first three elements and with them create a row of a Pandas Dataframe with three columns. I can do this easily using a Dictionary for example. The problem is that I would like the index of this one row DataFrame to be the Datetime index of the first element of the Series. Here I fail.
For a reproducible example:
CRM
Date
2018-08-30 0.000442
2018-08-29 0.005923
2018-08-28 0.004782
2018-08-27 0.003243
pd.DataFrame({'Reg_Coef_5_1' : ts1.iloc[0][0], 'Reg_Coef_5_2' : ts1.shift(-5).iloc[0][0], \
'Reg_Coef_5_3' : ts1.shift(-10).iloc[0][0]}, index = ts1.iloc[0].index )
I get:
Reg_Coef_5_1 Reg_Coef_5_2 Reg_Coef_5_3
CRM 0.000442 0.001041 -0.00035
Instead I would like the index to be '2018-08-30' a datetime object.

If I understand you correctly, you would like the index to be a date object instead of "CRM" as it is in your example. Just set the index accordingly: index = [ts1.index[0]] instead of index = ts1.iloc[0].index.
df = pd.DataFrame({'Reg_Coef_5_1' : ts1.iloc[0][0], 'Reg_Coef_5_2' : ts1.shift(-5).iloc[0][0], \
'Reg_Coef_5_3' : ts1.shift(-10).iloc[0][0]}, index = [ts1.index[0]] )
But as user10300706 has said, there might be a better way to do what you want, ultimately.

If you're simply trying to recover the index position then do:
index = ts1.index[0]
I would note that if you are shifting your dataframe up incrementally (5/10 respectively) the indexes won't aline. I assume, however, you're trying to build out some lagging indicator.

Related

How can I get the default index from the pandas dataframe [duplicate]

Ok, so this is confusing because of a lack of vocabulary.
Pandas series have an index and a value: so 'series[0]' contains (index,value).
How do I get the index (in my case it is a date), out of the series by indexing the series? This is really a very simple idea...it is just encrypted by the word "index." lol.
So, to rephrase,
I need the date of the first entry in my series and the last entry, when my series is indexed by date.
just to be clear, I have a series indexed by date, so when I print it out, it prints:
12-12-2008 1.2
12-13-2008 1.3
...
and calling
df.ix[0] -> 1.2
I need:
df.something[0] -> 12-12-2008

Got it.
df.index[0]
yields the label at index 0.

You can access the elements of your index just as you would a list. So df.index[0] will be the first element of your index and df.index[-1] will be the last.
Incidently if a series (or dataframe) has a non-integer index, df.ix[n] will return the n-th row corresponding to the n-th element of your index.
So df.ix[0] will return the first row and df.ix[-1] will return the last row. So an alternative way of getting the index values would be to use df.ix[0].name and df.ix[-1].name

min of all columns of the dataframe in a range

I want to find the min value of every row of a dataframe restricting to only few columns.
For example: consider a dataframe of size 10*100. I want the min of middle 5 rows and this becomes of size 10*5.
I know to find the min using df.min(axis=0) but i dont know how to restrict the number of columns. Thanks for the help.
I use pandas lib.

You can start by selecting the slice of columns you are interested in and applying DataFrame.min() to only that selection:
df.iloc[:, start:end].min(axis=0)
If you want these to be the middle 5, simply find the integer indices which correspond to the start and end of that range:
start = int(n_columns/2 - 2.5)
end = start + 5

Following the 'pciunkiewicz's logic:
First you should select the columns that you desire. You can use the functions: .loc[..] or .iloc[..].
The first one you can use the names of the columns. When it takes 2 arguments, the first one is the row's index. The second is the columns.
df.loc[[rows], [columns]] # The filter data should be inside the brakets.
df.loc[:, [columns]] # This will consider all rows.
You can also use .iloc. In this case, you have to use integers to locate the data. So you don't have to know the name of the columns, but their position.

df.set_index returns key error python pandas dataframe

I have this Pandas DataFrame and I have to convert some of the items into coordinates, (meaning they have to be floats) and it includes the indexes while trying to convert them into floats. So I tried to set the indexes to the first thing in the DataFrame but it doesn't work. I wonder if it has anything to do with the fact that it is a part of the whole DataFrame, only the section that is "Latitude" and "Longitude".
df = df_volc.iloc(axis = 0)[0:, 3:5]
df.set_index("hello", inplace = True, drop = True)
df
and I get the a really long error, but this is the last part of it:
KeyError: '34.50'
if I don't do the set_index part I get:
Latitude Longitude
0 34.50 131.60
1 -23.30 -67.62
2 14.50 -90.88
I just wanna know if its possible to get rid of the indexes or set them.

The parameter you need to pass to set_index() function is keys : column label or list of column labels / arrays. In your scenario, it seems like "hello" is not a column name.
I just wanna know if its possible to get rid of the indexes or set them.
It is possible to replace the 0, 1, 2 index with something else, though it doesn't sound like it's necessary for your end goal:
to convert some of the items into [...] floats
To achieve this, you could overwrite the existing values by using astype():
df['Latitude'] = df['Latitude'].astype('float')

Python Pandas: fill a column using values from rows at an earlier timestamps

I have a dataframe df where one column is timestamp and one is A. Column A contains decimals.
I would like to add a new column B and fill it with the current value of A divided by the value of A one minute earlier. That is:
df['B'] = df['A']_current / df['A'] _(current - 1 min)
NOTE: The data does not come in exactly every 1 minute so "the row one minute earlier" means the row whose timestamp is the closest to (current - 1 minute).
Here is how I do it:
First, I use the timestamp as index in order to use get_loc and create a new dataframe new_df starting from 1 minute after df. In this way I'm sure I have all the data when I go look 1 minute earlier within the first minute of data.
new_df = df.loc[df['timestamp'] > df.timestamp[0] + delta] # delta = 1 min timedelta
values = []
for index, row n new_df.iterrows():
v = row.A / df.iloc[df.index.get_loc(row.timestamp-delta,method='nearest')]['A']
values.append[v]
v_ser = pd.Series(values)
new_df['B'] = v_ser.values
I'm afraid this is not that great. It takes a long time for large dataframes. Also, I am not 100% sure the above is completely correct. Sometimes I get this message:
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
What is the best / most efficient way to do the task above? Thank you.
PS. If someone can think of a better title please let me know. It took me longer to write the title than the post and I still don't like it.

You could try to use .asof() if the DataFrame has been indexed correctly by the timestamps (if not, use .set_index() first).
Simple example here
import pandas as pd
import numpy as np
n_vals = 50
# Create a DataFrame with random values and 'unusual times'
df = pd.DataFrame(data = np.random.randint(low=1,high=6, size=n_vals),
index=pd.DatetimeIndex(start=pd.Timestamp.now(),
freq='23s', periods=n_vals),
columns=['value'])
# Demonstrate how to use .asof() to get the value that was the 'state' at
# the time 1 min since the index. Note the .values call
df['value_one_min_ago'] = df['value'].asof(df.index - pd.Timedelta('1m')).values
# Note that there will be some NaNs to deal with consider .fillna()

Computing the first non-missing value from each column in a DataFrame

I have a DataFrame which looks like this:
1125400 5430095 1095751
2013-05-22 105.24 NaN 6507.58
2013-05-23 104.63 NaN 6393.86
2013-05-26 104.62 NaN 6521.54
2013-05-27 104.62 NaN 6609.31
2013-05-28 104.54 87.79 6640.24
2013-05-29 103.91 86.88 6577.39
2013-05-30 103.43 87.66 6516.55
2013-06-02 103.56 87.55 6559.43
I would like to compute the first non-NaN value in each column.
As Locate first and last non NaN values in a Pandas DataFrame points out, first_valid_index can be used. Unfortunately, it returns the first row where at least one element is not NaN and does not work per-column.

You should use the apply function which applies a function on either each column (default) or each row efficiently:
>>> first_valid_indices = df.apply(lambda series: series.first_valid_index())
>>> first_valid_indices
1125400 2013-05-22 00:00:00
5430095 2013-05-28 00:00:00
1095751 2013-05-22 00:00:00
first_valid_indiceswill then be a series containing the first_valid_index for each column.
You could also define the lambda function as a normal function outside:
def first_valid_index(series):
return series.first_valid_index()
and then call apply like this:
df.apply(first_valid_index)

The built in function DataFrame.groupby().column.first() returns the first non null value in the column, while last() returns the last.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.first.html
If you don't wish to get the first value for each group, you can add a dummy column of 1s. Then get the first non null value using the groupby & first functions.
from Pandas import DataFrame
df = DataFrame({'a':[None,1,None],'b':[None,2,None]})
df['dummy'] = 1
df.groupby('dummy').first()
df.groupby('dummy').last()

By compute I assume you mean access?
The simplest way to do this is with the pd.Series.first_valid_index() method probably inside a dict comprehension:
values = {col : DF.loc[DF[col].first_valid_index(), col] for col in DF.columns}
values
Just to be clear, each column in a pandas DataFrame is a Series. So the above is the same as doing:
values = {}
for column in DF.columns:
First_Non_Null_Index = DF[column].first_valid_index()
values[column] = DF.loc[First_Non_Null_Index, column]
So the operation in my one line solution is on a per column basis. I.e. it is not going to create the type of error you seem to be suggesting in the edit you made to the question. Let me know if it does not work as expected.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Indexing a Pandas Dataframe using the index of a Series - python

If you're simply trying to recover the index position then do: index = ts1.index[0] I would note that if you are shifting your dataframe up incrementally (5/10 respectively) the indexes won't aline. I assume, however, you're trying to build out some lagging indicator.

Related

How can I get the default index from the pandas dataframe [duplicate]

min of all columns of the dataframe in a range

df.set_index returns key error python pandas dataframe

Python Pandas: fill a column using values from rows at an earlier timestamps

Computing the first non-missing value from each column in a DataFrame

Categories

Resources