How can I get the default index from the pandas dataframe [duplicate] - python

Ok, so this is confusing because of a lack of vocabulary.
Pandas series have an index and a value: so 'series[0]' contains (index,value).
How do I get the index (in my case it is a date), out of the series by indexing the series? This is really a very simple idea...it is just encrypted by the word "index." lol.
So, to rephrase,
I need the date of the first entry in my series and the last entry, when my series is indexed by date.
just to be clear, I have a series indexed by date, so when I print it out, it prints:
12-12-2008 1.2
12-13-2008 1.3
...
and calling
df.ix[0] -> 1.2
I need:
df.something[0] -> 12-12-2008

Got it.
df.index[0]
yields the label at index 0.

You can access the elements of your index just as you would a list. So df.index[0] will be the first element of your index and df.index[-1] will be the last.
Incidently if a series (or dataframe) has a non-integer index, df.ix[n] will return the n-th row corresponding to the n-th element of your index.
So df.ix[0] will return the first row and df.ix[-1] will return the last row. So an alternative way of getting the index values would be to use df.ix[0].name and df.ix[-1].name

Related

min of all columns of the dataframe in a range

I want to find the min value of every row of a dataframe restricting to only few columns.
For example: consider a dataframe of size 10*100. I want the min of middle 5 rows and this becomes of size 10*5.
I know to find the min using df.min(axis=0) but i dont know how to restrict the number of columns. Thanks for the help.
I use pandas lib.
You can start by selecting the slice of columns you are interested in and applying DataFrame.min() to only that selection:
df.iloc[:, start:end].min(axis=0)
If you want these to be the middle 5, simply find the integer indices which correspond to the start and end of that range:
start = int(n_columns/2 - 2.5)
end = start + 5
Following the 'pciunkiewicz's logic:
First you should select the columns that you desire. You can use the functions: .loc[..] or .iloc[..].
The first one you can use the names of the columns. When it takes 2 arguments, the first one is the row's index. The second is the columns.
df.loc[[rows], [columns]] # The filter data should be inside the brakets.
df.loc[:, [columns]] # This will consider all rows.
You can also use .iloc. In this case, you have to use integers to locate the data. So you don't have to know the name of the columns, but their position.

Finding the index of the maximum number in a python matrix which includes strings

I understand that
np.argmax(np.max(x, axis=1))
returns the index of the row that contains the maximum value and
np.argmax(np.max(x, axis=0))
returns the index of the row that contains the maximum value.
But what if the matrix contained strings? How can I change the code so that it still finds the index of the largest value?
Also (if there's no way to do what I previously asked for), can I change the code so that the operation is only carried out on a sub-section of the matrix, for instance, on the bottom right '2x2' sub-matrix in this example:
array = [['D','F,'J'],
['K',3,4],
['B',3,1]]
[[3,4],
[3,1]]
Can you try first converting the column to type dtype? If you take the min/max of a dtype column, it should use string values for the minimum/maximum.
Although not efficient, this could be one way to find index of the maximum number in the original matrix by using slices:
newmax=0
newmaxrow=0
newmaxcolumn=0
for row in [array[i][1:] for i in range(1,2)]:
for num in row:
if num>newmax:
newmax=num
newmaxcolumn=row.index(newmax)+1
newmaxrow=[array[i][1:] for i in range(1,2)].index(row)+1
Note: this method would not work if the lagest number lies within row 0 or column 0.

Indexing a Pandas Dataframe using the index of a Series

I have a TimeSeries and I want to extract the three first three elements and with them create a row of a Pandas Dataframe with three columns. I can do this easily using a Dictionary for example. The problem is that I would like the index of this one row DataFrame to be the Datetime index of the first element of the Series. Here I fail.
For a reproducible example:
CRM
Date
2018-08-30 0.000442
2018-08-29 0.005923
2018-08-28 0.004782
2018-08-27 0.003243
pd.DataFrame({'Reg_Coef_5_1' : ts1.iloc[0][0], 'Reg_Coef_5_2' : ts1.shift(-5).iloc[0][0], \
'Reg_Coef_5_3' : ts1.shift(-10).iloc[0][0]}, index = ts1.iloc[0].index )
I get:
Reg_Coef_5_1 Reg_Coef_5_2 Reg_Coef_5_3
CRM 0.000442 0.001041 -0.00035
Instead I would like the index to be '2018-08-30' a datetime object.
If I understand you correctly, you would like the index to be a date object instead of "CRM" as it is in your example. Just set the index accordingly: index = [ts1.index[0]] instead of index = ts1.iloc[0].index.
df = pd.DataFrame({'Reg_Coef_5_1' : ts1.iloc[0][0], 'Reg_Coef_5_2' : ts1.shift(-5).iloc[0][0], \
'Reg_Coef_5_3' : ts1.shift(-10).iloc[0][0]}, index = [ts1.index[0]] )
But as user10300706 has said, there might be a better way to do what you want, ultimately.
If you're simply trying to recover the index position then do:
index = ts1.index[0]
I would note that if you are shifting your dataframe up incrementally (5/10 respectively) the indexes won't aline. I assume, however, you're trying to build out some lagging indicator.

Pandas, for each row getting value of largest column between two columns

I'd like to express the following on a pandas data frame, but I don't know how to other than slow manual iteration over all cells.
For context: I have a data frame with two categories of columns, we'll call them the read_columns and the non_read_columns. Given a column name I have a function that can return true or false to tell you which category the column belongs to.
Given a specific read column A:
For each row:
1. Inspect the read column A to get the value X
2. Find the read column with the smallest value Y that is greater than X.
If no read column has a value greater than X, then substitute the largest value
found in all of the *non*-read columns, call it Z, and skip to step 4.
3. Find the non-read column with the greatest value between X and Y and call its value Z.
4. Compute Z - X
At the end I hope to have a series of the Z - X values with the same index as the original data frame. Note that the sort order of column values is not consistent across rows.
What's the best way to do this?
It's hard to give an answer without looking at the example DF, but you could do the following:
Separate your read columns with Y values into a new DF.
Transpose this new DF to get the Y values in columns, not in rows.
Use built-in vectorized functions on the Series of Y values instead of iterating the rows and columns manually. You could first filter the values greater than X, and then apply min() on the filtered Series.

pandas set one cell value equals to another

I want to set a cell of pandas dataframe equal to another. For example:
station_dim.loc[station_dim.nlc==573,'longitude']=station_dim.loc[station_dim.nlc==5152,'longitude']
However, when I checked
station_dim.loc[station_dim.nlc==573,'longitude']
It returns NaN
Beside directly set the station_dim.loc[station_dim.nlc==573,'longitude']to a number, what else choice do I have? And why can't I use this method?
Take a look at get_value, or use .values:
station_dim.loc[station_dim.nlc==573,'longitude']=station_dim.loc[station_dim.nlc==5152,'longitude'].values[0]
For the assignment to work - .loc[] will return a pd.Series, the index of that pd.Series would need to align with your df, which it probably doesn't. So either extract the value directly using .get_value() - where you need to get the index position first - or use .values, which returns a np.array, and take the first value of that array.

Categories

Resources