pandas dataframe, How do I select the N surrounding points by index - python

Let's say I have a pandas dataframe with 200 rows and a point X. I need to select the N nearest points by index. N can be an even or an odd number.
Lets say my N in this case is 20. How can I get the closest 20 points to X by index. This would have to work if X is say at index 5, so I can't just take an the 10 points in either direction. Is there something in pandas where you say
df.getClosest(index=5, N=20)
And it will return the subset of the dataframe with the 20 points closest the point at index 5?

Something like this might work:
# Set Index to Search
index_to_search = 5
# Create an Index Column
df['Index'] = df.index
df_new = df.iloc[(df['Index']-index_to_search).abs().argsort()[:20]]
print(df_new['Index'].tolist())
I assumed you didn't have an "Index" column on your data frame. If you already do, then you shouldn't need the df['Index'] = df.index line.
I think if you have two ID's that are closest (one below and one above for example) it favours the smaller one. But this potential issue will only apply if you need an even number of indexes.

Related

Create a matrix with a set of ranges in columns and a set of ranges in rows with Pandas

I have a data frame in which one column 'F' has values from 0 to 100 and a second column 'E' has values from 0 to 500. I want to create a matrix in which frequencies fall within ranges in both 'F' and 'E'. For example, I want to know the frequency in range 20 to 30 for 'F' and range 400 to 500 for 'E'.
What I expect to have is the following matrix:
matrix of ranges
I have tried to group ranges using pd.cut() and groupby() but I don't know how to join data.
I really appreciate your help in creating the matrix with pandas.
you can use the cut function to create the bin "tag/name" for each column.
after you cat pivot the data frame.
df['rows'] = pd.cut(df['F'], 5)
df['cols'] = pd.cut(df['E'], 5)
df = df.groupby(['rows', 'cols']).agg('sum').reset_index([0,1], False) # your agg func here
df = df.pivot(columns='cols', index='rows')
So this is the way I found to create the matrix, that was obviously inspired by #usher's answer. I know it's more convoluted but wanted to share it. Thanks again #usher
E=df.E
F=df.F
bins_E=pd.cut(E, bins=(max(E)-min(E))/100)
bins_F=pd.cut(F, bins=(max(F)-min(F))/10)
bins_EF=bins_E.to_frame().join(bins_F)
freq_EF=bins_EF.groupby(['E', 'F']).size().reset_index(name="counts")
Mat_FE = freq_EF.pivot(columns='E', index='F')

min of all columns of the dataframe in a range

I want to find the min value of every row of a dataframe restricting to only few columns.
For example: consider a dataframe of size 10*100. I want the min of middle 5 rows and this becomes of size 10*5.
I know to find the min using df.min(axis=0) but i dont know how to restrict the number of columns. Thanks for the help.
I use pandas lib.
You can start by selecting the slice of columns you are interested in and applying DataFrame.min() to only that selection:
df.iloc[:, start:end].min(axis=0)
If you want these to be the middle 5, simply find the integer indices which correspond to the start and end of that range:
start = int(n_columns/2 - 2.5)
end = start + 5
Following the 'pciunkiewicz's logic:
First you should select the columns that you desire. You can use the functions: .loc[..] or .iloc[..].
The first one you can use the names of the columns. When it takes 2 arguments, the first one is the row's index. The second is the columns.
df.loc[[rows], [columns]] # The filter data should be inside the brakets.
df.loc[:, [columns]] # This will consider all rows.
You can also use .iloc. In this case, you have to use integers to locate the data. So you don't have to know the name of the columns, but their position.

Pandas, for each row getting value of largest column between two columns

I'd like to express the following on a pandas data frame, but I don't know how to other than slow manual iteration over all cells.
For context: I have a data frame with two categories of columns, we'll call them the read_columns and the non_read_columns. Given a column name I have a function that can return true or false to tell you which category the column belongs to.
Given a specific read column A:
For each row:
1. Inspect the read column A to get the value X
2. Find the read column with the smallest value Y that is greater than X.
If no read column has a value greater than X, then substitute the largest value
found in all of the *non*-read columns, call it Z, and skip to step 4.
3. Find the non-read column with the greatest value between X and Y and call its value Z.
4. Compute Z - X
At the end I hope to have a series of the Z - X values with the same index as the original data frame. Note that the sort order of column values is not consistent across rows.
What's the best way to do this?
It's hard to give an answer without looking at the example DF, but you could do the following:
Separate your read columns with Y values into a new DF.
Transpose this new DF to get the Y values in columns, not in rows.
Use built-in vectorized functions on the Series of Y values instead of iterating the rows and columns manually. You could first filter the values greater than X, and then apply min() on the filtered Series.

Trying to divide a dataframe column by a float yields NaN

Background
I deal with a csv datasheet that prints out columns of numbers. I am working on a program that will take the first column, ask a user for a time in float (ie. 45 and a half hours = 45.5) and then subtract that number from the first column. I have been successful in that regard. Now, I need to find the row index of the "zero" time point. I use min to find that index and then call that off of the following column A1. I need to find the reading at Time 0 to then normalize A1 to so that on a graph, at the 0 time point the reading is 1 in column A1 (and eventually all subsequent columns but baby steps for me)
time_zero = float(input("Which time would you like to be set to 0?"))
df['A1']= df['A1']-time_zero
This works fine so far to set the zero time.
zero_location_series = df[df['A1'] == df['A1'].min()]
r1 = zero_location_series[' A1.1']
df[' A1.1'] = df[' A1.1']/r1
Here's where I run into trouble. The first line will correctly identify a series that I can pull off of for all my other columns. Next r1 correctly identifies the proper A1.1 value and this value is a float when I use type(r1).
However when I divide df[' A1.1']/r1 it yields only one correct value and that value is where r1/r1 = 1. All other values come out NaN.
My Questions:
How to divide a column by a float I guess? Why am I getting NaN?
Is there a faster way to do this as I need to do this for 16 columns.(ie 'A2/r2' 'a3/r3' etc.)
Do I need to do inplace = True anywhere to make the operations stick prior to resaving the data? or is that only for adding/deleting rows?
Example
Dataframe that looks like this
!http://i.imgur.com/ObUzY7p.png
zero time sets properly (image not shown)
after dividing the column
!http://i.imgur.com/TpLUiyE.png
This should work:
df['A1.1']=df['A1.1']/df['A1.1'].min()
I think the reason df[' A1.1'] = df[' A1.1']/r1 did not work was because r1 is a series. Try r1? instead of type(r1) and pandas will tell you that r1 is a series, not an individual float number.
To do it in one attempt, you have to iterate over each column, like this:
for c in df:
df[c] = df[c]/df[c].min()
If you want to divide every value in the column by r1 it's best to apply, for example:
import pandas as pd
df = pd.DataFrame([1,2,3,4,5])
# apply an anonymous function to the first column ([0]), divide every value
# in the column by 3
df = df[0].apply(lambda x: x/3.0, 0)
print(df)
So you'd probably want something like this:
df = df["A1.1"].apply(lambda x: x/r1, 0)
This really only answers part 2 of you question. Apply is probably your best bet for running a function on multiple rows and columns quickly. As for why you're getting nans when dividing by a float, is it possible the values in your columns are anything other than floats or integers?

In Python, how do I select the columns of a dataframe satisfying a condition on the number of NaN?

I hope someone could help me. I'm new to Python, and I have a dataframe with 111 columns and over 40 000 rows. All the columns contain NaN values (some columns contain more NaN's than others), so I want to drop those columns having at least 80% of NaN values. How can I do this?
To solve my problem, I tried the following code
df1=df.apply(lambda x : x.isnull().sum()/len(x) < 0.8, axis=0)
The function x.isnull().sum()/len(x) is to divide the number of NaN in the column x by the length of x, and the part < 0.8 is to choose those columns containing less than 80% of NaN.
The problem is that when I run this code I only get the names of the columns together with the boolean "True" but I want the entire columns, not just the names. What should I do?
You could do this:
filt = df.isnull().sum()/len(df) < 0.8
df1 = df.loc[:, filt]
You want to achieve two things. First, you have to find the indices of all columns which contain at most 80% NaNs. Second, you want to discard them from your DataFrame.
To get a pandas Series indicating whether a row should be discarded by doing, you can do:
df1 = df.isnull().sum(axis=0) < 0.8*df.shape[1]
(Btw. you have a typo in your question. You should drop the ==True as it always tests whether 0.5==True)
This will give True for all column indices to keep, as .isnull() gives True (or 1) if it is NaN and False (or 0) for a valid number for every element. Then the .sum(axis=0) sums along the columns giving the number of NaNs in each column. The comparison is then, if that number is bigger than 80% of the number of columns.
For the second task, you can use this to index your columns by using:
df = df[df.columns[df1]]
or as suggested in the comments by doing:
df.drop(df.columns[df1==False], axis=1, inplace=True)

Categories

Resources