min of all columns of the dataframe in a range - python

I want to find the min value of every row of a dataframe restricting to only few columns.
For example: consider a dataframe of size 10*100. I want the min of middle 5 rows and this becomes of size 10*5.
I know to find the min using df.min(axis=0) but i dont know how to restrict the number of columns. Thanks for the help.
I use pandas lib.

You can start by selecting the slice of columns you are interested in and applying DataFrame.min() to only that selection:
df.iloc[:, start:end].min(axis=0)
If you want these to be the middle 5, simply find the integer indices which correspond to the start and end of that range:
start = int(n_columns/2 - 2.5)
end = start + 5

Following the 'pciunkiewicz's logic:
First you should select the columns that you desire. You can use the functions: .loc[..] or .iloc[..].
The first one you can use the names of the columns. When it takes 2 arguments, the first one is the row's index. The second is the columns.
df.loc[[rows], [columns]] # The filter data should be inside the brakets.
df.loc[:, [columns]] # This will consider all rows.
You can also use .iloc. In this case, you have to use integers to locate the data. So you don't have to know the name of the columns, but their position.

Related

How to select a range of columns when using the replace function on a large dataframe?

I have a large dataframe that consists of around 19,000 rows and 150 columns. Many of these columns contain values with -1s and -2s. When I try to replace the -1s and -2s with 0 using the following code, Jupyter times out on me and says no memory left. So, I am curious if you can select a range of columns and apply the replace function. This way I can replace in batches since I cant seem to replace in one pass based on my available memory.
Here is the code a tried to use that timed out on me when first replacing the -2s:
df.replace(to_replace=-2, value="0").
Thank you for any guidance!
Sean
Let's say you want to divide your columns in chunks of 10, then you should try something like this:
columns = your_df.columns
division_num = 10
chunks_num = int(len(columns)/division_num)
index = 0
for i in range(chunks_num):
your_df[columns[index: index+10]].replace(to_replace=-2, value="0")
index += division_num
If your memory keeps overflowing then maybe you can try with loc function to divide the data by rows instead of columns.

Copy values from column X+2 (two to the right of X) into column X

I have a dataframe and one every three columns has a name (the others are unnamed 1,2,3...).
I want values in the columns that have names to be equal to the value of two columns to the right of that.
I was using df.columns.get_loc("X") and I can use this to correctly select my desired column using df.iloc[:,X],
but I can't do Y = X +2 on pandas to do df.iloc[:,X] = df.iloc[:,Y] because X is not just an integer.
Any ideas on how to solve this? It can be a different way to get column X to have the same values as two columns to the right of X.
Thanks!
this would work, change 8 to fit your columns, or len(columns)//3*3
for n in range(0,8,3):
df.iloc[:,n]= df.iloc[:,n+2]
it doesn't seem we can assign a multi column to a multi column, not sure if that is possible

pandas dataframe, How do I select the N surrounding points by index

Let's say I have a pandas dataframe with 200 rows and a point X. I need to select the N nearest points by index. N can be an even or an odd number.
Lets say my N in this case is 20. How can I get the closest 20 points to X by index. This would have to work if X is say at index 5, so I can't just take an the 10 points in either direction. Is there something in pandas where you say
df.getClosest(index=5, N=20)
And it will return the subset of the dataframe with the 20 points closest the point at index 5?
Something like this might work:
# Set Index to Search
index_to_search = 5
# Create an Index Column
df['Index'] = df.index
df_new = df.iloc[(df['Index']-index_to_search).abs().argsort()[:20]]
print(df_new['Index'].tolist())
I assumed you didn't have an "Index" column on your data frame. If you already do, then you shouldn't need the df['Index'] = df.index line.
I think if you have two ID's that are closest (one below and one above for example) it favours the smaller one. But this potential issue will only apply if you need an even number of indexes.

Pandas, for each row getting value of largest column between two columns

I'd like to express the following on a pandas data frame, but I don't know how to other than slow manual iteration over all cells.
For context: I have a data frame with two categories of columns, we'll call them the read_columns and the non_read_columns. Given a column name I have a function that can return true or false to tell you which category the column belongs to.
Given a specific read column A:
For each row:
1. Inspect the read column A to get the value X
2. Find the read column with the smallest value Y that is greater than X.
If no read column has a value greater than X, then substitute the largest value
found in all of the *non*-read columns, call it Z, and skip to step 4.
3. Find the non-read column with the greatest value between X and Y and call its value Z.
4. Compute Z - X
At the end I hope to have a series of the Z - X values with the same index as the original data frame. Note that the sort order of column values is not consistent across rows.
What's the best way to do this?
It's hard to give an answer without looking at the example DF, but you could do the following:
Separate your read columns with Y values into a new DF.
Transpose this new DF to get the Y values in columns, not in rows.
Use built-in vectorized functions on the Series of Y values instead of iterating the rows and columns manually. You could first filter the values greater than X, and then apply min() on the filtered Series.

Numpy: how to select full row based on some columns

a = np.array([[1.,2.,3.],
[3.,4.,2.],
[8.,1.,3.]])
b = [8.,1.]
c = a[np.isclose(a[:,0:2],b)]
print(c)
I want to select full rows in a based on only a few columns. My attempt is above.
It works if I include the last column too in that condition, but I don't care about the last column. How do I select rows with 3 columns, based on a condition on 2?
Compare with np.isclose using the sliced version of a and then look for all matches along each row, for which we can use np.all or np.logical_and.reduce. Finally, index into input array for the output.
Hence, two solutions -
a[np.isclose(a[:,:2],b).all(axis=1)]
a[np.logical_and.reduce( np.isclose(a[:,:2],b), axis=1)]

Categories

Resources