selecting subsets of data in Pandas - python

I have a data set containing 5 rows × 1317 columns. attached you can se how the data set looks like. The header contains numbers which are wavelengths. However I only want to select the columns from a specific range of wavelength.
The wavelengths numbers which I am interested are stored in an array (c) with the size of 1 × 235.
How can I extract the desired columns according to the wavelength values stored in c?

If your array c only has values that are also a column heading (that is, c doesn't have any additional values), you may be able to just make it a list and use df[c], where c is that list.
For example, with what is shown in your current picture, you could do:
l = [102,105] # I am assuming that your column headings in the df are integers, not strings
df[l]
This will display those two columns. If you want it in somenew dataframe, then do something like df2 = pandas.Dataframe(df[l]) to If lwas 5 columns, it would show those 5 columns. And so if you can pass in your arrayc, (or make it into a list, probably by l = list(c)`), you'll get your columns
If your array has additional values that aren't necessarily columns in your dataframe, you'll need to make a sub-list of just those columns.
sub_c = list() #create a blank list that we will add to
c_list = list(c)
for column in df.columns:
if column in c_list: sub_c.append(column)
df[sub_c]
This will make that sublist that only has values that are column headers, and so you want be trying to view columns that don't exist.
Keep in mind that you'll need matching data-types between your c array and your column headers.

Related

Copy values from column X+2 (two to the right of X) into column X

I have a dataframe and one every three columns has a name (the others are unnamed 1,2,3...).
I want values in the columns that have names to be equal to the value of two columns to the right of that.
I was using df.columns.get_loc("X") and I can use this to correctly select my desired column using df.iloc[:,X],
but I can't do Y = X +2 on pandas to do df.iloc[:,X] = df.iloc[:,Y] because X is not just an integer.
Any ideas on how to solve this? It can be a different way to get column X to have the same values as two columns to the right of X.
Thanks!
this would work, change 8 to fit your columns, or len(columns)//3*3
for n in range(0,8,3):
df.iloc[:,n]= df.iloc[:,n+2]
it doesn't seem we can assign a multi column to a multi column, not sure if that is possible

min of all columns of the dataframe in a range

I want to find the min value of every row of a dataframe restricting to only few columns.
For example: consider a dataframe of size 10*100. I want the min of middle 5 rows and this becomes of size 10*5.
I know to find the min using df.min(axis=0) but i dont know how to restrict the number of columns. Thanks for the help.
I use pandas lib.
You can start by selecting the slice of columns you are interested in and applying DataFrame.min() to only that selection:
df.iloc[:, start:end].min(axis=0)
If you want these to be the middle 5, simply find the integer indices which correspond to the start and end of that range:
start = int(n_columns/2 - 2.5)
end = start + 5
Following the 'pciunkiewicz's logic:
First you should select the columns that you desire. You can use the functions: .loc[..] or .iloc[..].
The first one you can use the names of the columns. When it takes 2 arguments, the first one is the row's index. The second is the columns.
df.loc[[rows], [columns]] # The filter data should be inside the brakets.
df.loc[:, [columns]] # This will consider all rows.
You can also use .iloc. In this case, you have to use integers to locate the data. So you don't have to know the name of the columns, but their position.

Efficient way of converting a numpy array of 2 dimensions into a list with no duplicates

I want to extract the values from two different columns of a pandas dataframe, put them in a list with no duplicate values.
I have tried the following:
arr = df[['column1', 'column2']].values
thelist= []
for ix, iy in np.ndindex(arr.shape):
if arr[ix, iy] not in thelist:
thelist.append(edges[ix, iy])
This works but it is taking too long. The dataframe contains around 30 million rows.
Example:
column1 column2
1 adr1 adr2
2 adr1 adr2
3 adr3 adr4
4 adr4 adr5
Should generate the list with the values:
[adr1, adr2, adr3, adr4, adr5]
Can you please help me find a more efficient way of doing this, considering that the dataframe contains 30 million rows.
#ALollz gave a right answer. I'll extend from there. To convert into list as expected just use list(np.unique(df.values))
You can use just np.unique(df) (maybe this is the shortest version).
Formally, the first parameter of np.unique should be an array_like object,
but as I checked, you can also pass just a DataFrame.
Of course, if you want just plain list not a ndarray, write
np.unique(df).tolist().
Edit following your comment
If you want the list unique but in the order of appearance, write:
pd.DataFrame(df.values.reshape(-1,1))[0].drop_duplicates().tolist()
Operation order:
reshape changes the source array into a single column.
Then a DataFrame is created, with default column name = 0.
Then [0] takes just this (the only) column.
drop_duplicates acts exactly what the name says.
And the last step: tolist converts to a plain list.

Pandas, for each row getting value of largest column between two columns

I'd like to express the following on a pandas data frame, but I don't know how to other than slow manual iteration over all cells.
For context: I have a data frame with two categories of columns, we'll call them the read_columns and the non_read_columns. Given a column name I have a function that can return true or false to tell you which category the column belongs to.
Given a specific read column A:
For each row:
1. Inspect the read column A to get the value X
2. Find the read column with the smallest value Y that is greater than X.
If no read column has a value greater than X, then substitute the largest value
found in all of the *non*-read columns, call it Z, and skip to step 4.
3. Find the non-read column with the greatest value between X and Y and call its value Z.
4. Compute Z - X
At the end I hope to have a series of the Z - X values with the same index as the original data frame. Note that the sort order of column values is not consistent across rows.
What's the best way to do this?
It's hard to give an answer without looking at the example DF, but you could do the following:
Separate your read columns with Y values into a new DF.
Transpose this new DF to get the Y values in columns, not in rows.
Use built-in vectorized functions on the Series of Y values instead of iterating the rows and columns manually. You could first filter the values greater than X, and then apply min() on the filtered Series.

Fill numpy.darray with tuple skipping columns

I have a m x n array using np.zeros([m, n]) and I want to fill some row (for example row 0) with a tuple that is returned. However I want to skip certain columns that should remain 0.
Now i have to repeat the function (or store them somewhere) and fill certain parts of the row.
Example with a function that returns a tuple of length 6
A[0,0:2] = someClass.someFunc(var1, var2)[0:2]
A[0,4:8] = someClass.someFunc(var1, var2)[2:6]
I fill the first 2 columns with the first 2 variables of the tuple, skip 2 rows and then fill the following 4 columns with the remaining part of the tuple.
Is there some way to achieve something like this:
A[0,0:2], A[0,4:8] = someClass.someFunc(var1, var2)
Skipping the need to repeat the function?
You could concatenate those ranges with np.r_ to simplify the left side -
A[0,np.r_[0:2,4:8]] = someClass.someFunc(var1, var2)

Categories

Resources