Fill numpy.darray with tuple skipping columns - python

I have a m x n array using np.zeros([m, n]) and I want to fill some row (for example row 0) with a tuple that is returned. However I want to skip certain columns that should remain 0.
Now i have to repeat the function (or store them somewhere) and fill certain parts of the row.
Example with a function that returns a tuple of length 6
A[0,0:2] = someClass.someFunc(var1, var2)[0:2]
A[0,4:8] = someClass.someFunc(var1, var2)[2:6]
I fill the first 2 columns with the first 2 variables of the tuple, skip 2 rows and then fill the following 4 columns with the remaining part of the tuple.
Is there some way to achieve something like this:
A[0,0:2], A[0,4:8] = someClass.someFunc(var1, var2)
Skipping the need to repeat the function?

You could concatenate those ranges with np.r_ to simplify the left side -
A[0,np.r_[0:2,4:8]] = someClass.someFunc(var1, var2)

Related

selecting subsets of data in Pandas

I have a data set containing 5 rows × 1317 columns. attached you can se how the data set looks like. The header contains numbers which are wavelengths. However I only want to select the columns from a specific range of wavelength.
The wavelengths numbers which I am interested are stored in an array (c) with the size of 1 × 235.
How can I extract the desired columns according to the wavelength values stored in c?
If your array c only has values that are also a column heading (that is, c doesn't have any additional values), you may be able to just make it a list and use df[c], where c is that list.
For example, with what is shown in your current picture, you could do:
l = [102,105] # I am assuming that your column headings in the df are integers, not strings
df[l]
This will display those two columns. If you want it in somenew dataframe, then do something like df2 = pandas.Dataframe(df[l]) to If lwas 5 columns, it would show those 5 columns. And so if you can pass in your arrayc, (or make it into a list, probably by l = list(c)`), you'll get your columns
If your array has additional values that aren't necessarily columns in your dataframe, you'll need to make a sub-list of just those columns.
sub_c = list() #create a blank list that we will add to
c_list = list(c)
for column in df.columns:
if column in c_list: sub_c.append(column)
df[sub_c]
This will make that sublist that only has values that are column headers, and so you want be trying to view columns that don't exist.
Keep in mind that you'll need matching data-types between your c array and your column headers.

min of all columns of the dataframe in a range

I want to find the min value of every row of a dataframe restricting to only few columns.
For example: consider a dataframe of size 10*100. I want the min of middle 5 rows and this becomes of size 10*5.
I know to find the min using df.min(axis=0) but i dont know how to restrict the number of columns. Thanks for the help.
I use pandas lib.
You can start by selecting the slice of columns you are interested in and applying DataFrame.min() to only that selection:
df.iloc[:, start:end].min(axis=0)
If you want these to be the middle 5, simply find the integer indices which correspond to the start and end of that range:
start = int(n_columns/2 - 2.5)
end = start + 5
Following the 'pciunkiewicz's logic:
First you should select the columns that you desire. You can use the functions: .loc[..] or .iloc[..].
The first one you can use the names of the columns. When it takes 2 arguments, the first one is the row's index. The second is the columns.
df.loc[[rows], [columns]] # The filter data should be inside the brakets.
df.loc[:, [columns]] # This will consider all rows.
You can also use .iloc. In this case, you have to use integers to locate the data. So you don't have to know the name of the columns, but their position.

based on condition=True in an column filling random values to a particular column

I need to work on a column, and based on a condition (if it is True ), need to fill some random numbers for the entry(not a constant string/number ). Tried with for loop and its working, but any other fastest way to proceed similar to np.select or np.where conditions ?
I have written for loop and its working:
The 'NUMBER' column have here few entries with greater than 1000, i need to replace them by any random float in between (120,123),not the same one b/w 120-123 . I have used np.random.uniform and its working too.
for i in range(0,len(data['NUMBER'])):
if data['NUMBER'][i] >=1000:
data['NUMBER'][i]=np.random.uniform(120,123)\
'''The o/p for this code fills each entries with different values
between (120,123) in random,after replacement the entries are'''
0 7.139093
1 12.592815
2 12.712103
3 **120.305773**
4 11.941386
5 **122.548703**
6 6.357255.............etc
''' but while using codes using np.select and np.where as shown below(as
it will run faster) --> the result was replaced by same number alone
for all the entries satisfying the condition. for example instead of
having different values for the indexes 3 and 5 as shown above it
have same value of any b/w(120,123 ) for all the entries. please
guide here.'''
data['NUMBER'] =np.where(data['NUMBER'] >= 1000,np.random.uniform(120,123), data['NUMBER'])
data['NUMBER'] = np.select([data['NUMBER'] >=1000],[np.random.uniform(120,123)], [data['NUMBER']])
np.random.uniform(120, 123) is a single random number:
In [1]: np.random.uniform(120, 123)
Out[1]: 120.51317994772921
Use the size parameter to make an array of random numbers:
In [2]: np.random.uniform(120, 123, size=5)
Out[2]:
array([122.22935075, 122.70963032, 121.97763459, 121.68375085,
121.13568039])
Passing this to np.where (as the second argument) allows np.where to select from this array when the condition is True:
data['NUMBER'] = np.where(data['NUMBER'] >= 1000,
np.random.uniform(120, 123, size=len(data)),
data['NUMBER'])
Use np.select when there is more than one condition. Since there is only one condition here, use np.where.

Efficient way of converting a numpy array of 2 dimensions into a list with no duplicates

I want to extract the values from two different columns of a pandas dataframe, put them in a list with no duplicate values.
I have tried the following:
arr = df[['column1', 'column2']].values
thelist= []
for ix, iy in np.ndindex(arr.shape):
if arr[ix, iy] not in thelist:
thelist.append(edges[ix, iy])
This works but it is taking too long. The dataframe contains around 30 million rows.
Example:
column1 column2
1 adr1 adr2
2 adr1 adr2
3 adr3 adr4
4 adr4 adr5
Should generate the list with the values:
[adr1, adr2, adr3, adr4, adr5]
Can you please help me find a more efficient way of doing this, considering that the dataframe contains 30 million rows.
#ALollz gave a right answer. I'll extend from there. To convert into list as expected just use list(np.unique(df.values))
You can use just np.unique(df) (maybe this is the shortest version).
Formally, the first parameter of np.unique should be an array_like object,
but as I checked, you can also pass just a DataFrame.
Of course, if you want just plain list not a ndarray, write
np.unique(df).tolist().
Edit following your comment
If you want the list unique but in the order of appearance, write:
pd.DataFrame(df.values.reshape(-1,1))[0].drop_duplicates().tolist()
Operation order:
reshape changes the source array into a single column.
Then a DataFrame is created, with default column name = 0.
Then [0] takes just this (the only) column.
drop_duplicates acts exactly what the name says.
And the last step: tolist converts to a plain list.

How to index numpy array with an array

Given an array defined below as:
a = np.arange(30).reshape((3, 10)
col_index = [[1,2,3,5], [3,4,5,7]]
row_index = [2,1]
Is it possible to index a[row_index, col_index], so I can do something like
a[row_index, col_index] =1, so then a becomes
[[0,1,2,3,4,5,6,7,8,9], [10,11,12,1,1,1,16,1,18,19], [20,1,1,1,24,1,26,27,28,29]]
So to clarify, in row 2, column 1,2,3, and 5 are set to one, and in row 1, column 3,4,5,7 is also set to 1.
Or (if you don't like typing)
a[np.c_[row_index], col_index] = 1
or even shorter but Python 2 only
a[zip(row_index), col_index] = 1
What all these solutions do is to make row and col indices broadcastable to each other. np.c_ is the column concatenation convenience object. It makes columns out of 1D objects.
zip used to do essentially the same. Only, since Python 3 it returns an iterator instead of a list and numpy can't handle those. (One could do list(zip(row_index)) but that's not short.)

Categories

Resources