Pandas iloc complex slice every nth row - python

I have a dataframe with a periodicity in the rows of 14 i.e. there are 14 lines of data per record (means, sdev etc.) and I want to extract the 2nd, 4th, 7th and 9th line, repeatedly for every record (14 lines). My code is:
Mean_df = df.iloc[[1,3,6,8]::14,:].copy()
which does not work
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.range.RangeIndex'> with these indexers [[1, 3, 6, 8]] of <class 'list'>
I got help with the code from here, which has been useful, but not on the multi-row selections --
Pandas every nth row
I can extract as several different slices and combine, but it feels like there may be a more elegant solution.
Any ideas?

Using:
df[np.isin(np.arange(len(df))%14,np.array([1,3,6,8]))]

You can use a tuple comprehension with slice and np.r_:
arr = np.arange(14*3)
slices = tuple(slice(i, len(arr), 14) for i in (1, 3, 6, 8))
res = np.r_[slices]
print(res)
array([ 1, 15, 29, 3, 17, 31, 6, 20, 34, 8, 22, 36])
In this example, indexing dataframe rows with 1::14 is equivalent to indexing with slice(1, df.shape[0], 14).
This is fairly generic, you can define any tuple of slice objects and pass to np.r_.

Related

how do I properly use the result of argmin to slice out an array of the minimums?

I'm looking to slice out the minimum value along the first axis of an array.
For example, in the code below, I want to print out np.array([13, 0, 12, 3]).
However, the slicing isn't behaving as I would think it does.
(I do need the argmin array later and don't want to just use np.min(g, axis=1))
import numpy as np
g = np.array([[13, 23, 14], [12, 23, 0], [39, 12, 92], [19, 4, 3]])
min_ = np.argmin(g, axis=1)
print(g[:, min_])
What is happening here?
Why is my result from the code
[[13 14 23 14]
[12 0 23 0]
[39 92 12 92]
[19 3 4 3]]
Other details:
Python 3.10.2
Numpy 1.22.1
If you want use np.argmin, you can try this:
For more explanation : from min_ you have array([0, 2, 1, 2]) but for accessing to array you need ((0, 1, 2, 3), (0, 2, 1, 2)) for this reason you can use range.
min_ = np.argmin(g, axis=1)
g[range(len(min_)), min_] # like as np.min(g ,axis=1)
Output:
array([13, 0, 12, 3])
Your code is printing the first, third, second, and third columns of the g array, in that order.
>>> np.argmin(g, axis=1)
array([0, 2, 1, 2]) # first, third, second, third
If you want to get the minimum value of each row, use np.min:
>>> np.min(g, axis=1)
array([13, 0, 12, 3])
When you write g[:, min_], you're saying: "give me all of the rows (shorthand :) for columns at indices min_ (namely 0, 2, 1, 2)".
What you wanted to say was: "give me the values at these rows and these columns" - in other words, you're missing the corresponding row indices to match the column indices in min_.
Since your desired row indices are simply a range of numbers from 0 to g.shape[0] - 1, you could technically write it as:
print(g[range(g.shape[0]), min_])
# output: [13 0 12 3]
But #richardec's solution is better overall if your goal is to extract the row-wise min value.

Using Array or Series to select from multiple columns

I have a counter column which contains an integer. Based on that integer I would like to pick one of consecutive columns in my dataframe.
I tried using .apply(lambda x: ..., axis =1) but my solution there requires an extra if for each column I want to pick from.
df2 = pd.DataFrame(np.array([[1, 2, 3, 0 ], [4, 5, 6, 2 ], [7, 8, 9, 1]]),columns=['a', 'b', 'c','d'])
df2['e'] = df.iloc[:,df2['d']]
This code doesn't work because iloc only wants one item in that position and not 3 (df2['d']= [0,2,1]).
What I would like it to do is give me the 0th item in the first row the 2nd item in the second row and the 1st item in the third row. so
df2['e'] = [1,6,8]
You are asking for something similar to fancy indexing in numpy. In pandas, it is lookup. Try this:
df2.lookup(df2.index, df2.columns[df2['d']])
Out[86]: array([1, 6, 8])

How to sum a slice from a pandas dataframe

I'm trying to sum the a portion of the sessions in my dictionary so I can get totals for the current and previous week.
I've converted the JSON into a pandas dataframe in one test. I'm summing the total of the sessions using the .sum() function in pandas. However, I also need to know the total sessions from this week and the week prior. I've tried a few methods to sum values (-1:-7) and (-8:-15), but I'm pretty sure I need to use .iloc.
IN:
response = requests.get("url")
data = response.json()
df=pd.DataFrame(data['DailyUsage'])
total_sessions = df['Sessions'].sum()
current_week= df['Sessions'].iloc[-1:-7]
print(current_week)
total_sessions =['current_week'].sum
OUT:
Series([], Name: Sessions, dtype: int64)
AttributeError 'list' object has no attribute 'sum'
Note: I've tried this with and without pd.to_numeric and also with variations on the syntax of the slice and sum methods. Pandas doesn't feel very Pythonic and I'm out of ideas as to what to try next.
Assuming that df['Sessions'] holds each day, and you are comparing current and previous week only, you can use reshape to create a weekly sum for the last 14 values.
weekly_matrix = df['Sessions'][:-15:-1].values.reshape((2, 7))
Then, you can sum each row and get the weekly sum, most recent will be the first element.
import numpy as np
weekly_sum = np.sum(weekly_matrix, axis=1)
current_week = weekly_sum[0]
previous_week = weekly_sum[1]
EDIT: how the code works
Let's take the 1D-array which is accessed by the values attribute of the pandas Series. It contains the last 14 days, which is ordered from most recent to the oldest. I will call it x.
x = array([14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
The array's reshape function is then called on x to split this data into a 2D-array (matrix) with 2 rows and 7 columns.
The default behavior of the reshape function is to first fill all columns in a row before moving to the next row. Therefore, x[0] will be the element (1,1) in the reshaped array, x[1] will be the element (1,2), and so on. After the element (1,7) is filled with x[6] (ending the current week), the next element x[7] will then be placed in (2,1). This continues until finishing the reshape operation, with the placement of x[13] in (2,7).
This results in placing the first 7 elements of x (current week) in the first row, and the last 7 elements of x (previous week) in the second row. This was called weekly_matrix.
weekly_matrix = x.reshape((2, 7))
# weekly_matrix = array([[14, 13, 12, 11, 10, 9, 8],
# [ 7, 6, 5, 4, 3, 2, 1]])
Since now we have the values of each week organized in a matrix, we can use numpy.sum function to finish our operation. numpy.sum can take an axis argument, which will control how the value is computed:
if axis=None, all elements are added in a grand total.
if axis=0, all rows in each column will be added. In the case of weekly_matrix, this will result in a 7 element 1D-array ([21, 19,
17, 15, 13, 11, 9], which is not the result we want, as we are
actually adding equivalent days on each week).
if axis=1 (as the case of the solution), all columns in each row will be added, producing a 2 element 1D-array in the case of weekly_matrix. Order of this result
array follows the same order of the rows in the matrix (i.e., element
0 is the total of the first row, and element 1 is the total of the
second row). Since we know that the first row is the current week, and
the second row is the previous week, we can extract the information
using these indexes, which is
# weekly_sum = array([77, 28])
current_week = weekly_sum[0] # sum of [14, 13, 12, 11, 10, 9, 8] = 77
previous_week = weekly_sum[1] # sum of [ 7, 6, 5, 4, 3, 2, 1] = 28
To group and sum by a fixed number of values, for instance with daily data and weekly aggregation, consider groupby. You can do this forwards or backwards by slicing your series as appropriate:
np.random.seed(0)
df = pd.DataFrame({'col': np.random.randint(0, 10, 21)})
print(df['col'].values)
# array([5, 0, 3, 3, 7, 9, 3, 5, 2, 4, 7, 6, 8, 8, 1, 6, 7, 7, 8, 1, 5])
# forwards groupby
res = df['col'].groupby(df.index // 7).sum()
# 0 30
# 1 40
# 2 35
# Name: col, dtype: int32
# backwards groupby
df['col'].iloc[::-1].reset_index(drop=True).groupby(df.index // 7).sum()
# 0 35
# 1 40
# 2 30
# Name: col, dtype: int32

Delete rows at select indexes from a numpy array

In my dataset I've close to 200 rows but for a minimal working e.g., let's assume the following array:
arr = np.array([[1,2,3,4], [5,6,7,8],
[9,10,11,12], [13,14,15,16],
[17,18,19,20], [21,22,23,24]])
I can take a random sampling of 3 of the rows as follows:
indexes = np.random.choice(np.arange(arr.shape[0]), int(arr.shape[0]/2), replace=False)
Using these indexes, I can select my test cases as follows:
testing = arr[indexes]
I want to delete the rows at these indexes and I can use the remaining elements for my training set.
From the post here, it seems that training = np.delete(arr, indexes) ought to do it. But I get 1d array instead.
I also tried the suggestion here using training = arr[indexes.astype(np.bool)] but it did not give a clean separation. I get element [5,6,7,8] in both the training and testing sets.
training = arr[indexes.astype(np.bool)]
testing
Out[101]:
array([[13, 14, 15, 16],
[ 5, 6, 7, 8],
[17, 18, 19, 20]])
training
Out[102]:
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
Any idea what I am doing wrong? Thanks.
To delete indexed rows from numpy array:
arr = np.delete(arr, indexes, axis=0)
One approach would be to get the remaining row indices with np.setdiff1d and then use those row indices to get the desired output -
out = arr[np.setdiff1d(np.arange(arr.shape[0]), indexes)]
Or use np.in1d to leverage boolean indexing -
out = arr[~np.in1d(np.arange(arr.shape[0]), indexes)]

Python, find the index of 1D array that is filled with arrays of tuple

I have a 1D array. each element holds a unique value IE [2013 12 16 1 10] so array[0,0] would be [2013] array[0,1] would be [12]. array[0,0:2] would be [2013 12].
When I try array.index(array[0,0:5]). It creates error and says that list indicies must be integers, not tuple. find the index of a specific element if the element is [2013 12 16 1 10] a tuple...?
If you have a 1D array, then array[0,0] is invalid. Try this link:
http://www.thegeekstuff.com/2013/08/python-array/
Since this is a 1D array, you do:
my_array = [2013, 12, 16, 1, 10]
position for 2013 would be:
my_array[0]
I you want to do: my_array[0,0], then you need: `my_array = [[2013, 12, 16, 1, 10]]
Also check the link posted by: Musfiqur rahman
If you have a 1 dimensional array, that is simply an array of values like you mentioned where the array could equal [2013, 12, 16, 1, 10]. You access individual items in the array by using array[index]. However, there are actually 3 parameters used for getting array values:
array[start:end:step]
array[0, 1] is invalid, as the syntax is using colons not commas. 0,1 evaluates to a tuple of 2 values, (0, 1). If you want to get the value of 12, you need to say array[1]
Because you are talking about a 1D array then you must do the following:
myArray = [44, 62, 2013, 2, 1, 10]
#the position for 62 would be:
myArray[1]
Then if you want to get a few values from your array you should you ":" not ",". For example:
myArray = [44, 62, 2013, 2, 1, 10]
#the position for 62, 2013, 2, 1 would be:
myArray[1:4]

Categories

Resources