pandas dataframe look ahead value

pandas dataframe look ahead value - python

What's the best way to do this with pandas dataframe? I want to loop through a dataframe, and compute the difference between the current value and the next value which is different than the current value.
For example:
[13, 13, 13, 14, 13, 12]
will create a new column with this
[-1, -1, -1, 1, 1]

How about use diff to calculate the difference and then back fill 0 with the next non zero value:
import pandas as pd
import numpy as np
df = pd.DataFrame({"S": [13, 13, 13, 14, 13, 12]})
df.S.diff(-1).replace(0, np.nan).bfill() # replace zero with nan and apply back fill.
# 0 -1
# 1 -1
# 2 -1
# 3 1
# 4 1
# 5 NaN
# Name: S, dtype: float64

Related

change all values in array after max value

Is there a way to change all the values past the max value of a list to its own?
For example I have the given array arranged as followed:
values = [-10,-2,0,1,3,8,10,22,18,16,12,10]
where 22 is the max value of the list.
I have the following pseudocode:
max_value = max(values)
for i in range(len(values)):
if values[i]== max_value:
values[i+1] = max_value
values[i +2] == max_value
...etc.
break
therefore:
values = [-10,-2,0,1,3,8,10,22,22,22,22,22]

Try this:
import numpy as np
myvalues = np.array([-10, -2, 0, 1, 3, 8, 10, 22, 18, 16, 12, 10])
max_index = np.argmax(myvalues)
myvalues[max_index:] =myvalues[max_index]
print(myvalues)
The result is
[-10 -2 0 1 3 8 10 22 22 22 22 22]

In case the ascending-then-descending shape of your example array is not just a coincidence but that that's always the case for you, you could just accumulate by max:
values[:] = itertools.accumulate(values, max)

reference this post: Pythonic way to find maximum value and its index in a list?
import operator
values = [-10, -2, 0, 1, 3, 8, 10, 22, 18, 16, 12, 10]
# find the maximum value and its index
index, maximum_value = max(enumerate(values), key=operator.itemgetter(1))
# replace to maximum_value after the index
for i in range(index, len(values)):
values[i] = maximum_value

How to concatenate columns and pivot keeping columns information in pandas

I have an input df:
input_ = pd.DataFrame.from_records(
[
['X_val', 'Y_val1', 'Y_val2', 'Y_val3'],
[1, 10, 11, 31],
[2, 20, 12, 21],
[3, 30, 13, 11],])
and want to concat every y-value but still distinct where the value came from for plotting and analysis,
I have multiple files with variable number of Y columns and ended up concatenating them column by column and extending with multiplied value, but was wondering if there is a better solution, because mine is terribly tedious.
expected_output_ = pd.DataFrame.from_records(
[
['X_val', 'Y_val' 'Y_type'],
[1, 10, 'Y_val1'],
[1, 11, 'Y_val2'],
[1, 31, 'Y_val3'],
[2, 20, 'Y_val1'],
[2, 12, 'Y_val2'],
[2, 21, 'Y_val3'],
[3, 30, 'Y_val1'],
[3, 13, 'Y_val2'],
[3, 11, 'Y_val3'],])

You can use pandas.DataFrame.melt :
input_.melt(
id_vars=['X_val'],
value_vars=['Y_val1', 'Y_val2', 'Y_val3'],
var_name='Y_type',
value_name='Y_val'
).sort_values(['X_val'], ignore_index=True)
Alternatively, as suggested by #Vishnudev, you can also use the following variation, especially for large number of similarly named Y_val* columns:
input_.melt(
id_vars=['X_val'],
value_vars=input_.filter(regex='Y_val').columns,
var_name='Y_type',
value_name='Y_val'
).sort_values(['X_val'], ignore_index=True)
Output:
X_val Y_type Y_val
0 1 Y_val1 10
1 1 Y_val2 11
2 1 Y_val3 31
3 2 Y_val1 20
4 2 Y_val2 12
5 2 Y_val3 21
6 3 Y_val1 30
7 3 Y_val2 13
8 3 Y_val3 11
Optionally, you can rearrange the column sequence if you like.

How to sum a slice from a pandas dataframe

I'm trying to sum the a portion of the sessions in my dictionary so I can get totals for the current and previous week.
I've converted the JSON into a pandas dataframe in one test. I'm summing the total of the sessions using the .sum() function in pandas. However, I also need to know the total sessions from this week and the week prior. I've tried a few methods to sum values (-1:-7) and (-8:-15), but I'm pretty sure I need to use .iloc.
IN:
response = requests.get("url")
data = response.json()
df=pd.DataFrame(data['DailyUsage'])
total_sessions = df['Sessions'].sum()
current_week= df['Sessions'].iloc[-1:-7]
print(current_week)
total_sessions =['current_week'].sum
OUT:
Series([], Name: Sessions, dtype: int64)
AttributeError 'list' object has no attribute 'sum'
Note: I've tried this with and without pd.to_numeric and also with variations on the syntax of the slice and sum methods. Pandas doesn't feel very Pythonic and I'm out of ideas as to what to try next.

Assuming that df['Sessions'] holds each day, and you are comparing current and previous week only, you can use reshape to create a weekly sum for the last 14 values.
weekly_matrix = df['Sessions'][:-15:-1].values.reshape((2, 7))
Then, you can sum each row and get the weekly sum, most recent will be the first element.
import numpy as np
weekly_sum = np.sum(weekly_matrix, axis=1)
current_week = weekly_sum[0]
previous_week = weekly_sum[1]
EDIT: how the code works
Let's take the 1D-array which is accessed by the values attribute of the pandas Series. It contains the last 14 days, which is ordered from most recent to the oldest. I will call it x.
x = array([14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
The array's reshape function is then called on x to split this data into a 2D-array (matrix) with 2 rows and 7 columns.
The default behavior of the reshape function is to first fill all columns in a row before moving to the next row. Therefore, x[0] will be the element (1,1) in the reshaped array, x[1] will be the element (1,2), and so on. After the element (1,7) is filled with x[6] (ending the current week), the next element x[7] will then be placed in (2,1). This continues until finishing the reshape operation, with the placement of x[13] in (2,7).
This results in placing the first 7 elements of x (current week) in the first row, and the last 7 elements of x (previous week) in the second row. This was called weekly_matrix.
weekly_matrix = x.reshape((2, 7))
# weekly_matrix = array([[14, 13, 12, 11, 10, 9, 8],
# [ 7, 6, 5, 4, 3, 2, 1]])
Since now we have the values of each week organized in a matrix, we can use numpy.sum function to finish our operation. numpy.sum can take an axis argument, which will control how the value is computed:
if axis=None, all elements are added in a grand total.
if axis=0, all rows in each column will be added. In the case of weekly_matrix, this will result in a 7 element 1D-array ([21, 19,
17, 15, 13, 11, 9], which is not the result we want, as we are
actually adding equivalent days on each week).
if axis=1 (as the case of the solution), all columns in each row will be added, producing a 2 element 1D-array in the case of weekly_matrix. Order of this result
array follows the same order of the rows in the matrix (i.e., element
0 is the total of the first row, and element 1 is the total of the
second row). Since we know that the first row is the current week, and
the second row is the previous week, we can extract the information
using these indexes, which is
# weekly_sum = array([77, 28])
current_week = weekly_sum[0] # sum of [14, 13, 12, 11, 10, 9, 8] = 77
previous_week = weekly_sum[1] # sum of [ 7, 6, 5, 4, 3, 2, 1] = 28

To group and sum by a fixed number of values, for instance with daily data and weekly aggregation, consider groupby. You can do this forwards or backwards by slicing your series as appropriate:
np.random.seed(0)
df = pd.DataFrame({'col': np.random.randint(0, 10, 21)})
print(df['col'].values)
# array([5, 0, 3, 3, 7, 9, 3, 5, 2, 4, 7, 6, 8, 8, 1, 6, 7, 7, 8, 1, 5])
# forwards groupby
res = df['col'].groupby(df.index // 7).sum()
# 0 30
# 1 40
# 2 35
# Name: col, dtype: int32
# backwards groupby
df['col'].iloc[::-1].reset_index(drop=True).groupby(df.index // 7).sum()
# 0 35
# 1 40
# 2 30
# Name: col, dtype: int32

Cumulative subtraction from first row

I have one series and one DataFrame, all integers.
s = [10,
10,
10]
m = [[0,0,0,0,3,4,5],
[0,0,0,0,1,1,1],
[10,0,0,0,0,5,5]]
I want to return a matrix containing the cumulative differences to take the place of the existing number.
Output:
n = [[10,10,10,10,7,3,-2],
[10,10,10,10,9,8,7],
[0,0,0,0,0,-5,-10]]

Calculate the cumsum of data frame by row first and then subtract from the Series:
import pandas as pd
s = pd.Series(s)
df = pd.DataFrame(m)
-df.cumsum(1).sub(s, axis=0)
# 0 1 2 3 4 5 6
#0 10 10 10 10 7 3 -2
#1 10 10 10 10 9 8 7
#2 0 0 0 0 0 -5 -10

You can directly compute a cumulative difference using np.subtract.accumulate:
# make a copy
>>> n = np.array(m)
# replace first column
>>> n[:, 0] = s - n[:, 0]
# subtract in-place
>>> np.subtract.accumulate(n, axis=1, out=n)
array([[ 10, 10, 10, 10, 7, 3, -2],
[ 10, 10, 10, 10, 9, 8, 7],
[ 0, 0, 0, 0, 0, -5, -10]])

generate "category-intervals" from categories

I want to generate "category intervals" from categories.
for example, suppose I have the following :
>>> df['start'].describe()
count 259431.000000
mean 10.435858
std 5.504730
min 0.000000
25% 6.000000
50% 11.000000
75% 15.000000
max 20.000000
Name: start, dtype: float64
and unique value of my column are:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20], dtype=int8)
but I want to use the following list of intervals:
>>> intervals
[[0, 2.2222222222222223],
[2.2222222222222223, 4.4444444444444446],
[4.4444444444444446, 6.666666666666667],
[6.666666666666667, 8.8888888888888893],
[8.8888888888888893, 11.111111111111111],
[11.111111111111111, 13.333333333333332],
[13.333333333333332, 15.555555555555554],
[15.555555555555554, 17.777777777777775],
[17.777777777777775, 20]]
to change my column 'start' into values x where x represents the index of the interval that contains df['start'] (so x in my case will vary from 0 to 8)
is there a more or less simple way to do it using pandas/numpy?
In advance, thanks a lot for the help.
Regards.

You can use np.digitize:
import numpy as np
import pandas as pd
df = pd.DataFrame(dict(start=np.random.random_integers(0, 20, 10000)))
# the left-hand edges of each "interval"
intervals = np.linspace(0, 20, 9, endpoint=False)
print(intervals)
# [ 0. 2.22222222 4.44444444 6.66666667 8.88888889
# 11.11111111 13.33333333 15.55555556 17.77777778]
df['start_idx'] = np.digitize(df['start'], intervals) - 1
print(df.head())
# start start_idx
# 0 8 3
# 1 16 7
# 2 0 0
# 3 7 3
# 4 0 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas dataframe look ahead value - python

What's the best way to do this with pandas dataframe? I want to loop through a dataframe, and compute the difference between the current value and the next value which is different than the current value. For example: [13, 13, 13, 14, 13, 12] will create a new column with this [-1, -1, -1, 1, 1]

Related

change all values in array after max value

How to concatenate columns and pivot keeping columns information in pandas

How to sum a slice from a pandas dataframe

Cumulative subtraction from first row

generate "category-intervals" from categories

Categories

Resources