Python: Create some sort of cumsum between two columns

Python: Create some sort of cumsum between two columns - python

I am trying to figure out how to get some sort of running total using multiple columns and I can't figure out where to even start. I've used cumsum before but only for just one single column and this won't work.
I have this table :
Index A B C
1 10 12 20
2 10 14 20
3 10 6 20
I am trying to build out this table that looks like this:
Index A B C D
1 10 12 20 10
2 10 14 20 18
3 10 6 20 24
The formula for D is as follows:
D2 = ( D1 - B1 ) + C1
D1 = Column A
Any ideas on how I could do this? I am totally out of ideas on this.

This should work:
df.loc[0, 'New_Inventory'] = df.loc[0, 'Inventory']
for i in range(1, len(df)):
df.loc[i, 'New_Inventory'] = df.loc[i-1, 'Inventory'] - df.loc[i-1, 'Booked'] - abs(df.loc[i-1, 'New_Inventory'])
df.New_Inventory = df.New_Inventory.astype(int)
df
# Index Inventory Booked New_Inventory
#0 1/1/2020 10 12 10
#1 1/2/2020 10 14 -12
#2 1/3/2020 10 6 -16

You can get your answer by using shift, reference the answer here
import pandas as pd
raw_data = {'Index': ['1/1/2020', '1/2/2020', '1/3/2020', '1/4/2020', '1/5/2020'],
'Inventory': [10, 10, 10, 10, 10],
'Booked': [12,14,6,3,5] }
df = pd.DataFrame(raw_data)
df['New_Inventory'] = 10 # need to initialize
df['New_Inventory'] = df['Inventory'] - df['Booked'].shift(1) - df['New_Inventory'].shift(1)
df
Your requested output seems wrong. The calculation for above New_Inventory is what was requested.

Related

Setting values in pandas df using location

I have two dataframes as such:
df_pos = pd.DataFrame(
data = [[5,4,3,6,0,7,1,2], [2,5,3,6,4,7,1,0]]
)
df_value = pd.DataFrame(
data=[np.arange(10 + i, 50 + i, 5) for i in range(0,2)]
)
and I want to have a new dataframe df_final where df_pos notates the position and df_value the corresponding value.
I can do it like this:
df_value_copy = df_value.copy()
for i in range(len(df_pos)):
df_value_copy.iloc[i, df_pos.iloc[i, :]] = df_value.iloc[i].values
df_final = df_value_copy
However, I have very large dataframes that would be way too slow. Therefore I want to see whether there is any smarter way to do it.

We can also try np.put_along_axis to place df_value into df_final based on the df_pos:
df_final = df_value.copy()
np.put_along_axis(
arr=df_final.values, # Destination Arr
indices=df_pos.values, # Indices
values=df_value.values, # Source Values
axis=1 # Along Axis
)
The arguments do not need to be kwargs can be positional like:
df_final = df_value.copy()
np.put_along_axis(df_final.values, df_pos.values, df_value.values, 1)
df_final:
0 1 2 3 4 5 6 7
0 30 40 45 20 15 10 25 35
1 46 41 11 21 31 16 26 36

You can try setting values with numpy advanced indexing:
df_final = df_value.copy()
df_final.values[np.arange(len(df_pos))[:,None], df_pos.values] = df_value.values
df_final
0 1 2 3 4 5 6 7
0 30 40 45 20 15 10 25 35
1 46 41 11 21 31 16 26 36

replicating data in same dataFrame

I want to replicate the data from the same dataframe when a certain condition is fulfilled.
Dataframe:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
I want to replicate the dataframe when going through a loop and there is a difference greater than 4 in row.hour.
Expected Output:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
2,17
4,20
i want to replicate the rows when the iterating through all the row and there is a difference greater than 4 in row.hour
row.hour[0] = 1
row.hour[1] = 2.here the difference between is 1 but in (row.hour[2]=4 and row,hour[3]=10).here the difference is 6 which is greater than 4.I want to replicate the data above of the index where this condition(greater than 4) is fulfilled
I can replicate the data with **df = pd.concat([df]*2, ignore_index=False)**.but it does not replicate when i run it with if statement
I tried the code below but nothing is happening.
**for i in range(0,len(df)-1):
if (df.iloc[i,0] - df.iloc[i+1,0]) > 4 :
df = pd.concat([df]*2, ignore_index=False)**

My understanding is: you want to compare 'Hour' values for two successive rows.
If the difference is > 4 you want to add the previous row to the DF.
If that is what you want try this:
Create a DF:
j = pd.DataFrame({'Hour':[1, 2, 4,10,15,16,17,19],
'Wage':[15,17,20,25,26,30,40,15]})
Define a function:
def f1(d):
dn = d.copy()
for x in range(len(d)-2):
if (abs(d.iloc[x+1].Hour - d.iloc[x+2].Hour) > 4):
idx = x + 0.5
dn.loc[idx] = d.iloc[x]['Hour'], d.iloc[x]['Wage']
dn = dn.sort_index().reset_index(drop=True)
return dn
Call the function passing your DF:
nd = f1(j)
Hour Wage
0 1 15
1 2 17
2 2 17
3 4 20
4 4 20
5 10 25
6 15 26
7 16 30
8 17 40
9 19 15

In line
if df.iloc[i,0] - df.iloc[i+1,0] > 4
you calculate 4-10 instead of 10-4 so you check -6 > 4 instead of 6 > 4
You have to replace items
if df.iloc[i+1,0] - df.iloc[i,0] > 4
or use abs() if you want to replicate in both situations - > 4 and < -4
if abs(df.iloc[i+1,0] - df.iloc[i,0]) > 4
If you would use print( df.iloc[i,0] - df.iloc[i+1,0]) (or debuger) the you would see it.

Creation dataframe from several list of lists

I need to build a dataframe from 10 list of list. I did it manually, but it's need a time. What is a better way to do it?
I have tried to do it manually. It works fine (#1)
I tried code (#2) for better perfomance, but it returns only last column.
1
import pandas as pd
import numpy as np
a1T=[([7,8,9]),([10,11,12]),([13,14,15])]
a2T=[([1,2,3]),([5,0,2]),([3,4,5])]
print (a1T)
#Output[[7, 8, 9], [10, 11, 12], [13, 14, 15]]
vis1=np.array (a1T)
vis_1_1=vis1.T
tmp2=np.array (a2T)
tmp_2_1=tmp2.T
X=np.column_stack([vis_1_1, tmp_2_1])
dataset_all = pd.DataFrame({"Visab1":X[:,0], "Visab2":X[:,1], "Visab3":X[:,2], "Temp1":X[:,3], "Temp2":X[:,4], "Temp3":X[:,5]})
print (dataset_all)
Output: Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
> Actually I have varying number of columns in dataframe (500-1500), thats why I need auto generated column names. Extra index (1, 2, 3) after name Visab_, Temp_ and so on - constant for every case. See code below.
For better perfomance I tried
code<br>
#2
n=3 # This is varying parameter. The parameter affects the number of columns in the table.
m=2 # This is constant for every case. here is 2, because we have "Visab", "Temp"
mlist=('Visab', 'Temp')
nlist=[range(1, n)]
for j in range (1,n):
for i in range (1,m):
col=i+(j-1)*n
dataset_all=pd.DataFrame({mlist[j]+str(i):X[:, col]})
I expect output like
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
but there is not any result (only error expected an indented block)

Ok, so the number of columns n is the number of sublists in each list, right? You can measure that with len:
len(a1T)
#Output
3
I'll simplify the answer above so you don't need X and add automatic column-names creation:
my_lists = [a1T,a2T]
my_names = ["Visab","Temp"]
dfs=[]
for one_list,name in zip(my_lists,my_names):
n_columns = len(one_list)
col_names=[name+"_"+str(n) for n in range(n_columns)]
df = pd.DataFrame(one_list).T
df.columns = col_names
dfs.append(df)
dataset_all = pd.concat(dfs,axis=1)
#Output
Visab_0 Visab_1 Visab_2 Temp_0 Temp_1 Temp_2
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5

Now is much clearer. So you have:
X=np.column_stack([vis_1_1, tmp_2_1])
Let's create a list with the names of the columns:
columns_names = ["Visab1","Visab2","Visab3","Temp1","Temp2","Temp3"]
Now you can directly make a dataframe like this:
dataset_all = pd.DataFrame(X,columns=columns_names)
#Output
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5

Random sampling pandas based on column values

I have files (A,B,C etc) each having 12,000 data points. I have divided the files into batches of 1000 points and computed the value for each batch. So now for each file we have 12 values, which is loaded in a pandas Data Frame (shown below).
file value_1 value_2
0 A 1 43
1 A 1 89
2 A 1 22
3 A 1 87
4 A 1 43
5 A 1 89
6 A 1 22
7 A 1 87
8 A 1 43
9 A 1 89
10 A 1 22
11 A 1 87
12 A 1 83
13 B 0 99
14 B 0 23
15 B 0 29
16 B 0 34
17 B 0 99
18 B 0 23
19 B 0 29
20 B 0 34
21 B 0 99
22 B 0 23
23 B 0 29
24 B 0 34
25 C 1 62
- - - -
- - - -
Now as the next step I need to randomly select a file, and for that file randomly select a sequence of 4 batches for value_1. The later, I believe can be done with df.sample(), but I'm not sure how to randomly select the files. I tried to make it work with np.random.choice(data['file'].unique()), but doesn't seems correct.
Thanks for the help in advance. I'm pretty new to pandas and python in general.

If I understand what you are trying to get at, the following should be of help:
# Test dataframe
import numpy as np
import pandas as pd
data = pd.DataFrame({'file': np.repeat(['A', 'B', 'C'], 12),
'value_1': np.repeat([1,0,1],12),
'value_2': np.random.randint(20, 100, 36)})
# Select a file
data1 = data[data.file == np.random.choice(data['file'].unique())].reset_index(drop=True)
# Get a random index from data1
start_ix = np.random.choice(data1.index[:-3])
# Get a sequence starting at the random index from the previous step
print(data.loc[start_ix:start_ix+3])

Here's a rather long winded answer that has a lot of flexibility and uses some random data I generated. I also added a field to the dataframe to denote whether that row had been used.
Generating Data
import pandas as pd
from string import ascii_lowercase
import random
random.seed(44)
files = [ascii_lowercase[i] for i in range(4)]
value_1 = random.sample(range(1, 10), 8)
files_df = files*len(value_1)
value_1_df = value_1*len(files)
value_1_df.sort()
value_2_df = random.sample(range(100, 200), len(files_df))
df = pd.DataFrame({'file' : files_df,
'value_1': value_1_df,
'value_2': value_2_df,
'used': 0})
Randomly Selecting Files
len_to_run = 3 #change to run for however long you'd like
batch_to_pull = 4
updated_files = df.loc[df.used==0,'file'].unique()
for i in range(len_to_run): #not needed if you only want to run once
file_to_pull = ''.join(random.sample(updated_files, 1))
print 'file ' + file_to_pull
for j in range(batch_to_pull): #pulling 4 values
updated_value_1 = df.loc[(df.used==0) & (df.file==file_to_pull),'value_1'].unique()
value_1_to_pull = random.sample(updated_value_1,1)
print 'value_1 ' + str(value_1_to_pull)
df.loc[(df.file == file_to_pull) & (df.value_1==value_1_to_pull),'used']=1
file a
value_1 [1]
value_1 [7]
value_1 [5]
value_1 [4]
file d
value_1 [3]
value_1 [2]
value_1 [1]
value_1 [5]
file d
value_1 [7]
value_1 [4]
value_1 [6]
value_1 [9]

broadcast groupby with boolean filter scalar in pandas

I have a data frame according to below.
df = pd.DataFrame({'var1' : list('a' * 3) + list('b' * 2) + list('c' * 4)
,'var2' : [i for i in range(9)]
,'var3' : [20, 40, 100, 10, 80, 12,24, 53, 90]
})
End result that I want is the following:
var1 var2 var3 var3_lt_50
0 a 0 20 60
1 a 1 40 60
2 a 2 100 60
3 b 3 10 10
4 b 4 80 10
5 c 5 12 36
6 c 6 24 36
7 c 7 53 36
8 c 8 90 36
I get this result in two steps, through a group-by and a merge, according to code below:
df = df.merge(df[df.var3 < 50][['var1', 'var3']].groupby('var1', as_index = False).sum().rename(columns = {'var3' : 'var3_lt_50'})
,how = 'left'
,left_on = 'var1'
,right_on = 'var1')
Can someone show me a way of doing this type of boolean logic expression + broadcasting of inter groupby scalar without the "groupby" + "merge" step im doing today. I want a smoother line of code.
Thanks in advance for input,
/Swepab

You can use groupby.transform which keeps the shape of the transformed variable as well as the index so you can just assign the result back to the data frame:
df['var3_lt_50'] = df.groupby('var1').var3.transform(lambda g: g[g < 50].sum())
df

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Create some sort of cumsum between two columns - python

Related

Setting values in pandas df using location

replicating data in same dataFrame

Creation dataframe from several list of lists

Random sampling pandas based on column values

broadcast groupby with boolean filter scalar in pandas

Categories

Resources