Python: Create some sort of cumsum between two columns - python

I am trying to figure out how to get some sort of running total using multiple columns and I can't figure out where to even start. I've used cumsum before but only for just one single column and this won't work.
I have this table :
Index A B C
1 10 12 20
2 10 14 20
3 10 6 20
I am trying to build out this table that looks like this:
Index A B C D
1 10 12 20 10
2 10 14 20 18
3 10 6 20 24
The formula for D is as follows:
D2 = ( D1 - B1 ) + C1
D1 = Column A
Any ideas on how I could do this? I am totally out of ideas on this.

This should work:
df.loc[0, 'New_Inventory'] = df.loc[0, 'Inventory']
for i in range(1, len(df)):
df.loc[i, 'New_Inventory'] = df.loc[i-1, 'Inventory'] - df.loc[i-1, 'Booked'] - abs(df.loc[i-1, 'New_Inventory'])
df.New_Inventory = df.New_Inventory.astype(int)
df
# Index Inventory Booked New_Inventory
#0 1/1/2020 10 12 10
#1 1/2/2020 10 14 -12
#2 1/3/2020 10 6 -16

You can get your answer by using shift, reference the answer here
import pandas as pd
raw_data = {'Index': ['1/1/2020', '1/2/2020', '1/3/2020', '1/4/2020', '1/5/2020'],
'Inventory': [10, 10, 10, 10, 10],
'Booked': [12,14,6,3,5] }
df = pd.DataFrame(raw_data)
df['New_Inventory'] = 10 # need to initialize
df['New_Inventory'] = df['Inventory'] - df['Booked'].shift(1) - df['New_Inventory'].shift(1)
df
Your requested output seems wrong. The calculation for above New_Inventory is what was requested.

Related

Setting values in pandas df using location

I have two dataframes as such:
df_pos = pd.DataFrame(
data = [[5,4,3,6,0,7,1,2], [2,5,3,6,4,7,1,0]]
)
df_value = pd.DataFrame(
data=[np.arange(10 + i, 50 + i, 5) for i in range(0,2)]
)
and I want to have a new dataframe df_final where df_pos notates the position and df_value the corresponding value.
I can do it like this:
df_value_copy = df_value.copy()
for i in range(len(df_pos)):
df_value_copy.iloc[i, df_pos.iloc[i, :]] = df_value.iloc[i].values
df_final = df_value_copy
However, I have very large dataframes that would be way too slow. Therefore I want to see whether there is any smarter way to do it.
We can also try np.put_along_axis to place df_value into df_final based on the df_pos:
df_final = df_value.copy()
np.put_along_axis(
arr=df_final.values, # Destination Arr
indices=df_pos.values, # Indices
values=df_value.values, # Source Values
axis=1 # Along Axis
)
The arguments do not need to be kwargs can be positional like:
df_final = df_value.copy()
np.put_along_axis(df_final.values, df_pos.values, df_value.values, 1)
df_final:
0 1 2 3 4 5 6 7
0 30 40 45 20 15 10 25 35
1 46 41 11 21 31 16 26 36
You can try setting values with numpy advanced indexing:
df_final = df_value.copy()
df_final.values[np.arange(len(df_pos))[:,None], df_pos.values] = df_value.values
df_final
0 1 2 3 4 5 6 7
0 30 40 45 20 15 10 25 35
1 46 41 11 21 31 16 26 36

replicating data in same dataFrame

I want to replicate the data from the same dataframe when a certain condition is fulfilled.
Dataframe:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
I want to replicate the dataframe when going through a loop and there is a difference greater than 4 in row.hour.
Expected Output:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
2,17
4,20
i want to replicate the rows when the iterating through all the row and there is a difference greater than 4 in row.hour
row.hour[0] = 1
row.hour[1] = 2.here the difference between is 1 but in (row.hour[2]=4 and row,hour[3]=10).here the difference is 6 which is greater than 4.I want to replicate the data above of the index where this condition(greater than 4) is fulfilled
I can replicate the data with **df = pd.concat([df]*2, ignore_index=False)**.but it does not replicate when i run it with if statement
I tried the code below but nothing is happening.
**for i in range(0,len(df)-1):
if (df.iloc[i,0] - df.iloc[i+1,0]) > 4 :
df = pd.concat([df]*2, ignore_index=False)**
My understanding is: you want to compare 'Hour' values for two successive rows.
If the difference is > 4 you want to add the previous row to the DF.
If that is what you want try this:
Create a DF:
j = pd.DataFrame({'Hour':[1, 2, 4,10,15,16,17,19],
'Wage':[15,17,20,25,26,30,40,15]})
Define a function:
def f1(d):
dn = d.copy()
for x in range(len(d)-2):
if (abs(d.iloc[x+1].Hour - d.iloc[x+2].Hour) > 4):
idx = x + 0.5
dn.loc[idx] = d.iloc[x]['Hour'], d.iloc[x]['Wage']
dn = dn.sort_index().reset_index(drop=True)
return dn
Call the function passing your DF:
nd = f1(j)
Hour Wage
0 1 15
1 2 17
2 2 17
3 4 20
4 4 20
5 10 25
6 15 26
7 16 30
8 17 40
9 19 15
In line
if df.iloc[i,0] - df.iloc[i+1,0] > 4
you calculate 4-10 instead of 10-4 so you check -6 > 4 instead of 6 > 4
You have to replace items
if df.iloc[i+1,0] - df.iloc[i,0] > 4
or use abs() if you want to replicate in both situations - > 4 and < -4
if abs(df.iloc[i+1,0] - df.iloc[i,0]) > 4
If you would use print( df.iloc[i,0] - df.iloc[i+1,0]) (or debuger) the you would see it.

Creation dataframe from several list of lists

I need to build a dataframe from 10 list of list. I did it manually, but it's need a time. What is a better way to do it?
I have tried to do it manually. It works fine (#1)
I tried code (#2) for better perfomance, but it returns only last column.
1
import pandas as pd
import numpy as np
a1T=[([7,8,9]),([10,11,12]),([13,14,15])]
a2T=[([1,2,3]),([5,0,2]),([3,4,5])]
print (a1T)
#Output[[7, 8, 9], [10, 11, 12], [13, 14, 15]]
vis1=np.array (a1T)
vis_1_1=vis1.T
tmp2=np.array (a2T)
tmp_2_1=tmp2.T
X=np.column_stack([vis_1_1, tmp_2_1])
dataset_all = pd.DataFrame({"Visab1":X[:,0], "Visab2":X[:,1], "Visab3":X[:,2], "Temp1":X[:,3], "Temp2":X[:,4], "Temp3":X[:,5]})
print (dataset_all)
Output: Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
> Actually I have varying number of columns in dataframe (500-1500), thats why I need auto generated column names. Extra index (1, 2, 3) after name Visab_, Temp_ and so on - constant for every case. See code below.
For better perfomance I tried
code<br>
#2
n=3 # This is varying parameter. The parameter affects the number of columns in the table.
m=2 # This is constant for every case. here is 2, because we have "Visab", "Temp"
mlist=('Visab', 'Temp')
nlist=[range(1, n)]
for j in range (1,n):
for i in range (1,m):
col=i+(j-1)*n
dataset_all=pd.DataFrame({mlist[j]+str(i):X[:, col]})
I expect output like
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
but there is not any result (only error expected an indented block)
Ok, so the number of columns n is the number of sublists in each list, right? You can measure that with len:
len(a1T)
#Output
3
I'll simplify the answer above so you don't need X and add automatic column-names creation:
my_lists = [a1T,a2T]
my_names = ["Visab","Temp"]
dfs=[]
for one_list,name in zip(my_lists,my_names):
n_columns = len(one_list)
col_names=[name+"_"+str(n) for n in range(n_columns)]
df = pd.DataFrame(one_list).T
df.columns = col_names
dfs.append(df)
dataset_all = pd.concat(dfs,axis=1)
#Output
Visab_0 Visab_1 Visab_2 Temp_0 Temp_1 Temp_2
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
Now is much clearer. So you have:
X=np.column_stack([vis_1_1, tmp_2_1])
Let's create a list with the names of the columns:
columns_names = ["Visab1","Visab2","Visab3","Temp1","Temp2","Temp3"]
Now you can directly make a dataframe like this:
dataset_all = pd.DataFrame(X,columns=columns_names)
#Output
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5

Random sampling pandas based on column values

I have files (A,B,C etc) each having 12,000 data points. I have divided the files into batches of 1000 points and computed the value for each batch. So now for each file we have 12 values, which is loaded in a pandas Data Frame (shown below).
file value_1 value_2
0 A 1 43
1 A 1 89
2 A 1 22
3 A 1 87
4 A 1 43
5 A 1 89
6 A 1 22
7 A 1 87
8 A 1 43
9 A 1 89
10 A 1 22
11 A 1 87
12 A 1 83
13 B 0 99
14 B 0 23
15 B 0 29
16 B 0 34
17 B 0 99
18 B 0 23
19 B 0 29
20 B 0 34
21 B 0 99
22 B 0 23
23 B 0 29
24 B 0 34
25 C 1 62
- - - -
- - - -
Now as the next step I need to randomly select a file, and for that file randomly select a sequence of 4 batches for value_1. The later, I believe can be done with df.sample(), but I'm not sure how to randomly select the files. I tried to make it work with np.random.choice(data['file'].unique()), but doesn't seems correct.
Thanks for the help in advance. I'm pretty new to pandas and python in general.
If I understand what you are trying to get at, the following should be of help:
# Test dataframe
import numpy as np
import pandas as pd
data = pd.DataFrame({'file': np.repeat(['A', 'B', 'C'], 12),
'value_1': np.repeat([1,0,1],12),
'value_2': np.random.randint(20, 100, 36)})
# Select a file
data1 = data[data.file == np.random.choice(data['file'].unique())].reset_index(drop=True)
# Get a random index from data1
start_ix = np.random.choice(data1.index[:-3])
# Get a sequence starting at the random index from the previous step
print(data.loc[start_ix:start_ix+3])
Here's a rather long winded answer that has a lot of flexibility and uses some random data I generated. I also added a field to the dataframe to denote whether that row had been used.
Generating Data
import pandas as pd
from string import ascii_lowercase
import random
random.seed(44)
files = [ascii_lowercase[i] for i in range(4)]
value_1 = random.sample(range(1, 10), 8)
files_df = files*len(value_1)
value_1_df = value_1*len(files)
value_1_df.sort()
value_2_df = random.sample(range(100, 200), len(files_df))
df = pd.DataFrame({'file' : files_df,
'value_1': value_1_df,
'value_2': value_2_df,
'used': 0})
Randomly Selecting Files
len_to_run = 3 #change to run for however long you'd like
batch_to_pull = 4
updated_files = df.loc[df.used==0,'file'].unique()
for i in range(len_to_run): #not needed if you only want to run once
file_to_pull = ''.join(random.sample(updated_files, 1))
print 'file ' + file_to_pull
for j in range(batch_to_pull): #pulling 4 values
updated_value_1 = df.loc[(df.used==0) & (df.file==file_to_pull),'value_1'].unique()
value_1_to_pull = random.sample(updated_value_1,1)
print 'value_1 ' + str(value_1_to_pull)
df.loc[(df.file == file_to_pull) & (df.value_1==value_1_to_pull),'used']=1
file a
value_1 [1]
value_1 [7]
value_1 [5]
value_1 [4]
file d
value_1 [3]
value_1 [2]
value_1 [1]
value_1 [5]
file d
value_1 [7]
value_1 [4]
value_1 [6]
value_1 [9]

broadcast groupby with boolean filter scalar in pandas

I have a data frame according to below.
df = pd.DataFrame({'var1' : list('a' * 3) + list('b' * 2) + list('c' * 4)
,'var2' : [i for i in range(9)]
,'var3' : [20, 40, 100, 10, 80, 12,24, 53, 90]
})
End result that I want is the following:
var1 var2 var3 var3_lt_50
0 a 0 20 60
1 a 1 40 60
2 a 2 100 60
3 b 3 10 10
4 b 4 80 10
5 c 5 12 36
6 c 6 24 36
7 c 7 53 36
8 c 8 90 36
I get this result in two steps, through a group-by and a merge, according to code below:
df = df.merge(df[df.var3 < 50][['var1', 'var3']].groupby('var1', as_index = False).sum().rename(columns = {'var3' : 'var3_lt_50'})
,how = 'left'
,left_on = 'var1'
,right_on = 'var1')
Can someone show me a way of doing this type of boolean logic expression + broadcasting of inter groupby scalar without the "groupby" + "merge" step im doing today. I want a smoother line of code.
Thanks in advance for input,
/Swepab
You can use groupby.transform which keeps the shape of the transformed variable as well as the index so you can just assign the result back to the data frame:
df['var3_lt_50'] = df.groupby('var1').var3.transform(lambda g: g[g < 50].sum())
df

Categories

Resources