Pandas dataframe column wise calculation - python

I have below dataframe columns:
Index(['Location' 'Dec-2021_x', 'Jan-2022_x', 'Feb-2022_x', 'Mar-2022_x',
'Apr-2022_x', 'May-2022_x', 'Jun-2022_x', 'Jul-2022_x', 'Aug-2022_x',
'Sep-2022_x', 'Oct-2022_x', 'Nov-2022_x', 'Dec-2022_x', 'Jan-2023_x',
'Feb-2023_x', 'Mar-2023_x', 'Apr-2023_x', 'May-2023_x', 'Jun-2023_x',
'Jul-2023_x', 'Aug-2023_x', 'Sep-2023_x', 'Oct-2023_x', 'Nov-2023_x',
'Dec-2023_x', 'Jan-2024_x', 'Feb-2024_x', 'Mar-2024_x', 'Apr-2024_x',
'May-2024_x', 'Jun-2024_x', 'Jul-2024_x', 'Aug-2024_x', 'Sep-2024_x',
'Oct-2024_x', 'Nov-2024_x', 'Dec-2024_x',
'sum_val',
'Dec-2021_y', 'Jan-2022_y', 'Feb-2022_y',
'Mar-2022_y', 'Apr-2022_y', 'May-2022_y', 'Jun-2022_y', 'Jul-2022_y',
'Aug-2022_y', 'Sep-2022_y', 'Oct-2022_y', 'Nov-2022_y', 'Dec-2022_y',
'Jan-2023_y', 'Feb-2023_y', 'Mar-2023_y', 'Apr-2023_y', 'May-2023_y',
'Jun-2023_y', 'Jul-2023_y', 'Aug-2023_y', 'Sep-2023_y', 'Oct-2023_y',
'Nov-2023_y', 'Dec-2023_y', 'Jan-2024_y', 'Feb-2024_y', 'Mar-2024_y',
'Apr-2024_y', 'May-2024_y', 'Jun-2024_y', 'Jul-2024_y', 'Aug-2024_y',
'Sep-2024_y', 'Oct-2024_y', 'Nov-2024_y', 'Dec-2024_y'],
dtype='object')
Sample dataframe with reduced columns looks like this:
df:
Location Dec-2021_x Jan-2022_x sum_val Dec-2021_y Jan-2022_y
A 212 315 1000 12 13
B 312 612 1100 13 17
C 242 712 1010 15 15
D 215 382 1001 16 17
E 252 319 1110 17 18
I have to create a resultant dataframe which will be in the below format:
Index(['Location' 'Dec-2021', 'Jan-2022', 'Feb-2022', 'Mar-2022',
'Apr-2022', 'May-2022', 'Jun-2022', 'Jul-2022', 'Aug-2022',
'Sep-2022', 'Oct-2022', 'Nov-2022', 'Dec-2022', 'Jan-2023',
'Feb-2023', 'Mar-2023', 'Apr-2023', 'May-2023', 'Jun-2023',
'Jul-2023', 'Aug-2023', 'Sep-2023', 'Oct-2023', 'Nov-2023',
'Dec-2023', 'Jan-2024', 'Feb-2024', 'Mar-2024', 'Apr-2024',
'May-2024', 'Jun-2024', 'Jul-2024', 'Aug-2024', 'Sep-2024',
'Oct-2024', 'Nov-2024', 'Dec-2024'
dtype='object')
The way we do this is using the formula:
'Dec-2021' = 'Dec-2021_x' * sum_val * 'Dec-2021_y' (these are all numeric columns)
and a similar way for all the months. There are 36 months to be precise. Is there any way to do this in a loop manner for each column in the month-year combination? There are around 65000+ rows here so do not want to overwhelm the system.

Use:
#sample data
np.random.seed(2022)
c = ['Location', 'Dec-2021_x', 'Jan-2022_x', 'Feb-2022_x', 'Mar-2022_x',
'Apr-2022_x','sum_val', 'Dec-2021_y', 'Jan-2022_y', 'Feb-2022_y',
'Mar-2022_y', 'Apr-2022_y']
df = (pd.DataFrame(np.random.randint(10, size=(5, len(c))), columns=c)
.assign(Location=list('abcde')))
print (df)
Location Dec-2021_x Jan-2022_x Feb-2022_x Mar-2022_x Apr-2022_x \
0 a 1 1 0 7 8
1 b 8 0 3 6 8
2 c 1 7 5 5 4
3 d 0 7 5 5 8
4 e 8 0 3 9 5
sum_val Dec-2021_y Jan-2022_y Feb-2022_y Mar-2022_y Apr-2022_y
0 2 8 0 5 9 1
1 0 1 2 0 5 7
2 8 2 3 1 0 4
3 2 4 0 9 4 9
4 2 1 7 2 1 7
#remove unnecessary columns
df1 = df.drop(['sum_val'], axis=1)
#add columns names for not necessary remove - if need in ouput
df1 = df1.set_index('Location')
#split columns names by last _
df1.columns = df1.columns.str.rsplit('_', n=1, expand=True)
#seelct x and y Dataframes by second level and multiple
df2 = (df1.xs('x', axis=1, level=1).mul(df['sum_val'].to_numpy(), axis= 0) *
df1.xs('y', axis=1, level=1))
print (df2)
Dec-2021 Jan-2022 Feb-2022 Mar-2022 Apr-2022
Location
a 16 0 0 126 16
b 0 0 0 0 0
c 16 168 40 0 128
d 0 0 90 40 144
e 16 0 12 18 70

Related

Multidimensional array restructuring like in pandas.stack

Consider the following code to create a dummy dataset
import numpy as np
from scipy.stats import norm
import pandas as pd
np.random.seed(10)
n=3
space= norm(20, 5).rvs(n)
time= norm(10,2).rvs(n)
values = np.kron(space, time).reshape(n,n) + norm(1,1).rvs([n,n])
### Output
array([[267.39784458, 300.81493866, 229.19163206],
[236.1940266 , 266.49469945, 204.01294305],
[122.55912977, 140.00957047, 106.28339745]])
I can put these data in a pandas dataframe using
space_names = ['A','B','C']
time_names = [2000,2001,2002]
df = pd.DataFrame(values, index=space_names, columns=time_names)
df
### Output
2000 2001 2002
A 267.397845 300.814939 229.191632
B 236.194027 266.494699 204.012943
C 122.559130 140.009570 106.283397
This is considered a wide dataset, where each observation lies in a table with 2 variable that acts as coordinates to identify it.
To make it a long-tidy dataset we can suse the .stack method of pandas dataframe
df.columns.name = 'time'
df.index.name = 'space'
df.stack().rename('value').reset_index()
### Output
space time value
0 A 2000 267.397845
1 A 2001 300.814939
2 A 2002 229.191632
3 B 2000 236.194027
4 B 2001 266.494699
5 B 2002 204.012943
6 C 2000 122.559130
7 C 2001 140.009570
8 C 2002 106.283397
My question is: how do I do exactly this thing but for a 3-dimensional dataset?
Let's imagine I have 2 observation for each space-time couple
s = 3
t = 4
r = 2
space_mus = norm(20, 5).rvs(s)
time_mus = norm(10,2).rvs(t)
values = np.kron(space_mus, time_mus)
values = values.repeat(r).reshape(s,t,r) + norm(0,1).rvs([s,t,r])
values
### Output
array([[[286.50322099, 288.51266345],
[176.64303485, 175.38175877],
[136.01675917, 134.44328617]],
[[187.07608546, 185.4068411 ],
[112.86398438, 111.983463 ],
[ 85.99035255, 86.67236986]],
[[267.66833894, 269.45295404],
[162.30044715, 162.50564386],
[124.6374401 , 126.2315447 ]]])
How can I obtain the same structure for the dataframe as above?
Ugly solution
Personally i don't like this solution, and i think one might do it in a more elegant and pythonic way, but still might be useful for someone else so I will post my solution.
labels = ['{}{}{}'.format(i,j,k) for i in range(s) for j in range(t) for k in range(r)] #space, time, repetition
def flatten3d(k):
return [i for l in k for s in l for i in s]
value_series = pd.Series(flatten3d(values)).rename('y')
split_labels= [[i for i in l] for l in labels]
df = pd.DataFrame(split_labels, columns=['s','t','r'])
pd.concat([df, value_series], axis=1)
### Output
s t r y
0 0 0 0 266.2408815208753
1 0 0 1 266.13662442609433
2 0 1 0 299.53178992512954
3 0 1 1 300.13941632567605
4 0 2 0 229.39037800681405
5 0 2 1 227.22227496248507
6 0 3 0 281.76357915411995
7 0 3 1 280.9639352062619
8 1 0 0 235.8137644198259
9 1 0 1 234.23202459516452
10 1 1 0 265.19681013560034
11 1 1 1 266.5462102589883
12 1 2 0 200.730100791878
13 1 2 1 199.83217739700535
14 1 3 0 246.54018839875374
15 1 3 1 248.5496308586532
16 2 0 0 124.90916276929234
17 2 0 1 123.64788669199066
18 2 1 0 139.65391860786775
19 2 1 1 138.08044561039517
20 2 2 0 106.45276370157518
21 2 2 1 104.78351933651582
22 2 3 0 129.86043618610572
23 2 3 1 128.97991481257253
This does not use stack, but maybe it is acceptable for your problem:
import numpy as np
import pandas as pd
values = np.arange(18).reshape(3, 3, 2) # Your values here
index = pd.MultiIndex.from_product([space_names, space_names, time_names], names=["space1", "space2", "time"])
df = pd.DataFrame({"value": values.ravel()}, index=index).reset_index()
# df:
# space1 space2 time value
# 0 A A 2000 0
# 1 A A 2001 1
# 2 A B 2000 2
# 3 A B 2001 3
# 4 A C 2000 4
# 5 A C 2001 5
# 6 B A 2000 6
# 7 B A 2001 7
# 8 B B 2000 8
# 9 B B 2001 9
# 10 B C 2000 10
# 11 B C 2001 11
# 12 C A 2000 12
# 13 C A 2001 13
# 14 C B 2000 14
# 15 C B 2001 15
# 16 C C 2000 16
# 17 C C 2001 17

rewritng a column cell values in a dataframe based on when the value change without using if statment

i have a column with faulty values as it is supposed to count cycles, but the device where the data from resets the count after 50 so i was left with exmalple [1,1,1,1,2,2,2,,3,3,3,3,...,50,50,50,1,1,1,2,2,2,2,3,3,3,...,50,50,.....,50]
My solution is and i cannt even make it work:(for simplicity i made the data resets from 10 cycles
data = {'Cyc-Count':[1,1,2,2,2,3,4,5,6,7,7,7,8,9,10,1,1,1,2,3,3,3,3,
4,4,5,6,6,6,7,8,8,8,8,9,10]}
df = pd.DataFrame(data)
x=0
count=0
old_value=df.at[x,'Cyc-Count']
for x in range(x,len(df)-1):
if df.at[x,'Cyc-Count']==df.at[x+1,'Cyc-Count']:
old_value=df.at[x+1,'Cyc-Count']
df.at[x+1,'Cyc-Count']=count
else:
old_value=df.at[x+1,'Cyc-Count']
count+=1
df.at[x+1,'Cyc-Count']=count
i need to fix this but preferably without even using if statments
the desired output for the upper example should be
data = {'Cyc-Count':[1,1,2,2,2,3,4,5,6,7,7,7,8,9,10,11,11,11,12,13,13,13,13,
14,14,15,16,16,16,17,18,18,18,18,19,20]}
hint" my method has a big issue is that the last indexed value will be hard to change since when comparing it with its index+1 > it dosnt even exist
IIUC, you want to continue the count when the counter decreases.
You can use vectorial code:
s = df['Cyc-Count'].shift()
df['Cyc-Count2'] = (df['Cyc-Count']
+ s.where(s.gt(df['Cyc-Count']))
.fillna(0, downcast='infer')
.cumsum()
)
Or, to modify the column in place:
s = df['Cyc-Count'].shift()
df['Cyc-Count'] += (s.where(s.gt(df['Cyc-Count']))
.fillna(0, downcast='infer').cumsum()
)
output:
Cyc-Count Cyc-Count2
0 1 1
1 1 1
2 1 1
3 1 1
4 2 2
5 2 2
6 2 2
7 3 3
8 3 3
9 3 3
10 3 3
11 4 4
12 5 5
13 5 5
14 5 5
15 1 6
16 1 6
17 1 6
18 2 7
19 2 7
20 2 7
21 2 7
22 3 8
23 3 8
24 3 8
25 4 9
26 5 10
27 5 10
28 1 11
29 2 12
30 2 12
31 3 13
32 4 14
33 5 15
34 5 15
used input:
l = [1,1,1,1,2,2,2,3,3,3,3,4,5,5,5,1,1,1,2,2,2,2,3,3,3,4,5,5,1,2,2,3,4,5,5]
df = pd.DataFrame({'Cyc-Count': l})
You can use df.loc to access a group of rows and columns by label(s) or a boolean array.
syntax: df.loc[df['column name'] condition, 'column name or the new one'] = 'value if condition is met'
for example:
import pandas as pd
numbers = {'set_of_numbers': [1,2,3,4,5,6,7,8,9,10,0,0]}
df = pd.DataFrame(numbers,columns=['set_of_numbers'])
print (df)
df.loc[df['set_of_numbers'] == 0, 'set_of_numbers'] = 999
df.loc[df['set_of_numbers'] == 5, 'set_of_numbers'] = 555
print (df)
before: ‘set_of_numbers’: [1,2,3,4,5,6,7,8,9,10,0,0]
After: ‘set_of_numbers’: [1,2,3,4,555,6,7,8,9,10,999,999]

Assigning to dask series using positional indexing

I have two DataFrames, df1 and df2, that share an index. I would like to assign the values in df1 based on a value in df2.
The standard pandas code looks like this:
df1['column1'][df2['column2']==i] = j
This populates df1 correctly when run on all the inputs.
However, the same syntax on dask DataFrames returns an error:
TypeError: 'Series' object does not support item assignment
dd.where() and dd.mask() don't seem to work as they return the original value as well.
Is there a dask equivalent to the above pandas code?
To do your task, you should:
use mask to get the new column,
save it back to column1.
To test, I used the following source DataFrames:
df1:
column1 xxx
0 1 230
1 2 160
2 3 160
3 4 190
4 5 190
5 6 260
6 7 260
7 8 260
8 9 300
df2:
column2 yyy
0 11 402
1 12 349
2 13 336
3 14 369
4 15 402
5 16 209
6 17 492
7 18 455
8 19 387
Then I set variables:
i = 15
j = 100
I created both Dask DataFrames as follows:
dd1 = dd.from_pandas(df1, chunksize=5)
dd2 = dd.from_pandas(df2, chunksize=5)
And to do the actual processing, I ran:
dd1.column1 = dd1.column1.mask(dd2['column2'] == i, j)
result = dd1.compute()
The result is:
column1 xxx
0 1 230
1 2 160
2 3 160
3 4 190
4 100 190
5 6 260
6 7 260
7 8 260
8 9 300
So, value in df1.column1 for index == 4 (where in df2.column2 == 15 (i))
has been set to 100 (j).
I believe that you are looking for the dask.dataframe.Series.where method. It seems to work ok for me.
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.Series.where
In [1]: import pandas as pd
In [2]: import dask.dataframe as dd
In [3]: s = pd.Series(range(5))
In [4]: ds = dd.from_pandas(s, npartitions=2)
In [5]: ds.where(ds > 1, 10).compute()
Out[5]:
0 10
1 10
2 2
3 3
4 4
dtype: int64

How to add dataframe column data to a range of indexes in another dataframe?

I have a dataframe called df1:
Long_ID IndexBegin IndexEnd
0 10000001 0 3
1 10000002 3 6
2 10000003 6 10
I have a second dataframe called df2, which can be up to 1 million rows long:
Short_ID
0 1
1 2
2 3
3 10
4 20
5 30
6 100
7 101
8 102
9 103
I want to link Long_ID to Short_ID in such a way that if (IndexBegin:IndexEnd) is (0:3), then Long_ID gets inserted into df2 at indexes 0 through 2 (IndexEnd - 1). The starting index and ending index are determined using the last two columns of df1.
So that ultimately, my final dataframe looks like this: df3:
Short_ID Long_ID
0 1 10000001
1 2 10000001
2 3 10000001
3 10 10000002
4 20 10000002
5 30 10000002
6 100 10000003
7 101 10000003
8 102 10000003
9 103 10000003
First, I tried storing the index of df2 as a key and Short_ID as a value in a dictionary, then iterating row by row, but that was too slow. This led me to learn about vectorization.
Then, I tried using where(), but I got "ValueError: Can only compare identically-labeled Series objects."
df2 = df2.reset_index()
df2['Long_ID'] = df1['Long_ID'] [ (df2['index'] < df1['IndexEnd']) & (df2['index'] >= df1['IndexBegin']) ]
I am relatively new to programming, and I appreciate if anyone can give a better approach to solving this problem. I have reproduced the code below:
df1_data = [(10000001, 0, 3), (10000002, 3, 6), (10000003, 6, 10)]
df1 = pd.DataFrame(df1_data, columns = ['Long_ID', 'IndexBegin', 'IndexEnd'])
df2_data = [1, 2, 3, 10, 20, 30, 100, 101, 102, 103]
df2 = pd.DataFrame(df2_data, columns = ['Short_ID'])
df2 does not need "IndexEnd" as long as the ranges are contiguous. You may use pd.merge_asof:
(pd.merge_asof(df2.reset_index(), df1, left_on='index', right_on='IndexBegin')
.reindex(['Short_ID', 'Long_ID'], axis=1))
Short_ID Long_ID
0 1 10000001
1 2 10000001
2 3 10000001
3 10 10000002
4 20 10000002
5 30 10000002
6 100 10000003
7 101 10000003
8 102 10000003
9 103 10000003
Here is one way using IntervalIndex
df1.index=pd.IntervalIndex.from_arrays(left=df1.IndexBegin,right=df1.IndexEnd,closed='left')
df2['New']=df1.loc[df2.index,'Long_ID'].values
you may do :
df3 = df2.copy()
df3['long_ID'] = df2.merge(df1, left_on =df2.index,right_on = "IndexBegin", how = 'left').Long_ID.ffill().astype(int)
I created a function to solve your question. Hope it helps.
df = pd.read_excel('C:/Users/me/Desktop/Sovrflw_data_2.xlsx')
df
Long_ID IndexBegin IndexEnd
0 10000001 0 3
1 10000002 3 6
2 10000003 6 10
df2 = pd.read_excel('C:/Users/me/Desktop/Sovrflw_data.xlsx')
df2
Short_ID
0 1
1 2
2 3
3 10
4 20
5 30
6 100
7 101
8 102
9 103
def convert_Short_ID(df1,df2):
df2['Long_ID'] = None
for i in range(len(df2)):
for j in range(len(df)):
if (df2.index[i] >= df.loc[j,'IndexBegin']) and (df2.index[i] < df.loc[j,'IndexEnd']):
number = str(df.iloc[j, 0])
df2.loc[i,'Long_ID'] = df.loc[j, 'Long_ID']
break
else:
df2.loc[i, 'Long_ID'] = np.nan
df2['Long_ID'] = df2['Long_ID'].astype(str)
return df2
convert_Short_ID(df,df2)
Short_ID Long_ID
0 1 10000001
1 2 10000001
2 3 10000001
3 10 10000002
4 20 10000002
5 30 10000002
6 100 10000003
7 101 10000003
8 102 10000003
9 103 10000003
Using Numpy to create the data before creating a Data Frame is a better approach since adding elements to a Data Frame is time-consuming. So:
import numpy as np
import pandas as pd
#Step 1: creating the first Data Frame
df1 = pd.DataFrame({'Long_ID':[10000001,10000002,10000003],
'IndexBegin':[0,3,6],
'IndexEnd':[3,6,10]})
#Step 2: creating the second chunk of data as a Numpy array
Short_ID = np.array([1,2,3,10,20,30,100,101,102,103])
#Step 3: creating a new column on df1 to count Long_ID ocurrences
df1['Qt']=df1['IndexEnd']-df1['IndexBegin']
#Step 4: using append to create a Numpy Array for the Long_ID item
Long_ID = np.array([])
for i in range(len(df1)):
Long_ID = np.append(Long_ID, [df1['Long_ID'][i]]*df1['Qt'][i])
#Finally, create the seconc Data Frame using both previous Numpy arrays
df2 = pd.DataFrame(np.vstack((Short_ID, Long_ID)).T, columns=['Short_ID','Long_ID'])
df2
Short_ID Long_ID
0 1.0 10000001.0
1 2.0 10000001.0
2 3.0 10000001.0
3 10.0 10000002.0
4 20.0 10000002.0
5 30.0 10000002.0
6 100.0 10000003.0
7 101.0 10000003.0
8 102.0 10000003.0
9 103.0 10000003.0

easy tool to filtering columns with specific conditions using pandas

I'm wondering if exist a tool in python to filter data between columns that follow an specific condition. I need to generate a clean dataframe where all the data in column 'A' must have the same consecutive number in column 'E'(and this number is repeated at least twice). Here an example:
df
Out[30]:
A B C D E
6 1 2.366 8.621 10.835 1
7 1 2.489 8.586 10.890 2
8 1 2.279 8.460 10.945 2
9 1 2.296 8.559 11.000 2
10 2 2.275 8.620 11.055 2
11 2 2.539 8.528 11.110 2
50 2 3.346 5.979 10.175 5
51 3 3.359 5.910 10.230 1
52 3 3.416 5.936 10.285 1
The output will be:
df
Out[31]:
A B C D E
7 1 2.489 8.586 10.890 2
8 1 2.279 8.460 10.945 2
9 1 2.296 8.559 11.000 2
10 2 2.275 8.620 11.055 2
11 2 2.539 8.528 11.110 2
51 3 3.359 5.910 10.230 1
52 3 3.416 5.936 10.285 1
What you are looking for is:
import numpy as np
df.groupby((df.E != df.E.shift(1)).cumsum()).filter(lambda x: np.size(x.E) >= 2)
# or
df[df.groupby((df.E != df.E.shift(1)).cumsum()).E.transform('size') >= 2]
Output:
A B C D E
7 1 2.489 8.586 10.890 2
8 1 2.279 8.460 10.945 2
9 1 2.296 8.559 11.000 2
10 2 2.275 8.620 11.055 2
11 2 2.539 8.528 11.110 2
51 3 3.359 5.910 10.230 1
52 3 3.416 5.936 10.285 1
Explanation:
You want to keep all records where there is a consecutive group in E which has a size of more than 2.
The first part (df.E != df.E.shift(1)).cumsum() allows you to label consecutive groups in column E, and then you group by that label and filter the DataFrame, keeping only groups where the size is 2 or more.
You should be able to do something like the following:
mask = (df['E'] == df['E'].shift(1)) | (df['E'] == df['E'].shift(-1))
filtered_df = df[mask]

Categories

Resources