Multidimensional array restructuring like in pandas.stack

Multidimensional array restructuring like in pandas.stack - python

Consider the following code to create a dummy dataset
import numpy as np
from scipy.stats import norm
import pandas as pd
np.random.seed(10)
n=3
space= norm(20, 5).rvs(n)
time= norm(10,2).rvs(n)
values = np.kron(space, time).reshape(n,n) + norm(1,1).rvs([n,n])
### Output
array([[267.39784458, 300.81493866, 229.19163206],
[236.1940266 , 266.49469945, 204.01294305],
[122.55912977, 140.00957047, 106.28339745]])
I can put these data in a pandas dataframe using
space_names = ['A','B','C']
time_names = [2000,2001,2002]
df = pd.DataFrame(values, index=space_names, columns=time_names)
df
### Output
2000 2001 2002
A 267.397845 300.814939 229.191632
B 236.194027 266.494699 204.012943
C 122.559130 140.009570 106.283397
This is considered a wide dataset, where each observation lies in a table with 2 variable that acts as coordinates to identify it.
To make it a long-tidy dataset we can suse the .stack method of pandas dataframe
df.columns.name = 'time'
df.index.name = 'space'
df.stack().rename('value').reset_index()
### Output
space time value
0 A 2000 267.397845
1 A 2001 300.814939
2 A 2002 229.191632
3 B 2000 236.194027
4 B 2001 266.494699
5 B 2002 204.012943
6 C 2000 122.559130
7 C 2001 140.009570
8 C 2002 106.283397
My question is: how do I do exactly this thing but for a 3-dimensional dataset?
Let's imagine I have 2 observation for each space-time couple
s = 3
t = 4
r = 2
space_mus = norm(20, 5).rvs(s)
time_mus = norm(10,2).rvs(t)
values = np.kron(space_mus, time_mus)
values = values.repeat(r).reshape(s,t,r) + norm(0,1).rvs([s,t,r])
values
### Output
array([[[286.50322099, 288.51266345],
[176.64303485, 175.38175877],
[136.01675917, 134.44328617]],
[[187.07608546, 185.4068411 ],
[112.86398438, 111.983463 ],
[ 85.99035255, 86.67236986]],
[[267.66833894, 269.45295404],
[162.30044715, 162.50564386],
[124.6374401 , 126.2315447 ]]])
How can I obtain the same structure for the dataframe as above?
Ugly solution
Personally i don't like this solution, and i think one might do it in a more elegant and pythonic way, but still might be useful for someone else so I will post my solution.
labels = ['{}{}{}'.format(i,j,k) for i in range(s) for j in range(t) for k in range(r)] #space, time, repetition
def flatten3d(k):
return [i for l in k for s in l for i in s]
value_series = pd.Series(flatten3d(values)).rename('y')
split_labels= [[i for i in l] for l in labels]
df = pd.DataFrame(split_labels, columns=['s','t','r'])
pd.concat([df, value_series], axis=1)
### Output
s t r y
0 0 0 0 266.2408815208753
1 0 0 1 266.13662442609433
2 0 1 0 299.53178992512954
3 0 1 1 300.13941632567605
4 0 2 0 229.39037800681405
5 0 2 1 227.22227496248507
6 0 3 0 281.76357915411995
7 0 3 1 280.9639352062619
8 1 0 0 235.8137644198259
9 1 0 1 234.23202459516452
10 1 1 0 265.19681013560034
11 1 1 1 266.5462102589883
12 1 2 0 200.730100791878
13 1 2 1 199.83217739700535
14 1 3 0 246.54018839875374
15 1 3 1 248.5496308586532
16 2 0 0 124.90916276929234
17 2 0 1 123.64788669199066
18 2 1 0 139.65391860786775
19 2 1 1 138.08044561039517
20 2 2 0 106.45276370157518
21 2 2 1 104.78351933651582
22 2 3 0 129.86043618610572
23 2 3 1 128.97991481257253

This does not use stack, but maybe it is acceptable for your problem:
import numpy as np
import pandas as pd
values = np.arange(18).reshape(3, 3, 2) # Your values here
index = pd.MultiIndex.from_product([space_names, space_names, time_names], names=["space1", "space2", "time"])
df = pd.DataFrame({"value": values.ravel()}, index=index).reset_index()
# df:
# space1 space2 time value
# 0 A A 2000 0
# 1 A A 2001 1
# 2 A B 2000 2
# 3 A B 2001 3
# 4 A C 2000 4
# 5 A C 2001 5
# 6 B A 2000 6
# 7 B A 2001 7
# 8 B B 2000 8
# 9 B B 2001 9
# 10 B C 2000 10
# 11 B C 2001 11
# 12 C A 2000 12
# 13 C A 2001 13
# 14 C B 2000 14
# 15 C B 2001 15
# 16 C C 2000 16
# 17 C C 2001 17

Related

rewritng a column cell values in a dataframe based on when the value change without using if statment

i have a column with faulty values as it is supposed to count cycles, but the device where the data from resets the count after 50 so i was left with exmalple [1,1,1,1,2,2,2,,3,3,3,3,...,50,50,50,1,1,1,2,2,2,2,3,3,3,...,50,50,.....,50]
My solution is and i cannt even make it work:(for simplicity i made the data resets from 10 cycles
data = {'Cyc-Count':[1,1,2,2,2,3,4,5,6,7,7,7,8,9,10,1,1,1,2,3,3,3,3,
4,4,5,6,6,6,7,8,8,8,8,9,10]}
df = pd.DataFrame(data)
x=0
count=0
old_value=df.at[x,'Cyc-Count']
for x in range(x,len(df)-1):
if df.at[x,'Cyc-Count']==df.at[x+1,'Cyc-Count']:
old_value=df.at[x+1,'Cyc-Count']
df.at[x+1,'Cyc-Count']=count
else:
old_value=df.at[x+1,'Cyc-Count']
count+=1
df.at[x+1,'Cyc-Count']=count
i need to fix this but preferably without even using if statments
the desired output for the upper example should be
data = {'Cyc-Count':[1,1,2,2,2,3,4,5,6,7,7,7,8,9,10,11,11,11,12,13,13,13,13,
14,14,15,16,16,16,17,18,18,18,18,19,20]}
hint" my method has a big issue is that the last indexed value will be hard to change since when comparing it with its index+1 > it dosnt even exist

IIUC, you want to continue the count when the counter decreases.
You can use vectorial code:
s = df['Cyc-Count'].shift()
df['Cyc-Count2'] = (df['Cyc-Count']
+ s.where(s.gt(df['Cyc-Count']))
.fillna(0, downcast='infer')
.cumsum()
)
Or, to modify the column in place:
s = df['Cyc-Count'].shift()
df['Cyc-Count'] += (s.where(s.gt(df['Cyc-Count']))
.fillna(0, downcast='infer').cumsum()
)
output:
Cyc-Count Cyc-Count2
0 1 1
1 1 1
2 1 1
3 1 1
4 2 2
5 2 2
6 2 2
7 3 3
8 3 3
9 3 3
10 3 3
11 4 4
12 5 5
13 5 5
14 5 5
15 1 6
16 1 6
17 1 6
18 2 7
19 2 7
20 2 7
21 2 7
22 3 8
23 3 8
24 3 8
25 4 9
26 5 10
27 5 10
28 1 11
29 2 12
30 2 12
31 3 13
32 4 14
33 5 15
34 5 15
used input:
l = [1,1,1,1,2,2,2,3,3,3,3,4,5,5,5,1,1,1,2,2,2,2,3,3,3,4,5,5,1,2,2,3,4,5,5]
df = pd.DataFrame({'Cyc-Count': l})

You can use df.loc to access a group of rows and columns by label(s) or a boolean array.
syntax: df.loc[df['column name'] condition, 'column name or the new one'] = 'value if condition is met'
for example:
import pandas as pd
numbers = {'set_of_numbers': [1,2,3,4,5,6,7,8,9,10,0,0]}
df = pd.DataFrame(numbers,columns=['set_of_numbers'])
print (df)
df.loc[df['set_of_numbers'] == 0, 'set_of_numbers'] = 999
df.loc[df['set_of_numbers'] == 5, 'set_of_numbers'] = 555
print (df)
before: ‘set_of_numbers’: [1,2,3,4,5,6,7,8,9,10,0,0]
After: ‘set_of_numbers’: [1,2,3,4,555,6,7,8,9,10,999,999]

Pandas dataframe column wise calculation

I have below dataframe columns:
Index(['Location' 'Dec-2021_x', 'Jan-2022_x', 'Feb-2022_x', 'Mar-2022_x',
'Apr-2022_x', 'May-2022_x', 'Jun-2022_x', 'Jul-2022_x', 'Aug-2022_x',
'Sep-2022_x', 'Oct-2022_x', 'Nov-2022_x', 'Dec-2022_x', 'Jan-2023_x',
'Feb-2023_x', 'Mar-2023_x', 'Apr-2023_x', 'May-2023_x', 'Jun-2023_x',
'Jul-2023_x', 'Aug-2023_x', 'Sep-2023_x', 'Oct-2023_x', 'Nov-2023_x',
'Dec-2023_x', 'Jan-2024_x', 'Feb-2024_x', 'Mar-2024_x', 'Apr-2024_x',
'May-2024_x', 'Jun-2024_x', 'Jul-2024_x', 'Aug-2024_x', 'Sep-2024_x',
'Oct-2024_x', 'Nov-2024_x', 'Dec-2024_x',
'sum_val',
'Dec-2021_y', 'Jan-2022_y', 'Feb-2022_y',
'Mar-2022_y', 'Apr-2022_y', 'May-2022_y', 'Jun-2022_y', 'Jul-2022_y',
'Aug-2022_y', 'Sep-2022_y', 'Oct-2022_y', 'Nov-2022_y', 'Dec-2022_y',
'Jan-2023_y', 'Feb-2023_y', 'Mar-2023_y', 'Apr-2023_y', 'May-2023_y',
'Jun-2023_y', 'Jul-2023_y', 'Aug-2023_y', 'Sep-2023_y', 'Oct-2023_y',
'Nov-2023_y', 'Dec-2023_y', 'Jan-2024_y', 'Feb-2024_y', 'Mar-2024_y',
'Apr-2024_y', 'May-2024_y', 'Jun-2024_y', 'Jul-2024_y', 'Aug-2024_y',
'Sep-2024_y', 'Oct-2024_y', 'Nov-2024_y', 'Dec-2024_y'],
dtype='object')
Sample dataframe with reduced columns looks like this:
df:
Location Dec-2021_x Jan-2022_x sum_val Dec-2021_y Jan-2022_y
A 212 315 1000 12 13
B 312 612 1100 13 17
C 242 712 1010 15 15
D 215 382 1001 16 17
E 252 319 1110 17 18
I have to create a resultant dataframe which will be in the below format:
Index(['Location' 'Dec-2021', 'Jan-2022', 'Feb-2022', 'Mar-2022',
'Apr-2022', 'May-2022', 'Jun-2022', 'Jul-2022', 'Aug-2022',
'Sep-2022', 'Oct-2022', 'Nov-2022', 'Dec-2022', 'Jan-2023',
'Feb-2023', 'Mar-2023', 'Apr-2023', 'May-2023', 'Jun-2023',
'Jul-2023', 'Aug-2023', 'Sep-2023', 'Oct-2023', 'Nov-2023',
'Dec-2023', 'Jan-2024', 'Feb-2024', 'Mar-2024', 'Apr-2024',
'May-2024', 'Jun-2024', 'Jul-2024', 'Aug-2024', 'Sep-2024',
'Oct-2024', 'Nov-2024', 'Dec-2024'
dtype='object')
The way we do this is using the formula:
'Dec-2021' = 'Dec-2021_x' * sum_val * 'Dec-2021_y' (these are all numeric columns)
and a similar way for all the months. There are 36 months to be precise. Is there any way to do this in a loop manner for each column in the month-year combination? There are around 65000+ rows here so do not want to overwhelm the system.

Use:
#sample data
np.random.seed(2022)
c = ['Location', 'Dec-2021_x', 'Jan-2022_x', 'Feb-2022_x', 'Mar-2022_x',
'Apr-2022_x','sum_val', 'Dec-2021_y', 'Jan-2022_y', 'Feb-2022_y',
'Mar-2022_y', 'Apr-2022_y']
df = (pd.DataFrame(np.random.randint(10, size=(5, len(c))), columns=c)
.assign(Location=list('abcde')))
print (df)
Location Dec-2021_x Jan-2022_x Feb-2022_x Mar-2022_x Apr-2022_x \
0 a 1 1 0 7 8
1 b 8 0 3 6 8
2 c 1 7 5 5 4
3 d 0 7 5 5 8
4 e 8 0 3 9 5
sum_val Dec-2021_y Jan-2022_y Feb-2022_y Mar-2022_y Apr-2022_y
0 2 8 0 5 9 1
1 0 1 2 0 5 7
2 8 2 3 1 0 4
3 2 4 0 9 4 9
4 2 1 7 2 1 7
#remove unnecessary columns
df1 = df.drop(['sum_val'], axis=1)
#add columns names for not necessary remove - if need in ouput
df1 = df1.set_index('Location')
#split columns names by last _
df1.columns = df1.columns.str.rsplit('_', n=1, expand=True)
#seelct x and y Dataframes by second level and multiple
df2 = (df1.xs('x', axis=1, level=1).mul(df['sum_val'].to_numpy(), axis= 0) *
df1.xs('y', axis=1, level=1))
print (df2)
Dec-2021 Jan-2022 Feb-2022 Mar-2022 Apr-2022
Location
a 16 0 0 126 16
b 0 0 0 0 0
c 16 168 40 0 128
d 0 0 90 40 144
e 16 0 12 18 70

Faster way of setting rows of DataFrame to NaN based on last index

I've got data frames that look like this:
import time
import pandas as pd
import numpy as np
N = 3
l = []
for i in range(N):
n = np.random.choice(5)+2
l += [pd.DataFrame(dict(ID = np.repeat(i, n),
t = list(range(n)),
X = np.random.normal(size = n)))]
df = pd.concat(l)
df
Out[85]:
ID t X
0 0 0 0.992300
1 0 1 0.226487
2 0 2 -0.731178
3 0 3 0.748376
4 0 4 1.269106
0 1 0 0.512957
1 1 1 -1.274963
2 1 2 0.186314
3 1 3 1.243093
0 2 0 0.321971
1 2 1 0.233895
2 2 2 0.293439
I need to set their last value of t for each ID to NaN. Right now, I can do it one of two ways:
trimlast = df.groupby('ID').apply(lambda x: x.head(-1)).reset_index(drop=True)
df = df.drop(columns='X').merge(trimlast, how='left')
Or
def f(d):
d.loc[d.t == d.t.max(), 'X'] = np.nan
return d
df = df.groupby('ID').apply(f).reset_index(drop=True)
Both of which yield:
df
Out[87]:
ID t X
0 0 0 0.992300
1 0 1 0.226487
2 0 2 -0.731178
3 0 3 0.748376
4 0 4 NaN
5 1 0 0.512957
6 1 1 -1.274963
7 1 2 0.186314
8 1 3 NaN
9 2 0 0.321971
10 2 1 0.233895
11 2 2 NaN
They're too slow when the data gets big. Time is approx linear.
def sizetry(N, other_way = False):
np.random.seed(0)
l = []
for i in range(N):
n = np.random.choice(5) + 2
l += [pd.DataFrame(dict(ID=np.repeat(i, n),
t=list(range(n)),
X=np.random.normal(size=n)))]
df = pd.concat(l)
start = time.time()
if other_way:
trimlast = df.groupby('ID').apply(lambda x: x.head(-1)).reset_index(drop=True)
df = df.drop(columns='X').merge(trimlast, how='left')
else:
df = df.groupby('ID').apply(f).reset_index(drop=True)
end = time.time()
return end-start
tvec = [sizetry(2**i) for i in range(15)]
tvec_other = [sizetry(2**i, other_way = True) for i in range(15)]
import matplotlib.pyplot as plt
plt.plot(np.log2(tvec), label = "merge way")
plt.plot(np.log2(tvec_other), label = 'other way')
plt.legend()
plt.show()
I suspect that the problem is groupby. Is there a faster way to do this?

first reset your index.
df = df.reset_index(drop=True)
then use duplicated() with an inversed boolean.
import numpy as np
df.loc[~df.duplicated(subset=['ID'],keep='last'),'X'] = np.nan
print(df)
ID t X
0 0 0 0.424902
1 0 1 1.597951
2 0 2 1.453884
3 0 3 NaN
4 1 0 0.534653
5 1 1 -0.318361
6 1 2 0.188290
7 1 3 1.157802
8 1 4 NaN
9 2 0 0.186005
10 2 1 0.036017
11 2 2 1.039822
12 2 3 -1.602205
13 2 4 -0.210601
14 2 5 NaN
if you want the max T value to be changed then use idxmax() with a groupby
df.loc[df.groupby('ID')['t'].idxmax(),'x'] = np.nan

grouping the index with the highest three values of another column using pandas python

i have a csv file that contains columns like StateName, Population, CityName... Note that for every state u can have multiple city name thus multiple population fo the same city
what i want to have is to group the StateName with the highest three population of the same city.
what i have: (image click to see)
what i want to have (image click to see)
my code is:
def answer_six():
x=census_df['STNAME'].unique()
census_df2 = df = pd.DataFrame()
for a in x :
census_dfcopy = census_df.copy()
census_dfcopy = census_dfcopy.set_index(['STNAME'])
census_dfcopy = census_dfcopy.loc[a]
census_dfcopy = census_dfcopy.reset_index()
census_dfcopy = census_dfcopy.set_index(['CENSUS2010POP'])
census_dfcopy1=census_dfcopy.sort_index(ascending = False)
census_dfcopy1= census_dfcopy1.append(census_dfcopy1)
census_dfcopy1.groupby('STNAME')
return census_dfcopy1.head(3)
answer_six()
i only get the last 3 value of the last state.
to download the csv file please visit the link:
https://drive.google.com/open?id=1ptE6MRQ1NGrfRYBB7NKjqhOJZXlxScPo

You could do
census_df.groupby('STNAME').CENSUS2010POP.nlargest(3)
In action:
In [51]: df
Out[51]:
ctyname pop stname
0 0 10 a
1 1 9 a
2 2 1 a
3 3 3 a
4 4 12 b
5 5 12 b
6 6 13 b
7 7 14 b
8 8 4 c
9 9 3 c
10 10 2 c
11 11 1 c
In [68]: df.groupby('stname').pop.nlargest(3)
Out[68]:
stname
a 0 10
1 9
3 3
b 7 14
6 13
4 12
c 8 4
9 3
10 2

Extract Number from Varying String

Given this data frame:
import pandas as pd
df = pd.DataFrame({'ID':['a','b','c','d','e','f','g','h','i','j','k'],
'value':['None',np.nan,'6D','7','10D','NONE','x','10D aaa','1 D','10 D aa',7]
})
df
ID value
0 a None
1 b NaN
2 c 6D
3 d 7
4 e 10D
5 f NONE
6 g x
7 h 10D aaa
8 i 1 D
9 j 10 D aa
10 k i7D
I'd like to extract numbers where present, else return 0, for any mess of situations as shown above.
The desired result is:
ID value
0 a 0
1 b 0
2 c 6
3 d 7
4 e 10
5 f 0
6 g 0
7 h 10
8 i 1
9 j 10
10 k 7
Thanks in advance!

Here is my approach of using re.findall and apply
df['value'].apply(lambda x: 0 if not re.findall('\d+', str(x)) else re.findall('\d+', str(x))[0])

Try the following using Series.str.replace and fillna:
import pandas as pd
df = pd.DataFrame({'ID':['a','b','c','d','e','f','g','h','i','j','k'],
'value':['None',np.nan,'6D','7','10D','NONE','x','10D aaa','1 D','10 D aa',7]
})
df = df.fillna(0)
df = df.str.replace(r'\D+', '').astype(int)

Alternatively, you can apply a function to the dataframe via applymap() following the EAFP principle catching multiple exceptions while extracting the digits:
def get_number(item):
try:
return int(re.search(r"\d+", str(item)).group(0))
except (AttributeError, ValueError, IndexError):
return 0
print(df.applymap(get_number))
Prints:
ID value
0 0 0
1 0 0
2 0 6
3 0 7
4 0 10
5 0 0
6 0 0
7 0 10
8 0 1
9 0 10
10 0 7

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multidimensional array restructuring like in pandas.stack - python

Related

rewritng a column cell values in a dataframe based on when the value change without using if statment

Pandas dataframe column wise calculation

Faster way of setting rows of DataFrame to NaN based on last index

grouping the index with the highest three values of another column using pandas python

Extract Number from Varying String

Categories

Resources