Multidimensional array restructuring like in pandas.stack - python

Consider the following code to create a dummy dataset
import numpy as np
from scipy.stats import norm
import pandas as pd
np.random.seed(10)
n=3
space= norm(20, 5).rvs(n)
time= norm(10,2).rvs(n)
values = np.kron(space, time).reshape(n,n) + norm(1,1).rvs([n,n])
### Output
array([[267.39784458, 300.81493866, 229.19163206],
[236.1940266 , 266.49469945, 204.01294305],
[122.55912977, 140.00957047, 106.28339745]])
I can put these data in a pandas dataframe using
space_names = ['A','B','C']
time_names = [2000,2001,2002]
df = pd.DataFrame(values, index=space_names, columns=time_names)
df
### Output
2000 2001 2002
A 267.397845 300.814939 229.191632
B 236.194027 266.494699 204.012943
C 122.559130 140.009570 106.283397
This is considered a wide dataset, where each observation lies in a table with 2 variable that acts as coordinates to identify it.
To make it a long-tidy dataset we can suse the .stack method of pandas dataframe
df.columns.name = 'time'
df.index.name = 'space'
df.stack().rename('value').reset_index()
### Output
space time value
0 A 2000 267.397845
1 A 2001 300.814939
2 A 2002 229.191632
3 B 2000 236.194027
4 B 2001 266.494699
5 B 2002 204.012943
6 C 2000 122.559130
7 C 2001 140.009570
8 C 2002 106.283397
My question is: how do I do exactly this thing but for a 3-dimensional dataset?
Let's imagine I have 2 observation for each space-time couple
s = 3
t = 4
r = 2
space_mus = norm(20, 5).rvs(s)
time_mus = norm(10,2).rvs(t)
values = np.kron(space_mus, time_mus)
values = values.repeat(r).reshape(s,t,r) + norm(0,1).rvs([s,t,r])
values
### Output
array([[[286.50322099, 288.51266345],
[176.64303485, 175.38175877],
[136.01675917, 134.44328617]],
[[187.07608546, 185.4068411 ],
[112.86398438, 111.983463 ],
[ 85.99035255, 86.67236986]],
[[267.66833894, 269.45295404],
[162.30044715, 162.50564386],
[124.6374401 , 126.2315447 ]]])
How can I obtain the same structure for the dataframe as above?
Ugly solution
Personally i don't like this solution, and i think one might do it in a more elegant and pythonic way, but still might be useful for someone else so I will post my solution.
labels = ['{}{}{}'.format(i,j,k) for i in range(s) for j in range(t) for k in range(r)] #space, time, repetition
def flatten3d(k):
return [i for l in k for s in l for i in s]
value_series = pd.Series(flatten3d(values)).rename('y')
split_labels= [[i for i in l] for l in labels]
df = pd.DataFrame(split_labels, columns=['s','t','r'])
pd.concat([df, value_series], axis=1)
### Output
s t r y
0 0 0 0 266.2408815208753
1 0 0 1 266.13662442609433
2 0 1 0 299.53178992512954
3 0 1 1 300.13941632567605
4 0 2 0 229.39037800681405
5 0 2 1 227.22227496248507
6 0 3 0 281.76357915411995
7 0 3 1 280.9639352062619
8 1 0 0 235.8137644198259
9 1 0 1 234.23202459516452
10 1 1 0 265.19681013560034
11 1 1 1 266.5462102589883
12 1 2 0 200.730100791878
13 1 2 1 199.83217739700535
14 1 3 0 246.54018839875374
15 1 3 1 248.5496308586532
16 2 0 0 124.90916276929234
17 2 0 1 123.64788669199066
18 2 1 0 139.65391860786775
19 2 1 1 138.08044561039517
20 2 2 0 106.45276370157518
21 2 2 1 104.78351933651582
22 2 3 0 129.86043618610572
23 2 3 1 128.97991481257253

This does not use stack, but maybe it is acceptable for your problem:
import numpy as np
import pandas as pd
values = np.arange(18).reshape(3, 3, 2) # Your values here
index = pd.MultiIndex.from_product([space_names, space_names, time_names], names=["space1", "space2", "time"])
df = pd.DataFrame({"value": values.ravel()}, index=index).reset_index()
# df:
# space1 space2 time value
# 0 A A 2000 0
# 1 A A 2001 1
# 2 A B 2000 2
# 3 A B 2001 3
# 4 A C 2000 4
# 5 A C 2001 5
# 6 B A 2000 6
# 7 B A 2001 7
# 8 B B 2000 8
# 9 B B 2001 9
# 10 B C 2000 10
# 11 B C 2001 11
# 12 C A 2000 12
# 13 C A 2001 13
# 14 C B 2000 14
# 15 C B 2001 15
# 16 C C 2000 16
# 17 C C 2001 17

Related

rewritng a column cell values in a dataframe based on when the value change without using if statment

i have a column with faulty values as it is supposed to count cycles, but the device where the data from resets the count after 50 so i was left with exmalple [1,1,1,1,2,2,2,,3,3,3,3,...,50,50,50,1,1,1,2,2,2,2,3,3,3,...,50,50,.....,50]
My solution is and i cannt even make it work:(for simplicity i made the data resets from 10 cycles
data = {'Cyc-Count':[1,1,2,2,2,3,4,5,6,7,7,7,8,9,10,1,1,1,2,3,3,3,3,
4,4,5,6,6,6,7,8,8,8,8,9,10]}
df = pd.DataFrame(data)
x=0
count=0
old_value=df.at[x,'Cyc-Count']
for x in range(x,len(df)-1):
if df.at[x,'Cyc-Count']==df.at[x+1,'Cyc-Count']:
old_value=df.at[x+1,'Cyc-Count']
df.at[x+1,'Cyc-Count']=count
else:
old_value=df.at[x+1,'Cyc-Count']
count+=1
df.at[x+1,'Cyc-Count']=count
i need to fix this but preferably without even using if statments
the desired output for the upper example should be
data = {'Cyc-Count':[1,1,2,2,2,3,4,5,6,7,7,7,8,9,10,11,11,11,12,13,13,13,13,
14,14,15,16,16,16,17,18,18,18,18,19,20]}
hint" my method has a big issue is that the last indexed value will be hard to change since when comparing it with its index+1 > it dosnt even exist
IIUC, you want to continue the count when the counter decreases.
You can use vectorial code:
s = df['Cyc-Count'].shift()
df['Cyc-Count2'] = (df['Cyc-Count']
+ s.where(s.gt(df['Cyc-Count']))
.fillna(0, downcast='infer')
.cumsum()
)
Or, to modify the column in place:
s = df['Cyc-Count'].shift()
df['Cyc-Count'] += (s.where(s.gt(df['Cyc-Count']))
.fillna(0, downcast='infer').cumsum()
)
output:
Cyc-Count Cyc-Count2
0 1 1
1 1 1
2 1 1
3 1 1
4 2 2
5 2 2
6 2 2
7 3 3
8 3 3
9 3 3
10 3 3
11 4 4
12 5 5
13 5 5
14 5 5
15 1 6
16 1 6
17 1 6
18 2 7
19 2 7
20 2 7
21 2 7
22 3 8
23 3 8
24 3 8
25 4 9
26 5 10
27 5 10
28 1 11
29 2 12
30 2 12
31 3 13
32 4 14
33 5 15
34 5 15
used input:
l = [1,1,1,1,2,2,2,3,3,3,3,4,5,5,5,1,1,1,2,2,2,2,3,3,3,4,5,5,1,2,2,3,4,5,5]
df = pd.DataFrame({'Cyc-Count': l})
You can use df.loc to access a group of rows and columns by label(s) or a boolean array.
syntax: df.loc[df['column name'] condition, 'column name or the new one'] = 'value if condition is met'
for example:
import pandas as pd
numbers = {'set_of_numbers': [1,2,3,4,5,6,7,8,9,10,0,0]}
df = pd.DataFrame(numbers,columns=['set_of_numbers'])
print (df)
df.loc[df['set_of_numbers'] == 0, 'set_of_numbers'] = 999
df.loc[df['set_of_numbers'] == 5, 'set_of_numbers'] = 555
print (df)
before: ‘set_of_numbers’: [1,2,3,4,5,6,7,8,9,10,0,0]
After: ‘set_of_numbers’: [1,2,3,4,555,6,7,8,9,10,999,999]

Pandas dataframe column wise calculation

I have below dataframe columns:
Index(['Location' 'Dec-2021_x', 'Jan-2022_x', 'Feb-2022_x', 'Mar-2022_x',
'Apr-2022_x', 'May-2022_x', 'Jun-2022_x', 'Jul-2022_x', 'Aug-2022_x',
'Sep-2022_x', 'Oct-2022_x', 'Nov-2022_x', 'Dec-2022_x', 'Jan-2023_x',
'Feb-2023_x', 'Mar-2023_x', 'Apr-2023_x', 'May-2023_x', 'Jun-2023_x',
'Jul-2023_x', 'Aug-2023_x', 'Sep-2023_x', 'Oct-2023_x', 'Nov-2023_x',
'Dec-2023_x', 'Jan-2024_x', 'Feb-2024_x', 'Mar-2024_x', 'Apr-2024_x',
'May-2024_x', 'Jun-2024_x', 'Jul-2024_x', 'Aug-2024_x', 'Sep-2024_x',
'Oct-2024_x', 'Nov-2024_x', 'Dec-2024_x',
'sum_val',
'Dec-2021_y', 'Jan-2022_y', 'Feb-2022_y',
'Mar-2022_y', 'Apr-2022_y', 'May-2022_y', 'Jun-2022_y', 'Jul-2022_y',
'Aug-2022_y', 'Sep-2022_y', 'Oct-2022_y', 'Nov-2022_y', 'Dec-2022_y',
'Jan-2023_y', 'Feb-2023_y', 'Mar-2023_y', 'Apr-2023_y', 'May-2023_y',
'Jun-2023_y', 'Jul-2023_y', 'Aug-2023_y', 'Sep-2023_y', 'Oct-2023_y',
'Nov-2023_y', 'Dec-2023_y', 'Jan-2024_y', 'Feb-2024_y', 'Mar-2024_y',
'Apr-2024_y', 'May-2024_y', 'Jun-2024_y', 'Jul-2024_y', 'Aug-2024_y',
'Sep-2024_y', 'Oct-2024_y', 'Nov-2024_y', 'Dec-2024_y'],
dtype='object')
Sample dataframe with reduced columns looks like this:
df:
Location Dec-2021_x Jan-2022_x sum_val Dec-2021_y Jan-2022_y
A 212 315 1000 12 13
B 312 612 1100 13 17
C 242 712 1010 15 15
D 215 382 1001 16 17
E 252 319 1110 17 18
I have to create a resultant dataframe which will be in the below format:
Index(['Location' 'Dec-2021', 'Jan-2022', 'Feb-2022', 'Mar-2022',
'Apr-2022', 'May-2022', 'Jun-2022', 'Jul-2022', 'Aug-2022',
'Sep-2022', 'Oct-2022', 'Nov-2022', 'Dec-2022', 'Jan-2023',
'Feb-2023', 'Mar-2023', 'Apr-2023', 'May-2023', 'Jun-2023',
'Jul-2023', 'Aug-2023', 'Sep-2023', 'Oct-2023', 'Nov-2023',
'Dec-2023', 'Jan-2024', 'Feb-2024', 'Mar-2024', 'Apr-2024',
'May-2024', 'Jun-2024', 'Jul-2024', 'Aug-2024', 'Sep-2024',
'Oct-2024', 'Nov-2024', 'Dec-2024'
dtype='object')
The way we do this is using the formula:
'Dec-2021' = 'Dec-2021_x' * sum_val * 'Dec-2021_y' (these are all numeric columns)
and a similar way for all the months. There are 36 months to be precise. Is there any way to do this in a loop manner for each column in the month-year combination? There are around 65000+ rows here so do not want to overwhelm the system.
Use:
#sample data
np.random.seed(2022)
c = ['Location', 'Dec-2021_x', 'Jan-2022_x', 'Feb-2022_x', 'Mar-2022_x',
'Apr-2022_x','sum_val', 'Dec-2021_y', 'Jan-2022_y', 'Feb-2022_y',
'Mar-2022_y', 'Apr-2022_y']
df = (pd.DataFrame(np.random.randint(10, size=(5, len(c))), columns=c)
.assign(Location=list('abcde')))
print (df)
Location Dec-2021_x Jan-2022_x Feb-2022_x Mar-2022_x Apr-2022_x \
0 a 1 1 0 7 8
1 b 8 0 3 6 8
2 c 1 7 5 5 4
3 d 0 7 5 5 8
4 e 8 0 3 9 5
sum_val Dec-2021_y Jan-2022_y Feb-2022_y Mar-2022_y Apr-2022_y
0 2 8 0 5 9 1
1 0 1 2 0 5 7
2 8 2 3 1 0 4
3 2 4 0 9 4 9
4 2 1 7 2 1 7
#remove unnecessary columns
df1 = df.drop(['sum_val'], axis=1)
#add columns names for not necessary remove - if need in ouput
df1 = df1.set_index('Location')
#split columns names by last _
df1.columns = df1.columns.str.rsplit('_', n=1, expand=True)
#seelct x and y Dataframes by second level and multiple
df2 = (df1.xs('x', axis=1, level=1).mul(df['sum_val'].to_numpy(), axis= 0) *
df1.xs('y', axis=1, level=1))
print (df2)
Dec-2021 Jan-2022 Feb-2022 Mar-2022 Apr-2022
Location
a 16 0 0 126 16
b 0 0 0 0 0
c 16 168 40 0 128
d 0 0 90 40 144
e 16 0 12 18 70

Faster way of setting rows of DataFrame to NaN based on last index

I've got data frames that look like this:
import time
import pandas as pd
import numpy as np
N = 3
l = []
for i in range(N):
n = np.random.choice(5)+2
l += [pd.DataFrame(dict(ID = np.repeat(i, n),
t = list(range(n)),
X = np.random.normal(size = n)))]
df = pd.concat(l)
df
Out[85]:
ID t X
0 0 0 0.992300
1 0 1 0.226487
2 0 2 -0.731178
3 0 3 0.748376
4 0 4 1.269106
0 1 0 0.512957
1 1 1 -1.274963
2 1 2 0.186314
3 1 3 1.243093
0 2 0 0.321971
1 2 1 0.233895
2 2 2 0.293439
I need to set their last value of t for each ID to NaN. Right now, I can do it one of two ways:
trimlast = df.groupby('ID').apply(lambda x: x.head(-1)).reset_index(drop=True)
df = df.drop(columns='X').merge(trimlast, how='left')
Or
def f(d):
d.loc[d.t == d.t.max(), 'X'] = np.nan
return d
df = df.groupby('ID').apply(f).reset_index(drop=True)
Both of which yield:
df
Out[87]:
ID t X
0 0 0 0.992300
1 0 1 0.226487
2 0 2 -0.731178
3 0 3 0.748376
4 0 4 NaN
5 1 0 0.512957
6 1 1 -1.274963
7 1 2 0.186314
8 1 3 NaN
9 2 0 0.321971
10 2 1 0.233895
11 2 2 NaN
They're too slow when the data gets big. Time is approx linear.
def sizetry(N, other_way = False):
np.random.seed(0)
l = []
for i in range(N):
n = np.random.choice(5) + 2
l += [pd.DataFrame(dict(ID=np.repeat(i, n),
t=list(range(n)),
X=np.random.normal(size=n)))]
df = pd.concat(l)
start = time.time()
if other_way:
trimlast = df.groupby('ID').apply(lambda x: x.head(-1)).reset_index(drop=True)
df = df.drop(columns='X').merge(trimlast, how='left')
else:
df = df.groupby('ID').apply(f).reset_index(drop=True)
end = time.time()
return end-start
tvec = [sizetry(2**i) for i in range(15)]
tvec_other = [sizetry(2**i, other_way = True) for i in range(15)]
import matplotlib.pyplot as plt
plt.plot(np.log2(tvec), label = "merge way")
plt.plot(np.log2(tvec_other), label = 'other way')
plt.legend()
plt.show()
I suspect that the problem is groupby. Is there a faster way to do this?
first reset your index.
df = df.reset_index(drop=True)
then use duplicated() with an inversed boolean.
import numpy as np
df.loc[~df.duplicated(subset=['ID'],keep='last'),'X'] = np.nan
print(df)
ID t X
0 0 0 0.424902
1 0 1 1.597951
2 0 2 1.453884
3 0 3 NaN
4 1 0 0.534653
5 1 1 -0.318361
6 1 2 0.188290
7 1 3 1.157802
8 1 4 NaN
9 2 0 0.186005
10 2 1 0.036017
11 2 2 1.039822
12 2 3 -1.602205
13 2 4 -0.210601
14 2 5 NaN
if you want the max T value to be changed then use idxmax() with a groupby
df.loc[df.groupby('ID')['t'].idxmax(),'x'] = np.nan

grouping the index with the highest three values of another column using pandas python

i have a csv file that contains columns like StateName, Population, CityName... Note that for every state u can have multiple city name thus multiple population fo the same city
what i want to have is to group the StateName with the highest three population of the same city.
what i have: (image click to see)
what i want to have (image click to see)
my code is:
def answer_six():
x=census_df['STNAME'].unique()
census_df2 = df = pd.DataFrame()
for a in x :
census_dfcopy = census_df.copy()
census_dfcopy = census_dfcopy.set_index(['STNAME'])
census_dfcopy = census_dfcopy.loc[a]
census_dfcopy = census_dfcopy.reset_index()
census_dfcopy = census_dfcopy.set_index(['CENSUS2010POP'])
census_dfcopy1=census_dfcopy.sort_index(ascending = False)
census_dfcopy1= census_dfcopy1.append(census_dfcopy1)
census_dfcopy1.groupby('STNAME')
return census_dfcopy1.head(3)
answer_six()
i only get the last 3 value of the last state.
to download the csv file please visit the link:
https://drive.google.com/open?id=1ptE6MRQ1NGrfRYBB7NKjqhOJZXlxScPo
You could do
census_df.groupby('STNAME').CENSUS2010POP.nlargest(3)
In action:
In [51]: df
Out[51]:
ctyname pop stname
0 0 10 a
1 1 9 a
2 2 1 a
3 3 3 a
4 4 12 b
5 5 12 b
6 6 13 b
7 7 14 b
8 8 4 c
9 9 3 c
10 10 2 c
11 11 1 c
In [68]: df.groupby('stname').pop.nlargest(3)
Out[68]:
stname
a 0 10
1 9
3 3
b 7 14
6 13
4 12
c 8 4
9 3
10 2

Extract Number from Varying String

Given this data frame:
import pandas as pd
df = pd.DataFrame({'ID':['a','b','c','d','e','f','g','h','i','j','k'],
'value':['None',np.nan,'6D','7','10D','NONE','x','10D aaa','1 D','10 D aa',7]
})
df
ID value
0 a None
1 b NaN
2 c 6D
3 d 7
4 e 10D
5 f NONE
6 g x
7 h 10D aaa
8 i 1 D
9 j 10 D aa
10 k i7D
I'd like to extract numbers where present, else return 0, for any mess of situations as shown above.
The desired result is:
ID value
0 a 0
1 b 0
2 c 6
3 d 7
4 e 10
5 f 0
6 g 0
7 h 10
8 i 1
9 j 10
10 k 7
Thanks in advance!
Here is my approach of using re.findall and apply
df['value'].apply(lambda x: 0 if not re.findall('\d+', str(x)) else re.findall('\d+', str(x))[0])
Try the following using Series.str.replace and fillna:
import pandas as pd
df = pd.DataFrame({'ID':['a','b','c','d','e','f','g','h','i','j','k'],
'value':['None',np.nan,'6D','7','10D','NONE','x','10D aaa','1 D','10 D aa',7]
})
df = df.fillna(0)
df = df.str.replace(r'\D+', '').astype(int)
Alternatively, you can apply a function to the dataframe via applymap() following the EAFP principle catching multiple exceptions while extracting the digits:
def get_number(item):
try:
return int(re.search(r"\d+", str(item)).group(0))
except (AttributeError, ValueError, IndexError):
return 0
print(df.applymap(get_number))
Prints:
ID value
0 0 0
1 0 0
2 0 6
3 0 7
4 0 10
5 0 0
6 0 0
7 0 10
8 0 1
9 0 10
10 0 7

Categories

Resources