I am using the pandas .qcut() function to divide a column 'AveragePrice' into 4 bins. I would like to assign each bin to a new variable. The reason for this is to do a separate analysis on each quartile. IE) I would like something like:
bin1 = quartile 1
bin2= quartile 2
bin3 = quartile 3
bin4= quantile 4
Here is what I'm working with.
`pd.qcut(data['AveragePrice'], q=4)`
2 (0.439, 1.1]
3 (0.439, 1.1]
17596 (1.1, 1.38]
17600 (1.1, 1.38]
Name: AveragePrice, Length: 14127, dtype: category
Categories (4, interval[float64]): [(0.439, 1.1] < (1.1, 1.38] < (1.38, 1.69] < (1.69, 3.25]]
If I understand correctly, you can "pivot" your quartile values into columns.
Toy example:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'AveragePrice': np.random.randint(0, 100, size=10) })
AveragePrice
0
20
1
29
2
53
3
30
4
3
5
4
6
78
7
62
8
75
9
1
Create the Quartile column, pivot Quartile into columns, and rename the columns to something more reader-friendly:
df['Quartile'] = pd.qcut(df.AveragePrice, q=4)
pivot = df.reset_index().pivot_table(
index='index',
columns='Quartile',
values='AveragePrice')
pivot.columns = ['Q1', 'Q2', 'Q3', 'Q4']
Q1
Q2
Q3
Q4
0
NaN
20.0
NaN
NaN
1
NaN
29.0
NaN
NaN
2
NaN
NaN
53.0
NaN
3
NaN
NaN
30.0
NaN
4
3.0
NaN
NaN
NaN
5
4.0
NaN
NaN
NaN
6
NaN
NaN
NaN
78.0
7
NaN
NaN
NaN
62.0
8
NaN
NaN
NaN
75.0
9
1.0
NaN
NaN
NaN
Now you can analyze the bins separately, e.g., describe them:
pivot.describe()
Q1
Q2
Q3
Q4
count
3.000000
2.000000
2.000000
3.000000
mean
2.666667
24.500000
41.500000
71.666667
std
1.527525
6.363961
16.263456
8.504901
min
1.000000
20.000000
30.000000
62.000000
25%
2.000000
22.250000
35.750000
68.500000
50%
3.000000
24.500000
41.500000
75.000000
75%
3.500000
26.750000
47.250000
76.500000
max
4.000000
29.000000
53.000000
78.000000
Related
I want to use every 5th row as a reference row (ref_row), divide this ref_row starting from this ref_row and do the same for the next 4 rows.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
len = df.shape[0]
for idx in range(0,len,5):
ref_row = df.iloc[idx:idx+1,:]
for idx_next in range(idx,idx+5):
df.iloc[idx_next:idx_next+1,:] = df.iloc[idx_next:idx_next+1,:].div(ref_row)
However, I got all NaN except the ref_row.
A B C D
0 1.0 1.0 1.0 1.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
... ... ... ... ...
95 1.0 1.0 1.0 1.0
96 NaN NaN NaN NaN
97 NaN NaN NaN NaN
98 NaN NaN NaN NaN
99 NaN NaN NaN NaN
Any idea what's wrong?
The problem with your code is that with df.iloc[idx_next:idx_next+1,:] and df.iloc[idx:idx+1,:], you're indexing df rows as DF objects. So when you divide, the indices don't match and you get NaN. Replace
df.iloc[idx_next:idx_next+1,:]
with
df.iloc[idx_next]
and
df.iloc[idx:idx+1,:]
with
df.iloc[idx]
everywhere, it will work as expected (because they're now Series objects, so the indices match).
You can also repeat the array of every fifth row of the DataFrame using np.repeat on axis=0, then element-wise divide it with the resulting array:
out = df.div(np.repeat(df[::5].to_numpy(), 5, axis=0))
Output:
A B C D
0 1.000000 1.000000 1.000000 1.000000
1 0.726190 0.359375 0.967742 1.644068
2 0.130952 0.046875 0.161290 0.406780
3 0.488095 0.312500 0.919355 0.305085
4 0.857143 0.203125 0.967742 0.525424
.. ... ... ... ...
95 1.000000 1.000000 1.000000 1.000000
96 0.061224 1.400000 0.518519 0.882353
97 1.510204 1.300000 1.740741 5.588235
98 0.224490 2.100000 1.407407 0.294118
99 1.061224 1.400000 1.388889 3.411765
[100 rows x 4 columns]
I have some datas I would like to organize for visualization and statistics but I don't know how to proceed.
The data are in 3 columns (stimA, stimB and subjectAnswer) and 10 rows (numero of pairs) and they are from a pairwise comparison test, in panda's dataFrame format. Example :
stimA
stimB
subjectAnswer
1
2
36
3
1
55
5
3
98
...
...
...
My goal is to organize them as a matrix with each row and column corresponding to one stimulus with the subjectAnswer data grouped to the left side of the matrix' diagonal (in my example, the subjectAnswer 36 corresponding to stimA 1 and stimB 2 should go to the index [2][1]), like this :
stimA/stimB
1
2
3
4
5
1
...
2
36
3
55
4
...
5
...
...
98
I succeeded in pivoting the first table to the matrix but I couldn't succeed the arrangement on the left side of the diag of my datas, here is my code :
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
session1 = pd.read_csv(filepath, names=['stimA', 'stimB', 'subjectAnswer'])
pivoted = session1.pivot('stimA','stimB','subjectAnswer')
Which gives :
session1 :
stimA stimB subjectAnswer
0 1 3 6
1 4 3 21
2 4 5 26
3 2 3 10
4 1 2 6
5 1 5 6
6 4 1 6
7 5 2 13
8 3 5 15
9 2 4 26
pivoted :
stimB 1 2 3 4 5
stimA
1 NaN 6.0 6.0 NaN 6.0
2 NaN NaN 10.0 26.0 NaN
3 NaN NaN NaN NaN 15.0
4 6.0 NaN 21.0 NaN 26.0
5 NaN 13.0 NaN NaN NaN
The expected output for pivoted :
stimB 1 2 3 4 5
stimA
1 NaN NaN Nan NaN NaN
2 6.0 NaN Nan NaN NaN
3 6.0 10.0 NaN NaN NaN
4 6.0 26.0 21.0 NaN NaN
5 6.0 13.0 15.0 26.0 NaN
Thanks a lot for your help !
If I understand you correctly, the stimuli A and B are interchangeable. So to get the matrix layout you want, you can swap A with B in those rows where A is smaller than B. In other words, you don't use the original A and B for the pivot table, but the maximum and minimum of A and B:
session1['stim_min'] = np.min(session1[['stimA', 'stimB']], axis=1)
session1['stim_max'] = np.max(session1[['stimA', 'stimB']], axis=1)
pivoted = session1.pivot('stim_max', 'stim_min', 'subjectAnswer')
pivoted
stim_min 1 2 3 4
stim_max
2 6.0 NaN NaN NaN
3 6.0 10.0 NaN NaN
4 6.0 26.0 21.0 NaN
5 6.0 13.0 15.0 26.0
sort the columns stimA and stimB along the columns axis and assign two temporary columns namely x and y in the dataframe. Here sorting is required because we need to ensure that the resulting matrix clipped on the upper right side.
Pivot the dataframe with index as y, columns as x and values as subjectanswer, then reindex the reshaped frame in order to ensure that all the available unique stim names are present in the index and columns of the matrix
session1[['x', 'y']] = np.sort(session1[['stimA', 'stimB']], axis=1)
i = np.union1d(session1['x'], session1['y'])
session1.pivot('y', 'x','subjectAnswer').reindex(i, i)
x 1 2 3 4 5
y
1 NaN NaN NaN NaN NaN
2 6.0 NaN NaN NaN NaN
3 6.0 10.0 NaN NaN NaN
4 6.0 26.0 21.0 NaN NaN
5 6.0 13.0 15.0 26.0 NaN
I'd like to sort the output of pandas' describe method, first by the column data type and then if possible by column name... so that all the columns with dates show up together in one group, then another grouping with the ints, then the strings and so on. How can this be done?
This is as far as I've got, sort_values causes it to crash:
df.describe(include='all').sort_values(by=df.dtypes.astype(str)).transpose()
For me working first sort by index by Series.sort_index, then by Series.sort_values and last change order by DataFrame.reindex:
df = pd.DataFrame({
'V':list('abcdef'),
'B':[4.,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1.,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
df1 = df.describe(include='all')
c = df.dtypes.astype(str).sort_index().sort_values()
print (c)
B float64
D float64
C int64
E int64
F object
V object
dtype: object
df2 = df1.reindex(columns=c.index)
print (df2)
B D C E F V
count 6.000000 6.000000 6.000000 6.000000 6 6
unique NaN NaN NaN NaN 2 6
top NaN NaN NaN NaN b c
freq NaN NaN NaN NaN 3 1
mean 4.500000 2.833333 5.500000 4.833333 NaN NaN
std 0.547723 2.714160 2.880972 2.483277 NaN NaN
min 4.000000 0.000000 2.000000 2.000000 NaN NaN
25% 4.000000 1.000000 3.250000 3.250000 NaN NaN
50% 4.500000 2.000000 5.500000 4.500000 NaN NaN
75% 5.000000 4.500000 7.750000 5.750000 NaN NaN
max 5.000000 7.000000 9.000000 9.000000 NaN NaN
Alternative solution create DataFrame by Series and sorting by DataFrame.sort_values by both columns:
df1 = df.describe(include='all')
c1 = (df.dtypes.astype(str)
.rename_axis('a')
.reset_index(name='b')
.sort_values(['b','a']))
print (c1)
a b
1 B float64
3 D float64
2 C int64
4 E int64
5 F object
0 V object
df2 = df1.reindex(columns=c1['a'])
print (df2)
a B D C E F V
count 6.000000 6.000000 6.000000 6.000000 6 6
unique NaN NaN NaN NaN 2 6
top NaN NaN NaN NaN b c
freq NaN NaN NaN NaN 3 1
mean 4.500000 2.833333 5.500000 4.833333 NaN NaN
std 0.547723 2.714160 2.880972 2.483277 NaN NaN
min 4.000000 0.000000 2.000000 2.000000 NaN NaN
25% 4.000000 1.000000 3.250000 3.250000 NaN NaN
50% 4.500000 2.000000 5.500000 4.500000 NaN NaN
75% 5.000000 4.500000 7.750000 5.750000 NaN NaN
max 5.000000 7.000000 9.000000 9.000000 NaN NaN
I am adding a new function which converts the DataFrame to lower triangle if its an upper triangle and vice versa. The data I am using always has first two rows filled with the first index only.
I tried using the solution from this problem Pandas: convert upper triangular dataframe by shifting rows to the left
Data :
0 1 2 3
0 1.000000 NaN NaN NaN
1 0.421655 NaN NaN NaN
2 0.747064 5.000000 NaN NaN
3 0.357616 0.631622 8.000000 NaN
which should be turned into:
Data :
0 1 2 3
0 NaN 8.000000 0.631622 0.357616
1 NaN NaN 5.000000 0.747064
2 NaN NaN NaN 0.421655
3 NaN NaN NaN 1.000000
Just like you need reverse order for row and columns
yourdf=df.iloc[::-1,::-1]
yourdf
Out[94]:
3 2 1 0
3 NaN 8.0 0.631622 0.357616
2 NaN NaN 5.000000 0.747064
1 NaN NaN NaN 0.421655
0 NaN NaN NaN 1.000000
your system should be having numpy installed. So, using numpy.flip is another way and provide more readable options
In [722]: df
Out[722]:
0 1 2 3
0 1.000000 NaN NaN NaN
1 0.421655 NaN NaN NaN
2 0.747064 5.000000 NaN NaN
3 0.357616 0.631622 8.0 NaN
In [724]: import numpy as np
In [725]: df_flip = pd.DataFrame(np.flip(df.values))
In [726]: df_flip
Out[726]:
0 1 2 3
0 NaN 8.0 0.631622 0.357616
1 NaN NaN 5.000000 0.747064
2 NaN NaN NaN 0.421655
3 NaN NaN NaN 1.000000
I have a dataframe that looks like the following. There are >=1 consecutive rows where y_l is populated and y_h is NaN and vice versa.
When we have more than 1 consecutive populated lines between the NaNs we only want to keep the one with the lowest y_l or the highest y_h.
e.g. on the df below from the last 3 rows we would only keep the 2nd and discard the other two.
What would be a smart way to implement that?
df = pd.DataFrame({'y_l': [NaN, 97,95,98,NaN],'y_h': [90, NaN,NaN,NaN,95]}, columns=['y_l','y_h'])
>>> df
y_l y_h
0 NaN 90.0
1 97.0 NaN
2 95.0 NaN
3 98.0 NaN
4 NaN 95
Desired result:
y_l y_h
0 NaN 90.0
1 95.0 NaN
2 NaN 95
You need create new column or Series for distinguish each consecutives and then use groupby with aggreagte by agg, last for change order of columns use reindex:
a = df['y_l'].isnull()
b = a.ne(a.shift()).cumsum()
df = (df.groupby(b, as_index=False)
.agg({'y_l':'min', 'y_h':'max'})
.reindex(columns=['y_l','y_h']))
print (df)
y_l y_h
0 NaN 90.0
1 95.0 NaN
2 NaN 95.0
Detail:
print (b)
0 1
1 2
2 2
3 2
4 3
Name: y_h, dtype: int32
What if you had more columns?
for example
df = pd.DataFrame({'A': [NaN, 15,20,25,NaN],'y_l': [NaN, 97,95,98,NaN],'y_h': [90, NaN,NaN,NaN,95]}, columns=['A','y_l','y_h'])
>>>df
A y_l y_h
0 NaN NaN 90.0
1 15.0 97.0 NaN
2 20.0 95.0 NaN
3 25.0 98.0 NaN
4 NaN NaN 95.0
How could you keep the values in column A after filtering out the irrelevant rows as below?
A y_l y_h
0 NaN NaN 90.0
1 20.0 95.0 NaN
2 NaN NaN 95.0