Reorder dataframe groupby medians following custom order [duplicate] - python

This question already has answers here:
How to sort pandas dataframe by custom order on string index
(5 answers)
Closed 18 days ago.
I have a dataset containing a bunch of data in the columns params and value. I'd like to count how many values each params contains (to use as labels in a boxplot), so I use mydf['params'].value_counts() to show this:
slidingwindow_250 11574
hotspots_1k_100 8454
slidingwindow_500 5793
slidingwindow_100 5366
hotspots_5k_500 3118
slidingwindow_1000 2898
hotspots_10k_1k 1772
slidingwindow_2500 1160
slidingwindow_5000 580
Name: params, dtype: int64
I have a list of all of the entries in params in the order I wish to display them in a boxplot. I try to use sort_index(level=myorder) to get them in my custom order, but the function ignores myorder and just sorts them alphabetically.
myorder = ["slidingwindow_100",
"slidingwindow_250",
"slidingwindow_500",
"slidingwindow_1000",
"slidingwindow_2500",
"slidingwindow_5000",
"hotspots_1k_100",
"hotspots_5k_500",
"hotspots_10k_1k"]
sizes_bp_log_df['params'].value_counts().sort_index(level=myorder)
hotspots_10k_1k 1772
hotspots_1k_100 8454
hotspots_5k_500 3118
slidingwindow_100 5366
slidingwindow_1000 2898
slidingwindow_250 11574
slidingwindow_2500 1160
slidingwindow_500 5793
slidingwindow_5000 580
Name: params, dtype: int64
How can I get the index of my value counts in the order I want them to be in?
In addition, I'll be using the median of each distribution as coordinates for the boxplot labels too, which I retrieve using sizes_bp_log_df.groupby(['params']).median(); hopefully your suggested sort methods will also work for that task.

Use reindex instead of sort_index

Related

Pandas: create columns based on unique values in column [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
I have a Pandas data frame with three columns, two of which have identifiers (date and id) and one with the values I actually care about (value). It looks like this:
,date,id,value
0,20210801,269277473,-1114389.6
1,20210802,269277473,-1658061.0
2,20210803,269277473,-1338010.2
3,20210804,269277473,-475779.6
4,20210805,269277473,-1417980.0
5,20210806,269277473,-1673400.6
6,20210807,269277473,-1438969.8
12,20210801,269277476,504300.0
13,20210802,269277476,519889.8
14,20210803,269277476,513899.4
15,20210804,269277476,526258.8
16,20210805,269277476,524730.0
17,20210806,269277476,548010.6
18,20210807,269277476,539031.0
24,20210801,269277480,477399.0
25,20210802,269277480,443499.0
26,20210803,269277480,394801.2
27,20210804,269277480,440100.0
28,20210805,269277480,455499.6
29,20210806,269277480,441100.2
30,20210807,269277480,438899.4
I want to roll the values into a table in which the date in the index, the columns are the ids, and the content is the values, like the following:
date,269277473,269277476,269277480
20210801,-1114389.6,504300.0,477399.0
20210802,-1658061.0,519889.8,443499.0
20210803,-1338010.2,513899.4,394801.2
20210804,-475779.6,526258.8,440100.0
20210805,-1417980.0,524730.0,455499.6
20210806,-1673400.6,548010.6,441100.2
20210807,-1438969.8,539031.0,438899.4
Given my table us huge (hundreds of millions of values), what is the most efficient way of accomplishing this?
You need to apply a pivot:
df.pivot(*df)
which is equivalent in your case (as the columns are in order) to:
df.pivot(index='date', columns='id', values='value')
output:
id 269277473 269277476 269277480
date
20210801 -1114389.6 504300.0 477399.0
20210802 -1658061.0 519889.8 443499.0
20210803 -1338010.2 513899.4 394801.2
20210804 -475779.6 526258.8 440100.0
20210805 -1417980.0 524730.0 455499.6
20210806 -1673400.6 548010.6 441100.2
20210807 -1438969.8 539031.0 438899.4
Use pivot:
>>> df.pivot(index='date', columns='id', values='value')
id 269277473 269277476 269277480
date
20210801 -1114389.6 504300.0 477399.0
20210802 -1658061.0 519889.8 443499.0
20210803 -1338010.2 513899.4 394801.2
20210804 -475779.6 526258.8 440100.0
20210805 -1417980.0 524730.0 455499.6
20210806 -1673400.6 548010.6 441100.2
20210807 -1438969.8 539031.0 438899.4
If you have other columns, you can use pivot_table instead to apply a function on values of each columns (mean, sum, ...)

Is loc an optional attribute when searching dataframe? [duplicate]

This question already has answers here:
What is the difference between using loc and using just square brackets to filter for columns in Pandas/Python?
(4 answers)
Closed 1 year ago.
Both the following lines seem to give the same output:
df1 = df[df['MRP'] > 1500]
df1 = df.loc[df['MRP'] > 1500]
Is loc an optional attribute when searching dataframe?
Coming from Padas.DataFrame.loc documentation:
Access a group of rows and columns by label(s) or a boolean array.
.loc[] is primarily label based, but may also be used with a boolean
array.
When you are using Boolean array to filter out data, .loc is optional, and in your example df['MRP'] > 1500 gives a Series with the values of truthfulness, so it's not necessary to use .loc in that case.
df[df['MRP']>15]
MRP cat
0 18 A
3 19 D
6 18 C
But if you want to access some other columns where this Boolean Series has True value, then you may use .loc:
df.loc[df['MRP']>15, 'cat']
0 A
3 D
6 C
Or, if you want to change the values where the condition is True:
df.loc[df['MRP']>15, 'cat'] = 'found'

Not getting stats analysis of binary column pandas

I have a dataframe, 11 columns 18k rows. The last column is either a 1 or 0, but when I use .describe() all I get is
count 19020
unique 2
top 1
freq 12332
Name: Class, dtype: int64
as opposed to an actual statistical analysis with mean, std, etc.
Is there a way to do this?
If your numeric (0, 1) column is not being picked up automatically by .describe(), it might be because it's not actually encoded as an int dtype. You can see this in the documentation of the .describe() method, which tells you that the default include parameter is only for numeric types:
None (default) : The result will include all numeric columns.
My suggestion would be the following:
df.dtypes # check datatypes
df['num'] = df['num'].astype(int) # if it's not integer, cast it as such
df.describe(include=['object', 'int64']) # explicitly state the data types you'd like to describe
That is, first check the datatypes (I'm assuming the column is called num and the dataframe df, but feel free to substitute with the right ones). If this indicator/(0,1) column is indeed not encoded as int/integer type, then cast it as such by using .astype(int). Then, you can freely use df.describe() and perhaps even specify columns of which data types you want to include in the description output, for more fine-grained control.
You could use
# percentile list
perc =[.20, .40, .60, .80]
# list of dtypes to include
include =['object', 'float', 'int']
data.describe(percentiles = perc, include = include)
where data is your dataframe (important point).
Since you are new to stack, I might suggest that you include some actual code (i.e. something showing how and on what you are using your methods). You'll get better answers

Key Errors when accessing some indices of pandas series [duplicate]

This question already has answers here:
How are iloc and loc different?
(6 answers)
Closed 2 years ago.
I am working with the OULAD dataset in pandas, and i'm trying to view the labels of some specific rows. For some reason, some indices produce key errors and some do not.
code:
labels = info["final_result"].copy()
print(type(labels))
print(labels)
gives the result:
<class 'pandas.core.series.Series'>
21847 Fail
19351 Fail
10841 Withdrawn
4360 Withdrawn
8991 Withdrawn
...
29976 Distinction
629 Withdrawn
7329 Pass
25941 Pass
21098 Pass
Name: final_result, Length: 26074, dtype: object
and, for example
print(labels[10])
prints out:
pass
which is the correct label.
However,
print(labels[9])
for whetever reason, results in:
KeyError: 9
any ideas?
probably this index is not existing in the dataframe.
Look at this exmample:
tmp = pd.DataFrame({"a":[0,1,2,3]})
tmp = tmp.drop(1)
tmp["a"][1]
KeyError: 1

Pandas/Python: Replace multiple values in multiple columns

All, I have an analytical csv file with 190 columns and 902 rows. I need to recode values in several columns (18 to be exact) from it's current 1-5 Likert scaling to 0-4 Likert scaling.
I've tried using replace:
df.replace({'Job_Performance1': {1:0, 2:1, 3:2, 4:3, 5:4}}, inplace=True)
But that throws a Value Error: "Replacement not allowed with overlapping keys and values"
I can use map:
df['job_perf1'] = df.Job_Performance1.map({1:0, 2:1, 3:2, 4:3, 5:4})
But, I know there has to be a more efficient way to accomplish this since this use case is standard in statistical analysis and statistical software e.g. SPSS
I've reviewed multiple questions on StackOverFlow but none of them quite fit my use case.
e.g. Pandas - replacing column values, pandas replace multiple values one column, Python pandas: replace values multiple columns matching multiple columns from another dataframe
Suggestions?
You can simply subtract a scalar value from your column which is in effect what you're doing here:
df['job_perf1'] = df['job_perf1'] - 1
Also as you need to do this on 18 cols, then I'd construct a list of the 18 column names and just subtract 1 from all of them at once:
df[col_list] = df[col_list] - 1
No need for a mapping. This can be done as a vector addition, since effectively, what you're doing, is subtracting 1 from each value. This works elegantly:
df['job_perf1'] = df['Job_Performance1'] - numpy.ones(len(df['Job_Performance1']))
Or, without numpy:
df['job_perf1'] = df['Job_Performance1'] - [1] * len(df['Job_Performance1'])

Categories

Resources