This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
I have a Pandas data frame with three columns, two of which have identifiers (date and id) and one with the values I actually care about (value). It looks like this:
,date,id,value
0,20210801,269277473,-1114389.6
1,20210802,269277473,-1658061.0
2,20210803,269277473,-1338010.2
3,20210804,269277473,-475779.6
4,20210805,269277473,-1417980.0
5,20210806,269277473,-1673400.6
6,20210807,269277473,-1438969.8
12,20210801,269277476,504300.0
13,20210802,269277476,519889.8
14,20210803,269277476,513899.4
15,20210804,269277476,526258.8
16,20210805,269277476,524730.0
17,20210806,269277476,548010.6
18,20210807,269277476,539031.0
24,20210801,269277480,477399.0
25,20210802,269277480,443499.0
26,20210803,269277480,394801.2
27,20210804,269277480,440100.0
28,20210805,269277480,455499.6
29,20210806,269277480,441100.2
30,20210807,269277480,438899.4
I want to roll the values into a table in which the date in the index, the columns are the ids, and the content is the values, like the following:
date,269277473,269277476,269277480
20210801,-1114389.6,504300.0,477399.0
20210802,-1658061.0,519889.8,443499.0
20210803,-1338010.2,513899.4,394801.2
20210804,-475779.6,526258.8,440100.0
20210805,-1417980.0,524730.0,455499.6
20210806,-1673400.6,548010.6,441100.2
20210807,-1438969.8,539031.0,438899.4
Given my table us huge (hundreds of millions of values), what is the most efficient way of accomplishing this?
You need to apply a pivot:
df.pivot(*df)
which is equivalent in your case (as the columns are in order) to:
df.pivot(index='date', columns='id', values='value')
output:
id 269277473 269277476 269277480
date
20210801 -1114389.6 504300.0 477399.0
20210802 -1658061.0 519889.8 443499.0
20210803 -1338010.2 513899.4 394801.2
20210804 -475779.6 526258.8 440100.0
20210805 -1417980.0 524730.0 455499.6
20210806 -1673400.6 548010.6 441100.2
20210807 -1438969.8 539031.0 438899.4
Use pivot:
>>> df.pivot(index='date', columns='id', values='value')
id 269277473 269277476 269277480
date
20210801 -1114389.6 504300.0 477399.0
20210802 -1658061.0 519889.8 443499.0
20210803 -1338010.2 513899.4 394801.2
20210804 -475779.6 526258.8 440100.0
20210805 -1417980.0 524730.0 455499.6
20210806 -1673400.6 548010.6 441100.2
20210807 -1438969.8 539031.0 438899.4
If you have other columns, you can use pivot_table instead to apply a function on values of each columns (mean, sum, ...)
This question already has answers here:
What is the difference between using loc and using just square brackets to filter for columns in Pandas/Python?
(4 answers)
Closed 1 year ago.
Both the following lines seem to give the same output:
df1 = df[df['MRP'] > 1500]
df1 = df.loc[df['MRP'] > 1500]
Is loc an optional attribute when searching dataframe?
Coming from Padas.DataFrame.loc documentation:
Access a group of rows and columns by label(s) or a boolean array.
.loc[] is primarily label based, but may also be used with a boolean
array.
When you are using Boolean array to filter out data, .loc is optional, and in your example df['MRP'] > 1500 gives a Series with the values of truthfulness, so it's not necessary to use .loc in that case.
df[df['MRP']>15]
MRP cat
0 18 A
3 19 D
6 18 C
But if you want to access some other columns where this Boolean Series has True value, then you may use .loc:
df.loc[df['MRP']>15, 'cat']
0 A
3 D
6 C
Or, if you want to change the values where the condition is True:
df.loc[df['MRP']>15, 'cat'] = 'found'
I have a dataframe, 11 columns 18k rows. The last column is either a 1 or 0, but when I use .describe() all I get is
count 19020
unique 2
top 1
freq 12332
Name: Class, dtype: int64
as opposed to an actual statistical analysis with mean, std, etc.
Is there a way to do this?
If your numeric (0, 1) column is not being picked up automatically by .describe(), it might be because it's not actually encoded as an int dtype. You can see this in the documentation of the .describe() method, which tells you that the default include parameter is only for numeric types:
None (default) : The result will include all numeric columns.
My suggestion would be the following:
df.dtypes # check datatypes
df['num'] = df['num'].astype(int) # if it's not integer, cast it as such
df.describe(include=['object', 'int64']) # explicitly state the data types you'd like to describe
That is, first check the datatypes (I'm assuming the column is called num and the dataframe df, but feel free to substitute with the right ones). If this indicator/(0,1) column is indeed not encoded as int/integer type, then cast it as such by using .astype(int). Then, you can freely use df.describe() and perhaps even specify columns of which data types you want to include in the description output, for more fine-grained control.
You could use
# percentile list
perc =[.20, .40, .60, .80]
# list of dtypes to include
include =['object', 'float', 'int']
data.describe(percentiles = perc, include = include)
where data is your dataframe (important point).
Since you are new to stack, I might suggest that you include some actual code (i.e. something showing how and on what you are using your methods). You'll get better answers
This question already has answers here:
How are iloc and loc different?
(6 answers)
Closed 2 years ago.
I am working with the OULAD dataset in pandas, and i'm trying to view the labels of some specific rows. For some reason, some indices produce key errors and some do not.
code:
labels = info["final_result"].copy()
print(type(labels))
print(labels)
gives the result:
<class 'pandas.core.series.Series'>
21847 Fail
19351 Fail
10841 Withdrawn
4360 Withdrawn
8991 Withdrawn
...
29976 Distinction
629 Withdrawn
7329 Pass
25941 Pass
21098 Pass
Name: final_result, Length: 26074, dtype: object
and, for example
print(labels[10])
prints out:
pass
which is the correct label.
However,
print(labels[9])
for whetever reason, results in:
KeyError: 9
any ideas?
probably this index is not existing in the dataframe.
Look at this exmample:
tmp = pd.DataFrame({"a":[0,1,2,3]})
tmp = tmp.drop(1)
tmp["a"][1]
KeyError: 1
All, I have an analytical csv file with 190 columns and 902 rows. I need to recode values in several columns (18 to be exact) from it's current 1-5 Likert scaling to 0-4 Likert scaling.
I've tried using replace:
df.replace({'Job_Performance1': {1:0, 2:1, 3:2, 4:3, 5:4}}, inplace=True)
But that throws a Value Error: "Replacement not allowed with overlapping keys and values"
I can use map:
df['job_perf1'] = df.Job_Performance1.map({1:0, 2:1, 3:2, 4:3, 5:4})
But, I know there has to be a more efficient way to accomplish this since this use case is standard in statistical analysis and statistical software e.g. SPSS
I've reviewed multiple questions on StackOverFlow but none of them quite fit my use case.
e.g. Pandas - replacing column values, pandas replace multiple values one column, Python pandas: replace values multiple columns matching multiple columns from another dataframe
Suggestions?
You can simply subtract a scalar value from your column which is in effect what you're doing here:
df['job_perf1'] = df['job_perf1'] - 1
Also as you need to do this on 18 cols, then I'd construct a list of the 18 column names and just subtract 1 from all of them at once:
df[col_list] = df[col_list] - 1
No need for a mapping. This can be done as a vector addition, since effectively, what you're doing, is subtracting 1 from each value. This works elegantly:
df['job_perf1'] = df['Job_Performance1'] - numpy.ones(len(df['Job_Performance1']))
Or, without numpy:
df['job_perf1'] = df['Job_Performance1'] - [1] * len(df['Job_Performance1'])