Pandas: create columns based on unique values in column [duplicate] - python

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
I have a Pandas data frame with three columns, two of which have identifiers (date and id) and one with the values I actually care about (value). It looks like this:
,date,id,value
0,20210801,269277473,-1114389.6
1,20210802,269277473,-1658061.0
2,20210803,269277473,-1338010.2
3,20210804,269277473,-475779.6
4,20210805,269277473,-1417980.0
5,20210806,269277473,-1673400.6
6,20210807,269277473,-1438969.8
12,20210801,269277476,504300.0
13,20210802,269277476,519889.8
14,20210803,269277476,513899.4
15,20210804,269277476,526258.8
16,20210805,269277476,524730.0
17,20210806,269277476,548010.6
18,20210807,269277476,539031.0
24,20210801,269277480,477399.0
25,20210802,269277480,443499.0
26,20210803,269277480,394801.2
27,20210804,269277480,440100.0
28,20210805,269277480,455499.6
29,20210806,269277480,441100.2
30,20210807,269277480,438899.4
I want to roll the values into a table in which the date in the index, the columns are the ids, and the content is the values, like the following:
date,269277473,269277476,269277480
20210801,-1114389.6,504300.0,477399.0
20210802,-1658061.0,519889.8,443499.0
20210803,-1338010.2,513899.4,394801.2
20210804,-475779.6,526258.8,440100.0
20210805,-1417980.0,524730.0,455499.6
20210806,-1673400.6,548010.6,441100.2
20210807,-1438969.8,539031.0,438899.4
Given my table us huge (hundreds of millions of values), what is the most efficient way of accomplishing this?

You need to apply a pivot:
df.pivot(*df)
which is equivalent in your case (as the columns are in order) to:
df.pivot(index='date', columns='id', values='value')
output:
id 269277473 269277476 269277480
date
20210801 -1114389.6 504300.0 477399.0
20210802 -1658061.0 519889.8 443499.0
20210803 -1338010.2 513899.4 394801.2
20210804 -475779.6 526258.8 440100.0
20210805 -1417980.0 524730.0 455499.6
20210806 -1673400.6 548010.6 441100.2
20210807 -1438969.8 539031.0 438899.4

Use pivot:
>>> df.pivot(index='date', columns='id', values='value')
id 269277473 269277476 269277480
date
20210801 -1114389.6 504300.0 477399.0
20210802 -1658061.0 519889.8 443499.0
20210803 -1338010.2 513899.4 394801.2
20210804 -475779.6 526258.8 440100.0
20210805 -1417980.0 524730.0 455499.6
20210806 -1673400.6 548010.6 441100.2
20210807 -1438969.8 539031.0 438899.4
If you have other columns, you can use pivot_table instead to apply a function on values of each columns (mean, sum, ...)

Related

How to select the first column of datasets?

I am trying to get the first column of dataset to calculate the summary of the data such as mean, median, variance, stdev etc...
This is how I read my csv file
wine_data = pd.read_csv('winequality-white.csv')
I tried to select the first columns two ways
first_col = wine_data[wine_data.columns[0]]
wine_data.iloc[:,0]
But I get this whole result
0 7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6
1 6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9...
2 8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;1...
4896 5.5;0.29;0.3;1.1;0.022;20;110;0.98869;3.34;0.3...
4897 6;0.21;0.38;0.8;0.02;22;98;0.98941;3.26;0.32;1...
Name: fixed acidity;"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality", Length: 4898, dtype: object
How can I just select the first columns such as 7,6.3,8.1,5.5,6.0.
You might use the following:
#to see all columns
df.columns
#Selecting one column
df['column_name']
#Selecting multiple columns
df[['column_one', 'column_two','column_four', 'column_seven']]
Something like this example:
Or if you prefer, you might use the df.iloc
You can try this:
first_col = wind_data.ix[:,0]

create a new column which is a value_counts of another column in python [duplicate]

This question already has answers here:
pandas add column to groupby dataframe
(3 answers)
Closed 2 years ago.
I have a pandas datafram df that contains a column say x, and I would like to create another column out of x which is a value_count of each item in x.
Here is my approach
x_counts= []
for item in df['x']:
item_count = len(df[df['x']==item])
x_counts.append(item_count)
df['x_count'] = x_counts
This works but this is far inefficient. I am looking for a more efficient way to handle this. Your approach and recommendations are highly appreciated
It sounds like you are looking for groupby function that you are trying to get the count of items in x
There are many other function driven methods but they may differ in different versions.
I suppose that you are looking to join the same elements and find their sum
df.loc[:,'x_count']=1 # This will make a new column of x_count to each row with value 1 in it
aggregate_functions={"x_count":"sum"}
df=df.groupby(["x"],as_index=False,sort=False).aggregate(aggregate_functions) # as_index and sort functions will allow you to choose x separately otherwise it would conside the x column as index column
Hope it heps.

Is there a way to check tail columns from a dataframe? [duplicate]

This question already has answers here:
Selecting last n columns and excluding last n columns in dataframe
(3 answers)
Closed 3 years ago.
I would like to see some columns in a dataframe but from the tail of columns. Not rows.
Usually when you want to check the dataframe the following will be the easiest way to do.
print(df.head(5))
Au Ag Al As ... Zn Zr SAMPLE Alt
0 -0.657577 -1.154902 -0.638272 1.541579 ... 0.477121 0.462398 GH101 phy
1 -2.431798 0.149219 -0.537602 1.086360 ... 1.681241 -0.301030 GH102 ad-ag
2 -2.568636 0.178977 -0.346787 1.025306 ... 1.681241 -0.154902 GH103 ad-ag
3 -2.455932 0.722634 -0.568636 1.378398 ... 1.113943 -0.301030 GH104 ad-ag
4 -3.698970 -1.522879 -0.292430 1.045323 ... 1.556303 -0.154902 GH105 phy
However, I would like to check the columns on the right side of this print result (from "Alt" to leftwards), and check 5 columns for example.
Is there anyway to do so?
You can use iloc for this, which allows integer-based indexing of a DataFrame. It takes the syntax [i, j], where i indexes rows and j indexes columns, and allows slicing. (docs here)
In this case, you want all the rows and the last five columns, so you can do this:
df.iloc[:, -5:]
You can do:
df[df.columns[-5:]]
or:
df.T.iloc[-5:]
To select the last five elements (columns).
For rows you can do:
df.iloc[-5:]

Python - Pandas - Groupby conditional on column values in group

I have a dataframe with the following structure with columns group_, vals_ and dates_.
I would like to perform a groupby operation on group_ and subsequently output for each group a statistic conditional on dates. For instance, the mean of all vals_ within a group whose associated date is below some date.
I tried
df_.groupby(group_).agg(lambda x: x[x['date_']< some_date][vals_].mean())
But this fails. I believe it is because x is not a dataframe but a series. Is this correct? Is it possible to achieve what I am trying to achieve here with groupby?
You can write it differently:
def summary(sub_df):
bool_before = sub_df["date_"] < some_date
bool_after = sub_df["date_"] > some_date
before = sub_df.loc[bool_before, vals_].mean()
after = sub_df.loc[bool_after, vals_].mean()
overall = sub_df.loc[:, vals_].mean()
return pd.Series({"before": before, "after": after, "overall": overall})
result = df_.groupby(group_).apply(summary)
The result is a data frame containing 3 mean values for before/after/overall.
If you require additional summary statistics, you can supply them within the summary function.

Pandas/Python: Replace multiple values in multiple columns

All, I have an analytical csv file with 190 columns and 902 rows. I need to recode values in several columns (18 to be exact) from it's current 1-5 Likert scaling to 0-4 Likert scaling.
I've tried using replace:
df.replace({'Job_Performance1': {1:0, 2:1, 3:2, 4:3, 5:4}}, inplace=True)
But that throws a Value Error: "Replacement not allowed with overlapping keys and values"
I can use map:
df['job_perf1'] = df.Job_Performance1.map({1:0, 2:1, 3:2, 4:3, 5:4})
But, I know there has to be a more efficient way to accomplish this since this use case is standard in statistical analysis and statistical software e.g. SPSS
I've reviewed multiple questions on StackOverFlow but none of them quite fit my use case.
e.g. Pandas - replacing column values, pandas replace multiple values one column, Python pandas: replace values multiple columns matching multiple columns from another dataframe
Suggestions?
You can simply subtract a scalar value from your column which is in effect what you're doing here:
df['job_perf1'] = df['job_perf1'] - 1
Also as you need to do this on 18 cols, then I'd construct a list of the 18 column names and just subtract 1 from all of them at once:
df[col_list] = df[col_list] - 1
No need for a mapping. This can be done as a vector addition, since effectively, what you're doing, is subtracting 1 from each value. This works elegantly:
df['job_perf1'] = df['Job_Performance1'] - numpy.ones(len(df['Job_Performance1']))
Or, without numpy:
df['job_perf1'] = df['Job_Performance1'] - [1] * len(df['Job_Performance1'])

Categories

Resources