Grouping data in a dataframe [duplicate] - python

This question already has an answer here:
Pandas long to wide (unmelt or similar?) [duplicate]
(1 answer)
Closed 2 months ago.
I have a python dataframe with a few columns, let's say that it looks like this:
Heading 1
Values
A
1
A
2
B
9
B
8
B
6
What I want to is to "pivot" or group the table so it would look something like:
Heading 1
Value 1
Value 2
Value 3
A
1
2
B
9
8
6
I was trying to group the table or pivot/unpivot it by several ways, but i cannot figure out how to do it properly.

You can derive a new column that will hold a row number (so to speak) for each partition of heading 1.
df = pd.DataFrame({"heading 1":['A','A','B','B','B'], "Values":[1,2,9,8,6]})
df['rn'] = df.groupby(['heading 1']).cumcount() + 1
heading 1 Values rn
0 A 1 1
1 A 2 2
2 B 9 1
3 B 8 2
4 B 6 3
Then you can pivot, using the newly derived column as your columns argument:
df = df.pivot(index='heading 1', columns='rn', values='Values').reset_index()
rn heading 1 1 2 3
0 A 1.0 2.0 NaN
1 B 9.0 8.0 6.0

Related

How to create new column in Pandas dataframe where each row is product of previous rows

I have the following DataFrame dt:
a
0 1
1 2
2 3
3 4
4 5
How do I create a a new column where each row is a function of previous rows?
For instance, say the formula is:
B_row(t) = A_row(t-1)+A_row(t-2)+3
Such that:
a b
0 1 /
1 2 /
2 3 6
3 4 8
4 5 10
Also, I hear a lot about the fact that we mustn't loop through rows in Pandas', however it seems to me that I should go at it by looping through each row and creating a sort of recursive loop - as I would do in regular Python.
You could use cumprod:
dt['b'] = dt['a'].cumprod()
Output:
a b
0 1 1
1 2 2
2 3 6
3 4 24
4 5 120

float precision in Pandas df [duplicate]

This question already has answers here:
Set value for particular cell in pandas DataFrame using index
(23 answers)
Closed 2 years ago.
I am trying to change a number in the df but the Pandas converts it to a floor number.
A B
0 1 4
1 2 5
2 3 6
I change a number:
df['B'][1] = 1.2
it gives:
A B
0 1 4
1 2 1
2 3 6
instead of:
A B
0 1 4
1 2 1.2
2 3 6
Pandas has some rather complex view/copy behavior. Your syntax assigns a new value to a copy of the data, leaving the original unchanged. You can update the value in place via:
df.loc[1, "B"] = 1.2
result:
A B
0 1 4.0
1 2 1.2
2 3 6.0

group by two columns and use third column as value without using pivot_table [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 3 years ago.
I have the following dataframe.
user movie rating
0 1 1 3
1 1 2 4
2 2 1 2
3 2 2 5
4 3 1 3
My desired output is
movie 1 2
user
1 3 4
2 2 5
3 3 0
If a user has not rated a movie, I need to have '0' in the corresponding output column, otherwise, the rating value should be present.
Note: I was able to achieve this with pivot_table, but the catch is my dataset contains more than 100000 columns because of which I am getting "Unstacked DataFrame is too big, causing int32 overflow". I am trying groupby as an alternative to bypass this error.
I am trying the following, but it doesn't include the values from 'value' column of my dataframe.
df.groupby(['user', 'movie']).size().unstack('movie', fill_value=0)
try using crosstab:
pd.crosstab(df.user, df.movie, values = df.rating, aggfunc = 'first').fillna(0)
# movie 1 2
# user
# 1 3.0 4.0
# 2 2.0 5.0
# 3 3.0 0.0
to get interger values, just use .astype(int), as follows :
pd.crosstab(df.user, df.movie, values = df.rating, aggfunc = 'first').fillna(0).astype(int)
# movie 1 2
# user
# 1 3 4
# 2 2 5
# 3 3 0
I'm not sure why would you expect movie 3 since it doesn't exist in the original data sample but other than that this will work for you: movie_ratings.set_index(['user', 'movie']).unstack('movie', fill_value=0)

expand pandas groupby results to initial dataframe

Say I have a dataframe df and group it by a few columns, dfg, with the median of one of its columns. How could I then take those median values, and expand them out so that those mean values are in a new column of the original df, and associated with the respective conditions? This will mean there are duplicates, but I will next be using this column for a subsequent calculation and having these in a column will make this possible.
Example data:
import pandas as pd
data = {'idx':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
'condition1':[1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4],
'condition2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2],
'values':np.random.normal(0,1,16)}
df = pd.DataFrame(data)
dfg = df.groupby(['idx', 'condition2'], as_index=False)['values'].median()
example of desired result (note duplicates corresponding to correct conditions):
idx condition1 condition2 values medians
0 1 1 1 0.35031 0.656355
1 1 1 2 -0.291736 -0.024304
2 1 2 1 1.593545 0.656355
3 1 2 2 -1.275154 -0.024304
4 1 3 1 0.075259 0.656355
5 1 3 2 1.054481 -0.024304
6 1 4 1 0.9624 0.656355
7 1 4 2 0.243128 -0.024304
8 2 1 1 1.717391 1.155406
9 2 1 2 0.788847 1.006583
10 2 2 1 1.145891 1.155406
11 2 2 2 -0.492063 1.006583
12 2 3 1 -0.157029 1.155406
13 2 3 2 1.224319 1.006583
14 2 4 1 1.164921 1.155406
15 2 4 2 2.042239 1.006583
I believe you need GroupBy.transform with median for new column:
df['medians'] = df.groupby(['idx', 'condition2'])['values'].transform('median')

how to union two data frames so that every value in one data frame is linked to all values in another using python and pandas [duplicate]

This question already has answers here:
cartesian product in pandas
(13 answers)
Closed 4 years ago.
For example, the data is:
a=pd.DataFrame({'aa':[1,2,3]})
b=pd.DataFrame({'bb':[4,5]})
what I want is to union these two data frames so the new frame is :
aa bb
1 4
1 5
2 4
2 5
3 4
3 5
You can see that every value in a is linked to all the values in b in the new frame. I probably can use tile or repeat to do this. But I have multiple frames which need to be done repeatedly. So I want to know if there is a better way?
Could anyone help me out here?
You can do it like this:
In [24]: a['key'] = 1
In [25]: b['key'] = 1
In [27]: pd.merge(a, b, on='key').drop('key', axis=1)
Out[27]:
aa bb
0 1 4
1 1 5
2 2 4
3 2 5
4 3 4
5 3 5
you can use pd.MultiIndex.from_product and then reset_index. It is generating all the combinations between both set of data (the same idea than itertools.product)
df_outut = (pd.DataFrame(index=pd.MultiIndex.from_product([a.aa,b.bb],names=['aa','bb']))
.reset_index())
and you get
aa bb
0 1 4
1 1 5
2 2 4
3 2 5
4 3 4
5 3 5

Categories

Resources