On DataFrame.pivot(), different result with what I expected - python

I'm referring to
https://github.com/pandas-dev/pandas/tree/main/doc/cheatsheet.
As you can see, if I use pivot(), then all values are in row number 0 and 1.
But if I do use pivot(), the result was different like below.
DataFrame before pivot():
DataFrame after pivot():
Is the result on purpose?

In your data, the grey column (index of the row) is missing:
df = pd.DataFrame({'variable': list('aaabbbccc'), 'value': range(9)})
print(df)
# Output
variable value
0 a 0
1 a 1
2 a 2
3 b 3
4 b 4
5 b 5
6 c 6
7 c 7
8 c 8
Add the grey column:
df['grey'] = df.groupby('variable').cumcount()
print(df)
# Output
variable value grey
0 a 0 0
1 a 1 1
2 a 2 2
3 b 3 0
4 b 4 1
5 b 5 2
6 c 6 0
7 c 7 1
8 c 8 2
Now you can pivot:
df = df.pivot('grey', 'variable', 'value')
print(df)
# Output
variable a b c
grey
0 0 3 6
1 1 4 7
2 2 5 8
Take the time to read How can I pivot a dataframe?

Related

Pandas new column with values from same id, following codition

I have a DataFrame with multiple columns I'll provide code to a artificial df for reproduction:
import pandas as pd
from itertools import product
df = pd.DataFrame(data=list(product([0,1,2], [0,1,2], [0,1,2])), columns=['A', 'B','C'])
df['D'] = range(len(df))
This results in the following dataframe:
A B C D
0 0 0 0 0
1 0 0 1 1
2 0 0 2 2
3 0 1 0 3
4 0 1 1 4
5 0 1 2 5
6 0 2 0 6
7 0 2 1 7
8 0 2 2 8
9 1 0 0 9
I want to get a new column new_C That takes the C value where B fullfills a condition and spreads it over all matching values in Column A.
The following code does exactly that:
new_df = df[['A','B', 'D']].loc[df['C'] == 0]
new_df.columns = ['A', 'B','new_D']
df = df.merge(new_df, on=['A', 'B'], how= 'outer')
However, I a strongly believe there is a better solution to this, where I do not have to introduce a whole new DataFrame and merging it back together.
Preferable a oneliner.
Thanks in advance.
Desired Output:
A B C D new_D
0 0 0 0 0 0
1 0 0 1 1 0
2 0 0 2 2 0
3 0 1 0 3 3
4 0 1 1 4 3
5 0 1 2 5 3
6 0 2 0 6 6
7 0 2 1 7 6
8 0 2 2 8 6
9 1 0 0 9 9
EDIT:
Adding other example:
A B C D
A B C D
0 0 4 foo 0
1 0 4 bar 1
2 0 4 baz 2
3 0 5 foo 3
4 0 5 bar 4
5 0 5 baz 5
6 0 6 foo 6
7 0 6 bar 7
8 0 6 baz 8
9 1 4 foo 9
Should be turned into the following with the condition being:df['C'] == 'bar'
A B C D new_D
0 0 4 foo 0 1
1 0 4 bar 1 1
2 0 4 baz 2 1
3 0 5 foo 3 4
4 0 5 bar 4 4
5 0 5 baz 5 4
6 0 6 foo 6 7
7 0 6 bar 7 7
8 0 6 baz 8 7
9 1 4 foo 9 10
Meaning all numbers are arbetrary. Order is also not the same, it just happens to work to take the first number.
If you want to get a new baseline every time C equals zero, you can use:
df['new_D'] = df['D'].where(df['C'].eq(0)).ffill(downcast='infer')
old answer
What you want is not fully clear, but it looks like you want to repeat the first item per group of A and B. You can easily achieve this with:
df['new_D'] = df.groupby(['A', 'B'])['D'].transform('first')
Even simpler, if your data is really composed of consecutive integers:
df['D'] = df['D']//3*3

Pandas drop duplicate base on 2 columns, having differents value

How to drop duplicate in that specific way:
Index B C
1 2 1
2 2 0
3 3 1
4 3 1
5 4 0
6 4 0
7 4 0
8 5 1
9 5 0
10 5 1
Desired output :
Index B C
3 3 1
5 4 0
So dropping duplicate on B but if C is the same on all row and keep one sample/record.
For example, B = 3 for index 3/4 but since C = 1 for both, I do not destroy them all
But for example B = 5 for index 8/9/10 since C = 1 or 0, it get destroy.
Try this, using transform with nunique and drop_duplicates:
df[df.groupby('B')['C'].transform('nunique') == 1].drop_duplicates(subset='B')
Output:
B C
Index
3 3 1
5 4 0

How do I multiply a pandas column with a part of a multi index dataframe

I have a data frame with a multi index and one column.
Index fields are type and amount, the column is called count
I would like to add a column that multiplies amount and count
df2 = df.groupby(['type','amount']).count().copy()
# I then dropped all columns but one and renamed it to "count"
df2['total_amount'] = df2['count'].multiply(df2['amount'], axis='index')
doesn't work. I get a key error on amount.
How do I access a part of the multi index to use it in calculations?
Use GroupBy.transform for Series with same size as original df with aggregated values, so possible multiple:
count = df.groupby(['type','amount'])['type'].transform('count')
df['total_amount'] = df['amount'].multiply(count, axis='index')
print (df)
A amount C D E type total_amount
0 a 4 7 1 5 a 8
1 b 5 8 3 3 a 5
2 c 4 9 5 6 a 8
3 d 5 4 7 9 b 10
4 e 5 2 1 2 b 10
5 f 4 3 0 4 b 4
Or:
df = pd.DataFrame({'A':list('abcdef'),
'amount':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'type':list('aaabbb')})
print (df)
A amount C D E type
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df2 = df.groupby(['type','amount'])['type'].count().to_frame('count')
df2['total_amount'] = df2['count'].mul(df2.index.get_level_values('amount'))
print (df2)
count total_amount
type amount
a 4 2 8
5 1 5
b 4 1 4
5 2 10

Pad dataframe discontinuous column

I have the following dataframe:
Name B C D E
1 A 1 2 2 7
2 A 7 1 1 7
3 B 1 1 3 4
4 B 2 1 3 4
5 B 3 1 3 4
What I'm trying to do is to obtain a new dataframe in which, for rows with the same "Name", the elements in the "B" column are continuous, hence in this example for rows with "Name" = A, the dataframe would have to be padded with elements ranging from 1 to 7, and the values for columns C, D, E should be 0.
Name B C D E
1 A 1 2 2 7
2 A 2 0 0 0
3 A 3 0 0 0
4 A 4 0 0 0
5 A 5 0 0 0
6 A 6 0 0 0
7 A 7 0 0 0
8 B 1 1 3 4
9 B 2 1 5 4
10 B 3 4 3 6
What I've done so far is to turn the B column values for the same "Name" into continuous values:
new_idx = df_.groupby('Name').apply(lambda x: np.arange(x.index.min(), x.index.max() + 1)).apply(pd.Series).stack()
and reindexing the original (having set B as the index) df using this new Series, but I'm having trouble reindexing using duplicates. Any help would be appreciated.
You can use:
def f(x):
a = np.arange(x.index.min(), x.index.max() + 1)
x = x.reindex(a, fill_value=0)
return (x)
new_idx = (df.set_index('B')
.groupby('Name')
.apply(f)
.drop('Name', 1)
.reset_index()
.reindex(columns=df.columns))
print (new_idx)
Name B C D E
0 A 1 2 2 7
1 A 2 0 0 0
2 A 3 0 0 0
3 A 4 0 0 0
4 A 5 0 0 0
5 A 6 0 0 0
6 A 7 1 1 7
7 B 1 1 3 4
8 B 2 1 3 4
9 B 3 1 3 4

add new column to pandas DataFrame with value depended on previous row

I have an existing pandas DataFrame, and I want to add a new column, where the value of each row will depend on the previous row.
for example:
df1 = pd.DataFrame(np.random.randint(10, size=(4, 4)), columns=['a', 'b', 'c', 'd'])
df1
Out[31]:
a b c d
0 9 3 3 0
1 3 9 5 1
2 1 7 5 6
3 8 0 1 7
and now I want to create column e, where for each row i the value of df1['e'][i] would be: df1['e'][i] = df1['d'][i] - df1['d'][i-1]
desired output:
df1:
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1
how can I achieve this?
You can use sub with shift:
df['e'] = df.d.sub(df.d.shift(), fill_value=0)
print (df)
a b c d e
0 9 3 3 0 0.0
1 3 9 5 1 1.0
2 1 7 5 6 5.0
3 8 0 1 7 1.0
If need convert to int:
df['e'] = df.d.sub(df.d.shift(), fill_value=0).astype(int)
print (df)
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1

Categories

Resources