Pandas: Convert DataFrame Column Values Into New Dataframe Indices and Columns - python

I have a dataframe that looks like this:
a b c
0 1 10
1 2 10
2 2 20
3 3 30
4 1 40
4 3 10
The dataframe above as default (0,1,2,3,4...) indices. I would like to convert it into a dataframe that looks like this:
1 2 3
0 10 0 0
1 0 10 0
2 0 20 0
3 0 0 30
4 40 0 10
Where column 'a' in the first dataframe becomes the index in the second dataframe, the values of 'b' become the column names and the values of c are copied over, with 0 or NaN filling missing values. The original dataset is large and will result in a very sparse second dataframe. I then intend to add this dataframe to a much larger one, which is straightforward.
Can anyone advise the best way to achieve this please?

You can use the pivot method for this.
See the docs: http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-pivoting-dataframe-objects
An example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a':[0,1,2,3,4,4], 'b':[1,2,2,3,1,3], 'c':[10,10,20,3
0,40,10]})
In [3]: df
Out[3]:
a b c
0 0 1 10
1 1 2 10
2 2 2 20
3 3 3 30
4 4 1 40
5 4 3 10
In [4]: df.pivot(index='a', columns='b', values='c')
Out[4]:
b 1 2 3
a
0 10 NaN NaN
1 NaN 10 NaN
2 NaN 20 NaN
3 NaN NaN 30
4 40 NaN 10
If you want zeros instead of NaN's as in your example, you can use fillna:
In [5]: df.pivot(index='a', columns='b', values='c').fillna(0)
Out[5]:
b 1 2 3
a
0 10 0 0
1 0 10 0
2 0 20 0
3 0 0 30
4 40 0 10

Related

groupby multiple columns, sum over one column and count over another column [duplicate]

SQL : Select Max(A) , Min (B) , C from Table group by C
I want to do the same operation in pandas on a dataframe. The closer I got was till :
DF2= DF1.groupby(by=['C']).max()
where I land up getting max of both the columns , how do i do more than one operation while grouping by.
You can use function agg:
DF2 = DF1.groupby('C').agg({'A': max, 'B': min})
Sample:
print DF1
A B C D
0 1 5 a a
1 7 9 a b
2 2 10 c d
3 3 2 c c
DF2 = DF1.groupby('C').agg({'A': max, 'B': min})
print DF2
A B
C
a 7 5
c 3 2
GroupBy-fu: improvements in grouping and aggregating data in pandas - nice explanations.
try agg() function:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=list('ABC'))
print(df)
print(df.groupby('C').agg({'A': max, 'B':min}))
Output:
A B C
0 2 3 0
1 2 2 1
2 4 0 1
3 0 1 4
4 3 3 2
5 0 4 3
6 2 4 2
7 3 4 0
8 4 2 2
9 3 2 1
10 2 3 1
11 4 1 0
12 4 3 2
13 0 0 1
14 3 1 1
15 4 1 1
16 0 0 0
17 4 0 1
18 3 4 0
19 0 2 4
A B
C
0 4 0
1 4 0
2 4 2
3 0 4
4 0 1
Alternatively you may want to check pandas.read_sql_query() function...
You can use the agg function
import pandas as pd
import numpy as np
df.groupby('something').agg({'column1': np.max, 'columns2': np.min})

Fast way to fill NaN in DataFrame

I have DataFrame object df with column like that:
[In]: df
[Out]:
id sum
0 1 NaN
1 1 NaN
2 1 2
3 1 NaN
4 1 4
5 1 NaN
6 2 NaN
7 2 NaN
8 2 3
9 2 NaN
10 2 8
10 2 NaN
... ... ...
[1810601 rows x 2 columns]
I have a lot a NaN values in my column and I want to fill these in the following way:
if NaN is on the beginning (for first index per id equals 0), then it should be 0
else if NaN I want take value from previous index for the same id
Output should be like that:
[In]: df
[Out]:
id sum
0 1 0
1 1 0
2 1 2
3 1 2
4 1 4
5 1 4
6 2 0
7 2 0
8 2 3
9 2 3
10 2 8
10 2 8
... ... ...
[1810601 rows x 2 columns]
I tried to do it "step by step" using loop with iterrows(), but it is very ineffective method. I believe it can be done faster with pandas methods
Try ffill as suggested with groupby
df['sum'] = df.groupby('id')['sum'].ffill().fillna(0)

Python Dataframe produce binary output when the value in a column changes

I came across this answer Determining when a column value changes in pandas dataframe about finding when the value in a data frame changes. I have a similar problem but want to produce a binary output.
My code:
df =
A
0 10
1 20
2 20
3 50
4 50
5 30
df['B'] = df['A'].diff()
df =
A B
0 10 Nan
1 20 10
2 20 0
3 50 30
4 50 0
5 30 -20
I am expecting output something like this
df =
A B C
0 10 Nan 1
1 20 10 1
2 20 0 0
3 50 30 1
4 50 0 0
5 30 -20 1
You just need an additional step to check if B equals 0:
df['B'] = df.A.diff()
df['C'] = df.B.ne(0).view('i1')
print(df)
A B C
0 10 NaN 1
1 20 10.0 1
2 20 0.0 0
3 50 30.0 1
4 50 0.0 0
5 30 -20.0 1
Not recommending, but since you've asked, we can make it a one liner with eval:
df['B'], df['C'] = df.assign(B=df.A.diff()).eval('B, B!=0')

How do I get dataframe values with multiindex where some value is NOT in multiindex?

Here is example of my df (for example):
2000-02-01 2000-03-01 ...
sub_col_one sub_col_two sub_col_one sub_col_two ...
idx_one idx_two
2 a 5 2 3 3
0 b 0 5 8 1
2 x 0 0 6 1
0 d 8 3 5 5
3 x 5 6 5 9
2 e 2 5 0 5
3 x 1 7 4 4
The question:
How could I get all rows of that df, where idx_two is not equal to x?
I've tried get_level_values, but cant get what I need.
Use Index.get_level_values with name of level with boolean indexing:
df1 = df[df.index.get_level_values('idx_two') != 'x']
Or with position of level, here 1, because python counts from 0:
df1 = df[df.index.get_level_values(1) != 'x']

How to add rows into existing dataframe in pandas? - python

df = pd.DataFrame({'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,10,11,12]})
How can I insert a new row of zeros at index 0 in one single line?
I tried pd.concat([pd.DataFrame([[0,0,0]]),df) but it did not work.
The desired output:
a b c
0 0 0 0
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
You can concat the temp df with the original df but you need to pass the same column names so that it aligns in the concatenated df, additionally to get the index as you desire call reset_index with drop=True param.
In [87]:
pd.concat([pd.DataFrame([[0,0,0]], columns=df.columns),df]).reset_index(drop=True)
Out[87]:
a b c
0 0 0 0
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
alternatively to EdChum's solution you can do this:
In [163]: pd.DataFrame([[0,0,0]], columns=df.columns).append(df, ignore_index=True)
Out[163]:
a b c
0 0 0 0
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
An answer more specific to the dataframe being prepended to
pd.concat([df.iloc[[0], :] * 0, df]).reset_index(drop=True)

Categories

Resources