Get matrix from list of tuples [duplicate] - python

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 5 years ago.
I have a list of (x,y,z) tuples in a dataframe A.
How can I produce a dataframe B which represents the underlying matrix of A, using the existing values of x and y as index and columns values, respectively?
Example:
A:
x y z
1 1 1
1 2 10
2 1 100
B:
1 2
1 1 10
2 100 NaN

For this data frame df:
x y z
0 1 1 1
1 1 2 10
2 2 1 100
pivoting:
df.pivot(index='x', columns='y')
works:
z
y 1 2
x
1 1.0 10.0
2 100.0 NaN
You can also clean the column and index names:
res = df.pivot(index='x', columns='y')
res.index.name = None
res.columns = res.columns.levels[1].values
print(res)
Output:
1 2
1 1.0 10.0
2 100.0 NaN

Related

Add a column using calculations involving the first element of value's group [duplicate]

This question already has answers here:
Pandas: groupby and make a new column applying aggregate to two columns
(2 answers)
Closed 3 years ago.
Here is an example dataframe:
prop1 prop2 prop3 value
a x 1 2
a x 2 3
a y 1 4
a y 2 5
b x 1 6
b x 2 7
b y 1 8
b y 2 9
I need to add a calculated column where the value is, for example, divided to the first element of its group:
prop1 prop2 prop3 value calculated
a x 1 2 2/2
a x 2 3 3/2
a y 1 4 4/4
a y 2 5 5/4
b x 1 6 6/6
b x 2 7 7/6
b y 1 8 8/8
b y 2 9 9/8
Honestly, I don't know how to implement this. I tried:
df['calculated'] = \
df['value'] / df.groupby(['prop1', 'prop2']).agg('first')['value']
but it gives me ValueError: cannot join with no level specified and no overlapping names.
How to calculate this column?
Try transform on the series groupby:
df['calculated'] = df['value'].div(df.groupby(['prop1', 'prop2'])['value']
.transform('first')
)
Output:
prop1 prop2 prop3 value calculated
0 a x 1 2 1.000000
1 a x 2 3 1.500000
2 a y 1 4 1.000000
3 a y 2 5 1.250000
4 b x 1 6 1.000000
5 b x 2 7 1.166667
6 b y 1 8 1.000000
7 b y 2 9 1.125000

Python: how to reshape a Pandas dataframe and keeping the information?

I have a dataframe counting the geographical information of points.
df:
A B ax ay bx by
0 x y 5 7 3 2
1 z w 2 0 7 4
2 k x 5 7 2 0
3 v y 2 3 3 2
I would like to create a dataframe with the geographical info of the unique points
df1:
ID x y
0 x 5 7
1 y 3 2
2 z 2 0
3 w 7 4
4 k 5 7
5 v 2 3
First flatten values in columns by numpy.ravel, create DataFrame by contructor and last add drop_duplicates, thanks #zipa:
a = df[['A','B']].values.ravel()
b = df[['ax','bx']].values.ravel()
c = df[['ay','by']].values.ravel()
df = pd.DataFrame({'ID':a, 'x':b, 'y':c}).drop_duplicates('ID').reset_index(drop=True)
print (df)
ID x y
0 x 5 7
1 y 3 2
2 z 2 0
3 w 7 4
4 k 5 7
5 v 2 3

mapping a multi-index to existing pandas dataframe columns using separate dataframe

I have an existing data frame in the following format (let's call it df):
A B C D
0 1 2 1 4
1 3 0 2 2
2 1 5 3 1
The column names were extracted from a spreadsheet that has the following form (let's call it cat_df):
current category
broader category
X A
Y B
Y C
Z D
First I'd like to prepend a higher level index to make df look like so:
X Y Z
A B C D
0 1 2 1 4
1 3 0 2 2
2 1 5 3 1
Lastly i'd like to 'roll-up' the data into the meta-index by summing over subindices, to generate a new dataframe like so:
X Y Z
0 1 3 4
1 3 2 2
2 1 8 1
Using concat from this answer has gotten me close, but it seems like it'd be a very manual process picking out each subset. My true dataset is has a more complex mapping, so I'd like to refer to it directly as I build my meta-index. I think once I get the meta-index settled, a simple groupby should get me to the summation, but I'm still stuck on the first step.
d = dict(zip(cat_df['current category'], cat_df.index))
cols = pd.MultiIndex.from_arrays([df.columns.map(d.get), df.columns])
df.set_axis(cols, axis=1, inplace=False)
X Y Z
A B C D
0 1 2 1 4
1 3 0 2 2
2 1 5 3 1
df_new = df.set_axis(cols, axis=1, inplace=False)
df_new.groupby(axis=1, level=0).sum()
X Y Z
0 1 3 4
1 3 2 2
2 1 8 1
IIUC, you can do it like this.
df.columns = pd.MultiIndex.from_tuples(cat_df.reset_index()[['broader category','current category']].apply(tuple, axis=1).tolist())
print(df)
Output:
X Y Z
A B C D
0 1 2 1 4
1 3 0 2 2
2 1 5 3 1
Sum level:
df.sum(level=0, axis=1)
Output:
X Y Z
0 1 3 4
1 3 2 2
2 1 8 1
You can using set_index for creating the idx, then assign to your df
idx=df1.set_index('category',append=True).index
df.columns=idx
df
Out[1170]:
current X Y Z
category A B C D
0 1 2 1 4
1 3 0 2 2
2 1 5 3 1
df.sum(axis=1,level=0)
Out[1171]:
current X Y Z
0 1 3 4
1 3 2 2
2 1 8 1

pandas replace column with mean for values

I have a pandas dataframe and want replace each value with the mean for it.
ID X Y
1 a 1
2 a 2
3 a 3
4 b 2
5 b 4
How do I replace Y values with mean Y for every unique X?
ID X Y
1 a 2
2 a 2
3 a 2
4 b 3
5 b 3
Use transform:
df['Y'] = df.groupby('X')['Y'].transform('mean')
print (df)
ID X Y
0 1 a 2
1 2 a 2
2 3 a 2
3 4 b 3
4 5 b 3
For new column in another DataFrame use map with drop_duplicates:
df1 = pd.DataFrame({'X':['a','a','b']})
print (df1)
X
0 a
1 a
2 b
df1['Y'] = df1['X'].map(df.drop_duplicates('X').set_index('X')['Y'])
print (df1)
X Y
0 a 2
1 a 2
2 b 3
Another solution:
df1['Y'] = df1['X'].map(df.groupby('X')['Y'].mean())
print (df1)
X Y
0 a 2
1 a 2
2 b 3

pandas dataframe inserting null values

I have two dataframes:
index a b c d
1 x x x x
2 x nan x x
3 x x x x
4 x nan x x
index a b e
3 x nan x
4 x x x
5 x nan x
6 x x x
I want to make it into the following, where we simply get rid of the NaN values. An easier version of this question is where the second dataframe has no nan values....
index a b c d e
1 x x x x x
2 x x x x x
3 x x x x x
4 x x x x x
5 x x x x x
6 x x x x x
You may use combine_first with fillna:
DataFrame.combine_first(other)
Combine two DataFrame objects and
default to non-null values in frame calling the method. Result index
columns will be the union of the respective indexes and columns
You can read the doc from here
import pandas as pd
d1 = pd.DataFrame([[nan,1,1],[2,2,2],[3,3,3]], columns=['a','b','c'])
d1
a b c
0 NaN 1 1
1 2 2 2
2 3 3 3
d2 = pd.DataFrame([[1,nan,1],[nan,2,2],[3,3,nan]], columns=['b','d','e'])
d2
b d e
0 1 NaN 1
1 NaN 2 2
2 3 3 NaN
d2.combine_first(d1) # d1's values are prioritized, if d2 has no NaN
a b c d e
0 NaN 1 1 NaN 1
1 2 2 2 2 2
2 3 3 3 3 NaN
d2.combine_first(d1).fillna(5) # simply fill NaN with a value
a b c d e
0 5 1 1 5 1
1 2 2 2 2 2
2 3 3 3 3 5
Use nan_to_num to replace a nan with a number:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.nan_to_num.html
Just apply this:
from numpy import nan_to_num
df2 = df.apply(nan_to_num)
Then you can merge the arrays however you want.

Categories

Resources