How to pivot table back to original entry format - python

import pandas as pd
import numpy as np
df=pd.DataFrame(np.array([['M',1, 1, 2, 3],
['F', 2, 4, 5, 6], ['M', 3, 7, 8, 9]]),columns=['SEX','AGE','A','B','C'])
dfm=pd.melt(df,id_vars=('SEX','AGE'),value_vars=list(df.columns[2:]),
var_name='LOCATION',value_name='DEATHS')
Based on the code provided i can create a basic table and melt the tables from df to dfm using the 'AGE' and 'SEX' as id variables.
Is there a simple way of reverting this table back to its original format ?
Going from dfm > df assuming i do not have df.
many thanks

The pivot_table method should allow you to return to the original dataframe
# Change data types from object integer
dfm[['AGE', 'DEATHS']] = dfm[['AGE', 'DEATHS']].astype(int)
# Pivot dataframe to "undo melt"
reshaped = dfm.pivot_table(index=['SEX', 'AGE'],columns=['LOCATION'],
values='DEATHS')
# Reset index to flatten dataframe
reshaped.reset_index(inplace=True)
# Change column name attribute to blank
reshaped.columns.rename('',inplace=True)
SEX AGE A B C
0 F 2 4 5 6
1 M 1 1 2 3
2 M 3 7 8 9

Related

New column in dataset based em last value of item

I have this dataset
In [4]: df = pd.DataFrame({'A':[1, 2, 3, 4, 5]})
In [5]: df
Out[5]:
A
0 1
1 2
2 3
3 4
4 5
I want to add a new column in dataset based em last value of item, like this
A
New Column
1
2
1
3
2
4
3
5
4
I tryed to use apply with iloc, but it doesn't worked
Can you help
Thank you
With your shown samples, could you please try following. You could use shift function to get the new column which will move all elements of given column into new column with a NaN in first element.
import pandas as pd
df['New_Col'] = df['A'].shift()
OR
In case you would like to fill NaNs with zeros then try following, approach is same as above for this one too.
import pandas as pd
df['New_Col'] = df['A'].shift().fillna(0)

Method to sort DataFrame's values

I don't understand this code:
d = {'col1': [5, 6,4, 1, 2, 9, 15, 11]}
df = pd.DataFrame(data=d)
df.head(10)
df['col1'] = df.sort_values('col1')['col1']
print(df.sort_values('col1')['col1'])
This is what is printed:
3 1
4 2
2 4
0 5
1 6
5 9
7 11
6 15
My df doesn't change at all.
Why does this code: df.sort_values('col1')['col1'] do not arrange my dataframe?
Thanks
If want assign back sorted column is necessary convert output to numpy array for prevent index alignment - it means if use only df.sort_values('col1')['col1'] it sorting correctly, index order is changed, but in assign step is change order like original, so no change in order of values.
df['col1'] = df.sort_values('col1')['col1'].to_numpy()
If default index another idea is create default index (same like original), so alignment asign by new index values:
df['col1'] = df.sort_values('col1')['col1'].reset_index(drop=True)
If want sort by col1 column:
df = df.sort_values('col1')

Numeric differences between two different dataframes in python

I would like to find the numeric difference between two or more columns of two different dataframe.
The following
would be the starting table.
This one Table (Table 2)
contains the single values that I need to subtract to Table 1.
I would like to get a third table where I get the numeric differences between each row of Table 1 and the single row from Table 2. Any help?
Try
df.subtract(df2.values)
with df being your starting table and df2 being Table 2.
Can you try this and see if this is what you need:
import pandas as pd
df = pd.DataFrame({'A':[5, 3, 1, 2, 2], 'B':[2, 3, 4, 2, 2]})
df2 = pd.DataFrame({'A':[1], 'B':[2]})
pd.DataFrame(df.values-df2.values, columns=df.columns)
Out:
A B
0 4 0
1 2 1
2 0 2
3 1 0
4 1 0
you can just do df1-df2.values like below this will use numpy broadcast to substract all df2 from all rows but df2 must have only one row
example
df1 = pd.DataFrame(np.arange(15).reshape(-1,3), columns="A B C".split())
df2 = pd.DataFrame(np.ones(3).reshape(-1,3), columns="A B C".split())
df1-df2.values

Pandas: select value from random column on each row

Suppose I have the following Pandas DataFrame:
df = pd.DataFrame({
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
})
a b c
0 1 4 7
1 2 5 8
2 3 6 9
I want to generate a new pandas.Series so that the values of this series are selected, row by row, from a random column in the DataFrame. So, a possible output for that would be the series:
0 7
1 2
2 9
dtype: int64
(where in row 0 it randomly chose 'c', in row 1 it randomly chose 'a' and in row 2 it randomly chose 'c' again).
I know this can be done by iterating over the rows and using random.choice to choose each row, but iterating over the rows not only has bad performance but also is "unpandonic", so to speak. Also, df.sample(axis=1) would choose whole columns, so all of them would be chosen from the same column, which is not what I want. Is there a better way to do this with vectorized pandas methods?
Here is a fully vectorized solution. Note however that it does not use Pandas methods, but rather involves operations on the underlying numpy array.
import numpy as np
indices = np.random.choice(np.arange(len(df.columns)), len(df), replace=True)
Example output is [1, 2, 1] which corresponds to ['b', 'c', 'b'].
Then use this to slice the numpy array:
df['random'] = df.to_numpy()[np.arange(len(df)), indices]
Results:
a b c random
0 1 4 7 7
1 2 5 8 5
2 3 6 9 9
May be something like:
pd.Series([np.random.choice(i,1)[0] for i in df.values])
This does the job (using the built-in module random):
ddf = df.apply(lambda row : random.choice(row.tolist()), axis=1)
or using pandas sample:
ddf = df.apply(lambda row : row.sample(), axis=1)
Both have the same behaviour. ddf is your Series.
pd.DataFrame(
df.values[range(df.shape[0]),
np.random.randint(
0, df.shape[1], size=df.shape[0])])
output
0
0 4
1 5
2 9
You're probably still going to need to iterate through each row while selecting a random value in each row - whether you do it explicitly with a for loop or implicitly with whatever function you decide to call.
You can, however, simplify the to a single line using a list comprehension, if it suits your style:
result = pd.Series([random.choice(pd.iloc[i]) for i in range(len(df))])

Selecting multiple columns R vs python pandas

I am an R user who is currently learning Python and I am trying to replicate a method of selecting columns used in R into Python.
In R, I could select multiple columns like so:
df[,c(2,4:10)]
In Python, I know how iloc works, but I couldn't split between a single column number and a consecutive set of them.
This wouldn't work
df.iloc[:,[1,3:10]]
So, I'll have to drop the second column like so:
df.iloc[:,1:10].drop(df.iloc[:,1:10].columns[1] , axis=1)
Is there a more efficient way of replicating the method from R in Python?
You can use np.r_ that accepts mixed slice notation and scalar indices and concatenate them as 1-d array:
import numpy as np
df.iloc[:,np.r_[1, 3:10]]
df = pd.DataFrame([[1,2,3,4,5,6]])
df
# 0 1 2 3 4 5
#0 1 2 3 4 5 6
df.iloc[:, np.r_[1, 3:6]]
# 1 3 4 5
#0 2 4 5 6
As np.r_ produces:
np.r_[1, 3:6]
# array([1, 3, 4, 5])
Assuming one wants to select multiple columns of a DataFrame by their name, considering the Dataframe df
df = pandas.DataFrame({'A' : ['X', 'Y'],
'B' : 1,
'C' : [2, 3]})
Considering one wants the columns A and C, simply use
df[['A', 'C']]
>>> A C
0 X 2
1 Y 3
Note that if one wants to use it later on one should assign it to a variable.

Categories

Resources