Pandas: How to pivot rows into columns - python

I am looking for a way to pivot around 600 columns into rows. Here's a sample with only 4 of those columns (good, bad, ok, Horrible):
df:
RecordID good bad ok Horrible
A 0 0 1 0
B 1 0 0 1
desired output:
RecordID Column Value
A Good 0
A Bad 0
A Ok 1
A Horrible 0
B Good 1
B Bad 0
B Ok 0
B Horrible 1

You can use melt function:
(df.melt(id_vars='RecordID', var_name='Column', value_name='Value')
.sort_values('RecordID')
.reset_index(drop=True)
)
Output:
RecordID Column Value
0 A good 0
1 A bad 0
2 A ok 1
3 A Horrible 0
4 B good 1
5 B bad 0
6 B ok 0
7 B Horrible 1

You can use .stack() as follows. Using .stack() is preferred as it naturally resulted in rows already sorted in the order of RecordID so that you don't need to waste processing time sorting on it again, especially important when you have a large number of columns.
df = df.set_index('RecordID').stack().reset_index().rename(columns={'level_1': 'Column', 0: 'Value'})
Output:
RecordID Column Value
0 A good 0
1 A bad 0
2 A ok 1
3 A Horrible 0
4 B good 1
5 B bad 0
6 B ok 0
7 B Horrible 1

Adding dataframe:
import pandas as pd
import numpy as np
data2 = {'RecordID': ['a', 'b', 'c'],
'good': [0, 1, 1],
'bad': [0, 0, 1],
'horrible': [0, 1, 1],
'ok': [1, 0, 0]}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data2)
Melt data:
https://pandas.pydata.org/docs/reference/api/pandas.melt.html
melted = df.melt(id_vars='RecordID', var_name='Column', value_name='Value')
melted
Optionally: Group By - for summ or mean values:
f2 = melted.groupby(['Column']).sum()
df2

Related

How to create multiple columns in Pandas Dataframe?

I have data as you can see in the terminal. I need it to be converted to the Excel sheet format as you can see in the Excel sheet file by creating multi-levels in columns.
I researched this and reached many different things but cannot achieve my goal then, I reached "transpose", and it gave me the shape that I need but unfortunately that it did reshape from a column to a row instead where I got the wrong data ordering.
Current result:
Desired result:
What can I try next?
You can use pivot() function and reorder multi-column levels.
Before that, index/group data for repeated iterations/rounds:
data=[
(2,0,0,1),
(10,2,5,3),
(2,0,0,0),
(10,1,1,1),
(2,0,0,0),
(10,1,2,1),
]
columns = ["player_number", "cel1", "cel2", "cel3"]
df = pd.DataFrame(data=data, columns=columns)
df_nbr_plr = df[["player_number"]].groupby("player_number").agg(cnt=("player_number","count"))
df["round"] = list(itertools.chain.from_iterable(itertools.repeat(x, df_nbr_plr.shape[0]) for x in range(df_nbr_plr.iloc[0,0])))
[Out]:
player_number cel1 cel2 cel3 round
0 2 0 0 1 0
1 10 2 5 3 0
2 2 0 0 0 1
3 10 1 1 1 1
4 2 0 0 0 2
5 10 1 2 1 2
Now, pivot and reorder the colums levels:
df = df.pivot(index="round", columns="player_number").reorder_levels([1,0], axis=1).sort_index(axis=1)
[Out]:
player_number 2 10
cel1 cel2 cel3 cel1 cel2 cel3
round
0 0 0 1 2 5 3
1 0 0 0 1 1 1
2 0 0 0 1 2 1
This can be done with unstack after setting player__number as index. You have to reorder the Multiindex columns and fill missing values/delete duplicates though:
import pandas as pd
data = {"player__number": [2, 10 , 2, 10, 2, 10],
"cel1": [0, 2, 0, 1, 0, 1],
"cel2": [0, 5, 0, 1, 0, 2],
"cel3": [1, 3, 0, 1, 0, 1],
}
df = pd.DataFrame(data).set_index('player__number', append=True)
df = df.unstack('player__number').reorder_levels([1, 0], axis=1).sort_index(axis=1) # unstacking, reordering and sorting columns
df = df.ffill().iloc[1::2].reset_index(drop=True) # filling values and keeping only every two rows
df.to_excel('output.xlsx')
Output:

Removing certain Rows from subset of df

I have a pandas dataframe. All the columns right of column#2 may only contain the value 0 or 1. If they contain a value that is NOT 0 or 1, I want to remove that entire row from the dataframe.
So I created a subset of the dataframe to only contain columns right of #2
Then I found the indices of the rows that had values other than 0 or 1 and deleted it from the original dataframe.
See code below please
#reading data file:
data=pd.read_csv('MyData.csv')
#all the columns right of column#2 may only contain the value 0 or 1. So "prod" is a subset of the data df containing these columns:
prod = data.iloc[:,2:]
index_prod = prod[ (prod!= 0) & (prod!= 1)].dropna().index
data = data.drop(index_prod)
However when I run this, the index_prod vector is empty and so does not drop anything at all.
okay so my friend just told me that the data is not numeric and he fixed it by making it numeric. Can anyone please advise how I can find that out? Because all the columns were numeric it seemed like to me. All numbers
You can check dtypes by DataFrame.dtypes.
print (data.dtypes)
Or:
print (data.columns.difference(data.select_dtypes(np.number).columns))
And then convert all values without first 2 to numeric:
data.iloc[:,2:] = data.iloc[:,2:].apply(lambda x: pd.to_numeric(x, errors='coerce'))
Or all columns:
data = data.apply(lambda x: pd.to_numeric(x, errors='coerce'))
And last apply solution:
subset = data.iloc[:,2:]
data1 = data[subset.isin([0,1]).all(axis=1)]
Let's say you have this dataframe:
data = {'A': [1, 2, 3, 4, 5], 'B': [0, 1, 4, 3, 1], 'C': [2, 1, 0, 3, 4]}
df = pd.DataFrame(data)
A B C
0 1 0 2
1 2 1 1
2 3 4 0
3 4 3 3
4 5 1 4
And you want to delete rows based on column B that don't contain 0 or 1, we could accomplish by:
subset = df.iloc[:,1:]
index = subset[ (subset!= 0) & (subset!= 1)].dropna().index
df.drop(index)
A B C
0 1 0 2
1 2 1 1
4 5 1 4
df.reset_index(drop=True)
A B C
0 1 0 2
1 2 1 1
2 5 1 4

Create dummy variable of multiple columns with python

I am working with a dataframe containing two columns with ID numbers. For further research I want to make a sort of dummy variables of these ID numbers (with the two ID numbers). My code, however, does not merge the columns from the two dataframes. How can I merge the columns from the two dataframes and create the dummy variables?
Dataframe
import pandas as pd
import numpy as np
d = {'ID1': [1,2,3], 'ID2': [2,3,4]}
df = pd.DataFrame(data=d)
Current code
pd.get_dummies(df, prefix = ['ID1', 'ID2'], columns=['ID1', 'ID2'])
Desired output
p = {'1': [1,0,0], '2': [1,1,0], '3': [0,1,1], '4': [0,0,1]}
df2 = pd.DataFrame(data=p)
df2
If need indicators in output use max, if need count values use sum after get_dummies with another parameters and casting values to strings:
df = pd.get_dummies(df.astype(str), prefix='', prefix_sep='').max(level=0, axis=1)
#count alternative
#df = pd.get_dummies(df.astype(str), prefix='', prefix_sep='').sum(level=0, axis=1)
print (df)
1 2 3 4
0 1 1 0 0
1 0 1 1 0
2 0 0 1 1
Different ways of skinning a cat; here's how I'd do it—use an additional groupby:
# pd.get_dummies(df.astype(str)).groupby(lambda x: x.split('_')[1], axis=1).sum()
pd.get_dummies(df.astype(str)).groupby(lambda x: x.split('_')[1], axis=1).max()
1 2 3 4
0 1 1 0 0
1 0 1 1 0
2 0 0 1 1
Another option is stacking, if you like conciseness:
# pd.get_dummies(df.stack()).sum(level=0)
pd.get_dummies(df.stack()).max(level=0)
1 2 3 4
0 1 1 0 0
1 0 1 1 0
2 0 0 1 1

`pandas.merge` not recognising same index

I have two dataframes with overlapping columns but identical indexes and I want to combine them. I feel like this should be straight forward but I have worked through sooo many examples and SO questions and it's not working but also seems to not be consistent with other examples.
import pandas as pd
# create test data
df = pd.DataFrame({'gen1': [1, 0, 0, 1, 1], 'gen3': [1, 0, 0, 1, 0], 'gen4': [0, 1, 1, 0, 1]}, index = ['a', 'b', 'c', 'd', 'e'])
df1 = pd.DataFrame({'gen1': [1, 0, 0, 1, 1], 'gen2': [0, 1, 1, 1, 1], 'gen3': [1, 0, 0, 1, 0]}, index = ['a', 'b', 'c', 'd', 'e'])
In [1]: df
Out[1]:
gen1 gen2 gen3
a 1 0 1
b 0 1 0
c 0 1 0
d 1 1 1
e 1 1 0
In [2]: df1
Out[2]:
gen1 gen3 gen4
a 1 1 0
b 0 0 1
c 0 0 1
d 1 1 0
e 1 0 1
After working through all the examples here (https://pandas.pydata.org/pandas-docs/stable/merging.html) I'm convinced I have found the correct example (the first and second example of the merges). The second example is this:
In [43]: result = pd.merge(left, right, on=['key1', 'key2'])
In their example they have two DFs (left and right) that have overlapping columns and identical indexs and their resulting dataframe has one version of each column and the original indexs but this is not what happens when I do that:
# get the intersection of columns (I need this to be general)
In [3]: column_intersection = list(set(df).intersection(set(df1))
In [4]: pd.merge(df, df1, on=column_intersection)
Out[4]:
gen1 gen2 gen3 gen4
0 1 0 1 0
1 1 0 1 0
2 1 1 1 0
3 1 1 1 0
4 0 1 0 1
5 0 1 0 1
6 0 1 0 1
7 0 1 0 1
8 1 1 0 1
Here we see that merge has not seen that the indexs are the same! I have fiddled around with the options but cannot get the result I want.
A similar but different question was asked here How to keep index when using pandas merge but I don't really understand the answers and so can't relate it to my problem.
Points for this specific example:
Index will always be identical.
Columns with the same name will always have identical entries (i.e. they are duplicates).
It would be great to have a solution for this specific problem but I would also really like to understand it because I find myself spending lots of time on combining dataframes from time to time. I love pandas and in general I find it very intuitive but I just can't seem to get comfortable with anything other than trivial combinations of dataframes.
Starting v0.23, you can specify an index name for the join key, if you have it.
df.index.name = df1.index.name = 'idx'
df.merge(df1, on=list(set(df).intersection(set(df1)) | {'idx'}))
gen1 gen3 gen4 gen2
idx
a 1 1 0 0
b 0 0 1 1
c 0 0 1 1
d 1 1 0 1
e 1 0 1 1
The assumption here is that your actual DataFrame does not have exactly the same values in overlapping columns. If they did, then your question would be one of concatenation— you can use pd.concat for that:
c = list(set(df).intersection(set(df1)))
pd.concat([df1, df.drop(c, 1)], axis=1)
gen1 gen2 gen3 gen4
a 1 0 1 0
b 0 1 0 1
c 0 1 0 1
d 1 1 1 0
e 1 1 0 1
In this special case, you can use assign
Things in df take priority but all other things in df1 are included.
df1.assign(**df)
gen1 gen2 gen3 gen4
a 1 0 1 0
b 0 1 0 1
c 0 1 0 1
d 1 1 1 0
e 1 1 0 1
**df unpacks df assuming a dictionary context. This unpacking delivers keyword arguments to assign with the names of columns as the keyword and the column as the argument.
It is the same as
df1.assign(gen1=df.gen1, gen3=df.gen3, gen4=df.gen4)

Extracting data from two dataframes to create a third

I am using Python Pandas for the following. I have three dataframes, df1, df2 and df3. Each has the same dimensions, index and column labels. I would like to create a fourth dataframe that takes elements from df1 or df2 depending on the values in df3:
df1 = pd.DataFrame(np.random.randn(4, 2), index=list('0123'), columns=['A', 'B'])
df1
Out[67]:
A B
0 1.335314 1.888983
1 1.000579 -0.300271
2 -0.280658 0.448829
3 0.977791 0.804459
df2 = pd.DataFrame(np.random.randn(4, 2), index=list('0123'), columns=['A', 'B'])
df2
Out[68]:
A B
0 0.689721 0.871065
1 0.699274 -1.061822
2 0.634909 1.044284
3 0.166307 -0.699048
df3 = pd.DataFrame({'A': [1, 0, 0, 1], 'B': [1, 0, 1, 0]})
df3
Out[69]:
A B
0 1 1
1 0 0
2 0 1
3 1 0
The new dataframe, df4, has the same index and column labels and takes an element from df1 if the corresponding value in df3 is 1. It takes an element from df2 if the corresponding value in df3 is a 0.
I need a solution that uses generic references (e.g. ix or iloc) rather than actual column labels and index values because my dataset has fifty columns and four hundred rows.
As your DataFrames happen to be numeric, and the selector matrix happens to be of indicator variables, you can do the following:
>>> pd.DataFrame(
df1.as_matrix() * df3.as_matrix() + df1.as_matrix() * (1 - df3.as_matrix()),
index=df1.index,
columns=df1.columns)
I tried it by me and it works. Strangely enough, #Yakym Pirozhenko's answer - which I think is superior - doesn't work by me as well.
df4 = df1.where(df3.astype(bool), df2) should do it.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(10, size = (4,2)))
df2 = pd.DataFrame(np.random.randint(10, size = (4,2)))
df3 = pd.DataFrame(np.random.randint(2, size = (4,2)))
df4 = df1.where(df3.astype(bool), df2)
print df1, '\n'
print df2, '\n'
print df3, '\n'
print df4, '\n'
Output:
0 1
0 0 3
1 8 8
2 7 4
3 1 2
0 1
0 7 9
1 4 4
2 0 5
3 7 2
0 1
0 0 0
1 1 0
2 1 1
3 1 0
0 1
0 7 9
1 8 4
2 7 4
3 1 2

Categories

Resources