How can I turn the following dataframe into a multi-index dataframe? - python

How can I achieve the following:
I have a table like so
|----------------------|
| Date | A | B | C | D |
|------+---+---+---+---|
| 2000 | 1 | 2 | 5 | 4 |
|------+---+---+---+---|
| 2001 | 2 | 2 | 7 | 4 |
|------+---+---+---+---|
| 2002 | 3 | 1 | 7 | 7 |
|------+---+---+---+---|
| 2003 | 4 | 1 | 5 | 7 |
|----------------------|
and turn it into a multi-index type dataframe:
|------------------------------------|
| Column Name | Date | Value | C | D |
|-------------+------+-------+---+---|
| A | 2000 | 1 | 5 | 4 |
| |------+-------+---+---|
| | 2001 | 2 | 7 | 4 |
| |------+-------+---+---|
| | 2002 | 3 | 7 | 7 |
| |------+-------+---+---|
| | 2003 | 4 | 5 | 7 |
|-------------+------+-------+---+---|
| B | 2000 | 2 | 5 | 4 |
| |------+-------+---+---|
| | 2001 | 2 | 7 | 4 |
| |------+-------+---+---|
| | 2002 | 1 | 7 | 7 |
| |------+-------+---+---|
| | 2003 | 1 | 5 | 7 |
|------------------------------------|
I have tried using the Melt function on a dataframe but could not figure out how to achieve this desired look. I think I would also then have to apply a groupby function to the melted dataframe.

You can use melt with set_index. By adding C and D as id_vars, the columns will keep the same structure, then you can just set the columns of interest as index to get a MultiIndex dataframe:
df.melt(id_vars=['Date', 'C', 'D']).set_index(['variable', 'Date'])
C D value
variable Date
A 2000 5 4 1
2001 7 4 2
2002 7 7 3
2003 5 7 4
B 2000 5 4 2
2001 7 4 2
2002 7 7 1
2003 5 7 1

Related

Increment rank each time flag changes

I have the following pandas dataframe where the first column is the datetime index. I am trying to achieve the desired_output column which increments every time the flag changes from 0 to 1 or 1 to 0. I have been able to achieve this type of thing in SQL however after finding that pandasql sqldf for some strange reason changes the values of the field undergoing the partition I am now trying to achieve this using regular python syntax.
Any help would be much appreciated.
+-------------+------+----------------+
| date(index) | flag | desired_output |
+-------------+------+----------------+
| 1/01/2020 | 0 | 1 |
| 2/01/2020 | 0 | 1 |
| 3/01/2020 | 0 | 1 |
| 4/01/2020 | 1 | 2 |
| 5/01/2020 | 1 | 2 |
| 6/01/2020 | 0 | 3 |
| 7/01/2020 | 1 | 4 |
| 8/01/2020 | 1 | 4 |
| 9/01/2020 | 1 | 4 |
| 10/01/2020 | 1 | 4 |
| 11/01/2020 | 1 | 4 |
| 12/01/2020 | 1 | 4 |
| 13/01/2020 | 0 | 5 |
| 14/01/2020 | 0 | 5 |
| 15/01/2020 | 0 | 5 |
| 16/01/2020 | 0 | 5 |
| 17/01/2020 | 1 | 6 |
| 18/01/2020 | 0 | 7 |
| 19/01/2020 | 0 | 7 |
| 20/01/2020 | 0 | 7 |
| 21/01/2020 | 0 | 7 |
| 22/01/2020 | 1 | 8 |
| 23/01/2020 | 1 | 8 |
+-------------+------+----------------+
Use diff and cumsum:
print (df["flag"].diff().ne(0).cumsum())
0 1
1 1
2 1
3 2
4 2
5 3
6 4
7 4
8 4
9 4
10 4
11 4
12 5
13 5
14 5
15 5
16 6
17 7
18 7
19 7
20 7
21 8
22 8

I want to add new column into cross tabulation data

I have cross tabulation data something. which i have created using
x = pd.crosstab(a['Age Category'], a['Category'])
| Category | A | B | C | D |
|--------------|---|----|----|---|
| Age Category | | | | |
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
And I want to add new column Total which will contain row sum something like this in cross tabulation data.
| Category | A | B | C | D | Total |
|--------------|---|----|----|---|-------|
| Age Category | | | | | |
| 21-26 | 2 | 2 | 4 | 1 | 9 |
| 26-31 | 7 | 11 | 12 | 5 | 35 |
| 31-36 | 3 | 5 | 5 | 2 | 15 |
| 36-41 | 2 | 4 | 1 | 7 | 14 |
| 41-46 | 0 | 1 | 3 | 2 | 6 |
| 46-51 | 0 | 0 | 2 | 3 | 5 |
| Above 51 | 0 | 3 | 0 | 6 | 9 |
I tried x['Total'] = x.sum(axis = 1) but this code is giving me TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
Thanks you for your time and consideration.
Use CategoricalIndex.add_categories for append new category to columns:
x.columns = x.columns.add_categories(['Total'])
x['Total'] = x.sum(axis = 1)
print (x)
A B C D Total
Category
21-26 2 2 4 1 9
26-31 7 11 12 5 35
31-36 3 5 5 2 15
36-41 2 4 1 7 14
41-46 0 1 3 2 6
46-51 0 0 2 3 5
Above 51 0 3 0 6 9

Not getting First column when I am writing to MySQL my Pandas DataFrame

I have Pandas object created using cross tabulation function
df = pd.crosstab(db['Age Category'], db['Category'])
| Age Category | A | B | C | D |
|--------------|---|----|----|---|
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
df.dtype give me
Age Category
A int64
B int64
C int64
D int64
dtype: object
But, when i am writing this to MySQL I am not getting first column
The Output of MySQL is shown below:
| A | B | C | D |
|---|----|----|---|
| | | | |
| 2 | 2 | 4 | 1 |
| 7 | 11 | 12 | 5 |
| 3 | 5 | 5 | 2 |
| 2 | 4 | 1 | 7 |
| 0 | 1 | 3 | 2 |
| 0 | 0 | 2 | 3 |
| 0 | 3 | 0 | 6 |
I want to write in MySQL with First column.
I have created connection using
SQLAlchemy and PyMySQL
engine = create_engine('mysql+pymysql://[user]:[passwd]#[host]:[port]/[database]')
and I am writing using pd.to_sql()
df.to_sql(name = 'demo', con = engine, if_exists = 'replace', index = False)
but this is not giving me first column in MySQL.
Thank you for your time and consideration.

Pandas merge two dataframe and drop extra rows

How can I merge/join these two dataframes ONLY on "sample_id" and drop the extra rows from the second dataframe when merging/joining?
Using pandas in Python.
First dataframe (fdf)
| sample_id | name |
|-----------|-------|
| 1 | Mark |
| 1 | Dart |
| 2 | Julia |
| 2 | Oolia |
| 2 | Talia |
Second dataframe (sdf)
| sample_id | salary | time |
|-----------|--------|------|
| 1 | 20 | 0 |
| 1 | 30 | 5 |
| 1 | 40 | 10 |
| 1 | 50 | 15 |
| 2 | 33 | 0 |
| 2 | 23 | 5 |
| 2 | 24 | 10 |
| 2 | 28 | 15 |
| 2 | 29 | 20 |
So the resulting df will be like -
| sample_id | name | salary | time |
|-----------|-------|--------|------|
| 1 | Mark | 20 | 0 |
| 1 | Dart | 30 | 5 |
| 2 | Julia | 33 | 0 |
| 2 | Oolia | 23 | 5 |
| 2 | Talia | 24 | 10 |
There are duplicates, so need helper column for correct DataFrame.merge with GroupBy.cumcount for counter:
df = (fdf.assign(g=fdf.groupby('sample_id').cumcount())
.merge(sdf.assign(g=sdf.groupby('sample_id').cumcount()), on=['sample_id', 'g'])
.drop('g', axis=1))
print (df)
sample_id name salary time
0 1 Mark 20 0
1 1 Dart 30 5
2 2 Julia 33 0
3 2 Oolia 23 5
4 2 Talia 24 10
final_res = pd.merge(df,df2,on=['sample_id'],how='left')
final_res.sort_values(['sample_id','name','time'],ascending=[True,True,True],inplace=True)
final_res.drop_duplicates(subset=['sample_id','name'],keep='first',inplace=True)

append pandas dataframe to column

I'm stuck and need some help. I have the following dataframe:
+-----+---+---+--+--+
| | A | B | | |
+-----+---+---+--+--+
| 288 | 1 | 4 | | |
+-----+---+---+--+--+
| 245 | 2 | 3 | | |
+-----+---+---+--+--+
| 543 | 3 | 6 | | |
+-----+---+---+--+--+
| 867 | 1 | 9 | | |
+-----+---+---+--+--+
| 345 | 2 | 7 | | |
+-----+---+---+--+--+
| 122 | 3 | 8 | | |
+-----+---+---+--+--+
| 233 | 1 | 1 | | |
+-----+---+---+--+--+
| 346 | 2 | 6 | | |
+-----+---+---+--+--+
| 765 | 3 | 3 | | |
+-----+---+---+--+--+
Column A has repeating values as shown. What I want to do is every time I see the repeating value in Column A I want to append a new colum with the corresponding values from column B as column C as shown below:
+-----+---+---+-----+
| | A | B | C |
+-----+---+---+-----+
| 288 | 1 | 4 | 9 |
+-----+---+---+-----+
| 245 | 2 | 3 | 7 |
+-----+---+---+-----+
| 543 | 3 | 6 | 8 |
+-----+---+---+-----+
| 867 | 1 | 9 | 1 |
+-----+---+---+-----+
| 345 | 2 | 7 | 6 |
+-----+---+---+-----+
| 122 | 3 | 8 | 3 |
+-----+---+---+-----+
| 233 | 1 | 1 | NaN |
+-----+---+---+-----+
| 346 | 2 | 6 | NaN |
+-----+---+---+-----+
| 765 | 3 | 3 | NaN |
+-----+---+---+-----+
Thanks.
Assuming that val is one of the repeated values,
slice = df.loc[df.A == val, 'B'].shift(-1)
will create a one-column data frame with the values re-indexed to their new positions.
Since none of the re-assigned index values should be redundant, you can use pandas.concat to stitch the different slices together without fear of losing data. Then just attach them as a new column:
df['C'] = pd.concat([df.loc[df['A'] == x, 'B'].shift(-1) for x in [1, 2, 3]])
When the column is assigned, the index values will make everything line up:
A B C
0 1 4 9.0
1 2 3 7.0
2 3 6 8.0
3 1 9 1.0
4 2 7 6.0
5 3 8 3.0
6 1 1 NaN
7 2 6 NaN
8 3 3 NaN
Reverse the dataframe order, groupby transform it against shift function, and reverse it back:
df = df[::-1]
df['C'] = df.groupby(df.columns[0]).transform('shift')
df = df[::-1]
df
A B C
0 1 4 9.0
1 2 3 7.0
2 3 6 8.0
3 1 9 1.0
4 2 7 6.0
5 3 8 3.0
6 1 1 NaN
7 2 6 NaN
8 3 3 NaN

Categories

Resources