Pivot Table in Python - python

I am pretty new to Python and hence I need your help on the following:
I have two tables (dataframes):
Table 1 has all the data and it looks like that:
GenDate column has the generation day.
Date column has dates.
Column D and onwards has different values
I also have the following table:
Column I has "keywords" that can be found in the header of Table 1
Column K has dates that should be in column C of table 1
My goal is to produce a table like the following:
I have omitted a few columns for Illustration purposes.
Every column on table 1 should be split base on the Type that is written on the Header.
Ex. A_Weeks: The Weeks corresponds to 3 Splits, Week1, Week2 and Week3
Each one of these slits has a specific Date.
in the new table, 3 columns should be created, using A_ and then the split name:
A_Week1, A_Week2 and A_Week3.
for each one of these columns, the value that corresponds to the Date of each split should be used.
I hope the explanation is good.
Thanks

You can get the desired table with the following code (follow comments and check panda api reference to learn about functions used):
import numpy as np
import pandas as pd
# initial data
t_1 = pd.DataFrame(
{'GenDate': [1, 1, 1, 2, 2, 2],
'Date': [10, 20, 30, 10, 20, 30],
'A_Days': [11, 12, 13, 14, 15, 16],
'B_Days': [21, 22, 23, 24, 25, 26],
'A_Weeks': [110, 120, 130, 140, np.NaN, 160],
'B_Weeks': [210, 220, 230, 240, np.NaN, 260]})
# initial data
t_2 = pd.DataFrame(
{'Type': ['Days', 'Days', 'Days', 'Weeks', 'Weeks'],
'Split': ['Day1', 'Day2', 'Day3', 'Week1', 'Week2'],
'Date': [10, 20, 30, 10, 30]})
# create multiindex
t_1 = t_1.set_index(['GenDate', 'Date'])
# pivot 'Date' level of MultiIndex - unstack it from index to columns
# and drop columns with all NaN values
tt_1 = t_1.unstack().dropna(axis=1)
# tt_1 is what you need with multi-level column labels
# map to rename columns
t_2 = t_2.set_index(['Type'])
mapping = {
type_: dict(zip(
t_2.loc[type_, :].loc[:, 'Date'],
t_2.loc[type_, :].loc[:, 'Split']))
for type_ in t_2.index.unique()}
# new column names
new_columns = list()
for letter_type, date in tt_1.columns.values:
letter, type_ = letter_type.split('_')
new_columns.append('{}_{}'.format(letter, mapping[type_][date]))
tt_1.columns = new_columns

Related

groupby using combination of column index and column name

I'm creating a function to filter many dataframes using groupby. The dataframes look like below. However each dataframe does not always contain the same number of columns.
df = pd.DataFrame({
'xyz CODE': [1,2,3,3,4, 5,6,7,7,8],
'a': [4, 5, 3, 1, 2, 20, 10, 40, 50, 30],
'b': [20, 10, 40, 50, 30, 4, 5, 3, 1, 2],
'c': [25, 20, 5, 15, 10, 25, 20, 5, 15, 10] })
For each dataframe I always apply groupby to the first column - which are named differently across dataframes. All other columns are named consistently across all dataframes.
My question: Is it possible to run groupby using a combination of column location and column names? How can I do it?
I wrote the following function and got an error TypeError: unhashable type: 'list'
def filter_all_df(df):
df['max_c'] = df.groupby(df.columns[0])['a'].transform('max')
newdf = df[df['a'] == df['max_c']].drop(['max_c'], axis=1)
newdf['max_score'] = newdf.groupby([newdf.columns[0],'a','b'])['c'].transform('max')
newdf = newdf[newdf['c'] == newdf['max_score']]
newdf = newdf.sort_values([newdf.columns[0]]).drop_duplicates([newdf.columns[0], 'a','b', 'c'], keep='last')
newdf.to_csv('newdf_all.csv')
return newdf

Python get specific value from HDF table

I have two tables, the first one contains 300 rows, each row presents a case with 3 columns in which we find 2 constant values presenting the case, the second table is my data table collected from sensors contains the same indicators as the first except the case column, the idea is to detect to which case belongs each line of the second table knowing that the data are not the same as the first but in the range.
example:
First table:
[[1500, 22, 0], [1100, 40, 1], [2400, 19, 2]]
columns=['analog', 'temperature', 'case'])**
second table:
[[1420, 20], [1000, 39], [2300, 29]]
columns=['analog', 'temperature']
I want to detect my first row (1420 20) belongs to which case?
You can simply use a classifier; K-NN for instance...
import pandas as pd
df = pd.DataFrame([[1500, 22, 0], [1100, 40, 1], [2400, 19, 2]],columns=['analog', 'temperature', 'case'])
df1 = pd.DataFrame([[1420, 10], [1000, 39], [2300, 29]],columns=['analog', 'temperature'])
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 1, metric = 'minkowski', p = 2)
classifier.fit(df[['analog', 'temperature']], df["case"])
df1["case"] = classifier.predict(df1)
Output of df1;
analog temperature case
0 1420 10 0
1 1000 39 1
2 2300 29 2
so, first row (1420 20) in df1 (2nd table) belongs to case 0...
What do you mean by belong? [1420, 20] belongs to [?,?,?]?

Mapping a multiindex dataframe to another using row ID

I have two dataframes of different shape
The 'ANTENNA1' and 'ANTENNA2' in the bigger dataframe correspond to the ID columns in the smaller dataframe. I want to create merge the smaller dataframe to the bigger one so that the bigger dataframe will have '(POSITION, col1)', '(POSITION, col2)', '(POSITION, col3)' according to ANTENNA1 == ID
Edit: I tried with pd.merge but it is changing the original dataframe column values
Original:
df = pd.merge(df_main, df_sub, left_on='ANTENNA1', right_on ='id', how = 'left')
Result:
I want to keep the original dataframe columns as it is.
Assuming your first dataframe (with positions) is called df1, and the second is called df2, with your loaded data, you could just use pandas.DataFrame.merge: ( -> pd.merge(...) )
df = pd.merge(df1,df2,left_on='id', right_on='ANTENNA1')
Than you might select the df on your needed columns(col1,col2,..) to get the desired result df[["col1","col2",..]].
simple example:
# import pandas as pd
import pandas as pd
# creating dataframes as df1 and df2
df1 = pd.DataFrame({'ID': [1, 2, 3, 5, 7, 8],
'Name': ['Sam', 'John', 'Bridge',
'Edge', 'Joe', 'Hope']})
df2 = pd.DataFrame({'id': [1, 2, 4, 5, 6, 8, 9],
'Marks': [67, 92, 75, 83, 69, 56, 81]})
# merging df1 and df2 by ID
# i.e. the rows with common ID's get
# merged i.e. {1,2,5,8}
df = pd.merge(df1, df2, left_on="ID", right_on="id")
print(df)

Yearly bar chart with quarterly data in the last column only

I aim to create a plot similar to the image where the '2020 Q4' data is in the same column as '2020'.
So far I was only able to place the 2020 Q4 data simply as an extra column.
The data is provided as a DataFrame like in the code below:
# DataFrame using arrays.
import pandas as pd
# initialize data of lists.
data = {'A':[10, 15, 20, 26, 27, 35, 15],
'B':[20, 25, 32, 33, 50, 52, 8],
'C':[30, 35, 41, 49, 52, 53, 25]}
# Creates pandas DataFrame.
df = pd.DataFrame(data, index =['2015',
'2016',
'2017',
'2018',
'2019',
'2020',
'2020 Q4',])
# plotting the data
df.plot(kind='bar',stacked=True)
Two problems have to be addressed here. First, one has to transpose the Q4 data into the row above. Second, the corresponding columns A A-Q4 etc. need similar colors to make it clear that they belong to the same category. Matplotlib's tab20 colormap comes in handy. Here is one approach:
from matplotlib import pyplot as plt
# DataFrame using arrays.
import pandas as pd
import numpy as np
# initialize data of lists.
data = {'A':[10, 15, 20, 26, 27, 35, 15],
'B':[20, 25, 32, 33, 50, 52, 8],
'C':[30, 35, 41, 49, 52, 53, 25]}
# Creates pandas DataFrame.
df = pd.DataFrame(data, index =['2015',
'2016',
'2017',
'2018',
'2019',
'2020',
'2020 Q4',])
#get column names
columns = df.columns
#store data of last row and create a new dataframe without the last row
val_q4 = df.iloc[-1].values
df1 = df.iloc[:-1]
#alternatively, one can simply overwrite df it doesn't matter to remove the Q4 row
#df = df.iloc[:-1]
#generate additional columns for Q4 data
new_columns = [f(item) for item in columns for f in (lambda x: x, lambda x: x+" Q4")]
df1 = df1.reindex(columns=new_columns)
#store Q4 data in last row
df1.iloc[-1, range(1, 2*len(columns), 2)] = val_q4
#create corresponding color pairs using the tab20 colormap
colors = plt.cm.tab20(np.linspace(0, 1, 20))
df1.plot(kind='bar',stacked=True, color=colors)
plt.show()
Sample output:
Restrictions: It relies on your data structure that the rows are already sorted and the last two rows are "Year" and "Year Q4". Tab20 limits the use to 10 columns A, B, C,...,J because beginning with K, the colors will not be unique. However, stacked bar graphs with more than 10 categories should be outlawed anyhow.
You can introduce Q4 values for A, B and C rows and initialise it to zero for all years except 2020. This should give you the desired result.
For example, see this updated code:
# DataFrame using arrays.
import pandas as pd
# initialize data of lists.
data = {'A':[10, 15, 20, 26, 27, 35, 15],
'B':[20, 25, 32, 33, 50, 52, 8],
'C':[30, 35, 41, 49, 52, 53, 25]}
# additional data initialise with zero for all years
add_data = {'AQ4':[0]*5,
'BQ4':[0]*5,
'CQ4':[0]*5}
# take the last element in list A, B, C and append it to add_data dict
add_data['AQ4'].append(data['A'].pop())
add_data['BQ4'].append(data['B'].pop())
add_data['CQ4'].append(data['C'].pop())
# concatenate the 2 dicts
data.update(add_data)
# Creates pandas DataFrame.
df = pd.DataFrame(data, index =['2015',
'2016',
'2017',
'2018',
'2019',
'2020'])
# plotting the data
df.plot(kind='bar',stacked=True)
Note that I have removed 2020 Q4 index when creating the data frame.

Pandas convert multiple rows as header (multi level)

I want to convert 3 rows as multi level column header in pandas dataframe.
Sample dataframe is,
df = pd.DataFrame({'a':['foo_0', 'bar_0', 1, 2, 3], 'b':['foo_0', 'bar_0', 11, 12, 13],
'c':['foo_1', 'bar_1', 21, 22, 23], 'd':['foo_1', 'bar_1', 31, 32, 33]})
expected output looks like, wherein yellow colored is a column multi level column header.
Thank you,
-Nilesh

Categories

Resources