Appending Multi-index column headers to existing dataframe - python

I'm looking to append a multi-index column headers to an existing dataframe, this is my current dataframe.
Name = pd.Series(['John','Paul','Sarah'])
Grades = pd.Series(['A','A','B'])
HumanGender = pd.Series(['M','M','F'])
DogName = pd.Series(['Rocko','Oreo','Cosmo'])
Breed = pd.Series(['Bulldog','Poodle','Golden Retriever'])
Age = pd.Series([2,5,4])
DogGender = pd.Series(['F','F','F'])
SchoolName = pd.Series(['NYU','UCLA','UCSD'])
Location = pd.Series(['New York','Los Angeles','San Diego'])
df = (pd.DataFrame({'Name':Name,'Grades':Grades,'HumanGender':HumanGender,'DogName':DogName,'Breed':Breed,
'Age':Age,'DogGender':DogGender,'SchoolName':SchoolName,'Location':Location}))
I want add 3 columns on top of the existing columns I already have. For example, columns [0,1,2,3] should be labeled 'People', columns [4,5,6] should be labeled 'Dogs', and columns [7,8] should be labeled 'Schools'. In the final result, it should be 3 columns on top of 9 columns.
Thanks!

IIUC, you can do:
newlevel = ['People']*4 + ['Dogs']*3 + ['Schools']*2
df.columns = pd.MultiIndex.from_tuples([*zip(newlevel, df.columns)])
Note [*zip(newlevel, df.columns)] is equivalent to
[(a,b) for a,b in zip(new_level, df.columns)]

Related

How to concatenate a series to a pandas dataframe in python?

I would like to iterate through a dataframe rows and concatenate that row to a different dataframe basically building up a different dataframe with some rows.
For example:
`IPCSection and IPCClass Dataframes
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns), axis = 0)
finalpatentclasses = pd.DataFrame(columns=allcolumns)
for isec, secrow in IPCSection.iterrows():
for icl, clrow in IPCClass.iterrows():
if (secrow[0] in clrow[0]):
pdList = [finalpatentclasses, pd.DataFrame(secrow), pd.DataFrame(clrow)]
finalpatentclasses = pd.concat(pdList, axis=0, ignore_index=True)
display(finalpatentclasses)
The output is:
I want the nan values to dissapear and move all the data under the correct columns. I tried axis = 1 but messes up the column names. Append does not work as well all values are placed diagonally at the table with nan values as well.
Alright, I have figured it out. The idea is that you create a newrowDataframe and concatenate all the data in a list from there you can add it to the dataframe and then conc with the final dataframe.
Here is the code:
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns), axis = 0)
finalpatentclasses = pd.DataFrame(columns=allcolumns)
for isec, secrow in IPCSection.iterrows():
for icl, clrow in IPCClass.iterrows():
newrow = pd.DataFrame(columns=allcolumns)
values = np.concatenate((secrow.values, subclrow.values), axis=0)
newrow.loc[len(newrow.index)] = values
finalpatentclasses = pd.concat([finalpatentclasses, newrow], axis=0)
finalpatentclasses.reset_index(drop=false, inplace=True)
display(finalpatentclasses)
Update the code below is more efficient:
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns, IPCSubClass.columns, IPCGroup.columns), axis = 0)
newList = []
for secrow in IPCSection.itertuples():
for clrow in IPCClass.itertuples():
if (secrow[1] in clrow[1]):
values = ([secrow[1], secrow[2], subclrow[1], subclrow[2]])
new_row = {IPCSection.columns[0]: [secrow[1]], IPCSection.columns[1]: [secrow[2]],
IPCClass.columns[0]: [clrow[1]], IPCClass.columns[1]: [clrow[2]]}
newList.append(values)
finalpatentclasses = pd.DataFrame(newList, columns=allcolumns)
display(finalpatentclasses)

How do I find the uniques and the count of rows for multiple columns?

I have a file with 136 columns. I was trying to find the unique values of each column and from there, I need to find the number of rows for the unique values.
I tried using df and dict for the unique values. However, when I export it back to csv file, the unique values are exported as a list in one cell for each column.
Is there any way I can do to simplify the counting process of the unique values in each column?
df = pd.read_excel(filename)
column_headers = list(df.columns.values)
df_unique = {}
df_count = {}
def approach_1(data):
count = 0
for entry in data:
if not entry =='nan' or not entry == 'NaN':
count += 1
return count
for unique in column_headers:
new = df.drop_duplicates(subset=unique , keep='first')
df_unique[unique] = new[unique].tolist()
csv_unique = pd.DataFrame(df_unique.items(), columns = ['Data Source Field', 'First Row'])
csv_unique.to_csv('Unique.csv', index = False)
for count in df_unique:
not_nan = approach_1(df_unique[count])
df_count[count] = not_nan
csv_count = pd.DataFrame(df_count.items(), columns = ['Data Source Field', 'Count'])
.unique() is simpler ->len(df[col].unique()) is the count
import pandas as pd
dict = [
{"col1":"0","col2":"a"},
{"col1":"1","col2":"a"},
{"col1":"2","col2":"a"},
{"col1":"3","col2":"a"},
{"col1":"4","col2":"a"},
{"col2":"a"}
]
df = pd.DataFrame.from_dict(dict)
result_dict = {}
for col in df.columns:
result_dict[col] = len(df[col].dropna().unique())
print(result_dict)

Reformatting a dataframe to access it for sort after concatenating two series

I've joined or concatenated two series into a dataframe. However one of the issues I'm not facing is that I have no column headings on the actual data that would help me do a sort
hist_a = pd.crosstab(category_a, category, normalize=True)
hist_b = pd.crosstab(category_b, category, normalize=True)
counts_a = pd.Series(np.diag(hist_a), index=[hist_a.index])
counts_b = pd.Series(np.diag(hist_b), index=[hist_b.index])
df_plots = pd.concat([counts_a, counts_b], axis=1).fillna(0)
The data looks like the following:
0 1
category
0017817703277 0.000516 5.384341e-04
0017817703284 0.000516 5.384341e-04
0017817731348 0.000216 2.856169e-04
0017817731355 0.000216 2.856169e-04
and I'd like to do a sort, but there are no proper column headings
df_plots = df_plots.sort_values(by=['0?'])
But the dataframe seems to be in two parts. How could I better structure the dataframe to have 'proper' columns such as '0' or 'plot a' rather than being indexable by an integer, which seems to be hard to work with.
category plot a plot b
0017817703277 0.000516 5.384341e-04
0017817703284 0.000516 5.384341e-04
0017817731348 0.000216 2.856169e-04
0017817731355 0.000216 2.856169e-04
Just rename the columns of the dataframe, for example:
df = pd.DataFrame({0:[1,23]})
df = df.rename(columns={0:'new name'})
If you have a lot of columns you rename all of them at once like:
df = pd.DataFrame({0:[1,23]})
rename_dict = {key: f'Col {key}' for key in df.keys() }
df = df.rename(columns=rename_dict)
You can also define the series with the name, so you avoid changing the name afterwards:
counts_a = pd.Series(np.diag(hist_a), index=[hist_a.index], name = 'counts_a')
counts_b = pd.Series(np.diag(hist_b), index=[hist_b.index], name = 'counts_b')

pandas equal all column names of the dataframe

I have a dataframe in which I have multiple leg columns names like leg/1 leg/2 leg/3 till leg/24 but the problem is that each leg has multiple string attached with like leg/1/a1 leg/1/a2.
For example I have a dataframe like this
leg/1/a1 leg/1/a2 leg/2/a1 leg/3/a2
I need that all leg names in the dataframe should have equal columns like leg/1
For example my required pandas dataframe column names should be
leg/1/a1 leg/1/a2 leg/2/a1 leg/2/a2 leg/3/a1 leg/3/a2
this should be the output of the dataframe.
for that purpose I have first collected the leg/1 details inside the list
legs=['leg/1/a1','leg/1/a2']
this list i have created to match all the dataframe column names
After that I have collected all the dataframe column names that are started with leg
cols = [col for col in df.columns if 'leg' in col]
but the problem is that I am unable to match , any help would be appreciated.
column_list = ['leg/1/a1','leg/1/a2','leg/2/a1','leg/3/a2'] #replace with df.columns
col_end_list = set([e.split('/')[-1] for e in column_list]) # get all a1,a2,....an
#Loop theough leg/1/a1 to leg/24/an
for i in range(1,25):
for c in col_end_list:
check_str = 'leg/'+str(i)+'/'+c
if check_str not in column_list: #Check if column doesn't exist ad a column
df[check_str] = 0 #adding new column
Code to preproduce on blank df
import pandas as pd
df = pd.DataFrame([],columns=['leg/1/a1','leg/1/a2','leg/2/a1','leg/3/a2'])
column_list = df.columns
col_end_list = set([e.split('/')[-1] for e in column_list]) # get all a1,a2,....an
#Loop theough leg/1/a1 to leg/24/an
for i in range(1,25):
for c in col_end_list:
check_str = 'leg/'+str(i)+'/'+c
if check_str not in column_list: #Check if column doesn't exist ad a column
df[check_str] = 0 #adding new column
>>> df.columns
>>> Index(['leg/1/a1', 'leg/1/a2', 'leg/2/a1', 'leg/3/a2', 'leg/2/a2', 'leg/3/a1',
'leg/4/a1', 'leg/4/a2', 'leg/5/a1', 'leg/5/a2', 'leg/6/a1', 'leg/6/a2',
'leg/7/a1', 'leg/7/a2', 'leg/8/a1', 'leg/8/a2', 'leg/9/a1', 'leg/9/a2',
'leg/10/a1', 'leg/10/a2', 'leg/11/a1', 'leg/11/a2', 'leg/12/a1',
'leg/12/a2', 'leg/13/a1', 'leg/13/a2', 'leg/14/a1', 'leg/14/a2',
'leg/15/a1', 'leg/15/a2', 'leg/16/a1', 'leg/16/a2', 'leg/17/a1',
'leg/17/a2', 'leg/18/a1', 'leg/18/a2', 'leg/19/a1', 'leg/19/a2',
'leg/20/a1', 'leg/20/a2', 'leg/21/a1', 'leg/21/a2', 'leg/22/a1',
'leg/22/a2', 'leg/23/a1', 'leg/23/a2', 'leg/24/a1', 'leg/24/a2'],
dtype='object')

How can I get below state desired output in Python?

How can I create a single row and get the data type, maximum column length and count for each column of a data frame as shown in bottom desired output section.
import pandas as pd
table = 'sample_data'
idx=0
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,'NULL',40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]),
'new_column':pd.Series([])
}
#Create a DataFrame using above data
sdf = pd.DataFrame(d)
#Create a summary description
desired_data = sdf.describe(include='all').T
desired_data = desired_data.rename(columns={'index':'Variable'})
#print(summary)
#Get Data Type
dtype = sdf.dtypes
#print(data_type)
#Get total count of records (need to work on)
counts = sdf.shape[0] # gives number of row count
#Get maximum length of values
maxcollen = []
for col in range(len(sdf.columns)):
maxcollen.append(max(sdf.iloc[:,col].astype(str).apply(len)))
#print('Max Column Lengths ', maxColumnLenghts)
#Constructing final data frame
desired_data = desired_data.assign(data_type = dtype.values)
desired_data = desired_data.assign(total_count = counts)
desired_data = desired_data.assign(max_col_length = maxcollen)
final_df = desired_data
final_df = final_df.reindex(columns=['data_type','max_col_length','total_count'])
final_df.insert(loc=idx, column='table_name', value=table)
final_df.to_csv('desired_data.csv')
#print(final_df)
Output of above code:
The desired output I am looking for is :
In : sdf
Out:
table_name Name_data_type Name_total_count Name_max_col_length Age_data_type Age_total_count Age_max_col_length Rating_data_type Rating_total_count Rating_max_col_length
sample_data object 12 6 object 12 4 float64 12 4
If you have noticed, I want to print single row where I create column_name_data_type,column_name_total_count,column_name_max_col_length and get the respective values for the same.
Here's a solution:
df = final_df
df = df.drop("new_column").drop("table_name", axis=1)
df = df.reset_index()
df.melt(id_vars=["index"]).set_index(["index", "variable"]).sort_index().transpose()
The result is:
index Age Name \
variable data_type max_col_length total_count data_type max_col_length ...
value object 4 12 object 6 ...
Can you try this:
The below code tries to iterate entire dataframe, hence it may take some time complexity. This is not the optimal solution but working solution for the above problem.
from collections import OrderedDict
## storing key-value pair
result_dic = OrderedDict()
unique_table_name = final_df["table_name"].unique()
# remove unwanted rows
final_df.drop("new_column", inplace=True)
cols_name = final_df.columns
## for every unique table name, generating row
for unique_table_name in unique_table_name:
result_dic["table_name"] = unique_table_name
filtered_df = final_df[final_df["table_name"] == unique_table_name]
for row in filtered_df.iterrows():
for cols in cols_name:
if cols != "table_name":
result_dic[row[0]+"_"+cols] = row[1][cols]
Convert dict to dataframe
## convert dataframe from dict
result_df = pd.DataFrame([result_dic])
result_df
expected output is:
table_name Name_data_type Name_max_col_length Name_total_count Age_data_type Age_max_col_length Age_total_count Rating_data_type Rating_max_col_length Rating_total_count
0 sample_data object 6 12 object 4 12 float64 4 12

Categories

Resources