How to obtain the content of a pandas multilevel index entry? - python

I set up a pandas dataframes that besides my data stores the respective units with it using a MultiIndex like this:
Name Relative_Pressure Volume_STP
Unit - ccm/g
Description p/p0
0 0.042691 29.3601
1 0.078319 30.3071
2 0.129529 31.1643
3 0.183355 31.8513
4 0.233435 32.3972
5 0.280847 32.8724
Now I can for example extract only the Volume_STP data by
Unit ccm/g
Description
0 29.3601
1 30.3071
2 31.1643
3 31.8513
4 32.3972
5 32.8724
With .values I can obtain a numpy array of the data. However how can I get the stored unit? I can't figure out what I need to do to receive the stored ccm/g string.
EDIT: Added example how data frame is generated
Let's say I have a string that looks like this:
Relative Volume # STP
Pressure
cc/g
4.26910e-02 29.3601
7.83190e-02 30.3071
1.29529e-01 31.1643
1.83355e-01 31.8513
2.33435e-01 32.3972
2.80847e-01 32.8724
3.34769e-01 33.4049
3.79123e-01 33.8401
I then use this function:
def read_result(contents, columns, units, descr):
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
df.drop(df.index[-1], inplace=True)
index = pd.MultiIndex.from_arrays((columns, units, descr))
df.columns = index
df.columns.names = ['Name','Unit','Description']
df = df.apply(pd.to_numeric)
return df
like this
def isotherm(contents):
columns = ['Relative_Pressure','Volume_STP']
units = ['-','ccm/g']
descr = ['p/p0','']
df = read_result(contents, columns, units, descr)
return df
to generate the DataFrame at the beginning of my question.

As df has a MultiIndex as columns, df.Volume_STP is still a pandas DataFrame. So you can still access its columns attribute, and the relevant item will be at index 0 because the dataframe contains only 1 Series.
So, you can extract the names that way:
print(df.Volume_STP.columns[0])
which should give: ('ccm/g', '')
At the end you extract the unit with .colums[0][0] and the description with .columns[0][1]

You can do something like this:
df.xs('Volume_STP', axis=1).columns.remove_unused_levels().get_level_values(0).tolist()[0]
Output:
'ccm/g'
Slice the dataframe from the 'Volume_STP' using xs, then select the columns remove the unused parts of the column headers, then get the value for the top most level of that slice which is the Units. Convert to a list as select the first value.

A generic way of accessing values on multi-index/columns is by using the index.get_level_values or columns.get_level_values functions of a data frame.
In your example, try df.columns.get_level_values(1) to access the second level of the multi-level column "Unit". If you have already selected a column, say "Volume_STP", then you have removed the top level and in this case, your units would be in the 0th level.

Related

Pandas, groupby by 2 non numeric columns

I have a dataframe with several columns that I only need to use 2 non numeric columns
1 is 'hashed_id' another is 'event name' with 10 unique names
I'm trying to do a groupby by 2 non numeric columns, so aggregation functions would not work here
my solution is:
df_events = df.groupby('subscription_hash', 'event_name')['event_name']
df_events = pd.DataFrame (df_events, columns = ["subscription_hash",
'event_name'])
I'm trying to get a format like:
subscription_hash event_name
0 (0000379144f24717a8d124d798008a0e672) AddToQueue
1 (0000379144f24717a8d124d798008a0e672) page_view
but instead getting:
subscription_hash event_name
0 (0000379144f24717a8d124d798008a0e672) 832433 AddToQueue
1 (0000379144f24717a8d124d798008a0e672) 245400 page_view
Please advise
Is your data clean ? what are those undesired numbers coming from ?
from the docs, I see groupby being used by providing the name of columns as a list and an aggregate function:
df.groupby(['col1','col2']).mean()
since your values are not numeric maybe try the pivot method:
df.pivot(columns=['col1','col2'])
so id try first putting [] around your colnames, then try the pivot.

Pandas:merge two dataframes with unequal row and column size and merge them based on explanation below

I have two dataframes of unequal size, one contains cuisine style along with its frequency in the dataset and another is the original dataset which has restaurant name and cuisine corresponding to it. I want to add a new column on the original dataset where the frequency value of each cuisine is displayed from the dataframe containing the frequency data. What is the best way to perform that? I have tried by using merge but that creates NaN values. Please suggest
I tried below code snippet suggested but it did not give me the required result. it generates freq for first row and excludes the other rows for the same 'name' column.
df = df.assign(freq=0)
# get all the cuisine styles in the cuisine df
for cuisine in np.unique(cuisine_df['cuisine_style']):
# get the freq
freq = cuisine_df.loc[cuisine_df['cuisine_style'] == cuisine,
'freq'].values
# update value in main df
df.loc[df['cuisine_style'] == cuisine, 'freq'] = freq
Result dataframe
I re ran the code on your data set and still got the same results. Here is the code I ran.
import pandas as pd
import numpy as np
# used to set 'Cuisine Style' to first 'style' in array of values
def getCusinie(row):
arr = row['Cuisine Style'].split("'")
return arr[1]
# read in data set. Used first col for index and drop nan for ease of use
csv = pd.read_csv('TA_restaurants_curated.csv', index_col=0).dropna()
# get cuisine values
cuisines = csv.apply(lambda row: getCusinie(row), axis=1)
# update dataframe
csv['Cuisine Style'] = cuisines
# json obj to quickly make a new data frame with meaningless frequencies
c = {'Cuisine Style' : np.unique(csv['Cuisine Style']), 'freq': range(113)}
cuisine_df = pd.DataFrame(c)
# add 'freq' column to original Data Frame
csv = csv.assign(freq=0)
# same loop as before
for cuisine in np.unique(cuisine_df['Cuisine Style']):
# get the freq
freq = cuisine_df.loc[cuisine_df['Cuisine Style'] == cuisine,
'freq'].values
# update value in main df
csv.loc[csv['Cuisine Style'] == cuisine, 'freq'] = freq
Output:
As you can see, every column, even duplicates, have been updated. If they still are not being updated I'd check to make sure that the names are actually equal i.e. make sure there isn't any hidden spaces or anything causing issues.
You can read up on selecting and indexing DataFrames here.
Its quite long but you can pick apart what you need, when you need it

How to sort a dataFrame in python pandas by two or more columns based on a list of values? [duplicate]

So I have a pandas DataFrame, df, with columns that represent taxonomical classification (i.e. Kingdom, Phylum, Class etc...) I also have a list of taxonomic labels that correspond to the order I would like the DataFrame to be ordered by.
The list looks something like this:
class_list=['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes', 'Clostridia', 'Bacilli', 'Actinobacteria', 'Betaproteobacteria', 'delta/epsilon subdivisions', 'Synergistia', 'Mollicutes', 'Nitrospira', 'Spirochaetia', 'Thermotogae', 'Aquificae', 'Fimbriimonas', 'Gemmatimonadetes', 'Dehalococcoidia', 'Oscillatoriophycideae', 'Chlamydiae', 'Nostocales', 'Thermodesulfobacteria', 'Erysipelotrichia', 'Chlorobi', 'Deinococci']
This list would correspond to the Dataframe column df['Class']. I would like to sort all the rows for the whole dataframe based on the order of the list as df['Class'] is in a different order currently. What would be the best way to do this?
You could make the Class column your index column
df = df.set_index('Class')
and then use df.loc to reindex the DataFrame with class_list:
df.loc[class_list]
Minimal example:
>>> df = pd.DataFrame({'Class': ['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes'], 'Number': [3, 5, 6]})
>>> df
Class Number
0 Gammaproteobacteria 3
1 Bacteroidetes 5
2 Negativicutes 6
>>> df = df.set_index('Class')
>>> df.loc[['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']]
Number
Bacteroidetes 5
Negativicutes 6
Gammaproteobacteria 3
Alex's solution doesn't work if your original dataframe does not contain all of the elements in the ordered list i.e.: if your input data at some point in time does not contain "Negativicutes", this script will fail. One way to get past this is to append your df's in a list and concatenate them at the end. For example:
ordered_classes = ['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']
df_list = []
for i in ordered_classes:
df_list.append(df[df['Class']==i])
ordered_df = pd.concat(df_list)

Wide to long returns empty output - Python dataframe

I have a dataframe which can be generated from the code as given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1':
['12/31/2007','11/25/2009','10/06/2005'],'val1':
[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'val2':[1,3,5],'date3':
['12/31/2027','11/25/2029','10/06/2025'],'val3':[7,9,11]})
I followed the below solution to convert it from wide to long
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id',
j='grp').sort_index(level=0)
Though this works with sample data as shown below, it doesn't work with my real data which has more than 200 columns. Instead of person_id, my real data has subject_ID which is values like DC0001,DC0002 etc. Does "I" always have to be numeric? Instead it adds the stub values as new columns in my dataset and has zero rows
This is how my real columns looks like
My real data might contains NA's as well. So do I have to fill them with default values for wide_to_long to work?
Can you please help as to what can be the issue? Or any other approach to achieve the same result is also helpful.
Try adding additional argument in the function which allows the strings suffix.
pd.long_to_wide(.......................,suffix='\w+')
The issue is with your column names, the numbers used to convert from wide to long need to be at the end of your column names or you need to specify a suffix to groupby. I think the easiest solution is to create a function that accepts regex and the dataframe.
import pandas as pd
import re
def change_names(df, regex):
# Select one of three column groups
old_cols = df.filter(regex = regex).columns
# Create list of new column names
new_cols = []
for col in old_cols:
# Get the stubname of the original column
stub = ''.join(re.split(r'\d', col))
# Get the time point
num = re.findall(r'\d+', col) # returns a list like ['1']
# Make new column name
new_col = stub + num[0]
new_cols.append(new_col)
# Create dictionary mapping old column names to new column names
dd = {oc: nc for oc, nc in zip(old_cols, new_cols)}
# Rename columns
df.rename(columns = dd, inplace = True)
return df
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
# Change date columns
tdf = change_names(tdf, 'date$')
tdf = change_names(tdf, 'val$')
print(tdf)
person_id hdate1 tval1 hdate2 tval2 hdate3 tval3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11
This is quite late to answer this question. But putting the solution here in case someone else find it useful
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
## You can use m13op22 solution to rename your columns with numeric part at the
## end of the column name. This is important.
tdf = tdf.rename(columns={'h1date': 'hdate1', 't1val': 'tval1',
'h2date': 'hdate2', 't2val': 'tval2',
'h3date': 'hdate3', 't3val': 'tval3'})
## Then use the non-numeric portion, (in this example 'hdate', 'tval') as
## stubnames. The mistake you were doing was using ['date', 'val'] as stubnames.
df = pd.wide_to_long(tdf, stubnames=['hdate', 'tval'], i='person_id', j='grp').sort_index(level=0)
print(df)

Extracting specific columns from pandas.dataframe

I'm trying to use python to read my csv file extract specific columns to a pandas.dataframe and show that dataframe. However, I don't see the data frame, I receive Series([], dtype: object) as an output. Below is the code that I'm working with:
My document consists of:
product sub_product issue sub_issue consumer_complaint_narrative
company_public_response company state zipcode tags
consumer_consent_provided submitted_via date_sent_to_company
company_response_to_consumer timely_response consumer_disputed?
complaint_id
I want to extract :
sub_product issue sub_issue consumer_complaint_narrative
import pandas as pd
df=pd.read_csv("C:\\....\\consumer_complaints.csv")
df=df.stack(level=0)
df2 = df.filter(regex='[B-F]')
df[df2]
import pandas as pd
input_file = "C:\\....\\consumer_complaints.csv"
dataset = pd.read_csv(input_file)
df = pd.DataFrame(dataset)
cols = [1,2,3,4]
df = df[df.columns[cols]]
Here specify your column numbers which you want to select. In dataframe, column start from index = 0
cols = []
You can select column by name wise also. Just use following line
df = df[["Column Name","Column Name2"]]
A simple way to achieve this would be as follows:
df = pd.read_csv("C:\\....\\consumer_complaints.csv")
df2 = df.loc[:,'B':'F']
Hope that helps.
This worked for me, using slicing:
df=pd.read_csv
df1=df[n1:n2]
Where $n1<n2# are both columns in the range, e.g:
if you want columns 3-5, use
df1=df[3:5]
For the first column, use
df1=df[0]
Though not sure how to select a discontinuous range of columns.
We can also use i.loc. Given data in dataset2:
dataset2.iloc[:3,[1,2]]
Will spit out the top 3 rows of columns 2-3 (Remember numbering starts at 0)
Then dataset2.iloc[:3,[1,2]] spits out

Categories

Resources