Inaccesable first column in pandas dataframe? - python

I have a dataframe with multiple columns. When I execute the following code it assigns the header for the first column to the second column thereby making the first column inaccessible.
COLUMN_NAMES = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave_points_worst',
'symmetry_worst']
TUMOR_TYPE = ['M', 'B']
path_to_file = list(files.upload().keys())[0]
data = pd.read_csv(path_to_file, names=COLUMN_NAMES, header=0)
print(data)
id diagnosis ... concave_points_worst symmetry_worst
842302 M 17.99 ... 0.4601 0.11890
842517 M 20.57 ... 0.2750 0.08902
84300903 M 19.69 ... 0.3613 0.08758
The id tag is supposed to be associated with the first column but it's associated with the second one resulting in the last column header to get deleted.

pd.read_csv is going to make your first column the index rather than a column like the rest of what is on your list.
You could update it to be:
path_to_file = list(files.upload().keys())[0]
data = pd.read_csv(path_to_file, names=COLUMN_NAMES, header=0,index_col = False)
to make sure that first column isn't treated as the index.

Related

How to use wide_to_long (Pandas)

I have this code which I thought would reformat the dataframe so that the columns with the same column name would be replaced by their duplicates.
# Function that splits dataframe into two separate dataframes, one with all unique
# columns and one with all duplicates
def sub_dataframes(dataframe):
# Extract common prefix -> remove trailing digits
columns = dataframe.columns.str.replace(r'\d*$', '', regex=True).to_series().value_counts()
# Split columns
unq_cols = columns[columns == 1].index
dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)] # All columns from
dataframe that is not in unq_cols
return dataframe[unq_cols], dataframe[dup_cols]
unq_df = sub_dataframes(df)[0]
dup_df = sub_dataframes(df)[1]
print("Unique columns:\n\n{}\n\nDuplicate
columns:\n\n{}".format(unq_df.columns.tolist(), dup_df.columns.tolist()))
Output:
Unique columns:
['total_tracks', 'popularity']
Duplicate columns:
['t_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0', 't_energy1', 't_energy2',
't_key0', 't_key1', 't_key2', 't_speech0', 't_speech1', 't_speech2', 't_acous0', 't_acous1', 't_acous2',
't_ins0', 't_ins1', 't_ins2', 't_live0', 't_live1', 't_live2', 't_val0', 't_val1', 't_val2', 't_tempo0',
't_tempo1', 't_tempo2']
Then I tried to use wide_to_long to combine columns with the same name:
cols = unq_df.columns.tolist()
temp = pd.wide_to_long(dataset.reset_index(), stubnames=['t_dur','t_dance', 't_energy', 't_key', 't_mode',
't_speech', 't_acous', 't_ins', 't_live', 't_val',
't_tempo'], i=['index'] + cols, j='temp', sep='t_')
.reset_index().groupby(cols, as_index=False).mean()
temp
Which gave me this output:
I tried to look at this question, but the dataframe that's returned has "Nothing to show". What am I doing wrong here? How do I fix this?
EDIT
Here is an example of how I've done it "by-hand", but I am trying to do it more efficiently using the already defined built-in functions.
The desired output is the dataframe that is shown last.

How to fetch preceding ids on fly using pandas

I have a data frame like as shown below
df = pd.DataFrame({'subject_id':[11,11,11,12,12,12],
'test_date':['02/03/2012 10:24:21','05/01/2019 10:41:21','12/13/2011 11:14:21','10/11/1992 11:14:21','02/23/2002 10:24:21','07/19/2005 10:24:21'],
'original_enc':['A742','B963','C354','D563','J323','G578']})
hash_file = pd.DataFrame({'source_enc':['A742','B963','C354','D563','J323','G578'],
'hash_id':[1,2,3,4,5,6]})
cols = ["subject_id","test_date","enc_id","previous_enc_id"]
test_df = pd.DataFrame(columns=cols)
test_df.head()
I would like to do two things here
Map original_enc to their corresponding hash_id and store it in enc_id
Find the previous hash_id for each subject based on their current hash_id and store it in previous_enc_id
I tried the below
test_df['subject_id'] = df['subject_id']
test_df['test_date'] = df['test_date']
test_df['enc_id'] = df['original_enc'].map(hash_file)
test_df = test_df.sort_values(['subject_id','test_date'],ascending=True)
test_df['previous_enc_id'] = test_df.groupby(['subject_id','test_date'])['enc_id'].shift(1)
However, I don't get the expected output for the previous_enc_id column as it is all NA.
I expect my output to be like as shown below. You see NA in the expected row for the 1st row of every subject because that's their 1st encounter. There is no info to look back to.
Use only one column for groupby:
test_df['previous_enc_id'] = test_df.groupby('subject_id')['enc_id'].shift()

pandas equal all column names of the dataframe

I have a dataframe in which I have multiple leg columns names like leg/1 leg/2 leg/3 till leg/24 but the problem is that each leg has multiple string attached with like leg/1/a1 leg/1/a2.
For example I have a dataframe like this
leg/1/a1 leg/1/a2 leg/2/a1 leg/3/a2
I need that all leg names in the dataframe should have equal columns like leg/1
For example my required pandas dataframe column names should be
leg/1/a1 leg/1/a2 leg/2/a1 leg/2/a2 leg/3/a1 leg/3/a2
this should be the output of the dataframe.
for that purpose I have first collected the leg/1 details inside the list
legs=['leg/1/a1','leg/1/a2']
this list i have created to match all the dataframe column names
After that I have collected all the dataframe column names that are started with leg
cols = [col for col in df.columns if 'leg' in col]
but the problem is that I am unable to match , any help would be appreciated.
column_list = ['leg/1/a1','leg/1/a2','leg/2/a1','leg/3/a2'] #replace with df.columns
col_end_list = set([e.split('/')[-1] for e in column_list]) # get all a1,a2,....an
#Loop theough leg/1/a1 to leg/24/an
for i in range(1,25):
for c in col_end_list:
check_str = 'leg/'+str(i)+'/'+c
if check_str not in column_list: #Check if column doesn't exist ad a column
df[check_str] = 0 #adding new column
Code to preproduce on blank df
import pandas as pd
df = pd.DataFrame([],columns=['leg/1/a1','leg/1/a2','leg/2/a1','leg/3/a2'])
column_list = df.columns
col_end_list = set([e.split('/')[-1] for e in column_list]) # get all a1,a2,....an
#Loop theough leg/1/a1 to leg/24/an
for i in range(1,25):
for c in col_end_list:
check_str = 'leg/'+str(i)+'/'+c
if check_str not in column_list: #Check if column doesn't exist ad a column
df[check_str] = 0 #adding new column
>>> df.columns
>>> Index(['leg/1/a1', 'leg/1/a2', 'leg/2/a1', 'leg/3/a2', 'leg/2/a2', 'leg/3/a1',
'leg/4/a1', 'leg/4/a2', 'leg/5/a1', 'leg/5/a2', 'leg/6/a1', 'leg/6/a2',
'leg/7/a1', 'leg/7/a2', 'leg/8/a1', 'leg/8/a2', 'leg/9/a1', 'leg/9/a2',
'leg/10/a1', 'leg/10/a2', 'leg/11/a1', 'leg/11/a2', 'leg/12/a1',
'leg/12/a2', 'leg/13/a1', 'leg/13/a2', 'leg/14/a1', 'leg/14/a2',
'leg/15/a1', 'leg/15/a2', 'leg/16/a1', 'leg/16/a2', 'leg/17/a1',
'leg/17/a2', 'leg/18/a1', 'leg/18/a2', 'leg/19/a1', 'leg/19/a2',
'leg/20/a1', 'leg/20/a2', 'leg/21/a1', 'leg/21/a2', 'leg/22/a1',
'leg/22/a2', 'leg/23/a1', 'leg/23/a2', 'leg/24/a1', 'leg/24/a2'],
dtype='object')

Pandas and stocks: From daily values (in columns) to monthly values (in rows)

I am having trouble reformatting a dataframe.
My input is a day value rows by symbols columns (each symbol has different dates with it's values):
Input
code to generate input
data = [("01-01-2010", 15, 10), ("02-01-2010", 16, 11), ("03-01-2010", 16.5, 10.5)]
labels = ["date", "AAPL", "AMZN"]
df_input = pd.DataFrame.from_records(data, columns=labels)
The needed output is (month row with new row for each month):
Needed output
code to generate output
data = [("01-01-2010","29-01-2010", "AAPL", 15, 20), ("01-01-2010","29-01-2010", "AMZN", 10, 15),("02-02-2010","30-02-2010", "AAPL", 20, 32)]
labels = ['bd start month', 'bd end month','stock', 'start_month_value', "end_month_value"]
df = pd.DataFrame.from_records(data, columns=labels)
Meaning (Pseudo code)
1. for each row take only non nan values to create a new "row" (maybe dictionary with the date as the index and the [stock, value] as the value.
2. take only rows that are business start of month or business end of month.
3. write those rows to a new datatframe.
I have read several posts like this and this and several more.
All treat with dataframe of the same "type" and just resampling while I need to change to structure...
My code so far
# creating the new index with business days
df1 =pd.DataFrame(range(10000), index = pd.date_range(df.iloc[0].name, periods=10000, freq='D'))
from pandas.tseries.offsets import CustomBusinessMonthBegin
from pandas.tseries.holiday import USFederalHolidayCalendar
bmth_us = CustomBusinessMonthBegin(calendar=USFederalHolidayCalendar())
df2 = df1.resample(bmth_us).mean()
# creating the new index interseting my old one (daily) with the monthly index
new_index = df.index.intersection(df2.index)
# selecting only the rows I want
df = df.loc[new_index]
# creating a dict that will be my new dataset
new_dict = collections.OrderedDict()
# iterating over the rows and adding to dictionary
for index, row in df.iterrows():
# print index
date = df.loc[index].name
# values are the not none values
values = df.loc[index][~df.loc[index].isnull().values]
new_dict[date]=values
# from dict to list
data=[]
for key, values in new_dict.iteritems():
for i in range(0, len(values)):
date = key
stock_name = str(values.index[i])
stock_value = values.iloc[i]
row = (key, stock_name, stock_value)
data.append(row)
# from the list to df
labels = ['date','stock', 'value']
df = pd.DataFrame.from_records(data, columns=labels)
df.to_excel("migdal_format.xls")
Current output I get
One big problem:
I only get value of the stock on the start of month day.. I need start and end so I can calculate the stock gain on this month..
One smaller problem:
I am sure this is not the cleanest and fastest code :)
Thanks a lot!
So I have found a way.
looping through each column
groupby month
taking the first and last value I have in that month
calculate return
df_migdal = pd.DataFrame()
for col in df_input.columns[0:]:
stock_position = df_input.loc[:,col]
name = stock_position.name
name = re.sub('[^a-zA-Z]+', '', name)
name = name[0:-4]
stock_position=stock_position.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])
stock_position["name"] = name
stock_position["return"] = ((stock_position["last"] / stock_position["first"]) - 1) * 100
stock_position.dropna(inplace=True)
df_migdal=df_migdal.append(stock_position)
df_migdal=df_migdal.round(decimals=2)
I tried I way cooler way, but did not know how to handle the ,multi index I got... I needed that for each column, to take the two sub columns and create a third one from some lambda function.
df_input.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])

Adding column to pandas DataFrame with new keys

I'm new to python.
I have this code:
import pandas as pd
import quandl
quandlApiKey = "XXX" # Can't write the real one because it is forbidden.
d = {}
closing_data = pd.DataFrame()
indexes = {
'SCF/CME_SP1_FW' : 'Settle',
'CHRIS/LIFFE_Z1' : 'Settle',
'CHRIS/EUREX_FMEU1' : 'Settle'
}
for index in indexes.keys():
d[index] = quandl.get(index, start_date="2013-12-31", end_date="2014-12-31", api_key=quandlApiKey)
for index in indexes.keys():
closing_data[index] = d[index]['Settle']
In the first iteration (SCF/CME_SP1_FW) it saves the first column just fine and the keys of the rows are 31/12/2013, 02/01/2014, 03/01/2014 ...
The date 01/01/2014 is intentionally missing.
On the second iteration the for loop adds the second column, but although d[index]['Settle'] had the date 01/01/2014 a row wasn't added for it in closing_data.
Is there a way to join the rows of all the columns (when a row is missing in one of the columns I would like to have a NaN or something similar).
Thank you!

Categories

Resources