Excel to Pandas with a multi-level index producing NaN - python

I'm using this dataset:
https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/datasets/commutingtoworkbygenderukcountryandregion
Loaded thus:
commuting_data_xls = pd.ExcelFile(commuting_data_filename)
commuting_data_sheets = commuting_data_front['Table description '].dropna()
commuting_data_1 = pd.read_excel(commuting_data_xls, '1', header=4, usecols=range(1,13))
commuting_data_1.dropna().dropna(axis=1)
The resulting hierarchical index only gets the rows right where all index columns are specified.
How can I correct this and name the index columns?

Try the following steps:
Open using pd.read_excel(), just the sheet and range you want.
commuting_data_xls = pd.read_excel("commutingdata.xlsx",'1', header=4, usecols=range(1,13))
Reset the multi index names.
commuting_data_xls.index.names = ['Gender', 'Work_Region', 'Region']
Reset the index and then restrict the rows to elimiate the totals, I assume you want them gone? If not just remove the iloc step.
commuting_data_xls = commuting_data_xls.reset_index().iloc[0:28]
Remove the 'Work_Region' column as this seems superfluous.
commuting_data_xls = commuting_data_xls.loc[:,commuting_data_xls.columns != 'Work_Region']
Fill down the Gender column to replace NaN
commuting_data_xls['Gender'].fillna(method='ffill', inpldace=True)
Reset the index if it suits your purposes.
commuting_data_xls.set_index('Gender', 'Region')

Related

How to remove column with number as index name?

I have the following dataframe:
I tried to drop the data of -1 column by using
df = df.drop(columns=['-1'])
However, it is giving me the following error:
I was able to drop the column if the column name is some language character using this similar coding script, but not a number. What am I doing wrong?
You can test real columns names by converting them to list:
print (df.columns.tolist())
I think you need droping number -1 instead string '-1':
df = df.drop(columns=[-1])
Or another solution with same ouput:
df = df.drop(-1, axis=1)
EDIT:
If need select all columns without first use DataFrame.iloc for select by position, first : means select all rows and second 1: all columns with omit first:
df = df.iloc[:, 1:]
If you are just trying to remove the first column, another approach that would be independent of the column name is this:
df = df[df.columns[1:]]
you can do it simply by using the following code:
first check the name of the column by using following:
df.columns
then if the output is like:
Index(['-1', '0'], dtype='object')
use drop command to delete the column
df.drop(['-1'], axis =1, inplace = True)
guess this should help for future as well

pandas df masking specific row by list

I have pandas df which has 7000 rows * 7 columns. And I have list (row_list) that consists with the value that I want to filter out from df.
What I want to do is to filter out the rows if the rows from df contain the corresponding value in the list.
This is what I got when I tried,
"Empty DataFrame
Columns: [A,B,C,D,E,F,G]
Index: []"
df = pd.read_csv('filename.csv')
df1 = pd.read_csv('filename1.csv', names = 'A')
row_list = []
for index, rows in df1.iterrows():
my_list = [rows.A]
row_list.append(my_list)
boolean_series = df.D.isin(row_list)
filtered_df = df[boolean_series]
print(filtered_df)
replace
boolean_series = df.RightInsoleImage.isin(row_list)
with
boolean_series = df.RightInsoleImage.isin(df1.A)
And let us know the result. If it doesn't work show a sample of df and df1.A
(1) generating separate dfs for each condition, concat, then dedup (slow)
(2) a custom function to annotate with bool column (default as False, then annotated True if condition is fulfilled), then filter based on that column
(3) keep a list of indices of all rows with your row_list values, then filter using iloc based on your indices list
Without an MRE, sample data, or a reason why your method didn't work, it's difficult to provide a more specific answer.

Pandas replace column values with a list

I have a dataframe df where some of the columns are strings and some are numeric. I am trying to convert all of them to numeric. So what I would like to do is something like this:
col = df.ix[:,i]
le = preprocessing.LabelEncoder()
le.fit(col)
newCol = le.transform(col)
df.ix[:,i] = newCol
but this does not work. Basically my question is how do I delete a column from a data frame then create a new column with the same name as the column I deleted when I do not know the column name, only the column index?
This should do it for you:
# Find the name of the column by index
n = df.columns[1]
# Drop that column
df.drop(n, axis = 1, inplace = True)
# Put whatever series you want in its place
df[n] = newCol
...where [1] can be whatever the index is, axis = 1 should not change.
This answers your question very literally where you asked to drop a column and then add one back in. But the reality is that there is no need to drop the column if you just replace it with newCol.
newcol = [..,..,.....]
df['colname'] = newcol
This will keep the colname intact while replacing its contents with newcol.

Iterating over multiIndex dataframe

I have a data frame as shown below
I have a problem in iterating over the rows. for every row fetched I want to return the key value. For example in the second row for 2016-08-31 00:00:01 entry df1 & df3 has compass value 4.0 so I wanted to return the keys which has the same compass value which is df1 & df3 in this case
I Have been iterating rows using
for index,row in df.iterrows():
Update
Okay so now I understand your question better this will work for you.
First change the shape of your dataframe with
dfs = df.stack().swaplevel(axis=0)
This will make your dataframe look like:
Then you can iterate the rows like before and extract the information you want. I'm just using print statements for everything, but you can put this in some more appropriate data structure.
for index, row in dfs.iterrows():
dup_filter = row.duplicated(keep=False)
dfss = row_tuple[dup_filter].index.values
print("Attribute:", index[0])
print("Index:", index[1])
print("Matches:", dfss, "\n")
which will print out something like
.....
Attribute: compass
Index: 5
Matches: ['df1' 'df3']
Attribute: gyro
Index: 5
Matches: ['df1' 'df3']
Attribute: accel
Index: 6
Matches: ['df1' 'df3']
....
You could also do it one attribute at a time by
dfs_compass = df.stack().swaplevel(axis=0).loc['compass']
and iterate through the rows with just the index.
Old
If I understand your question correctly, i.e. you want to return the indexes of rows which have matching values on the second level of your columns, i.e. ('compass', 'accel', 'gyro'). The following will work.
compass_match_indexes = []
for index, row in df.iterrows():
match_filter = row[:, 'compass'].duplicated()
if len(row[:, 'compass'][match_filter] > 0)
compass_match_indexes.append(index)
You can use select your dataframe with that list like df.loc[compass_match_indexes]
--
Another approach, you could get the transform of your DataFrame with df.T and then use the duplicated function.

Add values to bottom of DataFrame automatically with Pandas

I'm initializing a DataFrame:
columns = ['Thing','Time']
df_new = pd.DataFrame(columns=columns)
and then writing values to it like this:
for t in df.Thing.unique():
df_temp = df[df['Thing'] == t] #filtering the df
df_new.loc[counter,'Thing'] = t #writing the filter value to df_new
df_new.loc[counter,'Time'] = dftemp['delta'].sum(axis=0) #summing and adding that value to the df_new
counter += 1 #increment the row index
Is there are better way to add new values to the dataframe each time without explicitly incrementing the row index with 'counter'?
If I'm interpreting this correctly, I think this can be done in one line:
newDf = df.groupby('Thing')['delta'].sum().reset_index()
By grouping by 'Thing', you have the various "t-filters" from your for-loop. We then apply a sum() to 'delta', but only within the various "t-filtered" groups. At this point, the dataframe has the various values of "t" as the indices, and the sums of the "t-filtered deltas" as a corresponding column. To get to your desired output, we then bump the "t's" into their own column via reset_index().

Categories

Resources