How to remove column with number as index name? - python

I have the following dataframe:
I tried to drop the data of -1 column by using
df = df.drop(columns=['-1'])
However, it is giving me the following error:
I was able to drop the column if the column name is some language character using this similar coding script, but not a number. What am I doing wrong?

You can test real columns names by converting them to list:
print (df.columns.tolist())
I think you need droping number -1 instead string '-1':
df = df.drop(columns=[-1])
Or another solution with same ouput:
df = df.drop(-1, axis=1)
EDIT:
If need select all columns without first use DataFrame.iloc for select by position, first : means select all rows and second 1: all columns with omit first:
df = df.iloc[:, 1:]

If you are just trying to remove the first column, another approach that would be independent of the column name is this:
df = df[df.columns[1:]]

you can do it simply by using the following code:
first check the name of the column by using following:
df.columns
then if the output is like:
Index(['-1', '0'], dtype='object')
use drop command to delete the column
df.drop(['-1'], axis =1, inplace = True)
guess this should help for future as well

Related

Exclude values in DF column

I have a problem, I want to exclude from a column and drop from my DF all my rows finishing by "99".
I tried to create a list :
filteredvalues = [x for x in df['XX'] if x.endswith('99')]
I have in this list all the concerned rows but how to apply to my DF and drop those rows :
I tried a few things but nothing works :
Lately I tried this :
df = df[df['XX'] not in filteredvalues]
Any help on this?
Use the .str attribute, with corresponding string methods, to select such items. Then use ~ to negate the result, and filter your dataframe with that:
df = df[~df['XX'].str.endswith('99')]

If dataframe.tail(1) is X, do something

I am trying to check if the last cell in a pandas data-frame column contains a 1 or a 2 (these are the only options). If it is a 1, I would like to delete the whole row, if it is a 2 however I would like to keep it.
import pandas as pd
df1 = pd.DataFrame({'number':[1,2,1,2,1], 'name': ['bill','mary','john','sarah','tom']})
df2 = pd.DataFrame({'number':[1,2,1,2,1,2], 'name': ['bill','mary','john','sarah','tom','sam']})
In the above example I would want to delete the last row of df1 (so the final row is 'sarah'), however in df2 I would want to keep it exactly as it is.
So far, I have thought to try the following but I am getting an error
if df1['number'].tail(1) == 1:
df = df.drop(-1)
DataFrame.drop removes rows based on labels (the actual values of the indices). While it is possible to do with df1.drop(df1.index[-1]) this is problematic with a duplicated index. The last row can be selected with iloc, or a single value with .iat
if df1['number'].iat[-1] == 1:
df1 = df1.iloc[:-1, :]
You can check if the value of number in the last row is equal to one:
check = df1['number'].tail(1).values == 1
# Or check entire row with
# check = 1 in df1.tail(1).values
If that condition holds, you can select all rows, except the last one and assign back to df1:
if check:
df1 = df1.iloc[:-1, :]
if df1.tail(1).number == 1:
df1.drop(len(df1)-1, inplace = True)
You can use the same tail function
df.drop(df.tail(n).index,inplace=True) # drop last n rows

Excel to Pandas with a multi-level index producing NaN

I'm using this dataset:
https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/datasets/commutingtoworkbygenderukcountryandregion
Loaded thus:
commuting_data_xls = pd.ExcelFile(commuting_data_filename)
commuting_data_sheets = commuting_data_front['Table description '].dropna()
commuting_data_1 = pd.read_excel(commuting_data_xls, '1', header=4, usecols=range(1,13))
commuting_data_1.dropna().dropna(axis=1)
The resulting hierarchical index only gets the rows right where all index columns are specified.
How can I correct this and name the index columns?
Try the following steps:
Open using pd.read_excel(), just the sheet and range you want.
commuting_data_xls = pd.read_excel("commutingdata.xlsx",'1', header=4, usecols=range(1,13))
Reset the multi index names.
commuting_data_xls.index.names = ['Gender', 'Work_Region', 'Region']
Reset the index and then restrict the rows to elimiate the totals, I assume you want them gone? If not just remove the iloc step.
commuting_data_xls = commuting_data_xls.reset_index().iloc[0:28]
Remove the 'Work_Region' column as this seems superfluous.
commuting_data_xls = commuting_data_xls.loc[:,commuting_data_xls.columns != 'Work_Region']
Fill down the Gender column to replace NaN
commuting_data_xls['Gender'].fillna(method='ffill', inpldace=True)
Reset the index if it suits your purposes.
commuting_data_xls.set_index('Gender', 'Region')

Read dataframe in pandas skipping first column to read time series data

Question is quite self explanatory.Is there any way to read the csv file to read the time series data skipping first column.?
I tried this code:
df = pd.read_csv("occupancyrates.csv", delimiter = ',')
df = df[:,1:]
print(df)
But this is throwing an error:
"TypeError: unhashable type: 'slice'"
If you know the name of the column just do:
df = pd.read_csv("occupancyrates.csv") # no need to use the delimiter = ','
df = df.drop(['your_column_to_drop'], axis=1)
print(df)
df = pd.read_csv("occupancyrates.csv")
df.pop('column_name')
dataframe is like a dictionary, where column names are keys & values are the column items. For Ex
d = dict(a=1,b=2)
d.pop('a')
Now if you print d, the output will be
{'b': 2}
This is what I have done above to remove a column out of data frame.
By doing this way you need not to assign it back to dataframe like other answer(s)
df = df.iloc[:, 1:]
Or you don't even need to specify inplace=True anywhere
The simplest way to delete the first column should be:
del df[df.columns[0]]
or
df.pop(df.columns[0])

Extracting specific columns from pandas.dataframe

I'm trying to use python to read my csv file extract specific columns to a pandas.dataframe and show that dataframe. However, I don't see the data frame, I receive Series([], dtype: object) as an output. Below is the code that I'm working with:
My document consists of:
product sub_product issue sub_issue consumer_complaint_narrative
company_public_response company state zipcode tags
consumer_consent_provided submitted_via date_sent_to_company
company_response_to_consumer timely_response consumer_disputed?
complaint_id
I want to extract :
sub_product issue sub_issue consumer_complaint_narrative
import pandas as pd
df=pd.read_csv("C:\\....\\consumer_complaints.csv")
df=df.stack(level=0)
df2 = df.filter(regex='[B-F]')
df[df2]
import pandas as pd
input_file = "C:\\....\\consumer_complaints.csv"
dataset = pd.read_csv(input_file)
df = pd.DataFrame(dataset)
cols = [1,2,3,4]
df = df[df.columns[cols]]
Here specify your column numbers which you want to select. In dataframe, column start from index = 0
cols = []
You can select column by name wise also. Just use following line
df = df[["Column Name","Column Name2"]]
A simple way to achieve this would be as follows:
df = pd.read_csv("C:\\....\\consumer_complaints.csv")
df2 = df.loc[:,'B':'F']
Hope that helps.
This worked for me, using slicing:
df=pd.read_csv
df1=df[n1:n2]
Where $n1<n2# are both columns in the range, e.g:
if you want columns 3-5, use
df1=df[3:5]
For the first column, use
df1=df[0]
Though not sure how to select a discontinuous range of columns.
We can also use i.loc. Given data in dataset2:
dataset2.iloc[:3,[1,2]]
Will spit out the top 3 rows of columns 2-3 (Remember numbering starts at 0)
Then dataset2.iloc[:3,[1,2]] spits out

Categories

Resources