Pandas: Dataframe.Drop - ValueError: labels ['id'] not contained in axis - python

Attempting to drop a column from a DataFrame in Pandas. DataFrame created from a text file.
import pandas as pd
df = pd.read_csv('sample.txt')
df.drop(['a'], 1, inplace=True)
However, this generates the following error:
ValueError: labels ['a'] not contained in axis
Here is a copy of the sample.txt file :
a,b,c,d,e
1,2,3,4,5
2,3,4,5,6
3,4,5,6,7
4,5,6,7,8
Thanks in advance.

So the issue is that your "sample.txt" file doesn't actually include the data you are trying to remove.
Your line
df.drop(['id'], 1, inplace=True)
is attepmting to take your DataFrame (which includes the data from your sample file), find the column where the value is 'id' in the first row (axis 1) and do an inplace replace (modify the existing object rather than create a new object missing that column, this will return None and just modify the existing object.).
The issue is that your sample data doesn't include a column with a header equal to 'id'.
In your current sample file, you can only to a drop where the value in axis 1 is 'a', 'b', 'c', 'd', or 'e'. Either correct your code to drop one of those values or get a sample files with the correct header.
The documentation for Pandas isn't fantastic, but here is a good example of how to do a column drop in Pandas: http://chrisalbon.com/python/pandas_dropping_column_and_rows.html
** Below added in response to Answer Comment from #saar
Here is my example code:
Sample.txt:
a,b,c,d,e
1,2,3,4,5
2,3,4,5,6
3,4,5,6,7
4,5,6,7,8
Sample Code:
import pandas as pd
df = pd.read_csv('sample.txt')
print('Current DataFrame:')
print(df)
df.drop(['a'], 1, inplace=True)
print('\nModified DataFrame:')
print(df)
Output:
>>python panda_test.py
Current DataFrame:
a b c d e
0 1 2 3 4 5
1 2 3 4 5 6
2 3 4 5 6 7
3 4 5 6 7 8
Modified DataFrame:
b c d e
0 2 3 4 5
1 3 4 5 6
2 4 5 6 7
3 5 6 7 8

bad= pd.read_csv('bad_modified.csv')
A=bad.sample(n=10)
B=bad.drop(A.index,axis=0)
This is an example of dropping a dataframe partly.
In case you need it.

Related

Selecting first n columns and last n columns with pandas

I am trying to select the first 2 columns and the last 2 column from a data frame by index with pandas and save it on the same dataframe.
is there a way to do that in one step?
You can use the iloc function to get the columns, and then pass in the indexes.
df.iloc[:,[0,1,-1,-2]]
You are looking for iloc:
df = pd.DataFrame([[1,2,3,4,5], [2,3,4,5,6], [3,4,5,6,7]], columns=['a','b','c','d','e'])
df.iloc[:,:2] # Grabs all rows and first 2 columns
df.iloc[:,-2:] # Grabs all rows and last 2 columns
pd.concat([df.iloc[:,:2],df.iloc[:,-2:]],axis=1) # Puts them together row wise
df = pd.DataFrame([[1,2,3,4,5], [2,3,4,5,6], [3,4,5,6,7]], columns=['a','b','c','d','e'])
df[['a','b','d','e']]
result
a b d e
0 1 2 4 5
1 2 3 5 6
2 3 4 6 7

How to add a MultiIndex after loading csv data into a pandas dataframe?

I am trying to add additional index rows to an existing pandas dataframe after loading csv data into it.
So let's say I load my data like this:
columns = ['Relative_Pressure','Volume_STP']
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
df.columns = columns
where contents is a string in csv format. The resulting DataFrame might look something like this:
For clarity reasons I would now like to add additional index rows to the DataFrame as shown here:
However in the link these multiple index rows are generated right when the DataFrame is created. I would like to add e.g. rows for unit or descr to the columns.
How could I do this?
You can create a MultiIndex on the columns by specifically creating the index and then assigning it to the columns separately from reading in the data.
I'll use the example from the link you provided. The first method is to create the MultiIndex when you make the dataframe:
df = pd.DataFrame({('A',1,'desc A'):[1,2,3],('B',2,'desc B'):[4,5,6]})
df.columns.names=['NAME','LENGTH','DESCRIPTION']
df
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
As stated, this is not what you are after. Instead, you can make the dataframe (from your file for example) and then make the MultiIndex from a set of lists and then assign it to the columns:
df = pd.DataFrame({'desc A':[1,2,3], 'desc B':[4,5,6]})
# Output
desc A desc B
0 1 4
1 2 5
2 3 6
# Create a multiindex from lists
index = pd.MultiIndex.from_arrays((['A', 'B'], [1, 2], ['desc A', 'desc B']))
# Assign to the columns
df.columns = index
# Output
A B
1 2
desc A desc B
0 1 4
1 2 5
2 3 6
# Name the columns
df.columns.names = ['NAME','LENGTH','DESCRIPTION']
# Output
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
There are other ways to construct a MultiIndex, for example, from_tuples and from_product. You can read more about Multi Indexes in the documentation.

Ascending order of Excel rows

I have the following rows in Excel:
How can I put them in an ascending order in Python (i.e. notice how the row starting with 12 comes before that starting with 118).
I think the Pandas library would be a starting point? Any clue is appreciated.
Thanks.
First read the excel file
df = pd.read_excel("your/file/path/file.xls")
df
data
0 1212.i.jpg
1 121.i.jpg
2 212.i.jpg
3 512.i.jpg
then make a substring of the data
assuming the column name is "data"
df["sub"] = df["data"].str[:-6]
Just in case, convert the new column to type int
df["sub"] = df["sub"].astype(int)
Now sort the values, by that new column
df.sort_values("sub", inplace=True)
Finaly, if you only want your original data:
df = df["data"]
1 121.i.jpg
2 212.i.jpg
3 512.i.jpg
0 1212.i.jpg
Using natsorted
from natsort import natsorted
df.data=natsorted(df.data)
df
Out[129]:
data
0 121.i.jpg
1 212.i.jpg
2 512.i.jpg
3 1212.i.jpg
Keep original data index
df.loc[natsorted(df.index,key=lambda x : df.data[x] )]
Out[138]:
data
1 121.i.jpg
2 212.i.jpg
3 512.i.jpg
0 1212.i.jpg
Or using argsort with split
df.iloc[np.argsort(df.data.str.split('.').str[0].astype(int))]
Out[141]:
data
1 121.i.jpg
2 212.i.jpg
3 512.i.jpg
0 1212.i.jpg

Pandas Dataframe Reshaping

I have a dataframe as show below
>> df
A 1
B 2
A 5
B 6
A 7
B 8
How do I reformat it to make it
A 1 5 7
B 2 6 8
Thanks
Given a data frame like this
df = pd.DataFrame(dict(one=list('ABABAB'), two=range(6)))
you can do
df.groupby('one').two.apply(lambda s: s.reset_index(drop=True)).unstack()
# 0 1 2
# one
# A 0 2 4
# B 1 3 5
or (slightly slower, and giving a slightly different result)
df.groupby('one').apply(lambda d: d.two.reset_index(drop=True))
# two 0 1 2
# one
# A 0 2 4
# B 1 3 5
The first approach works with a DataFrameGroupBy, the second uses a SeriesGroupBy.
You can grab the series and use np.reshape to keep the correct dimensions.
The order = 'F' makes it scroll through columns (such as Fortran), order = 'C' scrolls through rows like C
Then it gets into a dataframe
df = pd.DataFrame(data=np.arange(10))
data = df['a'].values.reshape((2, 5), order='F')
df = pd.DataFrame(data=data, index=['a', 'b'])
how did you generate this data frame. I think it should have been generated using dictionary and then generate dataframe using that dict.
d = {'A': [1,5,7], 'B':[2,6,8]}
df = pandas.DataFrame(data=d, index=['p1','p2','p3'])
and then you can use df.T to transpose your dataframe if you need to.

Pandas indexing and accessing columns by names

I am trying to access pandas dataframe by column names after indexing the df with a specific column and it returns incorrect column values.
import pandas as pd
rs =pd.read_csv('rs.txt', header="infer", sep="\t", names=['id', 'exp','fov','cycle', 'color', 'values'], index_col=2)
rs.cycle.head()
I am indexing the df here with 'fov' and I want to access the 'cycle' column, it gives me the color column instead. I think I am missing something here?
EDIT
The first few lines of the input file are:
6 3 1 G 0.96593
6 3 1 O 0.88007
6 3 1 R 0.94305
6 3 2 B 0.90554
6 3 2 G 0.93146
I think the problem arises because your data file has 5 columns and your names list has 6 elements. To verify, check the first few values in the id column- these will all be set to 6 if I am right. The First few items in the exp column will have the value 3.
To fix this, read your input file like so:
rs =pd.read_csv('rs.txt', header="infer", sep="\t", names=['exp','fov','cycle', 'color', 'values'], index_col=2
Pandas will automatically insert row identifiers.

Categories

Resources