trying to do a quick function but struggling since new to Pandas/Python. I'm trying to remove nas from two of my columns, but I keep getting this error, my code is the following:
def remove_na():
df.dropna(subset=['Column 1', 'Column 2'])
df.reset_index(drop=True)
df = remove_rows()
df.head(3)
AttributeError: 'NoneType' object has no attribute 'dropna'
I want to use this function on different tables, hence why I thought it would make sense to create a method. However, I just don't understand why it's not working for this method when compared to others it seems fine. Thank you.
I believe you can specify if you want to remove NA from columns or rows by the paremeter axis where 0 is index and 1 is columns. This would remove all NAs from all columns
df.dropna(axis =1, inplace=True )
I think you can use apply with dropna:
df = df.apply(lambda x: pd.Series(x.dropna().values))
print (df)
OR you can also try this
df=df.dropna(axis=0, how='any')
You're getting an error cos the dropna function here yields a dataframe as its output.
You can either save it to a dataframe:
df = df.dropna(subset=['Column 1', 'Column 2'])
or call the argument 'inplace=True' :
df.dropna(subset=['Column 1', 'Column 2'], inplace=True)
In order to remove all the missing values from the data set at once using pandas you can use the following:(Remember You have to specify the index in the arguments so that you can efficiently remove the missing values)
# making new data frame with dropped NA values
new_data = data.dropna(axis = 0, how ='any')
Related
This question would be very basic but I'm struck in dropping a column without a column name. I imported an excel into pandas and the data looked something like below
A B
0 24 -10
1 12 -3
2 17 5
3 63 45
I tried to get rid of the first column (supposed to be index columns) that has no column name, and wish to have the dataframe with just A and B columns, for instance..
When I ran
df.columns
I get the below
Index(['Unnamed: 0', 'A', 'B', dtype='object')
I tried several ways
df = pd.read_excel(r"path", index_col = False)
and
df.reset_index(drop=True, inplace=True)
and
df = df.drop([''], axis=1)
the below line display an error
self.DATA.drop([""], axis=1, inplace=True)
The error for the above line is
name 'self' is not defined
I tried other possible ways. But nothing seems to work. What is the mistake that I'm making? Any help would be highly appreciated.
You can try
pd.read_excel('tmp.xlsx', index_col=0)
# Or
pd.read_excel('tmp.xlsx', usecols=lambda x: 'Unnamed' not in x)
# Or
pd.read_excel('tmp.xlsx', usecols=['A', 'B'])
this should work for the nth column in your dataframe df.drop(columns=df.columns[n], inplace=True), if it's the first columns, so n = 0.
I worked around as I deliberated on #enke comment and realized that the simple drop function giving the column name (as below) can actually solve the issue of removing the undesired index column
df = df.drop(['Unnamed: 0'], axis=1)
Try This
empty_index = [" " for i in range(len(d["A"]))]
df = pd.DataFrame(d, index = empty_index)
here the list is empty spaces..
I am trying to extract 3 columns so that I can create a graph later.
newDF = df.loc[filt,['Dublin','Cork','Galway']]
print(newDF)
but unfortunately I get an error :
Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['Dublin'],
dtype='object')
Thank you for your help...
new_df = df[['Dublin', 'Cork', 'Galway']]
I want to create a modified dataframe with the specified columns.
I tried the following but throws the error "Passing list-likes to .loc or [] with any missing labels is no longer supported"
# columns to keep
filtered_columns = ['text', 'agreeCount', 'disagreeCount', 'id', 'user.firstName', 'user.lastName', 'user.gender', 'user.id']
tips_filtered = tips_df.loc[:, filtered_columns]
# display tips
tips_filtered
Thank you
It looks like Pandas has deprecated this method of indexing. According to their docs:
This behavior is deprecated and will show a warning message pointing
to this section. The recommended alternative is to use .reindex()
Using the new recommended method, you can filter your columns using:
tips_filtered = tips_df.reindex(columns = filtered_columns).
NB: To reindex rows, you would use reindex(index = ...) (More information here).
Some of the columns in the list are not included in the dataframe , if you do want do that , let us try reindex
tips_filtered = tips_df.reindex(columns=filtered_columns)
I encountered the same error with missing row index labels rather than columns.
For example, I would have a dataset of products with the following ids: ['a','b','c','d']. I store those products in a dataframe with indices ['a','b','c','d']:
df=pd.DataFrame(['product a','product b','product c', 'product d'],index=['a','b','c','d'])
Now let's assume I have an updated product index:
row_indices=['b','c','d','e'] in which 'e' corresponds to a new product: 'product e'. Note that 'e' was not present in my original index ['a','b','c','d'].
If I try to pass this updated index to my df dataframe: df.loc[row_indices,:],
I'll get this nasty error message:
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['e'], dtype='object').
To avoid this error I need to do intersection of my updated index with the original index:
df.loc[df.index.intersection(row_indices),:]
this is in line with recommendation of what pandas docs
This error pops up if indexing on something which is not present - reset_index() worked for me as I was indexing on a subset of the actual dataframe with actual indices, in this case the column may not be present in the dataframe.
I had the same issue while trying to create new columns along with existing ones :
df = pd.DataFrame([[1,2,3]], columns=["a","b","c"])
def foobar(a,b):
return a,b
df[["c","d"]] = df.apply(lambda row: foobar(row["a"], row["b"]), axis=1)
The solution was to add result_type="expand" as an argument of apply() :
df[["c","d"]] = df.apply(lambda row: foobar(row["a"], row["b"]), axis=1, result_type="expand")
Quick question:
I have the following situation (table):
Imported data frame
Now what I would like to achieve is the following (or something in those lines, it does not have to be exactly that)
Goal
I do not want the following columns so I drop them
data.drop(data.columns[[0,5,6]], axis=1,inplace=True)
What I assumed is that the following line of code could solve it, but I am missing something?
pivoted = data.pivot(index=["Intentional homicides and other crimes","Unnamed: 2"],columns='Unnamed: 3', values='Unnamed: 4')
produces
ValueError: Length of passed values is 3395, index implies 2
Difference to the 8 question is that I do not want any aggregation functions, just to leave values as is.
Data can be found at: Data
The problem with the method pandas.DataFrame.pivot is that it does not handle duplicate values in the index. One way to solve this is to use the function pandas.pivot_table instead.
df = pd.read_csv('Crimes_UN_data.csv', skiprows=[0], encoding='latin1')
cols = list(df.columns)
cols[1] = 'Region'
df.columns = cols
pivoted = pd.pivot_table(df, values='Value', index=['Region', 'Year'], columns='Series', aggfunc=sum)
It should not sum anything, despite the aggfunc argument, but it was throwing pandas.core.base.DataError: No numeric types to aggregate if the argument was not provided.
When using Python Pandas to read a CSV it is possible to specify the index column. Is this possible using Python Dask when reading the file, as opposed to setting the index afterwards?
For example, using pandas:
df = pandas.read_csv(filename, index_col=0)
Ideally using dask could this be:
df = dask.dataframe.read_csv(filename, index_col=0)
I have tried
df = dask.dataframe.read_csv(filename).set_index(?)
but the index column does not have a name (and this seems slow).
No, these need to be two separate methods. If you try this then Dask will tell you in a nice error message.
In [1]: import dask.dataframe as dd
In [2]: df = dd.read_csv('*.csv', index='my-index')
ValueError: Keyword 'index' not supported dd.read_csv(...).set_index('my-index') instead
But this won't be any slower or faster than doing it the other way.
I know I'm a bit late, but this is the first result on google so it should get answered.
If you write your dataframe with:
# index = True is default
my_pandas_df.to_csv('path')
#so this is same
my_pandas_df.to_csv('path', index=True)
And import with Dask:
import dask.dataframe as dd
my_dask_df = dd.read_csv('path').set_index('Unnamed: 0')
It will use column 0 as your index (which is unnamed thanks to pandas.DataFrame.to_csv() ).
How to figure it out:
my_dask_df = dd.read_csv('path')
my_dask_df.columns
which returns
Index(['Unnamed: 0', 'col 0', 'col 1',
...
'col n'],
dtype='object', length=...)
Now you can write: df = pandas.read_csv(filename, index_col='column_name') (Where column name is the name of the column you want to set as the index).