I have a dataframe which looks like this
I tried to delete matchId but no matter what I use to delete it, for preprocessing, its outputting this error:
KeyError: "['matchId'] not found in axis"
What you attempted to do (which you should have mentioned in the question) is probably failing because you assume that the matchID column is a normal column. It is actually a special, index column and so cannot be accessed in the same way other columns can be accessed.
As suggested by anky_91, because of that, you should do
df = df.reset_index(drop=True)
if you want to completely remove the indexes in your table. This will replace them with the default indexes. To just make them into another column, you can just remove the drop=True from the above statement.
Your table will always have indexes, however, so you cannot completely get rid of them.
You can, however, output it with
df.values
and this will ignore the indexes and show just the values as arrays.
Related
Normally, I would be able to call dataframe.columns for a list of all columns, but I don't want to include the very first column in my list. Writing each column manually is an option, but one I'd like to avoid, given the few hundred column headers I'm working with. I do need to use this column, though, so deleting it from the dataframe entirely wouldn't work. How can I put every column into a list except for the first one?
This should work:
list(df.columns[1:])
Using pandas, I have to modify a DataFrame so that it only has the indexes that are also present in a vector, which was acquired by performing operations in one of the df's columns. Here's the specific line of code used for that (please do not mind me picking the name 'dataset' instead of 'dataframe' or 'df'):
dataset = dataset.iloc[list(set(dataset.index).intersection(set(vector.index)))]
it worked, and the image attached here shows the df and some of its indexes. However, when I try accessing a specific value by index in the new 'dataset', such as the line shown below, I get an error: single positional indexer is out-of-bounds
print(dataset.iloc[:, 21612])
note: I've also tried the following, to make sure it isn't simply an issue with me not knowing how to use iloc:
print(dataset.iloc[21612, :])
and
print(dataset.iloc[21612])
Do I have to create another column to "mimic" the actual indexes? What am I doing wrong? Please mind that it's necessary for me to make it so the indexes are not changed at all, despite the size of the DataFrame changing. E.g. if the DataFrame originally had 21000 rows and the new one only 15000, I still need to use the number 20999 as an index if it passed the intersection check shown in the first code snippet. Thanks in advance
Try this:
print(dataset.loc[21612, :])
After you have eliminated some of the original rows, the first (i.e., index) argument to iloc[] must not be greater than len(index) - 1.
I am looking to delete a row in a dataframe that is imported into python by pandas.
if you see the sheet below, the first column has same name multiple times. So the condition is, if the first column value re-appears in a next row, delete that row. If not keep that frame in the dataframe.
My final output should look like the following:
Presently I am doing it by converting each column into a list and deleting them by index values. I am hoping there would be an easy way. Rather than this workaround/
df.drop_duplicates([df.columns[0])
should do the trick.
Try the following code;
df.drop_duplicates(subset='columnName', keep=’first’, inplace=true)
I generate a grouped dataframe df = df.groupby(['X','Y']).max() which I then want to write (to csv, without indexes). So I need to convert 'X' and 'Y' back to regular columns; I tried using reset_index(), but the order of columns was wrong.
How to restore columns 'X' and 'Y' to their exact original column position?
Is the solution:
df.reset_index(level=0, inplace=True)
and then find a way to change the order of the columns?
(I also found this approach, for multiindex)
This solution keeps the columns as-is and doesn't create indexes, after grouping, hence we don't need reset_index() and column reordering at the end:
df.groupby(['X','Y'],as_index=False).max()
(After testing a lot of different methods, the simplest one was the best solution (as always) and the one which eluded me the longest. Thanks to #maxymoo for pointing it out.)
I have a df :
How can I remove duplicates based on of only one column? Because I have rows that all of their columns are the same but only one is not. I want to ignore that column and get the unique values based on the other column?
That is how I tried but I get an error on it:
data.drop_duplicates('asn','first_seen','incident_type','ip','uri')
Any idea?
What version of pandas are you running? I believe that since >0.14 you should provide a list of columns to drop_duplicates() using the subset keyword, so try
data.drop_duplicates(subset=['asn','first_seen','incident_type','ip','uri'])
Also note that if you are not using inplace=True you will need to assign the returned value to a new dataframe.
Depending on your needs, you may also want to call reset_index() after dropping the duplicate rows.