How to use pandas dataframe set_index() - python

Let us create a pandas dataframe with two columns:
lendf = pd.read_csv('/git/opencv-related/experiments/audio_and_text_files_lens.csv',
names=['path','duration'])
Here is the default numerically incrementing index:
Let's change the index to allow searching by the path attribute:
lendf.set_index(['path'])
But the index did not change??
How about invoking reindex() ?
lendf.reindex()
Still no change!
Note that I had been referencing the source code sphinx https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html: here is an excerpt:
So then what am I misunderstanding about pandas indexing - and how should the search/indexing by path be set up?

You need to pass inplace=True otherwise set_index will return a new dataframe not alter the existing one
lendf.set_index(['path'], inplace=True)

Related

Is there a way to add regular index numbers to a dataframe with dates as the index?

I am working with dataframes for a uni assignment, but do not have a lot of experience with it. One of the datasets we use automatically puts the date as the index, as you can see in the screenshot of the dataframe. I have to work with if- and for-loops, which works better with a regular index. I can't find anywhere how I can transform the date index into a regular column, and add normal index numbers. Can anyone help me with this?
Try this:
df_sleep_2.reset_index()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
You can either set the parameter inplace=True to directly modify the dataframe, or assign it to a new variable e.g.
# modify dataframe in place
df_sleep_2.reset_index(inplace=True)
# or assign result to new variable
df_sleep_2_new_index = df_sleep_2.reset_index()
try to reset the index using reset_index
df_sleep_2.reset_index()

How can I create index for python pandas dataframe?

I am importing several csv files into python using Jupyter notebook and pandas and some are created without a proper index column. Instead, the first column, which is data that I need to manipulate is used. How can I create a regular index column as first column? This seems like a trivial matter, but I can't find any useful help anywhere.
What my dataframe looks like
What my dataframe should look like
Could you please try this:
df.reset_index(inplace = True, drop = True)
Let me know if this works.
When you are reading in the csv, use pandas.read_csv(index_col= #, * args). If they don't have a proper index column, set index_col=False.
To change indices of an existing DataFrame df, try the methods df = df.reset_index() or df=df.set_index(#).
When you imported your csv, did you use the index_col argument? It should default to None, according to the documentation. If you don't use the argument, you should be fine.
Either way, you can force it not to use a column by using index_col=False. From the docs:
Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.
Python 3.8.5
pandas==1.2.4
pd.read_csv('file.csv', header=None)
I found the solution in the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Pandas won't allow selecting a column if there's a multi-index?

I'm debugging some pandas code that accidentally created a MultiIndex instead of a regular index. Due to the multi-index, Pandas won't allow selecting a column. In this case, I can just get rid of the MultiIndex but if I did need that MultiIndex, how can you select a column?
Additional info -- I'm getting this error with pandas 0.25.1 but this code was in a notebook somebody wrote years ago so apparently it used to work with older versions?
import numpy as np
import pandas as pd
names = ['FirstColumn', 'SecondColumn']
data = np.array([[5,6],[7,8]])
df = pd.DataFrame(data, columns = [names]) #Bug: this "works" but isn't what you want.
#The brackets around "[names]" creates a multi-index but that was unintentional.
#But "df.head()" and "df.describe()" both look normal so you can't see anything is wrong.
df['FirstColumn'] #ERROR! works fine with a single index, but fails with multiindex
df.FirstColumn #ERROR! works fine with a single index, but fails with multiindex
df.loc[:,'FirstColumn'] #ERROR! works fine with a single index, but fails with multiindex
Both of those statements give misleading errors about only integer scalar arrays can be converted to a scalar index
So how can you select the column when there's a multiindex? I know some tricks like unstack or changing the index, etc; but seems like there ought to be a simple way?
UPDATE: Turns out this worked fine in pandas 0.22.0 but fails in 0.25.1. Looks a regression bug was introduced. I've reported it on the pandas github.
Use DataFrame.xs function:
print (df.xs('FirstColumn', axis=1, level=0))
FirstColumn
0 5
1 7

Pandas - Creating a New Column

I have always made new columns in pandas using the following:
df['new_column'] = value
I am using this method, however, am receiving the warning for setting a copy.
What is the way to make a new column without creating a copy?
Try using
df.loc[:,'new column'] = value
As piRSquared comments, dfis probably a copy of another DataFrame and when you set values to df it probably incurs in what is called chain indexing. Refer to pandas docs for further information.

Python Pandas: How I can unique my table only based on certain columns?

I have a df :
How can I remove duplicates based on of only one column? Because I have rows that all of their columns are the same but only one is not. I want to ignore that column and get the unique values based on the other column?
That is how I tried but I get an error on it:
data.drop_duplicates('asn','first_seen','incident_type','ip','uri')
Any idea?
What version of pandas are you running? I believe that since >0.14 you should provide a list of columns to drop_duplicates() using the subset keyword, so try
data.drop_duplicates(subset=['asn','first_seen','incident_type','ip','uri'])
Also note that if you are not using inplace=True you will need to assign the returned value to a new dataframe.
Depending on your needs, you may also want to call reset_index() after dropping the duplicate rows.

Categories

Resources