I am importing several csv files into python using Jupyter notebook and pandas and some are created without a proper index column. Instead, the first column, which is data that I need to manipulate is used. How can I create a regular index column as first column? This seems like a trivial matter, but I can't find any useful help anywhere.
What my dataframe looks like
What my dataframe should look like
Could you please try this:
df.reset_index(inplace = True, drop = True)
Let me know if this works.
When you are reading in the csv, use pandas.read_csv(index_col= #, * args). If they don't have a proper index column, set index_col=False.
To change indices of an existing DataFrame df, try the methods df = df.reset_index() or df=df.set_index(#).
When you imported your csv, did you use the index_col argument? It should default to None, according to the documentation. If you don't use the argument, you should be fine.
Either way, you can force it not to use a column by using index_col=False. From the docs:
Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.
Python 3.8.5
pandas==1.2.4
pd.read_csv('file.csv', header=None)
I found the solution in the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Related
New to Pandas, so maybe I'm missing a big idea?
I have a Pandas DataFrame of register transactions with shape like (500,4):
Time datetime64[ns]
Net Total float64
Tax float64
Total Due float64
I'm working through my code in a Python3 Jupyter notebook. I can't get past sorting any column. Working through the different code examples for sort, I'm not seeing the output reorder when I inspect the df. So, I've reduced the problem to trying to order just one column:
df.sort_values(by='Time')
# OR
df.sort_values(['Total Due'])
# OR
df.sort_values(['Time'], ascending=True)
No matter which column title, or which boolean argument I use, the displayed results never change order.
Thinking it could be a Jupyter thing, I've previewed the results using print(df), df.head(), and HTML(df.to_html()) (the last example is for Jupyter notebooks). I've also rerun the whole notebook from import CSV to this code. And, I'm also new to Python3 (from 2.7), so I get stuck with that sometimes, but I don't see how that's relevant in this case.
Another post has a similar problem, Python pandas dataframe sort_values does not work. In that instance, the ordering was on a column type string. But as you can see all of the columns here are unambiguously sortable.
Why does my Pandas DataFrame not display new order using sort_values?
df.sort_values(['Total Due']) returns a sorted DF, but it doesn't update DF in place.
So do it explicitly:
df = df.sort_values(['Total Due'])
or
df.sort_values(['Total Due'], inplace=True)
My problem, fyi, was that I wasn't returning the resulting dataframe, so PyCharm wasn't bothering to update said dataframe. Naming the dataframe after the return keyword fixed the issue.
Edit:
I had return at the end of my method instead of
return df,
which the debugger must of noticed, because df wasn't being updated in spite of my explicit, in-place sort.
I am working with dataframes for a uni assignment, but do not have a lot of experience with it. One of the datasets we use automatically puts the date as the index, as you can see in the screenshot of the dataframe. I have to work with if- and for-loops, which works better with a regular index. I can't find anywhere how I can transform the date index into a regular column, and add normal index numbers. Can anyone help me with this?
Try this:
df_sleep_2.reset_index()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
You can either set the parameter inplace=True to directly modify the dataframe, or assign it to a new variable e.g.
# modify dataframe in place
df_sleep_2.reset_index(inplace=True)
# or assign result to new variable
df_sleep_2_new_index = df_sleep_2.reset_index()
try to reset the index using reset_index
df_sleep_2.reset_index()
I have been tasked with extracting data from a specific column of a csv file using numpy and loadtxt. This data is on column D of the attatched image. By my logic i should use the numpy paramter usecols=3 to only obtain the 4th column which is the one I want. But my output keeps telling me that the index is out of range when there is clearly a column there. I have done some prior searching and the general consensus seems to be that one of the rows doesn't have any data in the column. But i have checked and all the rows have data in the column. Here is the code Im using highlighted in green.Can anyone possibly tell me why this is happening?
data = open("suttonboningtondata_moodle.csv","r")
min_temp = loadtxt(data,usecols=(3),skiprows=5,dtype=str,delimiter=" ")
print(min_temp)
I will suggest you use another library to extract your data. The pandas library works well in this regard.
Here is a documentation link to guide you.
pandas docs
I added a comma instead of whitespace for the delimiter value and it worked. I have no idea why though
Let us create a pandas dataframe with two columns:
lendf = pd.read_csv('/git/opencv-related/experiments/audio_and_text_files_lens.csv',
names=['path','duration'])
Here is the default numerically incrementing index:
Let's change the index to allow searching by the path attribute:
lendf.set_index(['path'])
But the index did not change??
How about invoking reindex() ?
lendf.reindex()
Still no change!
Note that I had been referencing the source code sphinx https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html: here is an excerpt:
So then what am I misunderstanding about pandas indexing - and how should the search/indexing by path be set up?
You need to pass inplace=True otherwise set_index will return a new dataframe not alter the existing one
lendf.set_index(['path'], inplace=True)
I have a tab separated file which I extracted in pandas dataframe as below:
import pandas as pd
data1 = pd.DataFrame.from_csv(r"C:\Users\Ashish\Documents\indeed_ml_dataset\train.tsv", sep="\t")
data1
Here is how the data1 looks like:
Now, I want to view the column name tags. I don't know whether I should call it a column or not, but I have tried accessing it using the norm:
data2=data1[['tags']]
but it errors out. I have tried several other things as well using index and loc, but all of them fails. Any suggestions?
To fix this you'll need to remove description from the index by resetting. Try the below:
data2 = data1.reset_index()
data2['tags']
You'll then be able to select by "tags".
Try reading your data using pd.read_csv instead of pd.DataFrame.from_csv as it takes first column as index by default.
For more info refer to this documentation on pandas website: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_csv.html