We have a data system that creates tables of data as Excel files. I'm trying to import this Excel file into a pandas dataframe.
In the Excel, Row 1 is some metadata I don't want, while row 2 is the column header. By default, Pandas correctly uses column 1 as the index (a lot number), but the second column is a production date, but it for what ever reason does not have a header in row 2.
So pandas seems to be creating a multi-index by default, is there a way to suppress this function? It seems to be doing this because there is no column header in row 2 column 2 (cell B2). If I manually edit the Excel to add a label, it imports as I want.
import pandas as pd
xlsx01 = pd.ExcelFile("C:/Users/maherp/Desktop/JunkFiles/Book1.xlsx")
df_01 = pd.read_excel(xlsx01, header=1)
I get an error that I cannot decipher when I try:
df_01 = pd.read_excel(xlsx01, header=1, index_col=0)
As suggested by #Peej1226, here is final solution which worked.
df_01 = pd.read_excel(xlsx01, sheet_name='Discrete', skiprows=1, header=0,index_col=0)
Related
import pandas as pd
import re
file_name = "example.xlsx" #name of the excel file
sheet = "sheet" #name of the sheet
df = pd.read_excel(file_name, sheet_name = sheet, usecols = "A:F")
select_rows = df.iloc[516-2:] #specify rows
My question is why if I want to refer to row 516 onwards (from excel's index), I should subtract the number by 2 as stated on the code? I know the index on Pandas starting from zero, which means subtracted by 1 and not 2.
#Samuel You are already 'minus one' because of zero-based index in Pandas. However, what isn't clear until reading the Pandas documentation for pd.read_excel is that there is a parameter called 'header' that is set to 0 by default (i.e. the first row (row 1 in Excel) is used as your header for column names). To demonstrate, try modifying the line where you create 'df' by adding an additional argument of header=None (code snippet below) and then run your code and inspect the results.
df = pd.read_excel(file_name, sheet_name = sheet, usecols = "A:F", header=None)
I have read in some data from a csv, and there were a load of spare columns and rows that were not needed. I've managed to get rid of most of them, but the first column is showing as an NaN and will not drop despite several attempts. This means I cannot promote the titles in row 0 to headers. I have tried the below:
df = pd.read_csv(("List of schools.csv"))
df = df.iloc[3:]
df.dropna(how='all', axis=1, inplace =True)
df.head()
But I am still getting this returned:
Any help please? I'm a newbie
You can improve your read_csv() operation.
Avloss can tell your "columns" are indices because they are bold. Looking at your output, there are two things of note.
The "columns" are bold implying that pandas read them in as part of the index of the DataFrame rather than as values
There is no information above the horizontal line at the top indicating there are currently no column names. The top row of the csv file that contains the column names is being read in as values.
To solve your column deletion problem, you should first improve your read_csv() operation by being more explicit. Your current code is placing column headers in the data and placing some of the data in the indicies. Since you have the operation df = df.iloc[3:] in your code, I'm assuming the data in your csv file doesn't start until the 4th row. Try this:
header_row = 3 #or 4 - I am bad at zero-indexing
df = pd.read_csv('List of schools.csv', header=header_row, index_col=False)
df.dropna(how='all', axis=1, inplace =True)
This code should read the column names in as column names and not index any of the columns, giving you a cleaner DataFrame to work from when dropping NA values.
those aren't columns, those are indices. You can convert them to columns by doing
df = df.reset_index()
I am trying to read an Excel file using pandas but my columns and index are changed:
df = pd.read_excel('Assignment.xlsx',sheet_name='Assignment',index_col=0)
Excel file:
Jupyter notebook:
By default pandas consider first row as header. You need to tell that take 2 rows as header.
df = pd.read_excel("xyz.xlsx", header=[0,1], usecols = "A:I", skiprows=[0])
print df
You can choose to mention skiprows depending on the requirement. If you remove skiprows, it will show first row header without any unnamed entries.
Refer this link
I'm trying to read an Excel or CSV file into pandas dataframe. The file will read the first two columns only, and the top row of the first two columns will be the column names. The problem is when I have the first column of the top row empty in the Excel file.
IDs
2/26/2010 2
3/31/2010 4
4/31/2010 2
5/31/2010 2
Then, the last line of the following code fails:
uploaded_file = request.FILES['file-name']
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1])
else:
df = pd.read_excel(uploaded_file, usecols=[0,1])
ref_date = 'ref_date'
regime_tag = 'regime_tag'
df.columns = [ref_date, regime_tag]
Apparently, it only reads one column (i.e. the IDs). However, with read_csv, it reads both column, with the first column being unnamed. I want it to behave that way and read both columns regardless of whether the top cells are empty or filled. How do I go about doing that?
What's happening is the first "column" in the Excel file is being read in as an index, while in the CSV file it's being treated as a column / series.
I recommend you work the other way and amend pd.read_csv to read the first column as an index. Then use reset_index to elevate the index to a series:
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1], index_col=0)
else:
df = pd.read_excel(uploaded_file, header=[0,1], usecols=[0,1])
df = df.reset_index() # this will elevate index to a column called 'index'
This will give consistent output, i.e. first series will have label 'index' and the index of the dataframe will be the regular pd.RangeIndex.
You could potentially use a dispatcher to get rid of the unwieldy if / else construct:
file_flag = {True: pd.read_csv, False: pd.read_excel}
read_func = file_flag[uploaded_file.name.endswith('.csv')]
df = read_func(uploaded_file, usecols=[0,1], index_col=0).reset_index()
I am reading a csv file, cleaning it up a little, and then saving it back to a new csv file. The problem is that the new csv file has a new column (first column in fact), labelled as index. Now this is not the row index, as I have turned that off in the to_csv() function as you can see in the code. Plus row index doesn't have a column label as well.
df = pd.read_csv('D1.csv', na_values=0, nrows = 139) # Read csv, with 0 values converted to NaN
df = df.dropna(axis=0, how='any') # Delete any rows containing NaN
df = df.reset_index()
df.to_csv('D1Clean.csv', index=False)
Any ideas where this phantom column is coming from and how to get rid of it?
I think you need add parameter drop=True to reset_index:
df = df.reset_index(drop=True)
drop : boolean, default False
Do not try to insert index into dataframe columns. This resets the index to the default integer index.