I have a dataframe from a .xls spreadsheet and I print off the columns print(df.columns.values) and the output contains a column with the name: Poll Responses\n\t\t\t\t\t.
I look in the excel sheet and in the cell column header, there's no additional spaces or tabs.
So in order to get the data from those columns, I have to use print(df['Poll Responses\n\t\t\t\t\t'])
Is this is how it is, or am I doing something wrong?
Use .str.strip:
df.columns = df.columns.str.strip()
This will strip whitespace from column headings in dataframe.
Related
I have read in some data from a csv, and there were a load of spare columns and rows that were not needed. I've managed to get rid of most of them, but the first column is showing as an NaN and will not drop despite several attempts. This means I cannot promote the titles in row 0 to headers. I have tried the below:
df = pd.read_csv(("List of schools.csv"))
df = df.iloc[3:]
df.dropna(how='all', axis=1, inplace =True)
df.head()
But I am still getting this returned:
Any help please? I'm a newbie
You can improve your read_csv() operation.
Avloss can tell your "columns" are indices because they are bold. Looking at your output, there are two things of note.
The "columns" are bold implying that pandas read them in as part of the index of the DataFrame rather than as values
There is no information above the horizontal line at the top indicating there are currently no column names. The top row of the csv file that contains the column names is being read in as values.
To solve your column deletion problem, you should first improve your read_csv() operation by being more explicit. Your current code is placing column headers in the data and placing some of the data in the indicies. Since you have the operation df = df.iloc[3:] in your code, I'm assuming the data in your csv file doesn't start until the 4th row. Try this:
header_row = 3 #or 4 - I am bad at zero-indexing
df = pd.read_csv('List of schools.csv', header=header_row, index_col=False)
df.dropna(how='all', axis=1, inplace =True)
This code should read the column names in as column names and not index any of the columns, giving you a cleaner DataFrame to work from when dropping NA values.
those aren't columns, those are indices. You can convert them to columns by doing
df = df.reset_index()
I have a Pandas DataFrame with a bunch of rows and labeled columns.
I also have an excel file which I prepared with one sheet which contains no data but only
labeled columns in row 1 and each column is formatted as it should be: for example if I
expect percentages in one column then that column will automatically convert a raw number to percentage.
What I want to do is fill the raw data from my DataFrame into that Excel sheet in such a way
that row 1 remains intact so the column names remain. The data from the DataFrame should fill
the excel rows starting from row 2 and the pre-formatted columns should take care of converting
the raw numbers to their appropriate type, hence filling the data should not override the column format.
I tried using openpyxl but it ended up creating a new sheet and overriding everything.
Any help?
If you're certain about the order of columns is same, you can try this after opening the sheet with openpyxl:
df.to_excel(writer, startrow = 2,index = False, Header = False)
If your # of columns and order is same then you may try xlsxwriter and also mention the sheet name to want to refresh:
df.to_excel('filename.xlsx', engine='xlsxwriter', sheet_name='sheetname', index=False)
I am working on concatenating many csv files together and want to take one column, from a multicolumn csv, and append it as a new column in a second csv. The problem is that the columns have different numbers of rows so the new column that I am adding to the existing csv gets cut short once the row index from the existing csv is reached.
I have tried to read in the new column as a second dataframe and then add that dataframe as a new column to the existing csv.
df = pd.read_csv("Existing CSV.csv")
df2 = pd.read_csv("New CSV.csv", usecols = ['Desired Column'])
df["New CSV"] = df2
"Existing CSV" has 1200 rows of data while "New CSV" has 1500 rows. When I run the code, the 'New CSV" column is added to "Existing CSV", however, only the first 1200 rows of data are included.
Ideally, all 1500 rows from "New CSV" will be included and the 300 rows missing from "Existing CSV" will be left blank.
By default, read_csv gives the resulting DataFrame an integer index, so I can think of a couple of options to try.
Setup
df = pd.read_csv("Existing CSV.csv")
df2 = pd.read_csv("New CSV.csv", usecols = ['Desired Column'])
Method 1: join
df = df.join(df2['Desired Column'], how='right')
Method 2: reindex_like and assign
df = df.reindex_like(df2).assign(**{'Desired Column': df2['Desired Column']})
How can I read a excel file in pandas starting from a row and column, I am looking to drop some rows and columns, say my excel file contains some random data in starting rows and columns, so I would either like to begin reading at a given row,c column or drop few rows and columns. How can I achieve this ?
Typically I would like my rows to start from B21, drop everything till row 20 and Column A.
Please help.
You can read your file as normal with the pd.read_excel command, to skip the first 20 rows you use the skiprows option and then drop the columns that you do not want. In this case that column will be columnAname.
df = pd.read_excel('filename', skiprows = 20).drop([columnAname], 1)
I was importing 400+ csv files with 50+ columns into data frame, each file having different columns but some column name contains comma ',', I want to remove it, please help on this.
data = pd.read_csv('D://ABC//WID_AM_MacroData.csv',delimiter=';').loc[[6]]
you can use :
data.columns = [col.replace(',', '') for col in data.columns]