Pandas cannot load the proper column of the CSV File - python

I have been facing some problems importing a specific column of a CSV file.I needed to import the Longitude and Latitude Column of the dataset (Fig:1).
But in spyder, the variable explorer is showing the wrong values of the variable (Fig:2). And it seems like that my expected column of values is showing inside the Index column. How do I fix this/ How do I import it?
However, When I click the resize button below on the variable explorer window, the index column expands and show something like Fig: 3
The code I am using:
import pandas as pd
import numpy as np
dataset = pd.read_csv('dataset.csv',error_bad_lines=False)
X=dataset.loc[:,['latitude','longitude']]

I suggest making an array of column names, and trying to read the csv like so:
colnames = ["latitude", "longitude",...]
dataset = pd.read_csv('dataset.csv', names=colnames, index_col=0)
# index_col = 0 makes a new index column
# and if you must use error_bad_lines...
dataset = pd.read_csv('dataset.csv', names=colnames, index_col=0, error_bad_lines=False)

When you set error_bad_lines=False you are telling pandas to not raise an error when an error happens. Your previous error instead was telling you exactly what is going wrong:
"Error tokenizing data. C error: Expected 62 fields in line 8, saw 65"
It means you have lines with more fields than the number of headers, which cause the misalignment when you tell pandas to don't care about that. You should clean your data removing the extra column or import just some specific columns using the headers as the other answer suggests.

Related

Renaming Column Not Working even Though Syntax is Correct in Python

I have been trying to rename the column name in a csv file which I have been working on through Google-Colab. But the same line of code is working on one column name and is also not working for the other.
import pandas as pd
import numpy as np
data = pd.read_csv("Daily Bike Sharing.csv",
index_col="dteday",
parse_dates=True)
dataset = data.loc[:,["cnt","holiday","workingday","weathersit",
"temp","atemp","hum","windspeed"]]
dataset = dataset.rename(columns={'cnt' : 'y'})
dataset = dataset.rename(columns={"dteday" : 'ds'})
dataset.head(1)
The Image below is the dataframe called data
The Image below is dataset
This image is the final output which I get when I try to rename the dataframe.
The column name "dtedate" is not getting renamed but "cnt" is getting replaced "y" by the same code. Can someone help me out, I have been racking my brain on this for sometime now.
That's because you're setting dteday as your index, upon reading in the csv, whereas cnt is quite simply a column. Avoid the index_col attribute in read_csv and instead perform dataset = dataset.set_index('ds') after renaming.
An alternative in which only your penultimate line (trying to rename the index) would need to be changed:
dataset.index.names = ['ds']
You can remove the 'index-col' in the read statement, include 'dtedate' in your dataset and then change the column name. You can make the column index using df.set_index later.

Getting wrong readings when trying to plot CSV file using pandas

My csv file looks like the following:
As you see there are 7 columns with comma separated. I have spent hours to read and plot the first column starting with 31364 with the following code:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('test.csv', sep=',', header=None, names=['colA','colB','colC','colD','colE','colF','colG'])
y = df['colA']
plt.plot(y)
But the code outputs this plot which does not match the data at all:
I'm using Spyder with Anaconda. What could be the problem?
Is column A all values in the 31,000 range? You're not plotting the whole file.
edit: Don't know what result you're looking for. In your code, the first column in your csv is used as the index to the dataframe (after you read the csv, enter 'df', no quotes, at the python prompt to see what your dataset looks like.
If you don't want the first column in the csv as an index, add 'index_col=False', no quotes, to the parameters when you read the csv in.
Also, not a good idea to end lines in a csv wit the delimiter, comma in this case.

Why does PANDAS only see one column to csv dataset with numerous columns?

I am new to and PANDAS and I am trying to work out why the shape of this csv dataset[https://www.kaggle.com/vfoufikos/airbnb-analysis-lisbon][1] is being shown as: (237, 1)? As it appears that the dataset has 20 columns.
import time
import pandas as pd
import numpy as np
df = pd.read_csv('airbnb_lisbon.csv', error_bad_lines=False)
print(df.shape)
Could anyone please explain why?
You could use a usecols option to select the columns youd like to use. For example if you wanted to store dataset columns into 'df' you could use:
df = pd.read_csv(...., usecols=['col1', 'col2',..., 'coln'])
If you'd like to select all the data without specifying which columns, I'd look into specifying your delimiter, as that might be the problem you've run into.
You can specify the type used by using sep=',' or sep=';' in your pd.read_csv() function. Let me know if either of these work!
I had the very same problem reported by LeoGER. I've tried the three solutions that you have suggested.
df = pd.read_csv(...., usecols=['col1', 'col2',..., 'coln']) DIDN'T WORK. Jupyter reported an error
sep ; DIDN'T WORK, as dataset kept the same one column stardard
sep , IT WORKED, and finally I can see the whole set of columns! :D
It seems that my mistake was that I used delimiter ; instead of sep ,.

Prevent Pandas read_Excel / read_CSV from assigning (i.e. inferring) an index automatically

Total newbie and this is my first ever question so apologies in advance for any inadvertent faux pas.
I have a large(ish) dataset in Excel xlsx format that I would like to import into a pandas dataframe. The data has column headers except for the first column which does not have a header label. Here is what the excel sheet looks like:
Raw data
I am using read_excel() in Pandas to read in the data. The code I am using is:
df = pd.read_excel('Raw_Data.xlsx', sheetname=0, labels=None, header=0, index_col=None)
(I have tried index_col = false or 0 but, for obvious reasons, it doesn't change anything)
The headers for the columns are picked up fine but the first column, circled in red in the image below, is assigned as the index.
wrong index
What I am trying to get from the read_excel command is as follows with the index circled in red:
correct index
I have other excel sheets that I have used read_excel() to import into pandas and pandas automatically adds in a numerical incremental index rather than inferring one of the columns as an index.
None of those excel sheets had missing label in the column header though which might be the issue here though I am not sure.
I understand that I can use the reset_index() command after the import to get the correct index.
Wondering if it can be done without having to do the reset_index() and within the read_excel() command. i.e. is there anyway to prevent an index being inferred or to force pandas to add in the index column like it normally does.
Thank you in advance!
I don't think you can do it with only the read_excel function because of the missing value in cell A1. If you want to insert something into that cell prior to reading the file with pandas, you could consider using openpyxl as below.
from openpyxl import load_workbook as load
path = 'Raw_Data.xlsx'
col_name = 'not_index'
cell = 'A1'
def write_to_cell(path, col_name, cell):
wb = load(path)
for sheet in wb.sheetnames:
ws = wb[sheet]
if ws[cell].value is None:
ws[cell] = col_name
wb.save(path)

Unable to get correct output from tsv file using pandas

I have a tsv file which I am trying to read by the help of pandas. The first two rows of the files are of no use and needs to be ignored. Although, when I get the output, I get it in the form of two columns. The name of the first column is Index and the name of second column is a random row from the csv file.
import pandas as pd
data = pd.read_csv('zahlen.csv', sep='\t', skiprows=2)
Please refer to the screenshot below.
The second column name is in bold black, which is one of the row from the file. Moreover, using '\t' as delimiter does not separate the values in different column. I am using Spyder IDE for this. Am I doing something wrong here?
Try this:
data = pd.read_table('zahlen.csv', header=None, skiprows=2)
read_table() is more suited for tsv files and read_csv() is a more specialized version of it. Then header=None will make first row data, instead of header.

Categories

Resources