pd.read_csv ignores columns that don't have headers - python

I have a .csv file that is generated by a third-party program. The data in the file is in the following format:
%m/%d/%Y 49.78 85 6 15
03/01/1984 6.63368 82 7 9.8 34.29056405 2.79984079 2.110346498 0.014652412 2.304545521 0.004732732
03/02/1984 6.53368 68 0 0.2 44.61471002 3.21623666 2.990408898 0.077444779 2.793385466 0.02661873
03/03/1984 4.388344 55 6 0 61.14463457 3.637231063 3.484310818 0.593098236 3.224973641 0.214360796
There are 5 column headers (row 1 in excel, columns A-E) but 11 columns in total (row 1 columns F-K are empty, rows 2-N contain float values for columns A-K)
I was not sure how to paste the .csv lines in so they are easily replicable, sorry for that. An image of the excel sheet is shown here: Excel sheet to read in
when I use the following code:
FWInds=pd.read_csv("path.csv")
or:
FWInds=pd.read_csv("path.csv", header=None)
the resulting dataframe FWInds does not contain the last 6 columns - it only contains the columns with headers (columns A-E from excel, column A as index values).
FWIDat.shape
Out[48]: (245, 4)
Ultimately the last 6 columns are the only ones I even want to read in.
I also tried:
FWInds=pd.read_csv('path,csv', header=None, index_col=False)
but got the following error
CParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 11
I also tried to ignore the first row since the column titles are unimportant:
FWInds=pd.read_csv('path.csv', header=None, skiprows=0)
but get the same error.
Also no luck with the "usecols" parameter, it doesn't seem to understand that I'm referring to the column numbers (not names), unless I'm doing it wrong:
FWInds=pd.read_csv('path.csv', header=None, usecols=[5,6,7,8,9,10])
Any tips? I'm sure it's an easy fix but I'm very new to python.

There are a couple of parameters that can be passed to pd.read_csv():
import pandas as pd
colnames = list('ABCDEFGHIKL')
df = pd.read_csv('test.csv', sep='\t', names=colnames)
With this, I can actually import your data quite fine (and it is accessible via eg df['K'] afterwards).

You could do it as shown:
col_name = list('ABCDEFGHIJK')
data = 'path.csv'
pd.read_csv(data, delim_whitespace=True, header=None, names=col_name, usecols=col_name[5:])
To read all the columns from A → K, simply omit the usecols parameter.
Data:
data = StringIO(
'''
%m/%d/%Y,49.78,85,6,15
03/01/1984,6.63368,82,7,9.8,34.29056405,2.79984079,2.110346498,0.014652412,2.304545521,0.004732732
03/02/1984,6.53368,68,0,0.2,44.61471002,3.21623666,2.990408898,0.077444779,2.793385466,0.02661873
03/03/1984,4.388344,55,6,0,61.14463457,3.637231063,3.484310818,0.593098236,3.224973641,0.214360796
''')
col_name = list('ABCDEFGHIJK')
pd.read_csv(data, header=None, names=col_name, usecols=col_name[5:])

Related

Pandas data frame, to_csv creating duplicate rows

Here is my current code
data = pd.read_csv('file', sep='\t', header=[2])
ndf = pd.DataFrame(data=nd)
new_data = pd.concat([data, ndf])
new_data.to_csv('file', sep='\t', index=False, mode='a', header=False)
So the file I am reading has 3 rows of headers, the headers in the first 2 rows are not used but I need to keep them there.
The headers in row 3 are the same as the headers in ndf, when I concat data and ndf the new_data dataframe is correctly aligned. So there's no problem there.
The problem comes when I try to write the new_data back to the original file with append mode. Every row of data that was in the original file is duplicated. This happens each time.
I have tried adding drop_duplicates new_data = pd.concat([data, ndf]).drop_duplicates(subset='item_sku', keep=False)
But this still leaves me with 2 of each row each time I write back to file.
I also tried reading the file with multiple header rows: header=[0, 1, 2]
But this makes the concat fail, I'm guessing because it's I haven't told the concat function which row of headers to align with. I think passing keys= would work but I'm not understanding the documentation very well.
EDIT-
This is an example of the file I am reading
load v1.0 74b FlatFile
ver raid week month
Dept Date Sales IsHoliday
1 2010-02-05 24924.50 False
This would be the data I am trying to append
Dept Date Sales IsHoliday
3 2010-07-05 6743.50 False
And this is the output I am getting
load v1.0 74b FlatFile
ver raid week month
Dept Date Sales IsHoliday
1 2010-02-05 24924.50 False
1 2010-02-05 24924.50 False
3 2010-07-05 6743.50 False
Try re-setting the columns of nd to the three-level header before concat:
data = pd.read_csv("file1.csv",sep="\t",header=[0,1,2])
nd = pd.read_csv("file2.csv",sep="\t")
nd.columns = data.columns
output = pd.concat([data,nd])
output.to_csv('file', sep='\t', index=False)
>>> output
load v1.0 74b FlatFile
ver raid week month
Dept Date Sales IsHoliday
0 1 2010-02-05 24924.5 False
0 3 2010-07-05 6743.5 False
I'm sure there's a better way of doing it but I've ended up with this result that works.
data = pd.read_csv('file', sep='\t', header=[0, 1, 2])
columns = data.columns
ndf = pd.DataFrame(data=nd, columns=data.columns.get_level_values(2))
data.columns = data.columns.get_level_values(2)
new_data = pd.concat([data, ndf])
new_data.columns = columns
new_data.to_csv('file', sep='\t', index=False, header=True)
So what I did was, set the ndf to have the same columns as the third row of data, then did the same data.
This allowed me to concat the two dataframes.
I still had the issue that I was missing the first 2 rows of headers but if I saved the columns from the original data file I could then asign the columns, back to the original values, before I saved to csv.

converting a dataframe to a csv file

I am working with a data Adult that I have changed and would like to save it as a csv. however after saving it as a csv and re-loading the data to work with again, the data is not converted properly. The headers are not preserved and some columns are now combined. I have looked through the page and online, but what I have tried is not working. I load the data in with the following code:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
After inserting missing values and changing the data frame as desired I have tried:
df = Adult
df.to_csv('file_name.csv',header = True)
df.to_csv('file_name.csv')
and a few other variations. How can I save the file to a CSV and preserve the correct format for the next time I read the file in?
When re-loading the data I use the code:
import pandas as pd
df = pd.read_csv('file_name.csv')
when running df.head the output is:
<bound method NDFrame.head of Unnamed: 0 Unnamed: 0.1 age ... Black Asian-Pac-Islander Other
0 0 0 39 ... 0 0 0
1 1 1 50 ... 0 0 0
2 2 2 38 ... 0 0 0
3 3 3 53 ... 1 0 0
and print(df.loc[:,"age"].value_counts()) the output is:
36 898
31 888
34 886
23 877
35 876
which should not have 2 columns
If you pickle it like so:
Adult.to_pickle('adult.pickle')
You will, subsequently, be able to read it back in using read_pickle as follows:
original_adult = pd.read_pickle('adult.pickle')
Hope that helps.
If you want to preserve the output column order you can specify the columns directly while saving the DataFrame:
import pandas as pd
url2 = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
df = pd.read_csv(url2, header=None, skipinitialspace=True)
my_columns = ["age", "workclass", "fnlwgt", "education", "educationnum", "maritalstatus", "occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
df.columns = my_columns
# do the computation ...
df[my_columns].to_csv('file_name.csv')
You can add parameter index=False to the to_csv('file_name.csv', index=False) function if you are not interested in saving the DataFrame row index. Otherwise, while reading the csv file again you'd need to specify the index_col parameter.
According to the documentation value_counts() returns a Series object - you see two columns because the first one is the index - Age (36, 31, ...), and the second is the count (898, 888, ...).
I replicated your code and it works for me. The order of the columns is preserved.
Let me show what I tried. Tried this batch of code:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-
databases/adult/adult.data" #Reading in Data from a freely and easily
available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data
by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns =["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
This worked perfectly. Then
df = Adult
This also worked.
Then I saved this data frame to a csv file. Make sure you are providing the absolute path to the file even if is is being saved in the same folder as this script.
df.to_csv('full_path_to_the_file.csv',header = True)
# so someting like
#df.to_csv('Users/user_name/Desktop/folder/NameFile.csv',header = True)
Load this csv file into a new_df. It will generate a new column for keeping track of index. It is unnecessary and you can drop it like following:
new_df = pd.read_csv('Users/user_name/Desktop/folder/NameFile.csv', index_col = None)
new_df= new_df.drop('Unnamed: 0', axis =1)
When I compare the columns of the new_df from the original df, with this line of code
new_df.columns == df.columns
I get
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True])
You might not have been providing the absolute path to the file or saving the file twice as here. You only need to save it once.
df.to_csv('file_name.csv',header = True)
df.to_csv('file_name.csv')
When you save the dataframe in general, the first column is the index, and you sould load the index when reading the dataframe, also whenever you assign a dataframe to a variable make sure to copy the dataframe:
df = Adult.copy()
df.to_csv('file_name.csv',header = True)
And to read:
df = pd.read_csv('file_name.csv', index_col=0)
The first columns from print(df.loc[:,"age"].value_counts()) is the index column which is shown if you query the datframe, to save this to a list, use the to_listmethod:
print(df.loc[:,"age"].value_counts().to_list())

How can I read a double-semicolon-separated .csv with quoted values using pandas?

I analyse huge financial data-sets that often give me trouble because of corrupt data fields. Luckily, in the near future I get the opportunity to change the way data is delivered to me. The data will get delivered as a double-semicolon-separated txt-file with the fields in double quotation marks, i.e. "A";;"B";;"C"
In using pandas' read_csv to convert this file to a pandas df, however, pandas doesn't seem to recognize the double quotation marks, only the double-semicolon separator. Because the output looks like: "A" "B" "C", instead of A B C
I've tried passing quotechar='"' as a parameter and quoting=csv.QUOTE_ALL, but that doesn't change anything.
import pandas as pd
import csv
def create_df(loc):
df = pd.read_csv(loc, sep=';;', dtype=object, encoding="ISO-8859-1", quotechar='"', quoting=csv.QUOTE_ALL, header=None)
return df
directory = 'C:\\PycharmProjects\\Test\\'
file = directory + 'test;;qq;;.txt'
df = create_df(file)
writer = pd.ExcelWriter('test.xlsx')
df.to_excel(writer, 'test')
writer.save()
This is a bug when pandas has to use the python engine due to the separator not being a single character, if you pass a single character separator then it imports and parses those columns correctly but you end up with additional columns:
In[80]:
import csv
t='''"A";;"B";;"C"'''
df = pd.read_csv(io.StringIO(t), sep=';', quoting=csv.QUOTE_ALL)
df
Out[80]:
Empty DataFrame
Columns: [A, Unnamed: 1, B, Unnamed: 3, C]
Index: []
then you can drop the extra columns by filtering:
In[81]:
df = df.loc[:,~df.columns.str.contains('Unnamed:')]
df
Out[81]:
Empty DataFrame
Columns: [A, B, C]
Index: []

Problems when pandas reading Excel file that has blank top row and left column

I tried to read an Excel file that looks like below,
I was using pandas like this
xls = pd.ExcelFile(file_path)
assets = xls.parse(sheetname="Sheet1", header=1, index_col=1)
But I got error
ValueError: Expected 4 fields in line 3, saw 5
I also tried
assets = xls.parse(sheetname="Sheet1", header=1, index_col=1, parse_cols="B:E")
But I got misparsed result as follows
Then tried
assets = xls.parse(sheetname="Sheet1", header=1, index_col=0, parse_cols="B:E")
Finally works, but why index_col=0 and parse_cols="B:E"? This makes me confused becasue based on pandas documents, assets = xls.parse(sheetname="Sheet1", header=1, index_col=1) should just be fine. Have I missed something?
The read_excel documentation is not clear on a point.
skiprows=1 to skip the first empty row at the top of the file or header=1 also works to use the second row has column index.
parse_cols='B:E' is a way to skip the first empty column at the left of the file
index_col=0 is optional and permits to define the first parsed column (B in this example) as the DataFrame index. The mistake is here since index_col is relative to columns selected though the parse_cols parameter.
With your example, you can use the following code
pd.read_excel('test.xls', sheetname='Sheet1', skiprows=1,
parse_cols='B:E', index_col=0)
# AA BB CC
# 10/13/16 1 12 -1
# 10/14/16 3 12 -2
# 10/15/16 5 12 -3
# 10/16/16 3 12 -4
# 10/17/16 5 23 -5

read_csv: delimiter before end-of-line(EOL) leads to wrong column number

When the values in a file end with a separator, the columns read in with read_csv are not assigned properly. For example,
import pandas as pd
# File content
columns = ['A','B']
data = [[1,2], [3,4]]
# Generate very simple test file
with open('test.dat', 'w') as fh:
fh.writelines('{0}\t{1}'.format(*columns))
for line in data:
fh.write('\n')
for val in line:
# This is the crux: there is a tab-delimiter after each value,
# even the last one!
fh.write('{0}\t'.format(val))
# Try to read it
df = pd.read_csv('test.dat', sep='\t', index_col=None)
print(df)
produces
A B
1 2 NaN
3 4 NaN
Is this a bug, or a feature?
In this specific case, the problem can be fixed with
df = pd.read_csv('test.dat', sep='\t', index_col=None, usecols=['A','B'])
correctly produces
A B
0 1 2
1 3 4
However, for files with an unknown, large number of columns, this fix is inconvenient. Is there any option to "pd.read_csv" that can fix this problem?
Interestingly, adding the * quantifier to the sep argument seems to work:
df = pd.read_csv('test.dat', sep='\t*', index_col=None)

Categories

Resources