I have a DataFrame that has columns such as ID, Name, Specification, Time.
my file path to open them
mc = pd.read_csv("C:\\data.csv", sep = ",", header = 0, dtype = str)
When I checked my columns values, using
mc.coulumns.values
I found my ID had it with a weird character looked like this,
['/ufeffID', 'Name', 'Specification', 'Time']
After this I assigned that columns with ID like this,
mc.columns.values[0] = "ID"
When I checked this using
mc.columns.values
I got my result as,
Array(['ID', 'Name', 'Specification', 'Time'])
Then, I checked with,
"ID" in mc.columns.values
it gave me "True"
Then I tried,
mc["ID"]
I got an error stating like this,
keyError 'ID'.
I want to get the values of ID column and get rid of that weird characters in front of ID column? Is there any way to solve that? Any help would be appreciated. Thank you in advance.
That's utf-16 BOM, pass encoding='utf-16' to read_csv see: https://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
mc = pd.read_csv("C:\\data.csv", sep=",", header=0, dtype=str, encoding='utf-16')
the above should work FE FF is the BOM for utf-16 Big endian to be specific
Also you should use rename rather than try to overwrite the np array value:
mc.rename(columns={mc.columns[0]: "ID"}, inplace=True)
should work correctly
Related
I have a simple code that read csv file. After that I change the names of the columns and print them. I found one weird issue that for some numeric columns its adding extra .0 Here is my code:
v_df = pd.read_csv('csvfile', delimiter=;)
v_df = v_df.rename(columns={Order No. : Order_Id})
for index, csv_row in v_df.iterrows():
print(csv_row.Order_Id)
Output is:
149545961155429.0
149632391661184.0
If I remove the empty row (2nd one in the above output) from the csv file, .0 does not appear in the ORDER_ID.
After doing some search, I found that converting this column to string will solve the problem. It does work if I change the first row of the above code to:
v_df = pd.read_csv('csvfile', delimiter=;, dtype={'Order No.' : 'str'})
However, the issue is that the column name 'Order No.' is changing to Order_Id as I am doing the rename so I can not use 'Order No.'. For this reason I tried the following:
v_df[['Order_Id']] = v_df[['Order_Id']].values.astype('str')
But unfortunately it seems that astype is not changing the datatype and .0 is still appearing. My questions are:
1- Why .0 is coming at the first place if there is an empty row in the csv file?
2- Why datatype change is not happening after rename?
My aim is to just get rid of .0, I don't want to change the datatype if .0 can go away using any other method.
I am trying to emulate your df here, although it has some differences I think it will work for you:
import pandas as pd
import numpy as np
v_df = pd.DataFrame([['13-Oct-22','149545961155429.0','149545961255429','Delivered'],
['12-Oct-22',None,None,'delivered'],
['15-Oct-22','149632391661184.0','149632391761184','Delivered']], columns=
['Transaction Date','Order_Id','Order Item No.','Order Item Status'])
v_df[['Order_Id']] = v_df[['Order_Id']].fillna(np.nan).values.astype('float').astype('int').astype('str')
Try it and let me know
I am working on loading a sample csv file using koalas. What I see is a weird behavior.
The file has a blank column area_code which looks like this. As you can see, it is a blank column. All the rows for this column have blank.
When I read the file as df = ks.read_csv('zipcodes.csv'), I get the following output, which means that the column has nulls, as expected, all good.
When I read the file as df = ks.read_csv('zipcodes.csv', dtype = str), I get the following output, which means that the column doesn't have any nulls.
After a closer look, it seems that the dtype = str is causing this column to be loaded with a string value = None
Any reason why would this happen. Any help is appreciated. Thanks in advance.
Bhupesh C
For pandas, that issue was discussed here and seems to be solved.
I don't know much about koalas but you can try this :
import numpy as np
df = ks.read_csv('zipcodes.csv', dtype=str, keep_default_na=False).replace('', np.nan)
I have the following code which imports a CSV file. There are 3 columns and I want to set the first two of them to variables. When I set the second column to the variable "efficiency" the index column is also tacked on. How can I get rid of the index column?
df = pd.DataFrame.from_csv('Efficiency_Data.csv', header=0, parse_dates=False)
energy = df.index
efficiency = df.Efficiency
print efficiency
I tried using
del df['index']
after I set
energy = df.index
which I found in another post but that results in "KeyError: 'index' "
When writing to and reading from a CSV file include the argument index=False and index_col=False, respectively. Follows an example:
To write:
df.to_csv(filename, index=False)
and to read from the csv
df.read_csv(filename, index_col=False)
This should prevent the issue so you don't need to fix it later.
df.reset_index(drop=True, inplace=True)
DataFrames and Series always have an index. Although it displays alongside the column(s), it is not a column, which is why del df['index'] did not work.
If you want to replace the index with simple sequential numbers, use df.reset_index().
To get a sense for why the index is there and how it is used, see e.g. 10 minutes to Pandas.
You can set one of the columns as an index in case it is an "id" for example.
In this case the index column will be replaced by one of the columns you have chosen.
df.set_index('id', inplace=True)
If your problem is same as mine where you just want to reset the column headers from 0 to column size. Do
df = pd.DataFrame(df.values);
EDIT:
Not a good idea if you have heterogenous data types. Better just use
df.columns = range(len(df.columns))
you can specify which column is an index in your csv file by using index_col parameter of from_csv function
if this doesn't solve you problem please provide example of your data
One thing that i do is df=df.reset_index()
then df=df.drop(['index'],axis=1)
To remove or not to create the default index column, you can set the index_col to False and keep the header as Zero. Here is an example of how you can do it.
recording = pd.read_excel("file.xls",
sheet_name= "sheet1",
header= 0,
index_col= False)
The header = 0 will make your attributes to headers and you can use it later for calling the column.
It works for me this way:
Df = data.set_index("name of the column header to start as index column" )
I am importing study data into a Pandas data frame using read_csv.
My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").
When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.
Is there a way to import this column unchanged maybe as a string?
I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.
As indicated in this answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.
converters={'column_name': str}
Let's say I have csv file projects.csv like below:
project_name,project_id
Some Project,000245
Another Project,000478
As for example below code is trimming leading zeros:
from pandas import read_csv
dataframe = read_csv('projects.csv')
print dataframe
Result:
project_name project_id
0 Some Project 245
1 Another Project 478
Solution code example:
from pandas import read_csv
dataframe = read_csv('projects.csv', converters={'project_id': str})
print dataframe
Required result:
project_name project_id
0 Some Project 000245
1 Another Project 000478
To have all columns as str:
pd.read_csv('sample.csv', dtype=str)
To have certain columns as str:
# column names which need to be string
lst_str_cols = ['prefix', 'serial']
dict_dtypes = {x: 'str' for x in lst_str_cols}
pd.read_csv('sample.csv', dtype=dict_dtypes)
here is a shorter, robust and fully working solution:
simply define a mapping (dictionary) between variable names and desired data type:
dtype_dic= {'subject_id': str,
'subject_number' : 'float'}
use that mapping with pd.read_csv():
df = pd.read_csv(yourdata, dtype = dtype_dic)
et voila!
If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:
df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file
You could also do:
df = pd.read_csv("your_file.csv", dtype=str)
By doing this you will have all your columns as strings and you won't lose any leading zeros.
You Can do This , Works On all Versions of Pandas
pd.read_csv('filename.csv', dtype={'zero_column_name': object})
You can use converters to convert number to fixed width if you know the width.
For example, if the width is 5, then
data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})
This will do the trick. It works for pandas==0.23.0 and also read_excel.
Python3.6 or higher required.
I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt() and create the DataFrame from there.
EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10 coming in November.
As an example, consider the following my_data.txt file:
id,A
03,5
04,6
To preserve the leading zeros for the id column:
df = pd.read_csv("my_data.txt", dtype={"id":"string"})
df
id A
0 03 5
1 04 6
A bit confused why I am getting this error. I thought skiprows should have taken care of me.
Error:
CParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 13
Line:
df_data = pd.read_csv(infile.name, skiprows=[6], sep=',')
CSV:
Header: 1asdf
Header: 2fac
Header: 3aaz
Header: 4ssw
Header: 5aaa
0.0,-64,192,152,27023,3,0,26275,31473,149,67,77,0.0
0.04050016403198242,-64,192,148,27021,3,0,26274,31471,149,67,77,0.038919925689697266
0.08100008964538574,-64,192,148,27017,3,0,26275,31467,149,67,77,0.07783985137939453
0.12150001525878906,-60,192,148,27019,3,0,26277,31467,149,67,77,0.1167600154876709
0.16199994087219238,-60,192,144,27015,3,0,26277,31463,149,67,77,0.15567994117736816
0.2025001049041748,-60,192,148,27075,3,0,26319,31463,149,67,77,0.19460034370422363
If you pass a list to skiprows, it interprets it as 'skip the rows in this list (0 indexed)'. Pass an integer instead. You probably also want header=None so your first row of data doesn't become the column names.
pd.read_csv(infile.name, skiprows=6, header=None)
I got the same error message. In my case it was because commas were used as decimal marks and the cells were seperated by semicolon. In my case sep=";" solved the problem:
pd.read_csv(infile.name, sep=";")
This error comes when you have more columns entries than specified in schema.
That means - In your particulate column you should have delimiter in
it.
In this way interpreter assumes that new column is coming but in reality we dont have any so the exception is thrown as runtime.
Solution for that:
Best one is, ask your input source generator to solve this.
Second would be, if you have permission to skip the records then
using this -
df = pd.read_csv(file_loc, sep=',', keep_default_na=False)