Koalas Dataframe read_csv reads null column as not null - python

I am working on loading a sample csv file using koalas. What I see is a weird behavior.
The file has a blank column area_code which looks like this. As you can see, it is a blank column. All the rows for this column have blank.
When I read the file as df = ks.read_csv('zipcodes.csv'), I get the following output, which means that the column has nulls, as expected, all good.
When I read the file as df = ks.read_csv('zipcodes.csv', dtype = str), I get the following output, which means that the column doesn't have any nulls.
After a closer look, it seems that the dtype = str is causing this column to be loaded with a string value = None
Any reason why would this happen. Any help is appreciated. Thanks in advance.
Bhupesh C

For pandas, that issue was discussed here and seems to be solved.
I don't know much about koalas but you can try this :
import numpy as np
df = ks.read_csv('zipcodes.csv', dtype=str, keep_default_na=False).replace('', np.nan)

Related

Python pandas extra 0 in numeric values

I have a simple code that read csv file. After that I change the names of the columns and print them. I found one weird issue that for some numeric columns its adding extra .0 Here is my code:
v_df = pd.read_csv('csvfile', delimiter=;)
v_df = v_df.rename(columns={Order No. : Order_Id})
for index, csv_row in v_df.iterrows():
print(csv_row.Order_Id)
Output is:
149545961155429.0
149632391661184.0
If I remove the empty row (2nd one in the above output) from the csv file, .0 does not appear in the ORDER_ID.
After doing some search, I found that converting this column to string will solve the problem. It does work if I change the first row of the above code to:
v_df = pd.read_csv('csvfile', delimiter=;, dtype={'Order No.' : 'str'})
However, the issue is that the column name 'Order No.' is changing to Order_Id as I am doing the rename so I can not use 'Order No.'. For this reason I tried the following:
v_df[['Order_Id']] = v_df[['Order_Id']].values.astype('str')
But unfortunately it seems that astype is not changing the datatype and .0 is still appearing. My questions are:
1- Why .0 is coming at the first place if there is an empty row in the csv file?
2- Why datatype change is not happening after rename?
My aim is to just get rid of .0, I don't want to change the datatype if .0 can go away using any other method.
I am trying to emulate your df here, although it has some differences I think it will work for you:
import pandas as pd
import numpy as np
v_df = pd.DataFrame([['13-Oct-22','149545961155429.0','149545961255429','Delivered'],
['12-Oct-22',None,None,'delivered'],
['15-Oct-22','149632391661184.0','149632391761184','Delivered']], columns=
['Transaction Date','Order_Id','Order Item No.','Order Item Status'])
v_df[['Order_Id']] = v_df[['Order_Id']].fillna(np.nan).values.astype('float').astype('int').astype('str')
Try it and let me know

How to stop Pandas converting integer to decimal when reading in an .xlsx file?

I have an .xlsx file that I am loading into a dataframe using the pd.read_excel method. However, when I do so, one of my columns appears to change format, with pandas adding a decimal point. Does anyone know why this is happening and how to stop it please?
Example of data in the .xlsx file:
191001
191002
191003
Example of the same data in the dataframe:
191001.0
191002.0
191003.0
The relevant column is using the 'General' format option in Excel.
I tried removing the decimal point with the following method; however I got the error message "pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer".
df.column1 = df.column1.astype(int)
Any help would be appreciated!
Your file most likely has infinite and nan values within the column.
You will need to remove them first
import numpy as np
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.fillna(0, inplace = True)
df.column1 = df.column1.astype(int)

Label encoding in Pandas

I am working with data set which have numerical and categorical values. I find solution with numerical values, so next step is to make label encoding with categorical values. In order to do that I wrote these lines of code:
import pandas as pd
dataset_categorical = dataset.select_dtypes(include = ['object'])
new_column = dataset_categorical.astype('category')
After execution of last line of code in Jupyter I can't see an error, but values are not converted into encoded values.
Also this line work for example when I try with only one column but don't work with whole data frame.
So can anybody help me how to solve this problem?
df1 = {
'Name':['George','Andrea','micheal','maggie','Ravi',
'Xien','Jalpa'],
'Is_Male':[1,0,1,0,1,1,0]}
df1 = pd.DataFrame(df1,columns=['Name','Is_Male'])
Typecast to Categorical column in pandas
df1['Is_Male'] = df1.Is_Male.astype('category')

filter off NaNs in a large DataFrame with headers

I have a large number of time series, with blanks on certain dates for some of them. I read that with xlwings from an XL sheet:
Y0 = xw.Range('SomeRangeinXLsheet').options(pd.DataFrame, index=True , header=3).value
I'm trying to create a filter to run regressions on those series so I have to take out the void dates. If I :
print(Y0.iloc[:,[i]]==Y0.iloc[:,[i]])
I get a proper series of true/false for my column number i, fine.
I'm then stuck, can't find a way to filter the whole df, with the true/false for that column, or even just extract that clean series as a pd.Series.
I need them one by one to adapt my independent variables' dates to those of my each of these separately.
Thank you for your help.
I believe you want to use df.dropna()
I am not sure if I understood your problem, but if you want to for check NULLs in a specific column and drop those rows, you can try this -
import pandas as pd
df = df[pd.notnull(df['column_name'])]
For deleting NaNs, df.dropna() should work, as suggested in the previous answer. If it is not working, you can try replacing NaNs with a placeholder text and try deleting the rows that contain that placeholder text.
df['column_name'] = df['column_name'].replace(np.nan, 'delete-it', regex = True)
df = df[df["column_name"] != 'delete-it']
Hope this helps!

Pandas DataFrame- Finding Index Value for a Column

I have a DataFrame that has columns such as ID, Name, Specification, Time.
my file path to open them
mc = pd.read_csv("C:\\data.csv", sep = ",", header = 0, dtype = str)
When I checked my columns values, using
mc.coulumns.values
I found my ID had it with a weird character looked like this,
['/ufeffID', 'Name', 'Specification', 'Time']
After this I assigned that columns with ID like this,
mc.columns.values[0] = "ID"
When I checked this using
mc.columns.values
I got my result as,
Array(['ID', 'Name', 'Specification', 'Time'])
Then, I checked with,
"ID" in mc.columns.values
it gave me "True"
Then I tried,
mc["ID"]
I got an error stating like this,
keyError 'ID'.
I want to get the values of ID column and get rid of that weird characters in front of ID column? Is there any way to solve that? Any help would be appreciated. Thank you in advance.
That's utf-16 BOM, pass encoding='utf-16' to read_csv see: https://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
mc = pd.read_csv("C:\\data.csv", sep=",", header=0, dtype=str, encoding='utf-16')
the above should work FE FF is the BOM for utf-16 Big endian to be specific
Also you should use rename rather than try to overwrite the np array value:
mc.rename(columns={mc.columns[0]: "ID"}, inplace=True)
should work correctly

Categories

Resources