Convert dask to pandas dataframe - python

I have a quite similiar question to this one: Dask read_csv-- Mismatched dtypes found in `pd.read_csv`/`pd.read_table`
I am running the following script:
import pandas as pd
import dask.dataframe as dd
df2 = dd.read_csv("Path/*.csv", sep='\t', encoding='unicode_escape', sample=2500000)
df2 = df2.loc[~df2['Type'].isin(['STVKT','STKKT', 'STVK', 'STKK', 'STKET', 'STVET', 'STK', 'STKVT', 'STVVT', 'STV', 'STVZT', 'STVV', 'STKV', 'STVAT', 'STKAT', 'STKZT', 'STKAO', 'STKZE', 'STVAO', 'STVZE', 'STVT', 'STVNT'])]
df2 = df.compute()
And i get the following errror: ValueError: Mismatched dtypes found in pd.read_csv/pd.read_table.
How can I avoid that? I have over 32 columns, so i can't setup the dtypes upfront. As a hint it is also written: Specify dtype option on import or set low_memory=False

When Dask loads your CSV, it tries to derive the dtypes from the header of the file, and then assumes that the rest of the parts of the files have the same dtypes for each column. Sine pandas types from csv depend on the set of values seen, this is where the error comes from.
To fix, you either have to explicitly tell dask what types to expect, or increase the size of the portion dask tries to guess types from (sample=).
The error message should have told you which columns were not matching and the types found, so you only need to specify those to get things working.

Maybe try this:
df = pd.DataFrame()
df = df2.compute()

Related

How to remove b' from values in dataframe

I read my arff dataframe from here https://archive.ics.uci.edu/ml/machine-learning-databases/00426/ like this:
from scipy.io import arff
import pandas as pd
data = arff.loadarff('Autism-Adult-Data.arff')
df = pd.DataFrame(data[0])
df.head()
But my dataframe has b' in all values in all columns:
How to remove it?
When i try this, it doesn't work as well:
from scipy.io import arff
import pandas as pd
data = arff.loadarff('Autism-Adult-Data.arff')
df = pd.DataFrame(data[0].str.decode('utf-8'))
df.head()
It says AttributeError: 'numpy.ndarray' object has no attribute 'str'
as you see .str.decode('utf-8') from Removing b'' from string column in a pandas dataframe didn't solve a problem
This doesn't work as well:
df.index = df.index.str.encode('utf-8')
A you see its both string and and numbers are bytes object
I was looking at the same dataset and had a similar issue. I did find a workaround and am not sure if this post will be helpful? So rather than use the from scipy.io import arff, I used another library called liac-arff. So the code should be like
pip install liac-arff
Or whatever the pip command that works for your operating system or IDE, and then
import arff
import pandas as pd
data = arff.loads('Autism-Adult-Data.arff')
Data returns a dictionary. To find what columns that dictionary has, you do
data.keys()
and you will find that all arff files have the following keys
['description', 'relation', 'attributes', 'data']
Where data is the actual data and attributes has the column names and the unique values of those columns. So to get a data frame you need to do the following
colnames = []
for i in range(len(data['attributes'])):
colnames.append(data['attributes'][i][0])
df = pd.DataFrame.from_dict(data['data'])
df.columns = colnames
df.head()
So I went overboard here with all creating the dataframe and all but this returns a data frame with no issues with a b', and the key is using import arff.
So the GitHub for the library I used can be found here.
Although Shimon shared an answer, you could also give this a try:
df.apply(lambda x: x.str.decode('utf8'))

Read Specific Columns from a pickle files

So I know in Pandas, you can specific what columns to pull from a csv file to generate a DataFrame
df = pd.read_csv('data.csv', usecols=['a','b','c'])
How do you do this with a pickled file?
df = pd.read_pickle('data.pkl', usecols=['a','b','c'])
gives TypeError: read_pickle() got an unexpected keyword argument 'usecols'
I cant find the correct argument in the documentation
Since pickle files contain complete python objects I doubt you can select columns whilst loading them or atleast it seems pandas doesnt support that directly. But you can first load it completely and then filter for your columns as such:
df = pd.read_pickle('data.pkl')
df = df.filter(['a', 'b', 'c'])
Documentation

Why does PANDAS only see one column to csv dataset with numerous columns?

I am new to and PANDAS and I am trying to work out why the shape of this csv dataset[https://www.kaggle.com/vfoufikos/airbnb-analysis-lisbon][1] is being shown as: (237, 1)? As it appears that the dataset has 20 columns.
import time
import pandas as pd
import numpy as np
df = pd.read_csv('airbnb_lisbon.csv', error_bad_lines=False)
print(df.shape)
Could anyone please explain why?
You could use a usecols option to select the columns youd like to use. For example if you wanted to store dataset columns into 'df' you could use:
df = pd.read_csv(...., usecols=['col1', 'col2',..., 'coln'])
If you'd like to select all the data without specifying which columns, I'd look into specifying your delimiter, as that might be the problem you've run into.
You can specify the type used by using sep=',' or sep=';' in your pd.read_csv() function. Let me know if either of these work!
I had the very same problem reported by LeoGER. I've tried the three solutions that you have suggested.
df = pd.read_csv(...., usecols=['col1', 'col2',..., 'coln']) DIDN'T WORK. Jupyter reported an error
sep ; DIDN'T WORK, as dataset kept the same one column stardard
sep , IT WORKED, and finally I can see the whole set of columns! :D
It seems that my mistake was that I used delimiter ; instead of sep ,.

Trying to import all columns from a csv with an object data type with pandas

I'm trying to read a csv into a new dataframe with pandas. A number of the columns may only contain numeric values, but I still want to have them imported in as strings/objects, rather than having columns of float type.
I'm trying to write some python scripts for data conversion/migration. I'm not an advanced Python programmer and mostly learning as I come across a problem that needs solving.
The csvs I am importing have varying number of columns, and even different column titles, and in any order, over which I have no control, so I can't explicitly specify data types using the dtype parameter with read_csv. I just want any column imported to be treated as an object data type so I can analyse it further for data quality.
Examples would be 'Staff ID', and 'License Number' columns on one CSV I tried which should be strings fields holding 7-digit IDs, being imported as type float64.
I have tried using astype with read_csv and apply map on the dataframe after import
Note, there is no hard and fast rule on the contents of the type or quality of the data which is why I want to always import them as dtype of object.
Thanks in advance for anyone who can help me figure this out.
I've used the following code to read it in.
import pandas as pd
df = pd.read_csv("agent.csv",encoding="ISO-8859-1")
This creates the 'License Number' column in df with a type of float64 (among others).
Here is an example of License Number which should be a string:
'1275595' being stored as 1275595.0
Converting it back to a string/object in df after the import changes it back to '1275595.0'
It should stop converting data.
pd.read_csv(..., dtype=str)
Doc: read_csv
dtype: ... Use str or object together with suitable na_values settings
to preserve and not interpret dtype.
I recommend you split your csv reading process into multiple, specific-purpose functions.
For example:
import pandas as pd
# Base function for reading a csv. All the parsing/formatting is done here
def read_csv(file_content, header=False, columns=None, encoding='utf-8'):
df = pd.read_csv(file_content, header=header, encoding=encoding)
df.columns = columns
return df
# Function with a specific purpose as stated in the name.
def read_csv_license_plates(file_content, encoding='utf-8'):
columns = ['col1', 'col2', 'col3']
df = read_csv(file_content, True, columns)
return df
read_csv_license_plates('agent.csv', encoding='ISO-8859-1')

Proper way of writing and reading Dataframe to file in Python

I would like to write and later read a dataframe in Python.
df_final.to_csv(self.get_local_file_path(hash,dataset_name), sep='\t', encoding='utf8')
...
df_final = pd.read_table(self.get_local_file_path(hash,dataset_name), encoding='utf8',index_col=[0,1])
But then I get:
sys:1: DtypeWarning: Columns (7,17,28) have mixed types. Specify dtype
option on import or set low_memory=False.
I found this question. Which in the bottom line says I should specify the field types when I read the file because "low_memory" is deprecated... I find it very inefficient.
Isn't there a simple way to write & later read a Dataframe? I don't care about the human-readability of the file.
You can pickle your dataframe:
df_final.to_pickle(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_pickle(self.get_local_file_path(hash,dataset_name))
If your dataframe ist big and this gets to slow, you might have more luck using the HDF5 format:
df_final.to_hdf(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_hdf(self.get_local_file_path(hash,dataset_name))
You might need to install PyTables first.
Both ways store the data along with their types. Therefore, this should solve your problem.
The warning is because Pandas has detected conflicting Data values in your Column. You can specify the datatypes in the DataFrame Constructor if you wish.
,dtype={'FIELD':int,'FIELD2':str}
Etc.

Categories

Resources