So I know in Pandas, you can specific what columns to pull from a csv file to generate a DataFrame
df = pd.read_csv('data.csv', usecols=['a','b','c'])
How do you do this with a pickled file?
df = pd.read_pickle('data.pkl', usecols=['a','b','c'])
gives TypeError: read_pickle() got an unexpected keyword argument 'usecols'
I cant find the correct argument in the documentation
Since pickle files contain complete python objects I doubt you can select columns whilst loading them or atleast it seems pandas doesnt support that directly. But you can first load it completely and then filter for your columns as such:
df = pd.read_pickle('data.pkl')
df = df.filter(['a', 'b', 'c'])
Documentation
Related
I have an h5 data file, which includes key rawreport
I can read the rawreport and save as dataframe using read_hdf(filename, "rawreport") without any problems. But the data has 17 mil rows and i'd like to use chunking
When I ran this code
chunksize = 10**6
someval = 100
df = pd.DataFrame()
for chunk in pd.read_hdf(filename, 'rawreport', chunksize=chunksize, where='datetime < someval'):
df = pd.concat([df, chunk], ignore_index=True)
I get "TypeError: can only use an iterator or chunksize on a table"
What does it mean that the rawreport isn't a table and how could I overcome this issue? I'm not the person who created the h5 file.
Chunking is only possible if your file was written in a Table format using PyTables. This must be specified when your file was first written:
df.to_hdf('rawreport', format = 'table')
If this wasn't specified when you wrote the file, then Pandas defaults to using a fixed format. This means that while the file can be quickly written and read later, it does mean that the entire dataframe must be read into memory. Unfortunately, this means that chunking and other options in read_hdf to specify particular rows or columns can't be used here.
In Python, I open a data frame from multiple hdf5 files with vaex (vdf = vaex.open('test_*.hdf5')). Everything seems to work nicely, e.g. combining two columns to make a new one (vdf['newcol'] = vdf.x+vdf.y).
But I cannot get vaex's groupby to work: vdf.groupby('x', agg='count') throws a TypeError: unhashable type: 'Expression'.
It doesn't seem to matter if x is an integer column or a string column. It works nicely when I'm reading only one hdf5 file, but fails as soon as multiple files are combined into one vaex data frame. What could be the reason for this error and how can I get around it?
Which version of Vaex are you running? If the following example works for you, that means it is fixed when installed from source:
import vaex
import vaex.ml
df1 = vaex.ml.datasets.load_iris()
df2 = vaex.ml.datasets.load_iris()
df = vaex.concat([df1, df2])
df.groupby('class_', agg='count')
If the above example works for you, you can already try the latest alpha from pip.
I have a quite similiar question to this one: Dask read_csv-- Mismatched dtypes found in `pd.read_csv`/`pd.read_table`
I am running the following script:
import pandas as pd
import dask.dataframe as dd
df2 = dd.read_csv("Path/*.csv", sep='\t', encoding='unicode_escape', sample=2500000)
df2 = df2.loc[~df2['Type'].isin(['STVKT','STKKT', 'STVK', 'STKK', 'STKET', 'STVET', 'STK', 'STKVT', 'STVVT', 'STV', 'STVZT', 'STVV', 'STKV', 'STVAT', 'STKAT', 'STKZT', 'STKAO', 'STKZE', 'STVAO', 'STVZE', 'STVT', 'STVNT'])]
df2 = df.compute()
And i get the following errror: ValueError: Mismatched dtypes found in pd.read_csv/pd.read_table.
How can I avoid that? I have over 32 columns, so i can't setup the dtypes upfront. As a hint it is also written: Specify dtype option on import or set low_memory=False
When Dask loads your CSV, it tries to derive the dtypes from the header of the file, and then assumes that the rest of the parts of the files have the same dtypes for each column. Sine pandas types from csv depend on the set of values seen, this is where the error comes from.
To fix, you either have to explicitly tell dask what types to expect, or increase the size of the portion dask tries to guess types from (sample=).
The error message should have told you which columns were not matching and the types found, so you only need to specify those to get things working.
Maybe try this:
df = pd.DataFrame()
df = df2.compute()
I'm trying to read a csv into a new dataframe with pandas. A number of the columns may only contain numeric values, but I still want to have them imported in as strings/objects, rather than having columns of float type.
I'm trying to write some python scripts for data conversion/migration. I'm not an advanced Python programmer and mostly learning as I come across a problem that needs solving.
The csvs I am importing have varying number of columns, and even different column titles, and in any order, over which I have no control, so I can't explicitly specify data types using the dtype parameter with read_csv. I just want any column imported to be treated as an object data type so I can analyse it further for data quality.
Examples would be 'Staff ID', and 'License Number' columns on one CSV I tried which should be strings fields holding 7-digit IDs, being imported as type float64.
I have tried using astype with read_csv and apply map on the dataframe after import
Note, there is no hard and fast rule on the contents of the type or quality of the data which is why I want to always import them as dtype of object.
Thanks in advance for anyone who can help me figure this out.
I've used the following code to read it in.
import pandas as pd
df = pd.read_csv("agent.csv",encoding="ISO-8859-1")
This creates the 'License Number' column in df with a type of float64 (among others).
Here is an example of License Number which should be a string:
'1275595' being stored as 1275595.0
Converting it back to a string/object in df after the import changes it back to '1275595.0'
It should stop converting data.
pd.read_csv(..., dtype=str)
Doc: read_csv
dtype: ... Use str or object together with suitable na_values settings
to preserve and not interpret dtype.
I recommend you split your csv reading process into multiple, specific-purpose functions.
For example:
import pandas as pd
# Base function for reading a csv. All the parsing/formatting is done here
def read_csv(file_content, header=False, columns=None, encoding='utf-8'):
df = pd.read_csv(file_content, header=header, encoding=encoding)
df.columns = columns
return df
# Function with a specific purpose as stated in the name.
def read_csv_license_plates(file_content, encoding='utf-8'):
columns = ['col1', 'col2', 'col3']
df = read_csv(file_content, True, columns)
return df
read_csv_license_plates('agent.csv', encoding='ISO-8859-1')
I would like to write and later read a dataframe in Python.
df_final.to_csv(self.get_local_file_path(hash,dataset_name), sep='\t', encoding='utf8')
...
df_final = pd.read_table(self.get_local_file_path(hash,dataset_name), encoding='utf8',index_col=[0,1])
But then I get:
sys:1: DtypeWarning: Columns (7,17,28) have mixed types. Specify dtype
option on import or set low_memory=False.
I found this question. Which in the bottom line says I should specify the field types when I read the file because "low_memory" is deprecated... I find it very inefficient.
Isn't there a simple way to write & later read a Dataframe? I don't care about the human-readability of the file.
You can pickle your dataframe:
df_final.to_pickle(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_pickle(self.get_local_file_path(hash,dataset_name))
If your dataframe ist big and this gets to slow, you might have more luck using the HDF5 format:
df_final.to_hdf(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_hdf(self.get_local_file_path(hash,dataset_name))
You might need to install PyTables first.
Both ways store the data along with their types. Therefore, this should solve your problem.
The warning is because Pandas has detected conflicting Data values in your Column. You can specify the datatypes in the DataFrame Constructor if you wish.
,dtype={'FIELD':int,'FIELD2':str}
Etc.