I use pandas pd.read_csv() to read and process a column of strings with a converter function while reading. I get 'object' as data type, but string would be much more space efficient.
Can I somehow convince pd.read_csv() to make the column of type 'string' from the beginning? I know how to convert later, but this may become a memory issue, the dataset is large.
f = lambda x: "/".join(x.split('/')[1:5])
pd.read_csv(..., convertes={'path':f}, ...)
I use pandas 1.0.3 and python 3.8.2
Would be even better if I could create type category (of strings) from the beginning ...
thank you,
Heiner
Pandas maps the strings to object type that is why you are getting an object as dtype.
You can see the different mappings of python type to pandas dtype below.
img src
Related
I have noticed that when I specify the dtypes for columns I want to import from a csv as str, I get my dataframe with the object dtypes as expected. Some of these columns do have NaN values.
If I use str.strip() or other str. methods, would they actually work?
Or, after the above importation process, do I need to specify the dtype of each column individually - .astype("str") and then use. .str functions? When I do this, the sys.getsizeof increases (probably because the inclusion of ' ').
Everything appears to work as intended using both methodologies, but I am curious if I am making a mistake by relying on the first methodology and I am a bit confused why the sys.getsizeof is different between specifying dtype during import and then manually specifying after importation.
I have a column with text which is stored as a object tpye in a pandas dataframe.
The following two codes to not work. The required column is still object type.
TD_Eco_Comb_c['CountryPair'] = TD_Eco_Comb_c['CountryPair'].astype('|S')
TD_Eco_Comb_c['CountryPair'] = TD_Eco_Comb_c['CountryPair'].astype('str')
Any advice?
A string is always going to have dtype == 'object' in a dataframe.
This comes from numpy, which uses purely numerical dtypes. Everything that isn't numerical is classified as an 'object'. Your data is already in the format you need it to be in.
reading a fixed width .dat file in pandas is not very complicated using the pd.read_csv('file.dat', sep='\s+')or the pd.read_fwf('file.dat', widths=[7, ..]) method. But in the file is also given a format string like this:
Format = (i7,1x,i7,1x,i2,1x,i2,1x,i2,1x,f5.1,1x,i4,1x,3i,1x,f4.1,1x,i1,1x,f4.1,1x,i3,1x,i4,1x,i4,1x,i3,1x,i4,2x,i1)
looking at the columns content, I assume the character indicates the datatype (i->int, f->float, x->seperator) and the number is obviously the width of the column. Is this a standard notation? Is there a more pythonic way to read data files by just passing this format string and make scripts save against format changes in the data file?
I noticed the format argument for the read_fwf() function, but it takes a list of pairs (int, int) not the type of format string that is given.
First rows of the data file:
list of pairs (int, int)
This is a pretty standard way to indicate format using the C printf convention. The format is only really important if you are trying to write the file in an identical manner. For the purpose of reading it all into pandas you don't really care. If you want control over the specific data type of each column as you read it in you use the dtype parameter. In the example below I said to make column 'a' a 64-bit floag and 'b' a 32-bit int.
my_dtypes = {‘a’: np.float64, ‘b’: np.int32}
pd.read_csv('file.dat', sep='\s+', dtype=my_dtypes)
You don't have to specify every column, just the ones that you want. It's likely that pandas figured out most of this already though by default. After your call to read_csv() try
df = pd.read_csv(....)
print(df.dtypes)
this will show you the data type of each of your columns.
What exactly happens when Pandas issues this warning? Should I worry about it?
In [1]: read_csv(path_to_my_file)
/Users/josh/anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/parsers.py:1139:
DtypeWarning: Columns (4,13,29,51,56,57,58,63,87,96) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
I assume that this means that Pandas is unable to infer the type from values on those columns. But if that is the case, what type does Pandas end up using for those columns?
Also, can the type always be recovered after the fact? (after getting the warning), or are there cases where I may not be able to recover the original info correctly, and I should pre-specify the type?
Finally, how exactly does low_memory=False fix the problem?
Revisiting mbatchkarov's link, low_memory is not deprecated.
It is now documented:
low_memory : boolean, default True
Internally process the file in chunks, resulting in lower memory use while
parsing, but possibly mixed type inference. To ensure no
mixed types either set False, or specify the type with the dtype
parameter. Note that the entire file is read into a single DataFrame
regardless, use the chunksize or iterator parameter to return the data
in chunks. (Only valid with C parser)
I have asked what resulting in mixed type inference means, and chris-b1 answered:
It is deterministic - types are consistently inferred based on what's
in the data. That said, the internal chunksize is not a fixed number
of rows, but instead bytes, so whether you can a mixed dtype warning
or not can feel a bit random.
So, what type does Pandas end up using for those columns?
This is answered by the following self-contained example:
df=pd.read_csv(StringIO('\n'.join([str(x) for x in range(1000000)] + ['a string'])))
DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
type(df.loc[524287,'0'])
Out[50]: int
type(df.loc[524288,'0'])
Out[51]: str
The first part of the csv data was seen as only int, so converted to int,
the second part also had a string, so all entries were kept as string.
Can the type always be recovered after the fact? (after getting the warning)?
I guess re-exporting to csv and re-reading with low_memory=False should do the job.
How exactly does low_memory=False fix the problem?
It reads all of the file before deciding the type, therefore needing more memory.
low_memory is apparently kind of deprecated, so I wouldn't bother with it.
The warning means that some of the values in a column have one dtype (e.g. str), and some have a different dtype (e.g. float). I believe pandas uses the lowest common super type, which in the example I used would be object.
You should check your data, or post some of it here. In particular, look for missing values or inconsistently formatted int/float values. If you are certain your data is correct, then use the dtypes parameter to help pandas out.
I am just getting started with Pandas and I am reading in a csv file using the read_csv() method. The difficulty I am having is preventing pandas from converting my telephone numbers to large numbers, instead of keeping them as strings. I defined a converter which just left the numbers alone, but then they still converted to numbers. When I changed my converter to prepend a 'z' to the phone numbers, then they stayed strings. Is there some way to keep them strings without modifying the values of the fields?
Since Pandas 0.11.0 you can use dtype argument to explicitly specify data type for each column:
d = pandas.read_csv('foo.csv', dtype={'BAR': 'S10'})
It looks like you can't avoid pandas from trying to convert numeric/boolean values in the CSV file. Take a look at the source code of pandas for the IO parsers, in particular functions _convert_to_ndarrays, and _convert_types.
https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py
You can always assign the type you want after you have read the file:
df.phone = df.phone.astype(str)