Saving Pandas dataframe indices to file

Saving Pandas dataframe indices to file - python

I am trying to save just the indices of a dataframe to a file. Here is what I've tried:
A)
np.savetxt("file_name", df.index.values)
returns:
TypeError: Mismatch between array dtype ('object') and format specifier ('%.18e')
B)
df.index.values.tofile("file_name")
returns:
IOError: cannot write object arrays to a file in binary mode
C)
with open("file_name","w") as f:
f.write("\n".join(df_1_indexes.values.tolist()))
Can someone please explain why A) and B) failed? I'm at a loss.
Cheers

The error in A) is likely because you have strings or someother type of object in your index. The default format specifier in np.savetxt appears to assume the data is float-like. I got around this by setting fmt='%s', though it likely isn't a reliable solution.
B) doesn't yield any errors for me, using some basic examples of Index and MultiIndex. Your error is probably due to the specific type of elements in your index.
Note that there is an easier and more reliable way to save just the index. You can set the columns parameter of to_csv as an empty list, which will suppress all columns from the output:
df.to_csv('file_name', columns=[], header=False)
If your index has a name and you want the name to appear in the output (similar to how column names appear), remove header=False from the code above.

In addition to #root's answer, if you've already deleted df after saving its index, you can do this:
saved_index=df.index
pd.Series(saved_index,index=saved_index).to_csv('file_name', header=False, index=False)

Related

Why does dask throw an error when setting a String column as an index?

I'm reading a large CSV with dask, setting the dtypes as string and then setting it as an index:
dataframe = dd.read_csv(file_path, dtype={"colName": "string"}, blocksize=100e6)
dataframe.set_index("colName")
and it throws the following error:
TypeError: Cannot interpret 'StringDtype' as a data type
Why does this happen? How can I solve it?

As stated in the bug report here for an unrelated issue: https://github.com/dask/dask/issues/7206#issuecomment-797221227
When constructing the dask Array's meta object, we're currently assuming the underlying array type is a NumPy array, when in this case, it's actually going to be a pandas StringArray. But unlike pandas, NumPy doesn't know how to handle a StringDtype.
Currently, changing the column type to object from string solves the issue, but it's unclear if this is a bug or an expected behavior:
dataframe = dd.read_csv(file_path, dtype={"colName": "object"}, blocksize=100e6)
dataframe.set_index("colName")

pd.read_excel() replaces blanks with `nan` string, pd.read_csv() uses numpy.nan

When I create a DataFrame with
pd.read_excel(my_excel_file, dtype=str)
blank cells in the spreadsheet are replaced with the string nan. On the other hand, a DataFrame generated from
pd.read_csv(my_csv_file, dtype=str)
replaces blanks with numpy.nan objects. Why is this?

You can specify a na_values attribute to read_excel.
df = pd.read_excel(my_excel_file, na_values=[''], dtype=object)

I'll try to answer your question on Why is this? When using dtype=str to read the Excel File using pd.read_excel, the result obtained is not consistent with what you get when using pd.read_csv. The main reason or you can say, advantage in replacing blank cells with numpy.nan objects when using pd.read_csv is that it facilitates you to use pd.isna, which would only work with numpy.nan objects, and not otherwise with just nan.
There has been a lot of discussion pertaining to this, and it can be said this functionality in turn allows consistency to be maintained between pd.read_csv and pd.read_excel. You can read more about the discussion on the Github Page for Pandas, where some debate has been going on in this regard at read_excel with dtype=str converts empty cells to the string 'nan' #20377

How to clean dirty data of dataframe (imported from csv file), filter nums and transfer into the float type

There's a csv file, contains numbers, "***", "(X)" and NAN.
Then I use pd.read_csv() to import this into dataframe.
see: import data
but all values in df are "str" type. see: desc data
I want to filter the num and transfer them into float type and for the others to NAN.
Please help me. Thanks!

Try using the NaN filter of pd.read_csv(). For each column you can specify different values that should be considered NaN. In your case this should work:
df = pd.read_csv('your_file.csv', na_values={'HC04_VC03': '(X)', 'HC04_VC04': '***'})
Pandas will then automatically choose a fitting dtype for your data. In this case you get the desired float columns. You can also specify the data type as you read in the csv file using the parameter dtype = {'GEO.id2': np.int64, 'HC04_VC04': np.float64, 'HC02_VC05': np.float64} or any other valid dtypes of your choice. Use this option with care since setting the dtype will throw an error if the data cannot be converted to the desired type, e.g. if you don't get rid of all '***' strings first.
Alternatively, you could read in the csv file without specifying data types, and then convert the columns after using pd.to_numeric. For example,
df['GEO.id2'] = pd.to_numeric(df['GEO.id2'], errors = 'ignore') # values that can't be converted to integer types will be left alone
In the documentation, there are other methods for handling data that can't be converted.

Filtering Chunks by A string

I have csv file with 60M plus rows. I am only interested in a subset of these and would like to put them in a dataframe.
Here is the code I am using:
iter_csv = pd.read_csv('/Users/xxxx/Documents/globqa-pgurlbymrkt-Report.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['Site Market (evar13)'].str.contains("Canada", na=False)] for chunk in iter_csv])
off the answer here : pandas: filter lines on load in read_csv
I get the following error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Cant seem to figure out whats wrong and will appreciate guidance here.

Try verifying the data representing a string first.
What does the last chunk return that you are expecting to use .contains() on?
It seems that the data may be missing and if so then it wouldn't be a string.

Mixed types when reading csv files. Causes, fixes and consequences

What exactly happens when Pandas issues this warning? Should I worry about it?
In [1]: read_csv(path_to_my_file)
/Users/josh/anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/parsers.py:1139:
DtypeWarning: Columns (4,13,29,51,56,57,58,63,87,96) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
I assume that this means that Pandas is unable to infer the type from values on those columns. But if that is the case, what type does Pandas end up using for those columns?
Also, can the type always be recovered after the fact? (after getting the warning), or are there cases where I may not be able to recover the original info correctly, and I should pre-specify the type?
Finally, how exactly does low_memory=False fix the problem?

Revisiting mbatchkarov's link, low_memory is not deprecated.
It is now documented:
low_memory : boolean, default True
Internally process the file in chunks, resulting in lower memory use while
parsing, but possibly mixed type inference. To ensure no
mixed types either set False, or specify the type with the dtype
parameter. Note that the entire file is read into a single DataFrame
regardless, use the chunksize or iterator parameter to return the data
in chunks. (Only valid with C parser)
I have asked what resulting in mixed type inference means, and chris-b1 answered:
It is deterministic - types are consistently inferred based on what's
in the data. That said, the internal chunksize is not a fixed number
of rows, but instead bytes, so whether you can a mixed dtype warning
or not can feel a bit random.
So, what type does Pandas end up using for those columns?
This is answered by the following self-contained example:
df=pd.read_csv(StringIO('\n'.join([str(x) for x in range(1000000)] + ['a string'])))
DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
type(df.loc[524287,'0'])
Out[50]: int
type(df.loc[524288,'0'])
Out[51]: str
The first part of the csv data was seen as only int, so converted to int,
the second part also had a string, so all entries were kept as string.
Can the type always be recovered after the fact? (after getting the warning)?
I guess re-exporting to csv and re-reading with low_memory=False should do the job.
How exactly does low_memory=False fix the problem?
It reads all of the file before deciding the type, therefore needing more memory.

low_memory is apparently kind of deprecated, so I wouldn't bother with it.
The warning means that some of the values in a column have one dtype (e.g. str), and some have a different dtype (e.g. float). I believe pandas uses the lowest common super type, which in the example I used would be object.
You should check your data, or post some of it here. In particular, look for missing values or inconsistently formatted int/float values. If you are certain your data is correct, then use the dtypes parameter to help pandas out.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Saving Pandas dataframe indices to file - python

In addition to #root's answer, if you've already deleted df after saving its index, you can do this: saved_index=df.index pd.Series(saved_index,index=saved_index).to_csv('file_name', header=False, index=False)

Related

Why does dask throw an error when setting a String column as an index?

pd.read_excel() replaces blanks with `nan` string, pd.read_csv() uses numpy.nan

How to clean dirty data of dataframe (imported from csv file), filter nums and transfer into the float type

Filtering Chunks by A string

Mixed types when reading csv files. Causes, fixes and consequences

Categories

Resources