pandas.read_csv changes values on import - python

I have a csv file that looks as so:
"3040",0.24948,-0.89496
"3041",0.25344,-0.89496
"3042",0.2574,-0.891
"3043",0.2574,-0.89496
"3044",0.26136,-0.89892
"3045",0.2574,-0.891
"3046",0.26532,-0.9108
"3047",0.27324,-0.9306
"3048",0.23424,-0.8910
This data is "reference" data intended to validate calculations run on other data. Reading the data in gives me this:
In [2]: test = pd.read_csv('test.csv', header=0, names=['lx', 'ly'])
In [3]: test
Out[3]:
lx ly
3041 0.25344 -0.89496
3042 0.25740 -0.89100
3043 0.25740 -0.89496
3044 0.26136 -0.89892
3045 0.25740 -0.89100
3046 0.26532 -0.91080
3047 0.27324 -0.93060
3048 0.23424 -0.89100
Which looks as you might expect. Problem is, these values are not quite as they appear and comparisons with them don't work:
In [4]: test.loc[3042,'ly']
Out[4]: -0.8909999999999999
Why is it doing that? It seems to be specific to values in the csv that only have 3 places to the right of the decimal, at least so far:
In [5]: test.loc[3048,'ly']
Out[5]: -0.891
In [5]: test.loc[3048,'ly']
Out[5]: -0.891
In [6]: test.loc[3047,'ly']
Out[6]: -0.9306
In [7]: test.loc[3046,'ly']
Out[7]: -0.9108
I just want the exact values from the csv, not an interpretation. Ideas?
Update:
I set float_precision='round_trip' in the read_csv parameters and that seemed to fix it. Document here. What I don't understand is why by default the data is being changed as read in. This doesn't seem good for comparing data sets. Is there a better way to read in data for testing against other dataframes?
Update with answer:
Changing float_precision is what i went with, although I still don't understand how pandas can misrepresent the data in this way. I get a conversion happens on the import, but 0.891 should be 0.891.
For my comparison case rather than testing equivalence I went with something different:
# rather than
df1 == df2
# I tested as
(df1 / df2) - 1 > 1e-14
This works fine for my purposes.

For comparison purposes with other df's, you can use pd.option_context, (note I took off header=0 because it isn't displaying your first row in your df):
import pandas as pd
test = pd.read_csv('./Desktop/dummy.csv', names=['lx', 'ly'])
test.dtypes
with pd.option_context('display.precision', 5):
print(test.loc[3042,'ly'])
output:
-0.891
This isn't the nicest fix, but adding
float_precision='round_trip'
won't always fix your problem either:
import pandas as pd
test = pd.read_csv('./Desktop/dummy.csv', names=['lx', 'ly'], float_precision='round_trip')
test.dtypes
test.loc[3042,'ly']
output:
-0.89100000000000001
With display.precision, you will execute all blocks of code under this with statement at the precision you set, so you can guarantee that df's compared under this will be the value you expect.

it seems it is linked to data type you are loading, which is in your case float64. Using float 32 you get what you expect. So you can change the dtype while loading
test = pd.read_csv('test.csv', header=0, names=['lx', 'ly'],
dtype={'ly': np.float32, 'ly': np.float32})
or afterward
print(type(test.loc[3042,'ly'])) # <class 'numpy.float64'>
test[['lx', 'ly']] = test[['lx', 'ly']].astype('float32')
print(test.loc[3042,'ly']) # -0.891

Related

How to read only first n rows of parquet files to pandas dataframe?

I want to read only first n number of rows in pandas.I pasted the below code which I've tried.
def s3_read_file(src_bucket_name,s3_path,s3_filename):
try:
src_bucket_name ="lla.analytics.dev"
s3_path = "bigdata/dna/fixed/cwp/dt={}/".format(date_fmt)
result = s3.list_objects(Bucket=src_bucket_name, Prefix=s3_path) #getting dictionary
for i in result["Contents"]:
s3_filename = i['Key']
#print(s3_filename)
res = s3.get_object(Bucket=src_bucket_name, Key=s3_filename) #s3://lla.analytics.dev/bigdata/dna/fixed/cwp/dt=2021-12-05/file.parquet
#print(res)
#df = pd.read_parquet(io.BytesIO(res['Body'].read()))
#print(df)
pf = spark.read.parquet().limit(1)
logger.info("****")
logging.info('dataframe head - {}'.format(pf.count()))
logger.info("****")
except Exception as error:
logger.error(error)
I'm facing the below error: I tried with pyspark also but not getting
ERROR:root:read_table() got an unexpected keyword argument 'nrows'
I also tried with the below one but BytesIO is not taking two arguments.
#df = pd.read_parquet(io.BytesIO(s3_obj['Body'].read(),nrows = 10))
This may be a good place to start.
You can pass a subset of columns to read, which can be much faster than reading the whole file (due to the columnar layout):
pq.read_table('example.parquet', columns=['one', 'three'])
Out[11]:
pyarrow.Table
one: double
three: bool
----
one: [[-1,null,2.5]]
three: [[true,false,true]]
When reading a subset of columns from a file that used a Pandas dataframe as the source, we use
Also, you could write a loop and use read_row_group.
https://arrow.apache.org/docs/python/parquet.html#:~:text=row%20groups%20with-,read_row_group,-%3A
parquet_file.num_row_groups
Out[22]: 1
parquet_file.read_row_group(0)
Out[23]:
pyarrow.Table
one: double
two: string
three: bool
__index_level_0__: string
----
one: [[-1,null,2.5]]
two: [["foo","bar","baz"]]
three: [[true,false,true]]
__index_level_0__: [["a","b","c"]]

How does Pandas.read_csv type casting work?

Using pandas.read_csv with parse_dates option and a custom date parser, I find Pandas has a mind of its own about the data type it's reading.
Sample csv:
"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"
The actual datecleaner is here, but what I do boils down to this:
import pandas as pd
def dateclean(date):
return str(int(date)) # Note: we return A STRING
df = pd.read_csv(
'my.csv',
parse_dates=['birth_date'],
date_parser=dateclean,
engine='python'
)
print(df.birth_date)
Output:
0 NaN
1 1625.0
2 1533.0
Name: birth_date, dtype: float64
I get type float64, even when I specified str. Also, take out the first line in the CSV, the one with the empty birth_date, and I get type int. The workaround is easy:
return '"{}"'.format(int(date))
Is there a better way?
In data analysis, I can imagine it's useful that Pandas will say 'Hey dude, you thought you were reading strings, but in fact they're numbers'. But what's the rationale for overruling me when I tell it not to?
Using parse_dates / date_parser looks a bit complicated for me, unless you want to generalise your import on many date columns. I think you have more control with converters parameter, where you can fit dateclean() function. You can also experiment with dtype parameter.
The problem with original dateclean() function is that it fails on "" value, because int("") raises ValueError. Pandas seem to resort to standard import when it encounters this problem, but it will fail explicitly with converters.
Below is the code to demonstrate a fix:
import pandas as pd
from pathlib import Path
doc = """"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"
"""
Path('my.csv').write_text(doc)
def dateclean(date):
try:
return str(int(date))
except ValueError:
return ''
df = pd.read_csv(
'my.csv',
parse_dates=['birth_date'],
date_parser=dateclean,
engine='python'
)
df2 = pd.read_csv(
'my.csv',
converters = {'birth_date': dateclean}
)
print(df2.birth_date)
Hope it helps.
The problem is date_parser is designed specifically for conversion to datetime:
date_parser : function, default NoneFunction to use for converting a sequence of string columns to an array of datetime
instances.
There is no reason you should expect this parameter to work for other types. Instead, you can use the converters parameter. Here we use toolz.compose to apply int and then str. Alternatively, you can use lambda x: str(int(x)).
from io import StringIO
import pandas as pd
from toolz import compose
mystr = StringIO('''"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"''')
df = pd.read_csv(mystr,
converters={'birth_date': compose(str, int)},
engine='python')
print(df.birth_date)
0 NaN
1 1625
2 1533
Name: birth_date, dtype: object
If you need to replace NaN with empty strings, you can post-process with fillna:
print(df.birth_date.fillna(''))
0
1 1625
2 1533
Name: birth_date, dtype: object

pandas to_csv and then read_csv results to numpy.datetime64 messed up due to utc

Here is my problem in short: I am trying to write my data (containing, among other, np.datetime64 values) to csv and then read them back, and want my times not to change...
As discussed in many places, np.datetime64 keeps everything binary and UTC in mem, but reads strings from local time.
Here is a trivial example of my problem, here pd.read_csv("foo") saved from
df.to_csv("foo") results on altering the times:
In[184]: num = np.datetime64(datetime.datetime.now())
In[185]: num
Out[181]: numpy.datetime64('2015-10-28T19:19:42.408000+0100')
In[186]: df = pd.DataFrame({"Time":[num]})
In[187]: df
Out[183]:
Time
0 2015-10-28 18:19:42.408000
In[188]: df.to_csv("foo")
In[189]: df2=pd.read_csv("foo")
In[190]: df2
Out[186]:
Unnamed: 0 Time
0 0 2015-10-28 18:19:42.408000
In[191]: np.datetime64(df2.Time[0])
Out[187]: numpy.datetime64('2015-10-28T18:19:42.408000+0100')
In[192]: num == np.datetime64(df2.Time[0])
Out[188]: False
(as usual:)
import numpy as np
improt pandas as pd
There is a very large number of questions, and lots of info on the web, but i've been googling for a while now and have not been able to find an answer on how to overcome this. There should be some way to save the data in Zulu,
or read them supposing UTC, but have not found any directions on which would be the best (or even good?) way to do it.
I can do
In[193]: num == np.datetime64(df2.Time[0]+"Z")
Out[189]: True
but that seems to me really bad, in terms of practice, portability and efficiency... (plus its annoying when using the default save and read messes things up)
The numpy constructor is simply broken and will rarely do what you want. I would simply avoid. Use instead:
pd.read_csv(StringIO(df.to_csv(index=False)),parse_dates=['Time'])
np.datetime64 is merely display in local timezone. It is already stored in UTC.
In [42]: num = np.datetime64(datetime.datetime.now())
In [43]: num
Out[43]: numpy.datetime64('2015-10-28T10:02:22.298130-0400')
In [44]: df = pd.DataFrame({"Time":[num]})
In [45]: df
Out[45]:
Time
0 2015-10-28 14:02:22.298130
In [46]: pd.read_csv(StringIO(df.to_csv(index=False)),parse_dates=['Time'])
Out[46]:
Time
0 2015-10-28 14:02:22.298130
In [47]: pd.read_csv(StringIO(df.to_csv(index=False)),parse_dates=['Time']).Time.values
Out[47]: array(['2015-10-28T10:02:22.298130000-0400'], dtype='datetime64[ns]')
[47] is the just a local display. The time is as above.
Internally datetimes are kept as an int64 of ns since epoch.
In [7]: Timestamp('2015-10-28 14:02:22.298130')
Out[7]: Timestamp('2015-10-28 14:02:22.298130')
In [8]: Timestamp('2015-10-28 14:02:22.298130').value
Out[8]: 1446040942298130000
In [9]: np.array([1446040942298130000],dtype='M8[ns]')
Out[9]: array(['2015-10-28T10:02:22.298130000-0400'], dtype='datetime64[ns]')
In [10]: Timestamp(np.array([1446040942298130000],dtype='M8[ns]').view('i8').item())
Out[10]: Timestamp('2015-10-28 14:02:22.298130')

Get the same hash value for a Pandas DataFrame each time

My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file.
Whole point is to get the same hash each time I call hash() on it.
My idea was that I create the function
def _get_array_hash(arr):
arr_hashable = arr.values
arr_hashable.flags.writeable = False
hash_ = hash(arr_hashable.data)
return hash_
that is calling underlying numpy array, set it to immutable state and get hash of the buffer.
INLINE UPD.
As of 08.11.2016, this version of the function doesn't work anymore. Instead, you should use
hash(df.values.tobytes())
See comments for the Most efficient property to hash for numpy array.
END OF INLINE UPD.
It works for regular pandas array:
In [12]: data = pd.DataFrame({'A': [0], 'B': [1]})
In [13]: _get_array_hash(data)
Out[13]: -5522125492475424165
In [14]: _get_array_hash(data)
Out[14]: -5522125492475424165
But then I try to apply it to DataFrame obtained from a .csv file:
In [15]: fpath = 'foo/bar.csv'
In [16]: data_from_file = pd.read_csv(fpath)
In [17]: _get_array_hash(data_from_file)
Out[17]: 6997017925422497085
In [18]: _get_array_hash(data_from_file)
Out[18]: -7524466731745902730
Can somebody explain me, how's that possible?
I can create new DataFrame out of it, like
new_data = pd.DataFrame(data=data_from_file.values,
columns=data_from_file.columns,
index=data_from_file.index)
and it works again
In [25]: _get_array_hash(new_data)
Out[25]: -3546154109803008241
In [26]: _get_array_hash(new_data)
Out[26]: -3546154109803008241
But my goal is to preserve the same hash value for a dataframe across application launches in order to retrieve some value from cache.
As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object (source code) which was recently made public in pandas.util. It returns one hash value for reach row of the dataframe (and works on series etc. too)
import pandas as pd
import numpy as np
np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
df = pd.DataFrame(arr)
print(df)
# 0 1 2 3
# 0 42 foo 42 42
# 1 foo foo 42 bar
# 2 42 42 42 42
from pandas.util import hash_pandas_object
h = hash_pandas_object(df)
print(h)
# 0 5559921529589760079
# 1 16825627446701693880
# 2 7171023939017372657
# dtype: uint64
You can always do hash_pandas_object(df).sum() if you want an overall hash of all rows.
Joblib provides a hashing function optimized for objects containing numpy arrays (e.g. pandas dataframes).
import joblib
joblib.hash(df)
I had a similar problem: check if a dataframe is changed and I solved it by hashing the msgpack serialization string. This seems stable among different reloading the same data.
import pandas as pd
import hashlib
DATA_FILE = 'data.json'
data1 = pd.read_json(DATA_FILE)
data2 = pd.read_json(DATA_FILE)
assert hashlib.md5(data1.to_msgpack()).hexdigest() == hashlib.md5(data2.to_msgpack()).hexdigest()
assert hashlib.md5(data1.values.tobytes()).hexdigest() != hashlib.md5(data2.values.tobytes()).hexdigest()
This function seems to work fine:
from hashlib import sha256
def hash_df(df):
s = str(df.columns) + str(df.index) + str(df.values)
return sha256(s.encode()).hexdigest()

Looking for a python datastructure for cleaning/annotating large datasets

I'm doing a lot of cleaning, annotating and simple transformations on very large twitter datasets (~50M messages). I'm looking for some kind of datastructure that would contain column info the way pandas does, but works with iterators rather than reading the whole dataset into memory at once. I'm considering writing my own, but I wondered if there was something with similar functionality out there. I know I'm not the only one doing things like this!
Desired functionality:
>>> ds = DataStream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds.columns
['id', 'message']
>>> ds.iterator.next()
[2385, "Hi it's me, Sally!"]
>>> ds = datastream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds_tok = get_tokens(ds)
>>> ds_tok.columns
['message_id', 'token', 'n']
>>> ds_tok.iterator.next()
[2385, "Hi", 0]
>>> ds_tok.iterator.next()
[2385, "it's", 1]
>>> ds_tok.iterator.next()
[2385, "me", 2]
>>> ds_tok.to_sql(db_info)
UPDATE: I've settled on a combination of dict iterators and pandas dataframes to satisfy these needs.
As commented there is a chunksize argument for read_sql which means you can work on sql results piecemeal. I would probably use HDF5Store to save the intermediary results... or you could just append it back to another sql table.
dfs = pd.read_sql(..., chunksize=100000)
store = pd.HDF5Store("store.h5")
for df in dfs:
clean_df = ... # whatever munging you have to do
store.append("df", clean_df)
(see hdf5 section of the docs), or
dfs = pd.read_sql(..., chunksize=100000)
for df in dfs:
clean_df = ...
clean_df.to_sql(..., if_exists='append')
see the sql section of the docs.

Categories

Resources