Dask OutOfBoundsDatetime when reading parquet files - python

The following code fails with
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: -221047-10-07 10:43:35
from pathlib import Path
import dask.dataframe as dd
import numpy as np
import pandas as pd
import tempfile
def run():
temp_folder = tempfile.TemporaryDirectory()
rng = np.random.default_rng(42)
filenames = []
for i in range(2):
filename = Path(temp_folder.name, f"file_{i}.gzip")
filenames.append(filename)
df = pd.DataFrame(
data=rng.normal(size=(365, 1500)),
index=pd.date_range(
start="2021-01-01",
end="2022-01-01",
closed="left",
freq="D",
),
)
df.columns = df.columns.astype(str)
df.to_parquet(filename, compression="gzip")
df = dd.read_parquet(filenames)
result = df.mean().mean().compute()
temp_folder.cleanup()
return result
if __name__ == "__main__":
run()
Why does this (sample) code fail?
What I'm trying to do:
The loop resembles creating data which is larger than memory in batches.
In the next step I'd like to read that data from the files and work with it in dask.
Observations:
If I only read one file
for i in range(1):
the code is working.
If I don't use the DateTimeIndex
df = pd.DataFrame(
data=rng.normal(size=(365, 1500)),
)
the code is working.
If I use pandas only
df = pd.read_parquet(filenames)
result = df.mean().mean()
the code is working. (which is odd since read_parquet in pandas only expects one path not a collection)
If I use the distributed client with concat as suggested here I get a similar error pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 68024-12-20 01:46:56
Therefore I omitted the client in my sample.

Thanks to the helpful comment from #mdurant, providing the engine helped:
engine='fastparquet' # or 'pyarrow'
df.to_parquet(filename, compression="gzip", engine=engine)
df = dd.read_parquet(filenames, engine=engine)
Apparently engine='auto' selects different engines in dask vs. pandas when more than one parquet engine is installed.
Sidenote:
I've tried different combinations of the engine and this triggers the error in the question:
df.to_parquet(filename, compression="gzip", engine='pyarrow')
df = dd.read_parquet(filenames, engine='fastparquet')

Related

how to use pandas to read some Excel file at a time? [duplicate]

I would like to read several CSV files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I have so far:
import glob
import pandas as pd
# Get data file names
path = r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)
I guess I need some help within the for loop?
See pandas: IO tools for all of the available .read_ methods.
Try the following code if all of the CSV files have the same columns.
I have added header=0, so that after reading the CSV file's first row, it can be assigned as the column names.
import pandas as pd
import glob
import os
path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path , "/*.csv"))
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
Or, with attribution to a comment from Sid.
all_files = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)
It's often necessary to identify each sample of data, which can be accomplished by adding a new column to the dataframe.
pathlib from the standard library will be used for this example. It treats paths as objects with methods, instead of strings to be sliced.
Imports and Setup
from pathlib import Path
import pandas as pd
import numpy as np
path = r'C:\DRO\DCL_rawdata_files' # or unix / linux / mac path
# Get the files from the path provided in the OP
files = Path(path).glob('*.csv') # .rglob to get subdirectories
Option 1:
Add a new column with the file name
dfs = list()
for f in files:
data = pd.read_csv(f)
# .stem is method for pathlib objects to get the filename w/o the extension
data['file'] = f.stem
dfs.append(data)
df = pd.concat(dfs, ignore_index=True)
Option 2:
Add a new column with a generic name using enumerate
dfs = list()
for i, f in enumerate(files):
data = pd.read_csv(f)
data['file'] = f'File {i}'
dfs.append(data)
df = pd.concat(dfs, ignore_index=True)
Option 3:
Create the dataframes with a list comprehension, and then use np.repeat to add a new column.
[f'S{i}' for i in range(len(dfs))] creates a list of strings to name each dataframe.
[len(df) for df in dfs] creates a list of lengths
Attribution for this option goes to this plotting answer.
# Read the files into dataframes
dfs = [pd.read_csv(f) for f in files]
# Combine the list of dataframes
df = pd.concat(dfs, ignore_index=True)
# Add a new column
df['Source'] = np.repeat([f'S{i}' for i in range(len(dfs))], [len(df) for df in dfs])
Option 4:
One liners using .assign to create the new column, with attribution to a comment from C8H10N4O2
df = pd.concat((pd.read_csv(f).assign(filename=f.stem) for f in files), ignore_index=True)
or
df = pd.concat((pd.read_csv(f).assign(Source=f'S{i}') for i, f in enumerate(files)), ignore_index=True)
An alternative to darindaCoder's answer:
path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path, "*.csv")) # advisable to use os.path.join as this makes concatenation OS independent
df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one
import glob
import os
import pandas as pd
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))
Almost all of the answers here are either unnecessarily complex (glob pattern matching) or rely on additional third-party libraries. You can do this in two lines using everything Pandas and Python (all versions) already have built in.
For a few files - one-liner
df = pd.concat(map(pd.read_csv, ['d1.csv', 'd2.csv','d3.csv']))
For many files
import os
filepaths = [f for f in os.listdir(".") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths))
For No Headers
If you have specific things you want to change with pd.read_csv (i.e., no headers) you can make a separate function and call that with your map:
def f(i):
return pd.read_csv(i, header=None)
df = pd.concat(map(f, filepaths))
This pandas line, which sets the df, utilizes three things:
Python's map (function, iterable) sends to the function (the
pd.read_csv()) the iterable (our list) which is every CSV element
in filepaths).
Panda's read_csv() function reads in each CSV file as normal.
Panda's concat() brings all these under one df variable.
Easy and Fast
Import two or more CSV files without having to make a list of names.
import glob
import pandas as pd
df = pd.concat(map(pd.read_csv, glob.glob('data/*.csv')))
The Dask library can read a dataframe from multiple files:
>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')
(Source: https://examples.dask.org/dataframes/01-data-access.html#Read-CSV-files)
The Dask dataframes implement a subset of the Pandas dataframe API. If all the data fits into memory, you can call df.compute() to convert the dataframe into a Pandas dataframe.
I googled my way into Gaurav Singh's answer.
However, as of late, I am finding it faster to do any manipulation using NumPy and then assigning it once to a dataframe rather than manipulating the dataframe itself on an iterative basis and it seems to work in this solution too.
I do sincerely want anyone hitting this page to consider this approach, but I don't want to attach this huge piece of code as a comment and making it less readable.
You can leverage NumPy to really speed up the dataframe concatenation.
import os
import glob
import pandas as pd
import numpy as np
path = "my_dir_full_path"
allFiles = glob.glob(os.path.join(path,"*.csv"))
np_array_list = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None, header=0)
np_array_list.append(df.as_matrix())
comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array)
big_frame.columns = ["col1", "col2"....]
Timing statistics:
total files :192
avg lines per file :8492
--approach 1 without NumPy -- 8.248656988143921 seconds ---
total records old :1630571
--approach 2 with NumPy -- 2.289292573928833 seconds ---
A one-liner using map, but if you'd like to specify additional arguments, you could do:
import pandas as pd
import glob
import functools
df = pd.concat(map(functools.partial(pd.read_csv, sep='|', compression=None),
glob.glob("data/*.csv")))
Note: map by itself does not let you supply additional arguments.
If you want to search recursively (Python 3.5 or above), you can do the following:
from glob import iglob
import pandas as pd
path = r'C:\user\your\path\**\*.csv'
all_rec = iglob(path, recursive=True)
dataframes = (pd.read_csv(f) for f in all_rec)
big_dataframe = pd.concat(dataframes, ignore_index=True)
Note that the three last lines can be expressed in one single line:
df = pd.concat((pd.read_csv(f) for f in iglob(path, recursive=True)), ignore_index=True)
You can find the documentation of ** here. Also, I used iglobinstead of glob, as it returns an iterator instead of a list.
EDIT: Multiplatform recursive function:
You can wrap the above into a multiplatform function (Linux, Windows, Mac), so you can do:
df = read_df_rec('C:\user\your\path', *.csv)
Here is the function:
from glob import iglob
from os.path import join
import pandas as pd
def read_df_rec(path, fn_regex=r'*.csv'):
return pd.concat((pd.read_csv(f) for f in iglob(
join(path, '**', fn_regex), recursive=True)), ignore_index=True)
Inspired from MrFun's answer:
import glob
import pandas as pd
list_of_csv_files = glob.glob(directory_path + '/*.csv')
list_of_csv_files.sort()
df = pd.concat(map(pd.read_csv, list_of_csv_files), ignore_index=True)
Notes:
By default, the list of files generated through glob.glob is not sorted. On the other hand, in many scenarios, it's required to be sorted e.g. one may want to analyze number of sensor-frame-drops v/s timestamp.
In pd.concat command, if ignore_index=True is not specified then it reserves the original indices from each dataframes (i.e. each individual CSV file in the list) and the main dataframe looks like
timestamp id valid_frame
0
1
2
.
.
.
0
1
2
.
.
.
With ignore_index=True, it looks like:
timestamp id valid_frame
0
1
2
.
.
.
108
109
.
.
.
IMO, this is helpful when one may want to manually create a histogram of number of frame drops v/s one minutes (or any other duration) bins and want to base the calculation on very first timestamp e.g.
begin_timestamp = df['timestamp'][0]
Without, ignore_index=True, df['timestamp'][0] generates the series containing very first timestamp from all the individual dataframes, it does not give just a value.
Another one-liner with list comprehension which allows to use arguments with read_csv.
df = pd.concat([pd.read_csv(f'dir/{f}') for f in os.listdir('dir') if f.endswith('.csv')])
Alternative using the pathlib library (often preferred over os.path).
This method avoids iterative use of pandas concat()/apped().
From the pandas documentation:
It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.
import pandas as pd
from pathlib import Path
dir = Path("../relevant_directory")
df = (pd.read_csv(f) for f in dir.glob("*.csv"))
df = pd.concat(df)
If multiple CSV files are zipped, you may use zipfile to read all and concatenate as below:
import zipfile
import pandas as pd
ziptrain = zipfile.ZipFile('yourpath/yourfile.zip')
train = []
train = [ pd.read_csv(ziptrain.open(f)) for f in ziptrain.namelist() ]
df = pd.concat(train)
Based on Sid's good answer.
To identify issues of missing or unaligned columns
Before concatenating, you can load CSV files into an intermediate dictionary which gives access to each data set based on the file name (in the form dict_of_df['filename.csv']). Such a dictionary can help you identify issues with heterogeneous data formats, when column names are not aligned for example.
Import modules and locate file paths:
import os
import glob
import pandas
from collections import OrderedDict
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")
Note: OrderedDict is not necessary, but it'll keep the order of files which might be useful for analysis.
Load CSV files into a dictionary. Then concatenate:
dict_of_df = OrderedDict((f, pandas.read_csv(f)) for f in filenames)
pandas.concat(dict_of_df, sort=True)
Keys are file names f and values are the data frame content of CSV files.
Instead of using f as a dictionary key, you can also use os.path.basename(f) or other os.path methods to reduce the size of the key in the dictionary to only the smaller part that is relevant.
import os
os.system("awk '(NR == 1) || (FNR > 1)' file*.csv > merged.csv")
Where NR and FNR represent the number of the line being processed.
FNR is the current line within each file.
NR == 1 includes the first line of the first file (the header), while FNR > 1 skips the first line of each subsequent file.
In case of an unnamed column issue, use this code for merging multiple CSV files along the x-axis.
import glob
import os
import pandas as pd
merged_df = pd.concat([pd.read_csv(csv_file, index_col=0, header=0) for csv_file in glob.glob(
os.path.join("data/", "*.csv"))], axis=0, ignore_index=True)
merged_df.to_csv("merged.csv")
You can do it this way also:
import pandas as pd
import os
new_df = pd.DataFrame()
for r, d, f in os.walk(csv_folder_path):
for file in f:
complete_file_path = csv_folder_path+file
read_file = pd.read_csv(complete_file_path)
new_df = new_df.append(read_file, ignore_index=True)
new_df.shape
Consider using convtools library, which provides lots of data processing primitives and generates simple ad hoc code under the hood.
It is not supposed to be faster than pandas/polars, but sometimes it can be.
e.g. you could concat csv files into one for further reuse - here's the code:
import glob
from convtools import conversion as c
from convtools.contrib.tables import Table
import pandas as pd
def test_pandas():
df = pd.concat(
(
pd.read_csv(filename, index_col=None, header=0)
for filename in glob.glob("tmp/*.csv")
),
axis=0,
ignore_index=True,
)
df.to_csv("out.csv", index=False)
# took 20.9 s
def test_convtools():
table = None
for filename in glob.glob("tmp/*.csv"):
table_ = Table.from_csv(filename, header=False)
if table is None:
table = table_
else:
table = table.chain(table_)
table.into_csv("out_convtools.csv", include_header=False)
# took 15.8 s
Of course if you just want to obtain a dataframe without writing a concatenated file, it will take 4.63 s and 10.9 s correspondingly (pandas is faster here because it doesn't need to zip columns for writing it back).
import pandas as pd
import glob
path = r'C:\DRO\DCL_rawdata_files' # use your path
file_path_list = glob.glob(path + "/*.csv")
file_iter = iter(file_path_list)
list_df_csv = []
list_df_csv.append(pd.read_csv(next(file_iter)))
for file in file_iter:
lsit_df_csv.append(pd.read_csv(file, header=0))
df = pd.concat(lsit_df_csv, ignore_index=True)
This is how you can do it using Colaboratory on Google Drive:
import pandas as pd
import glob
path = r'/content/drive/My Drive/data/actual/comments_only' # Use your path
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True,sort=True)
frame.to_csv('/content/drive/onefile.csv')

Append data from TDMS file

I have a folder TDMS files (can also be Excel).
These are stored in 5 MB packages, but all contain the same data structure.
Unfortunately there is no absolute time in the lines and the timestamp is stored somewhat cryptically in the column "TimeStamp" in the following format
"Tues. 17.11.2020 19:20:15"
But now I would like to load each file and plot them one after the other in the same graph.
For one file this is no problem, because I simply use the index of the file for the x-axis, but if I load several files, the index in each file is the same and the data overlap.
Does anyone have an idea how I can write all the data into a DataFrame, but with a continuous timestamp, so that the data can be plotted one after the other or I can also specify a time period in which I would like to see the data?
My first approach would be as follows.
If someone could upload an example with a CSV file (pandas.read.csv) instead of npTDMS Module, it would be just as helpful!
https://nptdms.readthedocs.io/en/stable/
import pandas as pd
import matplotlib.pyplot as plt
from nptdms import TdmsFile
tdms_file = TdmsFile.read("Datei1.tdms")
tdms_groups = tdms_file.groups()
tdms_Variables_1 = tdms_file.group_channels(tdms_groups[0])
MessageData_channel_1 = tdms_file.object('Data', 'Position')
MessageData_data_1 = MessageData_channel_1.data
#MessageData_channel_2 = tdms_file.object('Data', 'Timestamp')
#MessageData_data_2 = MessageData_channel_2.data
df_y = pd.DataFrame(data=MessageData_data_1).append(df_y)
plt.plot(df_y)
Here is an example with CSV. It will first create a bunch of files that should look similar to yours in the ./data/ folder. Then it will read those files back (finding them with glob). It uses pandas.concat to combine the dataframes into 1, and then it parses the date.
import glob
import random
import pandas
import matplotlib.pyplot as plt
# Create a bunch of test files that look like your data (NOTE: my files aren't 5MB, but 100 rows)
df = pandas.DataFrame([{"value": random.randint(50, 100)} for _ in range(1000)])
df["timestamp"] = pandas.date_range(
start="17/11/2020", periods=1000, freq="H"
).strftime(r"%a. %d.%m.%Y %H:%M:%S")
chunks = [df.iloc[i : i + 100] for i in range(0, len(df) - 100 + 1, 100)]
for index, chunk in enumerate(chunks):
chunk[["timestamp", "value"]].to_csv(f"./data/data_{index}.csv", index=False)
# ===============
# Read the files back into a dataframe
dataframe_list = []
for file in glob.glob("./data/data_*.csv"):
df = pandas.read_csv(file)
dataframe_list.append(df)
# Combine all individual dataframes into 1
df = pandas.concat(dataframe_list)
# Set the time file correctly
df["timestamp"] = pandas.to_datetime(df["timestamp"], format=r"%a. %d.%m.%Y %H:%M:%S")
# Use the timestamp as the index for the dataframe, and make sure it's sorted
df = df.set_index("timestamp").sort_index()
# Create the plot
plt.plot(df)
#Gijs Wobben
Thank you so much ! It works perfectly well and it will save me a lot of work !
As a mechanical engineer you don't write code like this very often, so I'm happy if people from other disciplines can help you.
Here is the basic structure, how i did it directly with TDMS-Files, because I read afterwards that the npTDMS module offers the possibility to read the data directly as dataframe, which I didn't know before
import pandas as pd
from nptdms import TdmsFile
from nptdms import tdms
import os,glob
file_names=glob.glob('*.tdms')
tdms_file = TdmsFile.read(file_names[0])
# Read the files back into a dataframe
dataframe_list = []
for file in glob.glob("*.tdms"):
tdms_file = TdmsFile.read(file)
df = tdms_file['Sheet1'].as_dataframe()
dataframe_list.append(df)
df_all = pd.concat(dataframe_list)
# Set the time file correctly
df["Timestamp"] = pd.to_datetime(df["Timestamp"], format=r"%a. %d.%m.%Y %H:%M:%S")
# Use the timestamp as the index for the dataframe, and make sure it's sorted
df = df.set_index("Timestamp").sort_index()
# Create the plot
plt.plot(df)

What is the best way to replace the format of data in a large dataset?

I am just starting out with data science, so apologies if this is a bone question with a simple answer, but I have been scanning google for hours and have tried multiple solutions to no avail.
Basically, my dataset has automatically adjusted some values such as 3-5 to 03-May. I am not able to simply change the values in Excel, rather I need to clean the data in Python. My first thought was simply to use the replace tool i.e. df = df.replace('2019-05-03 00:00:00', '3-5') but it doesn't work, presumably as the dtype is different between the timestamp and the str(?) - it works if I adjust the code i.e. df = df.replace('0-2', '3-5').
I can't simply add that data as a missing value either as it is simply an error in formatting rather than a spurious entry.
Is there a simple way of doing this?
Listed below is an example snippet of the data I am working with:
GitHub public gist
PSB for code:
#Dependencies
import pytest
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
import numpy as np
from google.colab import drive
import io
#Import data
from google.colab import files
upload = files.upload()
df = pd.read_excel(io.BytesIO(upload['breast-cancer.xls']))
df
#Clean Data
df.types
#Correcting tumor-size and inv-nodes values
'''def clean_data(dataset):
for i in dataset:
dataset = dataset.replace('2019-05-03 00:00:00','3-5')
dataset = dataset.replace('2019-08-06 00:00:00','6-8')
dataset = dataset.replace('2019-09-11 00:00:00','9-11')
dataset = dataset.replace('2014-12-01 00:00:00','12-14')
dataset = dataset.replace('2014-10-01 00:00:00','10-14')
dataset = dataset.replace('2019-09-05 00:00:00','5-9')
return dataset
cleaned_dataset = dataset.apply(clean_data)
cleaned_dataset'''
df = df.replace('2019-05-03 00:00:00', '3-5')
df
#Check for duplicates
df.duplicated()
df[['tumor-size', 'inv-nodes']] = df[['tumor-size', 'inv-nodes']].astype(str)
That line of code saved the day.

How to split a csv into multiple csv files using Dask

How to split a csv file into multiple files using Dask?
The bellow code seems to write to one file only which takes a long time to write the full thing. I believe writing to multiple files will be faster.
import dask.dataframe as ddf
import dask
file_path = "file_name.csv"
df = ddf.read_csv(file_path)
futs = df.to_csv(r"*.csv", compute=False)
_, l = dask.compute(futs, df.size)
I suspect that when you read df you have df.npartitions is just 1.
import dask.dataframe as dd
file_path = "file_name.csv"
df = dd.read_csv(file_path)
# set how many file you would like to have
# in this case 10
df = df.repartition(npartitions=10)
df.to_csv("file_*.csv")
But as far as I can see it's not faster.

Inconsistent results when concatenating parsed csv files

I am puzzled with the following problem. I have a set of csv files, which I parse iterativly. Before collecting the dataframes in a list, I apply some function (as simple as tmp_df*2) to each of the tmp_df. It all worked perfectly fine at first glance, until I've realized I have inconsistencies with the results from run to run.
For example, when I apply df.std() I might receive for a first run:
In[2]: df1.std()
Out[2]:
some_int 15281.99
some_float 5.302293
and for a second run after:
In[3]: df2.std()
Out[3]:
some_int 15281.99
some_float 6.691013
Strangly, I don't not observe inconsistencies like this one when I don't manipulate the parsed data (simply comment out tmp_df = tmp_df*2). I also noticed that for the columns where I have datatypes int the results are consistent from run to run, which does not hold for floats. I suspect it has to do with the precision points. I also cannot establish a pattern how they vary, it might be that I have the same results for two or three consecutive runs. Maybe someone has an idea if I am missing something here. I am working on a replication example, I'll edit asap, as I cannot share the underlying data. Maybe someone can shed some light on this in the meantime. I am using win8.1, pandas 17.1, python 3.4.3.
Code example:
import pandas as pd
import numpy as np
data_list = list()
csv_files = ['a.csv', 'b.csv', 'c.csv']
for csv_file in csv_files:
# load csv_file
tmp_df = pd.read_csv(csv_file, index_col='ID', dtype=np.float64)
# replace infs by na
tmp_df.replace([np.inf, -np.inf], np.nan, inplace=True)
# manipulate tmp_df
tmp_df = tmp_df*2
data_list.append(tmp_df)
df = pd.concat(data_list, ignore_index=True)
df.reset_index(inplace=True)
Update:
Running the same code and data on a UX system works perfectly fine.
Edit:
I have managed to re-create the problem, it should run on win and ux. I've tested on win8.1 facing the same problem when with_function=True (typically after 1-5 runs), on ux the it runs without problems. with_function=False runs without differences for win and ux. I can also reject the hypothesis that it is related to int or float issue as also the simulated int are different...
Here is the code:
import pandas as pd
import numpy as np
from pathlib import Path
from tempfile import gettempdir
def simulate_csv_data(tmp_dir,num_files=5):
""" simulate a csv files
:param tmp_dir: Path, csv files are saved to
:param num_files: int, how many csv files to simulate
:return:
"""
rows = 20000
columns = 5
np.random.seed(1282)
for file_num in range(num_files):
file_path = tmp_dir.joinpath(''.join(['df_', str(file_num), '.csv']))
simulated_df = pd.DataFrame(np.random.standard_normal((rows, columns)))
simulated_df['some_int'] = np.random.randint(0,100)
simulated_df.to_csv(str(file_path))
def get_csv_data(tmp_dir,num_files=5, with_function=True):
""" Collect various csv files and return a concatenated dfs
:param tmp_dir: Path, csv files are saved to
:param num_files: int, how many csv files to simulate
:param with_function: Bool, apply function to tmp_dataframe
:return:
"""
data_list = list()
for file_num in range(num_files):
# current file path
file_path = tmp_dir.joinpath(''.join(['df_', str(file_num), '.csv']))
# load csv_file
tmp_df = pd.read_csv(str(file_path), dtype=np.float64)
# replace infs by na
tmp_df.replace([np.inf, -np.inf], np.nan, inplace=True)
# apply function to tmp_dataframe
if with_function:
tmp_df = tmp_df*2
data_list.append(tmp_df)
df = pd.concat(data_list, ignore_index=True)
df.reset_index(inplace=True)
return df
def main():
# INPUT ----------------------------------------------
num_files = 5
with_function = True
max_comparisons = 50
# ----------------------------------------------------
tmp_dir = gettempdir()
# use temporary "non_existing" dir for new file
tmp_csv_folder = Path(tmp_dir).joinpath('csv_files_sdfs2eqqf')
# if exists already don't simulate data/files again
if tmp_csv_folder.exists() is False:
tmp_csv_folder.mkdir()
print('Simulating temp files...')
simulate_csv_data(tmp_csv_folder, num_files)
print('Getting benchmark data frame...')
df1 = get_csv_data(tmp_csv_folder, num_files, with_function)
df_is_same = True
count_runs = 0
# Run until different df is found or max runs exceeded
print('Comparing data frames...')
while df_is_same:
# get another data frame
df2 = get_csv_data(tmp_csv_folder, num_files, with_function)
count_runs += 1
# compare data frames
if df1.equals(df2) is False:
df_is_same = False
print('Found unequal df after {} runs'.format(count_runs))
# print out a standard deviations (arbitrary example)
print('Std Run1: \n {}'.format(df1.std()))
print('Std Run2: \n {}'.format(df2.std()))
if count_runs > max_comparisons:
df_is_same = False
print('No unequal df found after {} runs'.format(count_runs))
print('Delete the following folder if no longer needed: "{}"'.format(
str(tmp_csv_folder)))
if __name__ == '__main__':
main()
Your variations are caused by something else, like input data change between executions, or source code changes.
Float precision does not ever gives different results between executions.
Btw, clean your examples and you will find the bug. At this moment you say something about and int but display a decimal value instead!!
Updating numexpr to 2.4.6 (or later), as numexpr 2.4.4 had some bugs on windows. After running the update it works for me.

Categories

Resources