Inconsistent results when concatenating parsed csv files - python

I am puzzled with the following problem. I have a set of csv files, which I parse iterativly. Before collecting the dataframes in a list, I apply some function (as simple as tmp_df*2) to each of the tmp_df. It all worked perfectly fine at first glance, until I've realized I have inconsistencies with the results from run to run.
For example, when I apply df.std() I might receive for a first run:
In[2]: df1.std()
Out[2]:
some_int 15281.99
some_float 5.302293
and for a second run after:
In[3]: df2.std()
Out[3]:
some_int 15281.99
some_float 6.691013
Strangly, I don't not observe inconsistencies like this one when I don't manipulate the parsed data (simply comment out tmp_df = tmp_df*2). I also noticed that for the columns where I have datatypes int the results are consistent from run to run, which does not hold for floats. I suspect it has to do with the precision points. I also cannot establish a pattern how they vary, it might be that I have the same results for two or three consecutive runs. Maybe someone has an idea if I am missing something here. I am working on a replication example, I'll edit asap, as I cannot share the underlying data. Maybe someone can shed some light on this in the meantime. I am using win8.1, pandas 17.1, python 3.4.3.
Code example:
import pandas as pd
import numpy as np
data_list = list()
csv_files = ['a.csv', 'b.csv', 'c.csv']
for csv_file in csv_files:
# load csv_file
tmp_df = pd.read_csv(csv_file, index_col='ID', dtype=np.float64)
# replace infs by na
tmp_df.replace([np.inf, -np.inf], np.nan, inplace=True)
# manipulate tmp_df
tmp_df = tmp_df*2
data_list.append(tmp_df)
df = pd.concat(data_list, ignore_index=True)
df.reset_index(inplace=True)
Update:
Running the same code and data on a UX system works perfectly fine.
Edit:
I have managed to re-create the problem, it should run on win and ux. I've tested on win8.1 facing the same problem when with_function=True (typically after 1-5 runs), on ux the it runs without problems. with_function=False runs without differences for win and ux. I can also reject the hypothesis that it is related to int or float issue as also the simulated int are different...
Here is the code:
import pandas as pd
import numpy as np
from pathlib import Path
from tempfile import gettempdir
def simulate_csv_data(tmp_dir,num_files=5):
""" simulate a csv files
:param tmp_dir: Path, csv files are saved to
:param num_files: int, how many csv files to simulate
:return:
"""
rows = 20000
columns = 5
np.random.seed(1282)
for file_num in range(num_files):
file_path = tmp_dir.joinpath(''.join(['df_', str(file_num), '.csv']))
simulated_df = pd.DataFrame(np.random.standard_normal((rows, columns)))
simulated_df['some_int'] = np.random.randint(0,100)
simulated_df.to_csv(str(file_path))
def get_csv_data(tmp_dir,num_files=5, with_function=True):
""" Collect various csv files and return a concatenated dfs
:param tmp_dir: Path, csv files are saved to
:param num_files: int, how many csv files to simulate
:param with_function: Bool, apply function to tmp_dataframe
:return:
"""
data_list = list()
for file_num in range(num_files):
# current file path
file_path = tmp_dir.joinpath(''.join(['df_', str(file_num), '.csv']))
# load csv_file
tmp_df = pd.read_csv(str(file_path), dtype=np.float64)
# replace infs by na
tmp_df.replace([np.inf, -np.inf], np.nan, inplace=True)
# apply function to tmp_dataframe
if with_function:
tmp_df = tmp_df*2
data_list.append(tmp_df)
df = pd.concat(data_list, ignore_index=True)
df.reset_index(inplace=True)
return df
def main():
# INPUT ----------------------------------------------
num_files = 5
with_function = True
max_comparisons = 50
# ----------------------------------------------------
tmp_dir = gettempdir()
# use temporary "non_existing" dir for new file
tmp_csv_folder = Path(tmp_dir).joinpath('csv_files_sdfs2eqqf')
# if exists already don't simulate data/files again
if tmp_csv_folder.exists() is False:
tmp_csv_folder.mkdir()
print('Simulating temp files...')
simulate_csv_data(tmp_csv_folder, num_files)
print('Getting benchmark data frame...')
df1 = get_csv_data(tmp_csv_folder, num_files, with_function)
df_is_same = True
count_runs = 0
# Run until different df is found or max runs exceeded
print('Comparing data frames...')
while df_is_same:
# get another data frame
df2 = get_csv_data(tmp_csv_folder, num_files, with_function)
count_runs += 1
# compare data frames
if df1.equals(df2) is False:
df_is_same = False
print('Found unequal df after {} runs'.format(count_runs))
# print out a standard deviations (arbitrary example)
print('Std Run1: \n {}'.format(df1.std()))
print('Std Run2: \n {}'.format(df2.std()))
if count_runs > max_comparisons:
df_is_same = False
print('No unequal df found after {} runs'.format(count_runs))
print('Delete the following folder if no longer needed: "{}"'.format(
str(tmp_csv_folder)))
if __name__ == '__main__':
main()

Your variations are caused by something else, like input data change between executions, or source code changes.
Float precision does not ever gives different results between executions.
Btw, clean your examples and you will find the bug. At this moment you say something about and int but display a decimal value instead!!

Updating numexpr to 2.4.6 (or later), as numexpr 2.4.4 had some bugs on windows. After running the update it works for me.

Related

Dask OutOfBoundsDatetime when reading parquet files

The following code fails with
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: -221047-10-07 10:43:35
from pathlib import Path
import dask.dataframe as dd
import numpy as np
import pandas as pd
import tempfile
def run():
temp_folder = tempfile.TemporaryDirectory()
rng = np.random.default_rng(42)
filenames = []
for i in range(2):
filename = Path(temp_folder.name, f"file_{i}.gzip")
filenames.append(filename)
df = pd.DataFrame(
data=rng.normal(size=(365, 1500)),
index=pd.date_range(
start="2021-01-01",
end="2022-01-01",
closed="left",
freq="D",
),
)
df.columns = df.columns.astype(str)
df.to_parquet(filename, compression="gzip")
df = dd.read_parquet(filenames)
result = df.mean().mean().compute()
temp_folder.cleanup()
return result
if __name__ == "__main__":
run()
Why does this (sample) code fail?
What I'm trying to do:
The loop resembles creating data which is larger than memory in batches.
In the next step I'd like to read that data from the files and work with it in dask.
Observations:
If I only read one file
for i in range(1):
the code is working.
If I don't use the DateTimeIndex
df = pd.DataFrame(
data=rng.normal(size=(365, 1500)),
)
the code is working.
If I use pandas only
df = pd.read_parquet(filenames)
result = df.mean().mean()
the code is working. (which is odd since read_parquet in pandas only expects one path not a collection)
If I use the distributed client with concat as suggested here I get a similar error pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 68024-12-20 01:46:56
Therefore I omitted the client in my sample.
Thanks to the helpful comment from #mdurant, providing the engine helped:
engine='fastparquet' # or 'pyarrow'
df.to_parquet(filename, compression="gzip", engine=engine)
df = dd.read_parquet(filenames, engine=engine)
Apparently engine='auto' selects different engines in dask vs. pandas when more than one parquet engine is installed.
Sidenote:
I've tried different combinations of the engine and this triggers the error in the question:
df.to_parquet(filename, compression="gzip", engine='pyarrow')
df = dd.read_parquet(filenames, engine='fastparquet')

Append data from TDMS file

I have a folder TDMS files (can also be Excel).
These are stored in 5 MB packages, but all contain the same data structure.
Unfortunately there is no absolute time in the lines and the timestamp is stored somewhat cryptically in the column "TimeStamp" in the following format
"Tues. 17.11.2020 19:20:15"
But now I would like to load each file and plot them one after the other in the same graph.
For one file this is no problem, because I simply use the index of the file for the x-axis, but if I load several files, the index in each file is the same and the data overlap.
Does anyone have an idea how I can write all the data into a DataFrame, but with a continuous timestamp, so that the data can be plotted one after the other or I can also specify a time period in which I would like to see the data?
My first approach would be as follows.
If someone could upload an example with a CSV file (pandas.read.csv) instead of npTDMS Module, it would be just as helpful!
https://nptdms.readthedocs.io/en/stable/
import pandas as pd
import matplotlib.pyplot as plt
from nptdms import TdmsFile
tdms_file = TdmsFile.read("Datei1.tdms")
tdms_groups = tdms_file.groups()
tdms_Variables_1 = tdms_file.group_channels(tdms_groups[0])
MessageData_channel_1 = tdms_file.object('Data', 'Position')
MessageData_data_1 = MessageData_channel_1.data
#MessageData_channel_2 = tdms_file.object('Data', 'Timestamp')
#MessageData_data_2 = MessageData_channel_2.data
df_y = pd.DataFrame(data=MessageData_data_1).append(df_y)
plt.plot(df_y)
Here is an example with CSV. It will first create a bunch of files that should look similar to yours in the ./data/ folder. Then it will read those files back (finding them with glob). It uses pandas.concat to combine the dataframes into 1, and then it parses the date.
import glob
import random
import pandas
import matplotlib.pyplot as plt
# Create a bunch of test files that look like your data (NOTE: my files aren't 5MB, but 100 rows)
df = pandas.DataFrame([{"value": random.randint(50, 100)} for _ in range(1000)])
df["timestamp"] = pandas.date_range(
start="17/11/2020", periods=1000, freq="H"
).strftime(r"%a. %d.%m.%Y %H:%M:%S")
chunks = [df.iloc[i : i + 100] for i in range(0, len(df) - 100 + 1, 100)]
for index, chunk in enumerate(chunks):
chunk[["timestamp", "value"]].to_csv(f"./data/data_{index}.csv", index=False)
# ===============
# Read the files back into a dataframe
dataframe_list = []
for file in glob.glob("./data/data_*.csv"):
df = pandas.read_csv(file)
dataframe_list.append(df)
# Combine all individual dataframes into 1
df = pandas.concat(dataframe_list)
# Set the time file correctly
df["timestamp"] = pandas.to_datetime(df["timestamp"], format=r"%a. %d.%m.%Y %H:%M:%S")
# Use the timestamp as the index for the dataframe, and make sure it's sorted
df = df.set_index("timestamp").sort_index()
# Create the plot
plt.plot(df)
#Gijs Wobben
Thank you so much ! It works perfectly well and it will save me a lot of work !
As a mechanical engineer you don't write code like this very often, so I'm happy if people from other disciplines can help you.
Here is the basic structure, how i did it directly with TDMS-Files, because I read afterwards that the npTDMS module offers the possibility to read the data directly as dataframe, which I didn't know before
import pandas as pd
from nptdms import TdmsFile
from nptdms import tdms
import os,glob
file_names=glob.glob('*.tdms')
tdms_file = TdmsFile.read(file_names[0])
# Read the files back into a dataframe
dataframe_list = []
for file in glob.glob("*.tdms"):
tdms_file = TdmsFile.read(file)
df = tdms_file['Sheet1'].as_dataframe()
dataframe_list.append(df)
df_all = pd.concat(dataframe_list)
# Set the time file correctly
df["Timestamp"] = pd.to_datetime(df["Timestamp"], format=r"%a. %d.%m.%Y %H:%M:%S")
# Use the timestamp as the index for the dataframe, and make sure it's sorted
df = df.set_index("Timestamp").sort_index()
# Create the plot
plt.plot(df)

What are more efficient ways to loop through a pandas dataframe?

I need to extract some data from 37,000 xls files, which are stored in 2100 folders (activity/year/month/day). I already wrote the script, however when given a small sample of a thousand files, it takes 5 minutes to run. Each individual file can include up to ten thousand entries I need to extract. Haven't tried running it on the entire folder, I'm looking for suggestions how to make it more efficient.
I would also like some help on how to export the dictionary to a new excel file, two columns, or how to skip the entire dictionary and just save directly to xls, and how to point the entire script at a shared drive folder, instead of Python's root.
import fnmatch
import os
import pandas as pd
docid = []
CoCo = []
for root, dirs, files in os.walk('Z_Option'):
for filename in files:
if fnmatch.fnmatch(filename, 'Z_*.xls'):
df = pd.read_excel(os.path.join(root, filename), sheet_name='Sheet0')
for i in df['BLDAT']:
if isinstance(i, int):
docid.append(i)
CoCo.append(df['BUKRS'].iloc[1])
data = dict(zip(docid, CoCo))
print(data)
This walkthrough was very helpful for me when I was beginning with pandas. What is likely taking so long is the for i in df['BLDAT'] line.
Using something like an apply function can offer a speed boost:
def check_if_int(row): #row is effectively a pd.Series of the index
if type(row['BLDAT']) == 'int':
docid.append(i)
CoCo.append(row.name) #name should be the index
df.apply(check_if_int, axis = 1) #axis = 1 will work rowwise
It's unclear what exactly this script is trying to do, but if it's as simple as filtering the dataframe to only include rows where the 'BLDAT' column is an integer, using a mask would be much faster
df_filtered = df.loc[type(df['BLDAT']) == 'int'] #could also use .isinstance()
Another advantage of filtering the dataframe as opposed to creating lists is the ability to use the pandas function df_filtered.to_csv() to output the file as an .xlsx compatible file.
Eventually I gave up due to time constraints (yay last minute "I need this tomorrow" reports), and came up with this. Dropping empty rows helped by some margin, and for the next quarter I'll try to do this entirely with pandas.
#Shared drive
import fnmatch
import os
import pandas as pd
import time
start_time = time.time()
docid = []
CoCo = []
os.chdir("X:\Shared activities")
for root, dirs, files in os.walk("folder"):
for filename in files:
if fnmatch.fnmatch(filename, 'Z_*.xls'):
try:
df = pd.read_excel(os.path.join(root, filename), sheet_name='Sheet0')
df.dropna(subset = ['BLDAT'], inplace = True)
for i in df['BLDAT']:
if isinstance(i, int):
docid.append(i)
CoCo.append(df['BUKRS'].iloc[1])
except:
errors.append((os.path.join(root, filename)))
data = dict(zip(docid, CoCo))
os.chdir("C:\project\reports")
pd.DataFrame.from_dict(data, orient="index").to_csv('test.csv')
with open('errors.csv', 'w') as f:
for item in errors:
f.write("%s\n" % item)
print("--- %s seconds ---" % (time.time() - start_time))

Removing rows whose index match certain values

My code:
import numpy as np
import pandas as pd
import time
tic = time.time()
I read a long file of the headers [meter] [daycode] [meter reading in kWh]. A time series of over 6,000 meters.
consum = pd.read_csv("data/File1.txt", delim_whitespace=True, encoding = "utf-8", names =['meter', 'daycode', 'val'], engine='python')
consum.set_index('meter', inplace=True)
Because I have in fact total 6 files of this humungous size, I want to filter out those with insufficient information. These are the time series data with [meter] values under code 3 by category. I can collect this category information from another file. Following is where I extract this.
id_total = pd.read_csv("data/meter_id_code.csv", header = 0, encoding="cp1252")
#print(len(id_total.index))
id_total.set_index('Code', inplace=True)
id_other = id_total.loc[3].copy()
print id_other
And this is where I write to csv to check whether the last line is correctly performed:
id_other.to_csv('data/id_other.csv', sep='\t', encoding='utf-8')
print consum[~consum.index.isin(id_other)]
Output: (of print id_other)
Problem:
I get the following warning. Here it says it didn't affect the code from working but mine is affected. I checked the correct directory (earlier confused my remote connection to gpu server with my hardware) and csv file was created. It turns out the meter IDs in the file are not filtered.
How can I fix the last line?

How to merge two programs with scheduled execution

I am trying to merge two programs or write a third program that will call these two programs as function. They are supposed to run one after the other and after interval of certain time in minutes. something like a make file which will have few more programs included later. I am not able to merge them nor able to put them into some format that will allow me to call them in a new main program.
program_master_id.py picks the *.csv file from a folder location and after computing appends the master_ids.csv file in another location of folder.
Program_master_count.py divides the count with respect to the count ofIds in the respective timeseries.
Program_1 master_id.py
import pandas as pd
import numpy as np
# csv file contents
# Need to change to path as the Transition_Data has several *.CSV files
csv_file1 = 'Transition_Data/Test_1.csv'
csv_file2 = '/Transition_Data/Test_2.csv'
#master file to be appended only
master_csv_file = 'Data_repository/master_lac_Test.csv'
csv_file_all = [csv_file1, csv_file2]
# read csv into df using list comprehension
# I use buffer here, replace stringIO with your file path
df_all = [pd.read_csv(csv_file) for csv_file in csv_file_all]
# processing
# =====================================================
# concat along axis=0, outer join on axis=1
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')
# custom function to handle/merge duplicates on Ids (axis=0)
def apply_func(group):
return group.fillna(method='ffill').iloc[-1]
# remove Ids duplicates
merged_unique = merged.groupby(level='Ids').apply(apply_func)
# do the subtraction
df_master = pd.read_csv(master_csv_file, index_col=['Ids']).sort_index()
# select matching records and horizontal concat
df_matched = pd.concat([df_master,merged_unique.reindex(df_master.index)], axis=1)
# use broadcasting
df_matched.iloc[:, 1:] = df_matched.iloc[:, 1:].sub(df_matched.iloc[:, 0], axis=0)
print(df_matched)
Program_2 master_count.py #This does not give any error nor gives any output.
import pandas as pd
import numpy as np
csv_file1 = '/Data_repository/master_lac_Test.csv'
csv_file2 = '/Data_repository/lat_lon_master.csv'
df1 = pd.read_csv(csv_file1).set_index('Ids')
# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()
# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])
# do the division by number of occurence of each Ids
# and add column 00:00:00
def my_func(group):
num_obs = len(group)
# process with column name after 00:30:00 (inclusive)
group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
return group
result = temp.groupby(level='Ids').apply(my_func)
I am trying to write a main program that will call master_ids.py first and then master_count.py. Is their a way to merge both in one program or write them as functions and call those functions in a new program ? Please suggest.
Okey, lets say you have program1.py:
import pandas as pd
import numpy as np
def main_program1():
csv_file1 = 'Transition_Data/Test_1.csv'
...
return df_matched
And then program2.py:
import pandas as pd
import numpy as np
def main_program2():
csv_file1 = '/Data_repository/master_lac_Test.csv'
...
result = temp.groupby(level='Ids').apply(my_func)
return result
You can now use these in a separate python program, say main.py
import time
import program1 # imports program1.py
import program2 # imports program2.py
df_matched = program1.main_program1()
print(df_matched)
# wait
min_wait = 1
time.sleep(60*min_wait)
# call the second one
result = program2.main_program2()
There are lots of ways to 'improve' these, but hopefully this will show you the gist. I would in particular recommend you use the What does if __name__ == "__main__": do?
in each of the files, so that they can easily be executed from the command-line or called from python.
Another option is a shell script, which for your 'master_id.py' and 'master_count.py' become (in its simplest form)
python master_id.py
sleep 60
python master_count.py
saved in 'main.sh' this can be executed as
sh main.sh

Categories

Resources