I am trying to merge two programs or write a third program that will call these two programs as function. They are supposed to run one after the other and after interval of certain time in minutes. something like a make file which will have few more programs included later. I am not able to merge them nor able to put them into some format that will allow me to call them in a new main program.
program_master_id.py picks the *.csv file from a folder location and after computing appends the master_ids.csv file in another location of folder.
Program_master_count.py divides the count with respect to the count ofIds in the respective timeseries.
Program_1 master_id.py
import pandas as pd
import numpy as np
# csv file contents
# Need to change to path as the Transition_Data has several *.CSV files
csv_file1 = 'Transition_Data/Test_1.csv'
csv_file2 = '/Transition_Data/Test_2.csv'
#master file to be appended only
master_csv_file = 'Data_repository/master_lac_Test.csv'
csv_file_all = [csv_file1, csv_file2]
# read csv into df using list comprehension
# I use buffer here, replace stringIO with your file path
df_all = [pd.read_csv(csv_file) for csv_file in csv_file_all]
# processing
# =====================================================
# concat along axis=0, outer join on axis=1
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')
# custom function to handle/merge duplicates on Ids (axis=0)
def apply_func(group):
return group.fillna(method='ffill').iloc[-1]
# remove Ids duplicates
merged_unique = merged.groupby(level='Ids').apply(apply_func)
# do the subtraction
df_master = pd.read_csv(master_csv_file, index_col=['Ids']).sort_index()
# select matching records and horizontal concat
df_matched = pd.concat([df_master,merged_unique.reindex(df_master.index)], axis=1)
# use broadcasting
df_matched.iloc[:, 1:] = df_matched.iloc[:, 1:].sub(df_matched.iloc[:, 0], axis=0)
print(df_matched)
Program_2 master_count.py #This does not give any error nor gives any output.
import pandas as pd
import numpy as np
csv_file1 = '/Data_repository/master_lac_Test.csv'
csv_file2 = '/Data_repository/lat_lon_master.csv'
df1 = pd.read_csv(csv_file1).set_index('Ids')
# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()
# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])
# do the division by number of occurence of each Ids
# and add column 00:00:00
def my_func(group):
num_obs = len(group)
# process with column name after 00:30:00 (inclusive)
group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
return group
result = temp.groupby(level='Ids').apply(my_func)
I am trying to write a main program that will call master_ids.py first and then master_count.py. Is their a way to merge both in one program or write them as functions and call those functions in a new program ? Please suggest.
Okey, lets say you have program1.py:
import pandas as pd
import numpy as np
def main_program1():
csv_file1 = 'Transition_Data/Test_1.csv'
...
return df_matched
And then program2.py:
import pandas as pd
import numpy as np
def main_program2():
csv_file1 = '/Data_repository/master_lac_Test.csv'
...
result = temp.groupby(level='Ids').apply(my_func)
return result
You can now use these in a separate python program, say main.py
import time
import program1 # imports program1.py
import program2 # imports program2.py
df_matched = program1.main_program1()
print(df_matched)
# wait
min_wait = 1
time.sleep(60*min_wait)
# call the second one
result = program2.main_program2()
There are lots of ways to 'improve' these, but hopefully this will show you the gist. I would in particular recommend you use the What does if __name__ == "__main__": do?
in each of the files, so that they can easily be executed from the command-line or called from python.
Another option is a shell script, which for your 'master_id.py' and 'master_count.py' become (in its simplest form)
python master_id.py
sleep 60
python master_count.py
saved in 'main.sh' this can be executed as
sh main.sh
Related
I have a directory with hundreds of csv files that represent the pixels of a thermal camera (288x383), and I want to get the center value of each file (e.g. 144 x 191), and with each one of the those values collected, add them in a dataframe that presents the list with the names of each file.
Follow my code, where I created the dataframe with the lists of several csv files:
import os
import glob
import numpy as np
import pandas as pd
os.chdir("/Programming/Proj1/Code/Image_Data")
!ls
Out:
2021-09-13_13-42-16.csv
2021-09-13_13-42-22.csv
2021-09-13_13-42-29.csv
2021-09-13_13-42-35.csv
2021-09-13_13-42-47.csv
2021-09-13_13-42-53.csv
...
file_extension = '.csv'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
files = glob.glob('*.csv')
all_df = pd.DataFrame(all_filenames, columns = ['Full_name '])
all_df.head()
**Full_name**
0 2021-09-13_13-42-16.csv
1 2021-09-13_13-42-22.csv
2 2021-09-13_13-42-29.csv
3 2021-09-13_13-42-35.csv
4 2021-09-13_13-42-47.csv
5 2021-09-13_13-42-53.csv
6 2021-09-13_13-43-00.csv
You can loop through your files one by one, reading them in as a dataframe and taking the center value that you want. Then save this value along with the file name. This list of results can then be read in to a new dataframe ready for you to use.
result = []
for file in files:
# read in the file, you may need to specify some extra parameters
# check the pandas docs for read_csv
df = pd.read_csv(file)
# now select the value you want
# this will vary depending on what your indexes look like (if any)
# and also your column names
value = df.loc[row, col]
# append to the list
result.append((file, value))
# you should now have a list in the format:
# [('2021-09-13_13-42-16.csv', 100), ('2021-09-13_13-42-22.csv', 255), ...
# load the list of tuples as a dataframe for further processing or analysis...
result_df = pd.DataFrame(result)
I have a folder TDMS files (can also be Excel).
These are stored in 5 MB packages, but all contain the same data structure.
Unfortunately there is no absolute time in the lines and the timestamp is stored somewhat cryptically in the column "TimeStamp" in the following format
"Tues. 17.11.2020 19:20:15"
But now I would like to load each file and plot them one after the other in the same graph.
For one file this is no problem, because I simply use the index of the file for the x-axis, but if I load several files, the index in each file is the same and the data overlap.
Does anyone have an idea how I can write all the data into a DataFrame, but with a continuous timestamp, so that the data can be plotted one after the other or I can also specify a time period in which I would like to see the data?
My first approach would be as follows.
If someone could upload an example with a CSV file (pandas.read.csv) instead of npTDMS Module, it would be just as helpful!
https://nptdms.readthedocs.io/en/stable/
import pandas as pd
import matplotlib.pyplot as plt
from nptdms import TdmsFile
tdms_file = TdmsFile.read("Datei1.tdms")
tdms_groups = tdms_file.groups()
tdms_Variables_1 = tdms_file.group_channels(tdms_groups[0])
MessageData_channel_1 = tdms_file.object('Data', 'Position')
MessageData_data_1 = MessageData_channel_1.data
#MessageData_channel_2 = tdms_file.object('Data', 'Timestamp')
#MessageData_data_2 = MessageData_channel_2.data
df_y = pd.DataFrame(data=MessageData_data_1).append(df_y)
plt.plot(df_y)
Here is an example with CSV. It will first create a bunch of files that should look similar to yours in the ./data/ folder. Then it will read those files back (finding them with glob). It uses pandas.concat to combine the dataframes into 1, and then it parses the date.
import glob
import random
import pandas
import matplotlib.pyplot as plt
# Create a bunch of test files that look like your data (NOTE: my files aren't 5MB, but 100 rows)
df = pandas.DataFrame([{"value": random.randint(50, 100)} for _ in range(1000)])
df["timestamp"] = pandas.date_range(
start="17/11/2020", periods=1000, freq="H"
).strftime(r"%a. %d.%m.%Y %H:%M:%S")
chunks = [df.iloc[i : i + 100] for i in range(0, len(df) - 100 + 1, 100)]
for index, chunk in enumerate(chunks):
chunk[["timestamp", "value"]].to_csv(f"./data/data_{index}.csv", index=False)
# ===============
# Read the files back into a dataframe
dataframe_list = []
for file in glob.glob("./data/data_*.csv"):
df = pandas.read_csv(file)
dataframe_list.append(df)
# Combine all individual dataframes into 1
df = pandas.concat(dataframe_list)
# Set the time file correctly
df["timestamp"] = pandas.to_datetime(df["timestamp"], format=r"%a. %d.%m.%Y %H:%M:%S")
# Use the timestamp as the index for the dataframe, and make sure it's sorted
df = df.set_index("timestamp").sort_index()
# Create the plot
plt.plot(df)
#Gijs Wobben
Thank you so much ! It works perfectly well and it will save me a lot of work !
As a mechanical engineer you don't write code like this very often, so I'm happy if people from other disciplines can help you.
Here is the basic structure, how i did it directly with TDMS-Files, because I read afterwards that the npTDMS module offers the possibility to read the data directly as dataframe, which I didn't know before
import pandas as pd
from nptdms import TdmsFile
from nptdms import tdms
import os,glob
file_names=glob.glob('*.tdms')
tdms_file = TdmsFile.read(file_names[0])
# Read the files back into a dataframe
dataframe_list = []
for file in glob.glob("*.tdms"):
tdms_file = TdmsFile.read(file)
df = tdms_file['Sheet1'].as_dataframe()
dataframe_list.append(df)
df_all = pd.concat(dataframe_list)
# Set the time file correctly
df["Timestamp"] = pd.to_datetime(df["Timestamp"], format=r"%a. %d.%m.%Y %H:%M:%S")
# Use the timestamp as the index for the dataframe, and make sure it's sorted
df = df.set_index("Timestamp").sort_index()
# Create the plot
plt.plot(df)
I have the code below:
import pandas as pd
import numpy as np
# Read data from file 'filename.csv'
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later)
df = pd.read_csv("../example.csv", parse_dates=['DateTime'])
# Preview the first 5 lines of the loaded data
df = df.assign(Burned = df['Quantity'])
df.loc[df['To'] != '0x0000000000000000000000000000000000000000', 'Burned'] = 0.0
# OR:
df['cum_sum'] = df['Burned'].cumsum()
df['percent_burned'] = df['cum_sum']/df['Quantity'].max()*100.0
per_day = df.groupby(df['DateTime'].dt.date)['Burned'].count().reset_index(name='Trx')
The last line create a dataFrame called per_day which sorts all the transactions by day from the df dataFrame. Many of the transactions from df have 'Burned' = 0. I want to count the total number of transactions which is what my current code does. But I want to exclude transactions in which 'Burned'=0.
Also, how can I consoladate these 3 lines of code? I don't even want to create per_day_burned. How do I just make this its own column in per_day without doing it the way I did it?
per_day = dg.groupby(dg['DateTime'].dt.date)['Burned'].count().reset_index(name='Trx')
per_day_burned = dg.groupby(dg['DateTime'].dt.date)['Burned'].sum().reset_index(name='Burned')
per_day['Burned'] = per_day_burned['Burned']
I have a for loop that I want to:
1) Make a pivot table out of the data
2) Convert the 5min data to 30min data
My code is below:
import numpy as np
import pandas as pd
import os
import glob
os.chdir('C:/Users/george/Desktop/testing/output/test')
for filename in os.listdir('C:/Users/george/Desktop/testing/output/test'):
data = pd.read_csv(filename,skiprows=[0])
table = pd.pivot_table(data, values='SCADAVALUE',columns=['DUID'],index='SETTLEMENTDATE', aggfunc=np.sum)
table.to_csv(filename+'pivoted.csv')
my_csv_files = []
for file in os.listdir("C:/Users/george/Desktop/testing/output/test"):
if file.endswith("*pivoted.csv"):
table.set_index(table.columns[0])
table.index = pd.to_datetime(table.index)
table_resampled = table.resample('30min',closed='right',label='right').mean()
table_resampled = table_resampled.reset_index()
table.to_csv(filename+'30min.csv')
The code performs the first loop, but the second loop does not work.Why is this? Whats wrong with my code?
EDIT1:
See comment below
import numpy as np
import pandas as pd
import os
import glob
os.chdir('C:/Users/george/Desktop/testing/output/test')
for filename in os.listdir('C:/Users/george/Desktop/testing/output/test'):
data = pd.read_csv(filename,skiprows=[0])
table = pd.pivot_table(data, values='SCADAVALUE',columns=['DUID'],index='SETTLEMENTDATE', aggfunc=np.sum)
table.to_csv(filename+'pivoted.csv')
my_csv_files = [] # what is this variable for?
for file in os.listdir("C:/Users/george/Desktop/testing/output/test"):
if file.endswith("*pivoted.csv"):
# At this point you are not reading the file, but you should.
# The 'table' variable is still making reference to the the last iteration
# of the 'for' loop a few lines above
# However, better than re-reading the file, you can remove
# the second 'for file in...' loop,
# and just merge the code with the first loop
table.set_index(table.columns[0])
table.index = pd.to_datetime(table.index)
table_resampled = table.resample('30min',closed='right',label='right').mean()
table_resampled = table_resampled.reset_index()
table.to_csv(filename+'30min.csv')
I am puzzled with the following problem. I have a set of csv files, which I parse iterativly. Before collecting the dataframes in a list, I apply some function (as simple as tmp_df*2) to each of the tmp_df. It all worked perfectly fine at first glance, until I've realized I have inconsistencies with the results from run to run.
For example, when I apply df.std() I might receive for a first run:
In[2]: df1.std()
Out[2]:
some_int 15281.99
some_float 5.302293
and for a second run after:
In[3]: df2.std()
Out[3]:
some_int 15281.99
some_float 6.691013
Strangly, I don't not observe inconsistencies like this one when I don't manipulate the parsed data (simply comment out tmp_df = tmp_df*2). I also noticed that for the columns where I have datatypes int the results are consistent from run to run, which does not hold for floats. I suspect it has to do with the precision points. I also cannot establish a pattern how they vary, it might be that I have the same results for two or three consecutive runs. Maybe someone has an idea if I am missing something here. I am working on a replication example, I'll edit asap, as I cannot share the underlying data. Maybe someone can shed some light on this in the meantime. I am using win8.1, pandas 17.1, python 3.4.3.
Code example:
import pandas as pd
import numpy as np
data_list = list()
csv_files = ['a.csv', 'b.csv', 'c.csv']
for csv_file in csv_files:
# load csv_file
tmp_df = pd.read_csv(csv_file, index_col='ID', dtype=np.float64)
# replace infs by na
tmp_df.replace([np.inf, -np.inf], np.nan, inplace=True)
# manipulate tmp_df
tmp_df = tmp_df*2
data_list.append(tmp_df)
df = pd.concat(data_list, ignore_index=True)
df.reset_index(inplace=True)
Update:
Running the same code and data on a UX system works perfectly fine.
Edit:
I have managed to re-create the problem, it should run on win and ux. I've tested on win8.1 facing the same problem when with_function=True (typically after 1-5 runs), on ux the it runs without problems. with_function=False runs without differences for win and ux. I can also reject the hypothesis that it is related to int or float issue as also the simulated int are different...
Here is the code:
import pandas as pd
import numpy as np
from pathlib import Path
from tempfile import gettempdir
def simulate_csv_data(tmp_dir,num_files=5):
""" simulate a csv files
:param tmp_dir: Path, csv files are saved to
:param num_files: int, how many csv files to simulate
:return:
"""
rows = 20000
columns = 5
np.random.seed(1282)
for file_num in range(num_files):
file_path = tmp_dir.joinpath(''.join(['df_', str(file_num), '.csv']))
simulated_df = pd.DataFrame(np.random.standard_normal((rows, columns)))
simulated_df['some_int'] = np.random.randint(0,100)
simulated_df.to_csv(str(file_path))
def get_csv_data(tmp_dir,num_files=5, with_function=True):
""" Collect various csv files and return a concatenated dfs
:param tmp_dir: Path, csv files are saved to
:param num_files: int, how many csv files to simulate
:param with_function: Bool, apply function to tmp_dataframe
:return:
"""
data_list = list()
for file_num in range(num_files):
# current file path
file_path = tmp_dir.joinpath(''.join(['df_', str(file_num), '.csv']))
# load csv_file
tmp_df = pd.read_csv(str(file_path), dtype=np.float64)
# replace infs by na
tmp_df.replace([np.inf, -np.inf], np.nan, inplace=True)
# apply function to tmp_dataframe
if with_function:
tmp_df = tmp_df*2
data_list.append(tmp_df)
df = pd.concat(data_list, ignore_index=True)
df.reset_index(inplace=True)
return df
def main():
# INPUT ----------------------------------------------
num_files = 5
with_function = True
max_comparisons = 50
# ----------------------------------------------------
tmp_dir = gettempdir()
# use temporary "non_existing" dir for new file
tmp_csv_folder = Path(tmp_dir).joinpath('csv_files_sdfs2eqqf')
# if exists already don't simulate data/files again
if tmp_csv_folder.exists() is False:
tmp_csv_folder.mkdir()
print('Simulating temp files...')
simulate_csv_data(tmp_csv_folder, num_files)
print('Getting benchmark data frame...')
df1 = get_csv_data(tmp_csv_folder, num_files, with_function)
df_is_same = True
count_runs = 0
# Run until different df is found or max runs exceeded
print('Comparing data frames...')
while df_is_same:
# get another data frame
df2 = get_csv_data(tmp_csv_folder, num_files, with_function)
count_runs += 1
# compare data frames
if df1.equals(df2) is False:
df_is_same = False
print('Found unequal df after {} runs'.format(count_runs))
# print out a standard deviations (arbitrary example)
print('Std Run1: \n {}'.format(df1.std()))
print('Std Run2: \n {}'.format(df2.std()))
if count_runs > max_comparisons:
df_is_same = False
print('No unequal df found after {} runs'.format(count_runs))
print('Delete the following folder if no longer needed: "{}"'.format(
str(tmp_csv_folder)))
if __name__ == '__main__':
main()
Your variations are caused by something else, like input data change between executions, or source code changes.
Float precision does not ever gives different results between executions.
Btw, clean your examples and you will find the bug. At this moment you say something about and int but display a decimal value instead!!
Updating numexpr to 2.4.6 (or later), as numexpr 2.4.4 had some bugs on windows. After running the update it works for me.