I have a folder TDMS files (can also be Excel).
These are stored in 5 MB packages, but all contain the same data structure.
Unfortunately there is no absolute time in the lines and the timestamp is stored somewhat cryptically in the column "TimeStamp" in the following format
"Tues. 17.11.2020 19:20:15"
But now I would like to load each file and plot them one after the other in the same graph.
For one file this is no problem, because I simply use the index of the file for the x-axis, but if I load several files, the index in each file is the same and the data overlap.
Does anyone have an idea how I can write all the data into a DataFrame, but with a continuous timestamp, so that the data can be plotted one after the other or I can also specify a time period in which I would like to see the data?
My first approach would be as follows.
If someone could upload an example with a CSV file (pandas.read.csv) instead of npTDMS Module, it would be just as helpful!
https://nptdms.readthedocs.io/en/stable/
import pandas as pd
import matplotlib.pyplot as plt
from nptdms import TdmsFile
tdms_file = TdmsFile.read("Datei1.tdms")
tdms_groups = tdms_file.groups()
tdms_Variables_1 = tdms_file.group_channels(tdms_groups[0])
MessageData_channel_1 = tdms_file.object('Data', 'Position')
MessageData_data_1 = MessageData_channel_1.data
#MessageData_channel_2 = tdms_file.object('Data', 'Timestamp')
#MessageData_data_2 = MessageData_channel_2.data
df_y = pd.DataFrame(data=MessageData_data_1).append(df_y)
plt.plot(df_y)
Here is an example with CSV. It will first create a bunch of files that should look similar to yours in the ./data/ folder. Then it will read those files back (finding them with glob). It uses pandas.concat to combine the dataframes into 1, and then it parses the date.
import glob
import random
import pandas
import matplotlib.pyplot as plt
# Create a bunch of test files that look like your data (NOTE: my files aren't 5MB, but 100 rows)
df = pandas.DataFrame([{"value": random.randint(50, 100)} for _ in range(1000)])
df["timestamp"] = pandas.date_range(
start="17/11/2020", periods=1000, freq="H"
).strftime(r"%a. %d.%m.%Y %H:%M:%S")
chunks = [df.iloc[i : i + 100] for i in range(0, len(df) - 100 + 1, 100)]
for index, chunk in enumerate(chunks):
chunk[["timestamp", "value"]].to_csv(f"./data/data_{index}.csv", index=False)
# ===============
# Read the files back into a dataframe
dataframe_list = []
for file in glob.glob("./data/data_*.csv"):
df = pandas.read_csv(file)
dataframe_list.append(df)
# Combine all individual dataframes into 1
df = pandas.concat(dataframe_list)
# Set the time file correctly
df["timestamp"] = pandas.to_datetime(df["timestamp"], format=r"%a. %d.%m.%Y %H:%M:%S")
# Use the timestamp as the index for the dataframe, and make sure it's sorted
df = df.set_index("timestamp").sort_index()
# Create the plot
plt.plot(df)
#Gijs Wobben
Thank you so much ! It works perfectly well and it will save me a lot of work !
As a mechanical engineer you don't write code like this very often, so I'm happy if people from other disciplines can help you.
Here is the basic structure, how i did it directly with TDMS-Files, because I read afterwards that the npTDMS module offers the possibility to read the data directly as dataframe, which I didn't know before
import pandas as pd
from nptdms import TdmsFile
from nptdms import tdms
import os,glob
file_names=glob.glob('*.tdms')
tdms_file = TdmsFile.read(file_names[0])
# Read the files back into a dataframe
dataframe_list = []
for file in glob.glob("*.tdms"):
tdms_file = TdmsFile.read(file)
df = tdms_file['Sheet1'].as_dataframe()
dataframe_list.append(df)
df_all = pd.concat(dataframe_list)
# Set the time file correctly
df["Timestamp"] = pd.to_datetime(df["Timestamp"], format=r"%a. %d.%m.%Y %H:%M:%S")
# Use the timestamp as the index for the dataframe, and make sure it's sorted
df = df.set_index("Timestamp").sort_index()
# Create the plot
plt.plot(df)
Related
I want to create an algorithm to extract data from csv files in different folders / subfolders. each folder will have 9000 csvs. and we will have 12 of them. 12*9000. over 100,000 files
If the files have consistent structure (column names and column order), then dask can create a large lazy representation of the data:
from dask.dataframe import read_csv
ddf = read_csv('my_path/*/file_*.csv')
# do something
This is working solution for over 100,000 files
Credits : Abhishek Thakur - https://twitter.com/abhi1thakur/status/1358794466283388934
import pandas as pd
import glob
import time
start = time.time()
path = 'csv_test/data/'
all_files = glob.glob(path + "/*.csv")
l = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header = 0)
l.append(df)
frame = pd.concat(l, axis = 0, ignore_index = True)
frame.to_csv('output.csv', index = False)
end = time.time()
print(end - start)
not sure if it can handle data of size 200 gb. - need feedback regarding this
You can read CSV-files using pandas and store them space efficiently on disk:
import pandas as pd
file = "your_file.csv"
data = pd.read_csv(file)
data = data.astype({"column1": int})
data.to_hdf("new_filename.hdf", "key")
Depending on the contents of your file, you can make adjustments to read_csv as described here:
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Make sure that after you've read your data in as a dataframe, the column types match the types they are holding. This way you can save a lot of storage in memory and later when saving these dataframes to disk. You can use astype to make these adjustments.
After you've done that, store your dataframe to disk with to_hdf.
If your data is compatible across csv-files, you can append the dataframes onto each other into a larger dataframe.
I am just starting out with data science, so apologies if this is a bone question with a simple answer, but I have been scanning google for hours and have tried multiple solutions to no avail.
Basically, my dataset has automatically adjusted some values such as 3-5 to 03-May. I am not able to simply change the values in Excel, rather I need to clean the data in Python. My first thought was simply to use the replace tool i.e. df = df.replace('2019-05-03 00:00:00', '3-5') but it doesn't work, presumably as the dtype is different between the timestamp and the str(?) - it works if I adjust the code i.e. df = df.replace('0-2', '3-5').
I can't simply add that data as a missing value either as it is simply an error in formatting rather than a spurious entry.
Is there a simple way of doing this?
Listed below is an example snippet of the data I am working with:
GitHub public gist
PSB for code:
#Dependencies
import pytest
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
import numpy as np
from google.colab import drive
import io
#Import data
from google.colab import files
upload = files.upload()
df = pd.read_excel(io.BytesIO(upload['breast-cancer.xls']))
df
#Clean Data
df.types
#Correcting tumor-size and inv-nodes values
'''def clean_data(dataset):
for i in dataset:
dataset = dataset.replace('2019-05-03 00:00:00','3-5')
dataset = dataset.replace('2019-08-06 00:00:00','6-8')
dataset = dataset.replace('2019-09-11 00:00:00','9-11')
dataset = dataset.replace('2014-12-01 00:00:00','12-14')
dataset = dataset.replace('2014-10-01 00:00:00','10-14')
dataset = dataset.replace('2019-09-05 00:00:00','5-9')
return dataset
cleaned_dataset = dataset.apply(clean_data)
cleaned_dataset'''
df = df.replace('2019-05-03 00:00:00', '3-5')
df
#Check for duplicates
df.duplicated()
df[['tumor-size', 'inv-nodes']] = df[['tumor-size', 'inv-nodes']].astype(str)
That line of code saved the day.
How can I convert a large .xlsx file which contains a lot of timestamps (i.e. 1537892885364) into date and time ( in python and then save it as a new .xlsx file?
I am new to python, and I tried lots of ways to achieve this today, but I did not find a solution.
Below is the code I used, but it gives me '[Errno 13] Permission denied'. I tried different ways which also gave problems.
from __future__ import absolute_import, division, print_function
import os
import pandas as pd
def main(path, filename, absolute_path_organisation_structure):
absolute_filepath = os.path.join(path,filename)
#Relevant list formed with 4th, 5th and 6th columns
df = pd.read_excel(absolute_filepath, header=None, parse_cols=[4,5,6])
# Transform column 0 and 2 to datetime
df[0] = pd.to_datetime(df[0])
df[2] = pd.to_datetime(df[2])
print(df)
path = open(r'C:\\Users\\****\\data')
MISfile = 'filename.xlsx'
main(path, MISfile,None)
Hope this helps:
# requires the following packages:
# - pandas
# - xlrd
# - openpyxl
import pandas as pd
# reading in the excel file timestamps.xlsx
# this file contains a column 'epoch' with the unix epoch timestamps
df = pd.read_excel('timestamps.xlsx')
# translate epochs into human readable and write into newly created column
# note, your timestamps are in ms, hence the unit
df['timestamp'] = pd.to_datetime(df['epoch'],unit='ms')
# write to excel file 'new_timestamps.xlsx'
# index=False prevents pandas to add the indices as a new column
df.to_excel('new_timestamps.xlsx', index=False)
I have a table as below:
How can I print all the sources that have an 'X' for a particular column?. For example, if I want to specify "Make", the output should be:
Delivery
Reputation
Profitability
PS: The idea is to import the excel file in python and do this operation.
use pandas
import pandas as pd
filename = "yourexcelfile"
dataframe = pd.read_excel(filename)
frame = dataframe.loc[dataframe["make"] == "X"]
print(frame["source"])
I have CSV file with data like
data,data,10.00
data,data,11.00
data,data,12.00
I need to update this as
data,data,10.00
data,data,11.00,1.00(11.00-10.00)
data,data,12.30,1.30(12.30-11.00)
could you help me to update the csv file using python
You can use pandas and numpy. pandas reads/writes the csv and numpy does the calculations:
import pandas as pd
import numpy as np
data = pd.read_csv('test.csv', header=None)
col_data = data[2].values
diff = np.diff(col_data)
diff = np.insert(diff, 0, 0)
data['diff'] = diff
# write data to file
data.to_csv('test1.csv', header=False, index=False)
when you open test1.csv then you will find the correct results as you described above with the addition of a zero next to the first data point.
For more info see the following docs:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.DataFrame.to_csv.html