Absolute noob here with EEG analysis. I used the following code to read one subject successfully:
import mne
file = "my_path\\my_file.edf"
data = mne.io.read_raw_edf(file)
raw_data = data.get_data()
channels = data.ch_names
This works perfectly fine. But my intention is to follow along the MNE-python documentation from this link where they use
raws = [read_raw_edf(f, preload=True) for f in raw_fnames]
I have a dataset of 25 subjects, all in one directory and with .edf extension. I am trying to append all the rows from all these tables and I cant get this to work. Please any light on this?
Related
I am new in the python world and I try to build a solution I struggle to develop. The goal is to check that some mandatory information (it will be keywords) are present in a pdf. I have an Excel file where each row correspond to a transaction, and I need to check that all the transaction (and the mandatory information related to them) are in the a corresponding PDF sent during the day.
So, on one side, I have several Excel row in a sheet with the mandatory information (corresponding to info on each transaction), and on the other side, I have a folder with several PDF.
I try to extract data of each pdf to allow the workflow to check if the information for each row in my Excel file are in a single pdf. I check some question raised here and tried to apply some solution to my problem, but I haven't managed to obtain a full working solution.
I have been able to build the partial code that will extract the pdf data and look for the keywords:
Import os
from glob import glob
import re
from PyPDF2 import PdfFileReader
def search_page(pattern, page):
yield from pattern.findall(page.extractText())
def search_document(pattern, path):
document = PdfFileReader(path)
for page in document.pages:
yield from search_page(pattern, page)
searchWords = ['my list of keywords in each row of my Excel file']
pattern = re.compiler(r'\b(?:%s)\b' % '|'.join(searchWords))
for path in glob('path of my folder with all the pdf files'):
matches = search_document(pattern, path)
#inspired by a solution on stackoverflow used to count the occurences of keywords
Also, I think that using panda to build the list of keyword should work, but I can't use it in me previous code, the search tool want a string, not a list.
import pandas as pd
df=pd.read_excel('path of my Excel file', sheet_name=0, usecols='G,L,R,S,Z')
print(df) #I wanted to check that the code was selecting the right colomn only, as some other colomn have unnecessary information
I don't know how to do a searchwords list for each row of my Excel file and put it in the first part of the code. Also, I don't know how to ask to search for ALL the keywords of the list (row in excel), as it is mandatory to have all the information of a transaction in the same pdf. And when it finds all the info, return "ok row 1" or something like that and do the check for the second row, etc. (and put error if it doesn't find all the information).
P.S.: Originally, I wanted only to extract the data with a python code and add it in an Alteryx Workflow, but the python tool of alteryx doesn't accept some Package in my company.
I would be very thankfull for any help!
I am using the book Forecasting: Methods and Applications by Makridakis, Wheelwright and Hyndman. I want to do the exercises along the way, but in Python, not R (as suggested in the book).
I do not know how to use R. I know that the datasets can be availed from an R package - fma. This is the link to the package.
Is there a possible script, in R or Python, which will allow me to download the datasets as .csv files? That way, I will be able to access them using Python.
one possibility:
## install and load package:
install.packages('fma')
library('fma')
## list example data of package fma:
data(package = 'fma')
## export single data as csv:
write.csv(cement, file = 'cement.csv')
## bulk export:
## data names are in `[,3]`rd column of list member "results"
## of `data(...)` output
for (data_name in data(package = 'fma')[['results']][,3]){
write.csv(get(data_name), file = paste0(data_name, '.csv'))
}
Edit:
As Anirban noted, attaching the package {fma} exposes only a few datasets to the search path. The data can be obtained by cloning or downloading from Rob J. Hyndman's source (click green Code button and choose). Subfolder 'data' contains each dataset as an .rda file which can be load()ed and converted. (Observe the licence conditions - GPL3.0 - and acknowledge the authors' efforts anyway.)
That said, you could load and convert the data like this:
setwd('path/to/fma-master/data')
for(data_name in dir()){
cat(paste0('converting ', data_name, '... '))
load(data_name)
object_name <- (gsub('\\.rda','', data_name))
write.csv(get(object_name),
file = paste0(object_name,'.csv'),
row.names = FALSE,
append = FALSE ## overwrite file if exists
)
}
Assume we have a folder with HDF5-files generated by pandas.to_hdf. I would like to create one master.h5 file that contains external links to all the DataFrames.
According to the documentation of h5py, the standard way to do this is
myfile = h5py.File('master.h5','w')
myfile['ext link'] = h5py.ExternalLink("some_sub_file.h5", "/path/to/resource")
But files generated by pandas.to_hdf contain not just datasets, but h5py.Groups. How exactly would you then set up the external link to work?
Links can point to any object in the HDF5 data structure (datasets or groups). The file is a special form of a group; called the root group and referenced with '/'. So, to link to a file, use: h5py.ExternalLink(filename,'/').
You didn't say if you want a link for each dataframe/dataset in each file, or links for each file. It's simpler to create links to the file root groups. If you create individual links to the datasets, be sure you assign unique names.
There are 2 answers that demonstrate each method. The questions were not specifically about h5py.ExternalLink(), but my answers to each question used external links. See these answers:
HDF5 Attributes of External Links: Creates links to root group in multiple files. (each file only has 1 dataset...but your process would be identical.)
I/O Issues in Loading Several Large H5PY Files : Creates links to multiple datasets in multiple files. (Requires unique dataset names to work "as-is". Can be modified if names are not unique.)
I modified the code from the second answer (70089964) to show how to create 3 external links from the master file to the root group in 3 files (where each file has 5 datasets).
Code to create 3 example files:
import h5py
import numpy as np
for fcnt in range(3):
fname = f'file_{fcnt+1}.h5'
with h5py.File(fname,'w') as h5fw:
for dscnt in range(1,6,1):
arr = np.random.random(10).reshape(5,2)
h5fw.create_dataset(f'data_{fcnt*10+dscnt:02}',data=arr*dscnt)
Code to create links from the master file to the 3 files:
import h5py
fnames = ['file_1.h5','file_2.h5','file_3.h5']
with h5py.File(f'master_{len(fnames)}_links}.h5','w') as h5fw:
for fname in fnames:
with h5py.File(fname,'r') as h5fr:
h5fw[fname] = h5py.ExternalLink(fname,'/')
Additional research on Pandas and HDF5 links revealed an interesting discovery: there are limitations with links (you can create them in Pandas, but Pandas can't access the linked data). In other words, the links are there, and work fine with HDFView, h5py and PyTables. Reference these GitHub issues:
Pandas hdf functions should support the hdf5 ExternalLink
functionality when reading/writing - Issue #6019
Presence of softlink in HDF5 file breaks HDFStore.keys() - Issue
#20523
Status for both appears to be Open. My tests confirm previously reported errors.
The code below shows how to create both link types. It also shows the error message you will get when you try to access the linked data. (Error message is: KeyError: 'you cannot get attributes from this 'NoAttrs' instance. This is due to a HDF5 limitation attribute restrictions on links. a A HDFStore node has some required attributes. Result is the 'NoAttrs' message when Pandas tries to read the attributes.
import pandas as pd
df1 = pd.DataFrame({ "a": [1,2,3,4], "b": [11,12,13,14] })
print(df1.to_string())
# Create file 1 with simple dataframe
f1 = "test_1.hdf"
with pd.HDFStore(f1, mode="w") as hdf1:
hdf1.put("/key1", df1)
# Create file 2 with external link
f2 = "test_extlink.hdf"
with pd.HDFStore(f2, mode="w") as hdf2:
hdf2._handle.create_external_link(hdf2._handle.root, "extlink_key1", f"{f1}:/key1")
print("Successful external link write")
with pd.HDFStore(f2, mode="r") as hdf2:
print(hdf2.keys()) # Notice that [] (no keys) is printed
# following lines will trigger the 'NoAttrs' error message
# df2test = pd.read_hdf(f2,key="extlink_key1")
# print(df2test.to_string())
print("End external link read")
# Create file 3 with simple dataframe and symbolic (soft) link
f3 = "test_symlink.hdf"
with pd.HDFStore(f3, mode="w") as hdf3:
hdf3.put("/key1", df1)
hdf3._handle.create_soft_link(hdf3._handle.root, "symlink_key1", "/key1")
print("Successful symbolic link write")
with pd.HDFStore(f3, mode="r") as hdf3:
print(hdf3.keys()) # Notice that only ['key1'] is printed
# following lines will trigger the 'NoAttrs' error message
# df3test = pd.read_hdf(f3,key="symlink_key1")
# print(df3test.to_string())
print("End symbolic link read")
I have a VTK file that correctly populates the data in ParaView:
However, when I open that same file with VTK's Python API, I cannot for the life of me seem to find these same labeled datasets. Here's what I've tried:
import vtk
from vtk.numpy_interface import dataset_adapter as dsa
reader = vtk.vtkUnstructuredGridReader()
reader.SetFileName('test.vtk')
reader.Update()
adapter = dsa.WrapDataObject(reader.GetOutput())
print(adapter.PointData.keys()) # ['hu', 'disp']
print(adapter.CellData.keys()) # []
print(adapter.FieldData.keys()) # []
So, it seems that ParaView is able to identify the other datasets beyond just 'hu' and 'disp', but I cannot seem to find them in the corresponding Python object.
I'm assuming it's there somewhere. Anyone know why they, e.g., 'meanstress', don't appear as keys?
You need to ask the reader to read all the data.
reader.ReadAllScalarsOn()
reader.ReadAllVectorsOn()
...
Dependending of wich kind of data you are trying to load.
(scalars, vector, tensor ... See for the whole list: https://vtk.org/doc/nightly/html/classvtkDataReader.html#a831f470c6fbfc6e7209a1243ccb546e2 )
How can I read edf data using Python? I want to analyze data of a edf file, but I cannot read it using pyEDFlib. It threw the error OSError: The file is discontinous and cannot be read and I'm not sure why.
I assume that your data are biological time-series like EEG, is this correct? If so, you can use the MNE library.
You have to install it first. Since it is not a standard library, take a look here. Then, you can use the read_raw_edf() method.
For example:
import mne
file = "my_path\\my_file.edf"
data = mne.io.read_raw_edf(file)
raw_data = data.get_data()
# you can get the metadata included in the file and a list of all channels:
info = data.info
channels = data.ch_names
See documentation in the links above for other properties of the data object