Related
I have these four lists, which are the filenames of images and the filenames are in the format:
(disease)-(randomized patient ID)-(image number by this patient)
A single patient can have multiple images per disease.
See these slices below:
print(train_cnv_list[0:3])
print(train_dme_list[0:3])
print(train_drusen_list[0:3])
print(train_normal_list[0:3])
>>>
['CNV-9911627-77.jpeg', 'CNV-9935363-45.jpeg', 'CNV-9911627-94.jpeg']
['DME-8889850-2.jpeg', 'DME-8773471-3.jpeg', 'DME-8797076-11.jpeg']
['DRUSEN-8986660-50.jpeg', 'DRUSEN-9100857-3.jpeg', 'DRUSEN-9025088-5.jpeg']
['NORMAL-9490249-31.jpeg', 'NORMAL-9509694-5.jpeg', 'NORMAL-9504376-3.jpeg']
I'd like to figure out:
How many images are there per patient / per list?
Is there any overlap in the "randomized patient ID" across the four lists? If so, can I aggregate that into some kind of report (patient, disease, number of images) using something like groupby?
patient - disease1 - total number of images
- disease2 - total number of images
- disease3 - total number of images
where total number of images is a max(image number by this patient)
I did see that this yields a patient id:
train_cnv_list[0][4:11]
>>> 9911627
Thanks, in advance, for any guidance.
You can do it easily with Pandas:
import pandas as pd
cnv_list=['CNV-9911627-77.jpeg', 'CNV-9935363-45.jpeg', 'CNV-9911627-94.jpeg']
dme_list=['DME-8889850-2.jpeg', 'DME-8773471-3.jpeg', 'DME-8797076-11.jpeg']
dru_list=['DRUSEN-8986660-50.jpeg', 'DRUSEN-9100857-3.jpeg', 'DRUSEN-9025088-5.jpeg']
nor_list=['NORMAL-9490249-31.jpeg', 'NORMAL-9509694-5.jpeg', 'NORMAL-9504376-3.jpeg']
data =[]
data.extend(cnv_list)
data.extend(dme_list)
data.extend(dru_list)
data.extend(nor_list)
df = pd.DataFrame(data, columns=["files"])
df["files"]=df["files"].str.replace ('.jpeg','')
df=df["files"].str.split('-', expand=True).rename(columns={0:"disease",1:"PatientID",2:"pictureName"})
res = df.groupby(['PatientID','disease']).apply(lambda x: x['pictureName'].count())
print(res)
Result:
PatientID disease
8773471 DME 1
8797076 DME 1
8889850 DME 1
8986660 DRUSEN 1
9025088 DRUSEN 1
9100857 DRUSEN 1
9490249 NORMAL 1
9504376 NORMAL 1
9509694 NORMAL 1
9911627 CNV 2
9935363 CNV 1
and even more now than you have a dataFrame...
Here are a few functions that might get you on the right track, but as #rick-supports-monica mentioned, this is a great use case for pandas. You'll have an easier time manipulating data.
def contains_duplicate_ids(img_list):
patient_ids = []
for image in img_list:
patient_id = image.split('.')[0].split('-')[1]
patient_ids.append(patient_id)
if len(set(patient_ids)) == len(patient_ids):
return False
return True
def get_duplicates(img_list):
patient_ids = []
duplicates = []
for image in img_list:
patient_id = image.split('.')[0].split('-')[1]
if patient_id in patient_ids:
duplicates.append(patient_id)
patient_ids.append(patient_id)
return duplicates
def count_images(img_list):
return len(set(img_list))
From get_duplicates you can use the patient IDs returned to lookup whatever you want from there. I'm not sure I completely understand the structure of the lists. It looks like {disease}-{patient_id}-{some_other_int}.jpg. I'm not sure how to add additional lookups to the functionality without understanding the input a bit more.
I mentioned pandas, but didn't mention how to use it, here's one way you could get your existing data into a dataframe:
import pandas as pd
# Sample data
train_cnv_list = ['CNV-9911627-77.jpeg', 'CNV-9935363-45.jpeg', 'CNV-9911628-94.jpeg', 'CNM-9911629-94.jpeg']
train_dme_list = ['DME-8889850-2.jpeg', 'DME-8773471-3.jpeg', 'DME-8797076-11.jpeg']
train_drusen_list = ['DRUSEN-8986660-50.jpeg', 'DRUSEN-9100857-3.jpeg', 'DRUSEN-9025088-5.jpeg']
train_normal_list = ['NORMAL-9490249-31.jpeg', 'NORMAL-9509694-5.jpeg', 'NORMAL-9504376-3.jpeg']
# Convert list to dataframe
def dataframe_from_list(img_list):
df = pd.DataFrame(img_list, columns=['filename'])
df['disease'] = [filename.split('.')[0].split('-')[0] for filename in img_list]
df['patient_id'] = [filename.split('.')[0].split('-')[1] for filename in img_list]
df['some_other_int'] = [filename.split('.')[0].split('-')[2] for filename in img_list]
return df
# Generate a dataframe for each list
cnv_df = dataframe_from_list(train_cnv_list)
dme_df = dataframe_from_list(train_dme_list)
drusen_df = dataframe_from_list(train_drusen_list)
normal_df = dataframe_from_list(train_normal_list)
# or combine them into one long dataframe
df = pd.concat([cnv_df, dme_df, drusen_df, normal_df], ignore_index=True)
Start by creating a well defined data structure, use counter in order to answer your first question.
from typing import NamedTuple
from collections import Counter,defaultdict
class FileInfo(NamedTuple):
disease:str
patient_id:str
image_id: str
l1 = ['CNV-9911627-77.jpeg', 'CNV-9935363-45.jpeg', 'CNV-9911627-94.jpeg']
l2 = ['DME-8889850-2.jpeg', 'DME-8773471-3.jpeg', 'DME-8797076-11.jpeg']
l3 = ['DRUSEN-8986660-50.jpeg', 'DRUSEN-9100857-3.jpeg', 'DRUSEN-9025088-5.jpeg']
l4 = ['NORMAL-9490249-31.jpeg', 'NORMAL-9509694-5.jpeg', 'NORMAL-9504376-3.jpeg']
lists = [l1,l2,l3,l4]
data_lists = []
for l in lists:
data_lists.append([FileInfo(*f[:-5].split('-')) for f in l])
counters = []
for l in data_lists:
counters.append(Counter(fi.patient_id for fi in l))
print(counters)
print('-----------')
cross_lists_data = dict()
for l in data_lists:
for file_info in l:
if file_info.patient_id not in cross_lists_data:
cross_lists_data[file_info.patient_id] = defaultdict(int)
cross_lists_data[file_info.patient_id][file_info.disease] += 1
print(cross_lists_data)
Start by concatenating your data
import pandas as pd
import numpy as np
train_cnv_list = ['CNV-9911627-77.jpeg', 'CNV-9935363-45.jpeg', 'CNV-9911627-94.jpeg']
train_dme_list = ['DME-8889850-2.jpeg', 'DME-8773471-3.jpeg', 'DME-8797076-11.jpeg']
train_drusen_list = ['DRUSEN-8986660-50.jpeg', 'DRUSEN-9100857-3.jpeg', 'DRUSEN-9025088-5.jpeg']
train_normal_list = ['NORMAL-9490249-31.jpeg', 'NORMAL-9509694-5.jpeg', 'NORMAL-9504376-3.jpeg']
train_data = np.array([
train_cnv_list,
train_dme_list,
train_drusen_list,
train_normal_list
])
Create a Series with the flattened array
>>> train = pd.Series(train_data.flat)
>>> train
0 CNV-9911627-77.jpeg
1 CNV-9935363-45.jpeg
2 CNV-9911627-94.jpeg
3 DME-8889850-2.jpeg
4 DME-8773471-3.jpeg
5 DME-8797076-11.jpeg
6 DRUSEN-8986660-50.jpeg
7 DRUSEN-9100857-3.jpeg
8 DRUSEN-9025088-5.jpeg
9 NORMAL-9490249-31.jpeg
10 NORMAL-9509694-5.jpeg
11 NORMAL-9504376-3.jpeg
dtype: object
Use Series.str.extract together with regex to extract the information from the filenames and separate it into different columns
>>> pat = '(?P<Disease>\w+)-(?P<Patient_ID>\d+)-(?P<IMG_ID>\d+).jpeg'
>>> train = train.str.extract(pat)
>>> train
Disease Patient_ID IMG_ID
0 CNV 9911627 77
1 CNV 9935363 45
2 CNV 9911627 94
3 DME 8889850 2
4 DME 8773471 3
5 DME 8797076 11
6 DRUSEN 8986660 50
7 DRUSEN 9100857 3
8 DRUSEN 9025088 5
9 NORMAL 9490249 31
10 NORMAL 9509694 5
11 NORMAL 9504376 3
Finally, aggregate the data and compute the total number of images per group based on the maximum IMG_ID number.
>>> report = train.groupby(["Patient_ID","Disease"])['IMG_ID'].agg(Total_IMGs="max")
>>> report
Total_IMGs
Patient_ID Disease
8773471 DME 3
8797076 DME 11
8889850 DME 2
8986660 DRUSEN 50
9025088 DRUSEN 5
9100857 DRUSEN 3
9490249 NORMAL 31
9504376 NORMAL 3
9509694 NORMAL 5
9911627 CNV 94
9935363 CNV 45
I have a .dat file which looks something like the below....
#| step | Channel| Mode | Duration|Freq.| Amplitude | Phase|
0 1 AWG Pi/2 100 2 1
1 1 SIN^2 100 1 1
2 1 SIN^2 200 0.5 1
3 1 REC 50 100 1 1
100 0 REC Pi/2 150 1 1
I had created a data frame and I wanted to read extract data from the data frame but I have an error
TypeError: expected str, bytes or os.PathLike object, not DataFrame
My code is below here,
import pandas as pd
import numpy as np
path = "updated.dat"
datContent = [i.strip().split() for i in open(path).readlines()]
#print(datContent)
column_names = datContent.pop(0)
print(column_names)
df = pd.DataFrame(datContent)
print(df)
extract_column = df.iloc[:,2]
with open (df, 'r') as openfile :
for line in openfile:
for column_search in line:
column_search = df.iloc[:,2]
if "REC" in column_search:
print ("Rec found")
Any suggestion would be appreciated
Since your post does not have any clear question, I have to guess based on your code. I am assuming that what you want to get is to find all rows in DataFrame where column Mode contains value REC.
Based on that, I prepared a small, self contained example that works on your data.
In your situation, the only line that you should use is the last one. Assuming that your DataFrame is created and filled correctly, your code below print(df) can be exchanged by this single line.
I would really recommend you reading the official documentation about indexing and selecting data from DataFrames. https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
import pandas as pd
from io import StringIO
data = StringIO("""
no;step;Channel;Mode;Duration;Freq.;Amplitude;Phase
;0;1;AWG;Pi/2;100;2;1
;1;1;SIN^2;;100;1;1
;2;1;SIN^2;;200;0.5;1
;3;1;REC;50;100;1;1
;100;0;REC;Pi/2;150;1;1
""")
df = pd.read_csv(data, sep=";")
df.loc[df.loc[:, 'Mode'] == "REC", :]
As the title suggests, I am trying to display a progress bar while performing pandas.to_csv.
I have the following script:
def filter_pileup(pileup, output, lists):
tqdm.pandas(desc='Reading, filtering, exporting', bar_format=BAR_DEFAULT_VIEW)
# Reading files
pileup_df = pd.read_csv(pileup, '\t', header=None).progress_apply(lambda x: x)
lists_df = pd.read_csv(lists, '\t', header=None).progress_apply(lambda x: x)
# Filtering pileup
intersection = pd.merge(pileup_df, lists_df, on=[0, 1]).progress_apply(lambda x: x)
intersection.columns = [i for i in range(len(intersection.columns))]
intersection = intersection.loc[:, 0:5]
# Exporting filtered pileup
intersection.to_csv(output, header=None, index=None, sep='\t')
On the first few lines I have found a way to integrate a progress bar but this method doesn't work for the last line, How can I achieve that?
You can divide the dataframe into chunks of n rows and save the dataframe to a csv chunk by chunk using mode='w' for the first row and mode="a" for the rest:
Example:
import numpy as np
import pandas as pd
from tqdm import tqdm
df = pd.DataFrame(data=[i for i in range(0, 10000000)], columns = ["integer"])
print(df.head(10))
chunks = np.array_split(df.index, 100) # split into 100 chunks
for chunck, subset in enumerate(tqdm(chunks)):
if chunck == 0: # first row
df.loc[subset].to_csv('data.csv', mode='w', index=True)
else:
df.loc[subset].to_csv('data.csv', header=None, mode='a', index=True)
Output:
integer
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
100%|██████████| 100/100 [00:12<00:00, 8.12it/s]
I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory.
How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.
I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv and to_hdf methods in its io_tools, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.
Use append=True in the call to to_hdf:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])
print(df)
# A B
# 0 0 1
# 1 2 3
# 2 4 5
# 3 6 7
# 4 8 9
# Save to HDF5
df.to_hdf(filename, 'data', mode='w', format='table')
del df # allow df to be garbage collected
# Append more data
df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])
df2.to_hdf(filename, 'data', append=True)
print(pd.read_hdf(filename, 'data'))
yields
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
Note that you need to use format='table' in the first call to df.to_hdf to make the table appendable. Otherwise, the format is 'fixed' by default, which is faster for reading and writing, but creates a table which can not be appended to.
Thus, you can process each CSV one at a time, use append=True to build the hdf5 file. Then overwrite the DataFrame or use del df to allow the old DataFrame to be garbage collected.
Alternatively, instead of calling df.to_hdf, you could append to a HDFStore:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
store = pd.HDFStore(filename)
for i in range(2):
df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])
store.append('data', df)
store.close()
store = pd.HDFStore(filename)
data = store['data']
print(data)
store.close()
yields
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
This should be possible with PyTables. You'll need to use the EArray class though.
As an example, the following is a script I wrote to import chunked training data stored as .npy files into a single .h5 file.
import numpy
import tables
import os
training_data = tables.open_file('nn_training.h5', mode='w')
a = tables.Float64Atom()
bl_filter = tables.Filters(5, 'blosc') # fast compressor at a moderate setting
training_input = training_data.create_earray(training_data.root, 'X', a,
(0, 1323), 'Training Input',
bl_filter, 4000000)
training_output = training_data.create_earray(training_data.root, 'Y', a,
(0, 27), 'Training Output',
bl_filter, 4000000)
for filename in os.listdir('input'):
print "loading {}...".format(filename)
a = numpy.load(os.path.join('input', filename))
print "writing to h5"
training_input.append(a)
for filename in os.listdir('output'):
print "loading {}...".format(filename)
training_output.append(numpy.load(os.path.join('output', filename)))
Take a look at the docs for detailed instructions, but very briefly, the create_earray function takes 1) a data root or parent node; 2) an array name; 3) a datatype atom; 4) a shape with a 0 in the dimension you want to expand; 5) a verbose descriptor; 6) a compression filter; and 7) an expected number of rows along the expandable dimension. Only the first two are required, but you'll probably use all seven in practice. The function accepts a few other optional arguments as well; again, see the docs for details.
Once the array is created, you can use its append method in the expected way.
If you have a very large single CSV file, you may want to stream the conversion to hdf, e.g.:
import numpy as np
import pandas as pd
from IPython.display import clear_output
CHUNK_SIZE = 5000000
filename = 'data.csv'
dtypes = {'latitude': float, 'longitude': float}
iter_csv = pd.read_csv(
filename, iterator=True,
dtype=dtypes, encoding='utf-8', chunksize=CHUNK_SIZE)
cnt = 0
for ix, chunk in enumerate(iter_csv):
chunk.to_hdf(
"data.hdf", 'data', format='table', append=True)
cnt += CHUNK_SIZE
clear_output(wait=True)
print(f"Processed {cnt:,.0f} coordinates..")
Tested with a 64GB CSV file and 450 Million coordinates (about 10 Minutes conversion).
I have a data file with multiple rows, and 8 columns - I want to average column 8 of rows that have the same data on columns 1, 2, 5 - for example my file can look like this:
564645 7371810 0 21642 1530 1 2 30.8007
564645 7371810 0 21642 8250 1 2 0.0103
564645 7371810 0 21643 1530 1 2 19.3619
I want to average the last column of the first and third row since columns 1-2-5 are identical;
I want the output to look like this:
564645 7371810 0 21642 1530 1 2 25.0813
564645 7371810 0 21642 8250 1 2 0.0103
my files (text files) are pretty big (~10000 lines) and redundant data (based on the above rule) are not in regular intervals - so I want the code to find the redundant data, and average them...
in response to larsks comment - here are my 4 lines of code...
import os
import numpy as np
datadirectory = input('path to the data directory, ')
os.chdir( datadirectory)
##READ DATA FILE AND CREATE AN ARRAY
dataset = open(input('dataset_to_be_used, ')).readlines()
data = np.loadtxt(dataset)
##Sort the data based on common X, Y and frequency
datasort = np.lexsort((data[:,0],data[:,1],data[:,4]))
datasorted = data[datasort]
you can use pandas to do this quickly:
import pandas as pd
from StringIO import StringIO
data = StringIO("""564645 7371810 0 21642 1530 1 2 30.8007
564645 7371810 0 21642 8250 1 2 0.0103
564645 7371810 0 21643 1530 1 2 19.3619
""")
df = pd.read_csv(data, sep="\\s+", header=None)
df.groupby(["X.1","X.2","X.5"])["X.8"].mean()
the output is:
X.1 X.2 X.5
564645 7371810 1530 25.0813
8250 0.0103
Name: X.8
if you don't need index, you can call:
df.groupby(["X.1","X.2","X.5"])["X.8"].mean().reset_index()
this will give the result as:
X.1 X.2 X.5 X.8
0 564645 7371810 1530 25.0813
1 564645 7371810 8250 0.0103
Ok, based on Hury's input I updated the code -
import os #needed system utils
import numpy as np# for array data processing
import pandas as pd #import the pandas module
datadirectory = input('path to the data directory, ')
working = os.environ.get("WORKING_DIRECTORY", datadirectory)
os.chdir( working)
##READ DATA FILE AND and convert it to string
dataset = open(input('dataset_to_be_used, ')).readlines()
data = ''.join(dataset)
df = pd.read_csv(data, sep="\\s+", header=None)
sorted_data = df.groupby(["X.1","X.2","X.5"])["X.8"].mean().reset_index()
tuple_data = [tuple(x) for x in sorted_data.values]
datas = np.asarray(tuple_data)
this worked with the test data, as posted by hury - but when I use my file after the df = ... does not seem to work (I get an output like:
Traceback (most recent call last):
File "/media/DATA/arxeia/Programming/MyPys/data_refine_average.py", line 31, in
df = pd.read_csv(data, sep="\s+", header=None)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 187, in read_csv
return _read(TextParser, filepath_or_buffer, kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 141, in _read
f = com._get_handle(filepath_or_buffer, 'r', encoding=encoding)
File "/usr/lib64/python2.7/site-packages/pandas/core/common.py", line 673, in _get_handle
f = open(path, mode)
IOError: [Errno 36] File name too long: '564645\t7371810\t0\t21642\t1530\t1\t2\t30.8007\r\n564645\t7371810\t0\t21642\t8250\t1\t2\t0.0103\r\n564645\t7371810\t0\t21642\t20370\t1\t2\t0.0042\r\n564645\t7371810\t0\t21642\t33030\t1\t2\t0.0026\r\n564645\t7371810\t0\t21642\t47970\t1\t2\t0.0018\r\n564645\t7371810\t0\t21642\t63090\t1\t2\t0.0013\r\n564645\t7371810\t0\t21642\t93090\t1\t2\t0.0009\r\n564645\t7371810\t0\t216..........
any ideas?
It's not the most elegant of answers, and I have no idea how fast/efficient it is, but I believe it gets the job done based on the information you provided:
import numpy
data_file = "full_location_of_data_file"
data_dict = {}
for line in open(data_file):
line = line.rstrip()
columns = line.split()
entry = [columns[0], columns[1], columns[4]]
entry = "-".join(entry)
try: #valid if have already seen combination of 1,2,5
x = data_dict[entry].append(float(columns[7]))
except (KeyError): #KeyError the first time you see a combination of columns 1,2,5
data_dict[entry] = [float(columns[7])]
for entry in data_dict:
value = numpy.mean(data_dict[entry])
output = entry.split("-")
output.append(str(value))
output = "\t".join(output)
print output
I'm unclear if you want/need columns 3, 6, or 7 so I omited them. Particularly, you do not make clear how you want to deal with different values which may exist within them. If you can elaborate on what behavior you want (ie default to a certain value, or to the first occurrence) I'd suggest either filling in with default values or store the first instance in a dictionary of dictionaries rather than a dictionary of lists.
import os #needed system utils
import numpy as np# for array data processing
datadirectory = '/media/DATA/arxeia/Dimitris/Testing/12_11'
working = os.environ.get("WORKING_DIRECTORY", datadirectory)
os.chdir( working)
##HERE I WAS TRYING TO READ THE FILE, AND THEN USE THE NAME OF THE STRING IN THE FOLLOWING LINE - THAT RESULTED IN THE SAME ERROR DESCRIBED BELOW (ERROR # 42 (I think) - too large name)
data_dict = {} #Create empty dictionary
for line in open('/media/DATA/arxeia/Dimitris/Testing/12_11/1a.dat'): ##above error resolved when used this
line = line.rstrip()
columns = line.split()
entry = [columns[0], columns[1], columns[4]]
entry = "-".join(entry)
try: #valid if have already seen combination of 1,2,5
x = data_dict[entry].append(float(columns[7]))
except (KeyError): #KeyError the first time you see a combination of columns 1,2,5
data_dict[entry] = [float(columns[7])]
for entry in data_dict:
value = np.mean(data_dict[entry])
output = entry.split("-")
output.append(str(value))
output = "\t".join(output)
print output
MY OTHER PROBLEM NOW IS GETTING OUTPUT IN STRING FORMAT (OR ANY FORMAT) - THEN I BELIEVE I KNOW I CAN GET TO THE SAVE PART AND MANIPULATE THE FINAL FORMAT
np.savetxt('sorted_data.dat', sorted, fmt='%s', delimiter='\t') #Save the data
I STILL HAVE TO FIGURE HOW TO ADD THE OTHER COLUMNS - I AM WORKING ON THAT TOO