As the title suggests, I am trying to display a progress bar while performing pandas.to_csv.
I have the following script:
def filter_pileup(pileup, output, lists):
tqdm.pandas(desc='Reading, filtering, exporting', bar_format=BAR_DEFAULT_VIEW)
# Reading files
pileup_df = pd.read_csv(pileup, '\t', header=None).progress_apply(lambda x: x)
lists_df = pd.read_csv(lists, '\t', header=None).progress_apply(lambda x: x)
# Filtering pileup
intersection = pd.merge(pileup_df, lists_df, on=[0, 1]).progress_apply(lambda x: x)
intersection.columns = [i for i in range(len(intersection.columns))]
intersection = intersection.loc[:, 0:5]
# Exporting filtered pileup
intersection.to_csv(output, header=None, index=None, sep='\t')
On the first few lines I have found a way to integrate a progress bar but this method doesn't work for the last line, How can I achieve that?
You can divide the dataframe into chunks of n rows and save the dataframe to a csv chunk by chunk using mode='w' for the first row and mode="a" for the rest:
Example:
import numpy as np
import pandas as pd
from tqdm import tqdm
df = pd.DataFrame(data=[i for i in range(0, 10000000)], columns = ["integer"])
print(df.head(10))
chunks = np.array_split(df.index, 100) # split into 100 chunks
for chunck, subset in enumerate(tqdm(chunks)):
if chunck == 0: # first row
df.loc[subset].to_csv('data.csv', mode='w', index=True)
else:
df.loc[subset].to_csv('data.csv', header=None, mode='a', index=True)
Output:
integer
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
100%|██████████| 100/100 [00:12<00:00, 8.12it/s]
Related
I have these four lists, which are the filenames of images and the filenames are in the format:
(disease)-(randomized patient ID)-(image number by this patient)
A single patient can have multiple images per disease.
See these slices below:
print(train_cnv_list[0:3])
print(train_dme_list[0:3])
print(train_drusen_list[0:3])
print(train_normal_list[0:3])
>>>
['CNV-9911627-77.jpeg', 'CNV-9935363-45.jpeg', 'CNV-9911627-94.jpeg']
['DME-8889850-2.jpeg', 'DME-8773471-3.jpeg', 'DME-8797076-11.jpeg']
['DRUSEN-8986660-50.jpeg', 'DRUSEN-9100857-3.jpeg', 'DRUSEN-9025088-5.jpeg']
['NORMAL-9490249-31.jpeg', 'NORMAL-9509694-5.jpeg', 'NORMAL-9504376-3.jpeg']
I'd like to figure out:
How many images are there per patient / per list?
Is there any overlap in the "randomized patient ID" across the four lists? If so, can I aggregate that into some kind of report (patient, disease, number of images) using something like groupby?
patient - disease1 - total number of images
- disease2 - total number of images
- disease3 - total number of images
where total number of images is a max(image number by this patient)
I did see that this yields a patient id:
train_cnv_list[0][4:11]
>>> 9911627
Thanks, in advance, for any guidance.
You can do it easily with Pandas:
import pandas as pd
cnv_list=['CNV-9911627-77.jpeg', 'CNV-9935363-45.jpeg', 'CNV-9911627-94.jpeg']
dme_list=['DME-8889850-2.jpeg', 'DME-8773471-3.jpeg', 'DME-8797076-11.jpeg']
dru_list=['DRUSEN-8986660-50.jpeg', 'DRUSEN-9100857-3.jpeg', 'DRUSEN-9025088-5.jpeg']
nor_list=['NORMAL-9490249-31.jpeg', 'NORMAL-9509694-5.jpeg', 'NORMAL-9504376-3.jpeg']
data =[]
data.extend(cnv_list)
data.extend(dme_list)
data.extend(dru_list)
data.extend(nor_list)
df = pd.DataFrame(data, columns=["files"])
df["files"]=df["files"].str.replace ('.jpeg','')
df=df["files"].str.split('-', expand=True).rename(columns={0:"disease",1:"PatientID",2:"pictureName"})
res = df.groupby(['PatientID','disease']).apply(lambda x: x['pictureName'].count())
print(res)
Result:
PatientID disease
8773471 DME 1
8797076 DME 1
8889850 DME 1
8986660 DRUSEN 1
9025088 DRUSEN 1
9100857 DRUSEN 1
9490249 NORMAL 1
9504376 NORMAL 1
9509694 NORMAL 1
9911627 CNV 2
9935363 CNV 1
and even more now than you have a dataFrame...
Here are a few functions that might get you on the right track, but as #rick-supports-monica mentioned, this is a great use case for pandas. You'll have an easier time manipulating data.
def contains_duplicate_ids(img_list):
patient_ids = []
for image in img_list:
patient_id = image.split('.')[0].split('-')[1]
patient_ids.append(patient_id)
if len(set(patient_ids)) == len(patient_ids):
return False
return True
def get_duplicates(img_list):
patient_ids = []
duplicates = []
for image in img_list:
patient_id = image.split('.')[0].split('-')[1]
if patient_id in patient_ids:
duplicates.append(patient_id)
patient_ids.append(patient_id)
return duplicates
def count_images(img_list):
return len(set(img_list))
From get_duplicates you can use the patient IDs returned to lookup whatever you want from there. I'm not sure I completely understand the structure of the lists. It looks like {disease}-{patient_id}-{some_other_int}.jpg. I'm not sure how to add additional lookups to the functionality without understanding the input a bit more.
I mentioned pandas, but didn't mention how to use it, here's one way you could get your existing data into a dataframe:
import pandas as pd
# Sample data
train_cnv_list = ['CNV-9911627-77.jpeg', 'CNV-9935363-45.jpeg', 'CNV-9911628-94.jpeg', 'CNM-9911629-94.jpeg']
train_dme_list = ['DME-8889850-2.jpeg', 'DME-8773471-3.jpeg', 'DME-8797076-11.jpeg']
train_drusen_list = ['DRUSEN-8986660-50.jpeg', 'DRUSEN-9100857-3.jpeg', 'DRUSEN-9025088-5.jpeg']
train_normal_list = ['NORMAL-9490249-31.jpeg', 'NORMAL-9509694-5.jpeg', 'NORMAL-9504376-3.jpeg']
# Convert list to dataframe
def dataframe_from_list(img_list):
df = pd.DataFrame(img_list, columns=['filename'])
df['disease'] = [filename.split('.')[0].split('-')[0] for filename in img_list]
df['patient_id'] = [filename.split('.')[0].split('-')[1] for filename in img_list]
df['some_other_int'] = [filename.split('.')[0].split('-')[2] for filename in img_list]
return df
# Generate a dataframe for each list
cnv_df = dataframe_from_list(train_cnv_list)
dme_df = dataframe_from_list(train_dme_list)
drusen_df = dataframe_from_list(train_drusen_list)
normal_df = dataframe_from_list(train_normal_list)
# or combine them into one long dataframe
df = pd.concat([cnv_df, dme_df, drusen_df, normal_df], ignore_index=True)
Start by creating a well defined data structure, use counter in order to answer your first question.
from typing import NamedTuple
from collections import Counter,defaultdict
class FileInfo(NamedTuple):
disease:str
patient_id:str
image_id: str
l1 = ['CNV-9911627-77.jpeg', 'CNV-9935363-45.jpeg', 'CNV-9911627-94.jpeg']
l2 = ['DME-8889850-2.jpeg', 'DME-8773471-3.jpeg', 'DME-8797076-11.jpeg']
l3 = ['DRUSEN-8986660-50.jpeg', 'DRUSEN-9100857-3.jpeg', 'DRUSEN-9025088-5.jpeg']
l4 = ['NORMAL-9490249-31.jpeg', 'NORMAL-9509694-5.jpeg', 'NORMAL-9504376-3.jpeg']
lists = [l1,l2,l3,l4]
data_lists = []
for l in lists:
data_lists.append([FileInfo(*f[:-5].split('-')) for f in l])
counters = []
for l in data_lists:
counters.append(Counter(fi.patient_id for fi in l))
print(counters)
print('-----------')
cross_lists_data = dict()
for l in data_lists:
for file_info in l:
if file_info.patient_id not in cross_lists_data:
cross_lists_data[file_info.patient_id] = defaultdict(int)
cross_lists_data[file_info.patient_id][file_info.disease] += 1
print(cross_lists_data)
Start by concatenating your data
import pandas as pd
import numpy as np
train_cnv_list = ['CNV-9911627-77.jpeg', 'CNV-9935363-45.jpeg', 'CNV-9911627-94.jpeg']
train_dme_list = ['DME-8889850-2.jpeg', 'DME-8773471-3.jpeg', 'DME-8797076-11.jpeg']
train_drusen_list = ['DRUSEN-8986660-50.jpeg', 'DRUSEN-9100857-3.jpeg', 'DRUSEN-9025088-5.jpeg']
train_normal_list = ['NORMAL-9490249-31.jpeg', 'NORMAL-9509694-5.jpeg', 'NORMAL-9504376-3.jpeg']
train_data = np.array([
train_cnv_list,
train_dme_list,
train_drusen_list,
train_normal_list
])
Create a Series with the flattened array
>>> train = pd.Series(train_data.flat)
>>> train
0 CNV-9911627-77.jpeg
1 CNV-9935363-45.jpeg
2 CNV-9911627-94.jpeg
3 DME-8889850-2.jpeg
4 DME-8773471-3.jpeg
5 DME-8797076-11.jpeg
6 DRUSEN-8986660-50.jpeg
7 DRUSEN-9100857-3.jpeg
8 DRUSEN-9025088-5.jpeg
9 NORMAL-9490249-31.jpeg
10 NORMAL-9509694-5.jpeg
11 NORMAL-9504376-3.jpeg
dtype: object
Use Series.str.extract together with regex to extract the information from the filenames and separate it into different columns
>>> pat = '(?P<Disease>\w+)-(?P<Patient_ID>\d+)-(?P<IMG_ID>\d+).jpeg'
>>> train = train.str.extract(pat)
>>> train
Disease Patient_ID IMG_ID
0 CNV 9911627 77
1 CNV 9935363 45
2 CNV 9911627 94
3 DME 8889850 2
4 DME 8773471 3
5 DME 8797076 11
6 DRUSEN 8986660 50
7 DRUSEN 9100857 3
8 DRUSEN 9025088 5
9 NORMAL 9490249 31
10 NORMAL 9509694 5
11 NORMAL 9504376 3
Finally, aggregate the data and compute the total number of images per group based on the maximum IMG_ID number.
>>> report = train.groupby(["Patient_ID","Disease"])['IMG_ID'].agg(Total_IMGs="max")
>>> report
Total_IMGs
Patient_ID Disease
8773471 DME 3
8797076 DME 11
8889850 DME 2
8986660 DRUSEN 50
9025088 DRUSEN 5
9100857 DRUSEN 3
9490249 NORMAL 31
9504376 NORMAL 3
9509694 NORMAL 5
9911627 CNV 94
9935363 CNV 45
Say I have in memory a large file, loaded using chunksize in pandas. Now I have to compare every value with the ones ajdacent to it. My problem is that I can't seem to select at the same time the extreme values (in first and last position) of two different chunks.
Example:
print(df)
a
0 102
1 101
2 104
3 110
4 104
5 105
count = 0
for i in range(len(df)-1):
if df.iloc[i+1]['a']>df.iloc[i]['a']:
count+=1
count would be equal to 3 in this example. But say I have loaded df from a .csv with chunksize=1, how would I achieve a similar result, considering that values will be in different chunks? In practice chunksize is 10000 and so the problem would be limited to the first and last value for each chunk.
EDIT:
Here is an example where I store the last_chunk_value to update the value when running the next loop.
I've tested a 'brut force' method to compare with the 'chunk script'. The results are the same with both methods.
By the way, I've simplified the 'brut force' method.
import pandas as pd
import numpy as np
import random
# 'data' generation as csv file
file = open("data.csv", 'w')
file.write('rand_int' + '\n')
for i in range(0, 10000):
file.write(str(random.randint(80,120)) + '\n')
file.close()
# "brute force method"
df = pd.read_csv("data.csv")
length = int( (df.shift(-1) - df > 0).sum() )
print('number=', length)
# chunksize method
chunksize = 33
length = 0
last_chunk_value = np.nan
for chunk in pd.read_csv("data.csv", chunksize=chunksize):
chunk['shift'] = chunk.shift(1)
chunk.iloc[0, 1] = last_chunk_value
length += (chunk['rand_int'] - chunk['shift'] > 0).sum()
last_chunk_value = chunk.iloc[-1, 0]
print('number=', length)
I want to read data (numbers) from a text file that has a random directory. The text file contains both words and numbers that looks like this, how can I extract these columns.
Start Time: 7/28/2019 7:58:06 PM Time Completed: 7/28/2019 8:21:24 PM Elapsed Time: 00:23:17
Sample ID: 190728-MTJ-IP
***DATA***
Field(Oe) Moment(emu)
987.95878 0.000046470297
963.27719 0.000046452876
938.57541 0.000046659299
913.89473 0.000046416303
889.19093 0.000046813005
864.50576 0.000047033128
839.80973 0.000046368291
815.12703 0.000046888714
790.45031 0.000045933749
765.75385 0.00004716459
741.05444 0.000046405491
I intend to use this but I am confused, what indexes I should put on:
def txtread(filepath):
data = []
with open(filepath+'.txt', 'r') as readfile:
datalines = readfile.readlines()
for lines in datalines:
temp = lines.strip('\t\n').split(',')
temp = np.array(temp[:],dtype=float)
data = np.array(data[0::2])
H = data[:,0]
M = data[:,1]
Pandas read_csv method has a bunch of parameters to handle all of these:
>>> import pandas as pd
>>> pd.read_csv('temp.txt', skiprows=5, delim_whitespace=True)
Field(Oe) Moment(emu)
0 987.95878 0.000046
1 963.27719 0.000046
2 938.57541 0.000047
3 913.89473 0.000046
4 889.19093 0.000047
5 864.50576 0.000047
6 839.80973 0.000046
7 815.12703 0.000047
8 790.45031 0.000046
9 765.75385 0.000047
10 741.05444 0.000046
The output of pd.read_csv is a DataFrame. if you prefer to work with numpy arrays,
df = pd.read_csv(...)
np_data = df.values
making a change from R to Python I have some difficulties to write multiple csv using pandas from a list of multiple DataFrames:
import pandas
from dplython import (DplyFrame, X, diamonds, select, sift, sample_n,
sample_frac, head, arrange, mutate, group_by, summarize,
DelayFunction)
diamonds = [diamonds, diamonds, diamonds]
path = "/user/me/"
def extractDiomands(path, diamonds):
for each in diamonds:
df = DplyFrame(each) >> select(X.carat, X.cut, X.price) >> head(5)
df = pd.DataFrame(df) # not sure if that is required
df.to_csv(os.path.join('.csv', each))
extractDiomands(path,diamonds)
That however generates an errors. Appreciate any suggestions!
Welcome to Python! First I'll load a couple libraries and download an example dataset.
import os
import pandas as pd
example_data = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
print(example_data.head(5))
first few rows of our example data:
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
Now here's what I think you want done:
# spawn a few datasets to loop through
df_1, df_2, df_3 = example_data.head(20), example_data.tail(20), example_data.head(10)
list_of_datasets = [df_1, df_2, df_3]
output_path = 'scratch'
# in Python you can loop through collections of items directly, its pretty cool.
# with enumerate(), you get the index and the item from the sequence, each step through
for index, dataset in enumerate(list_of_datasets):
# Filter to keep just a couple columns
keep_columns = ['gre', 'admit']
dataset = dataset[keep_columns]
# Export to CSV
filepath = os.path.join(output_path, 'dataset_'+str(index)+'.csv')
dataset.to_csv(filepath)
At the end, my folder 'scratch' has three new csv's called dataset_0.csv, dataset_1.csv, and dataset_2.csv
I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory.
How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.
I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv and to_hdf methods in its io_tools, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.
Use append=True in the call to to_hdf:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])
print(df)
# A B
# 0 0 1
# 1 2 3
# 2 4 5
# 3 6 7
# 4 8 9
# Save to HDF5
df.to_hdf(filename, 'data', mode='w', format='table')
del df # allow df to be garbage collected
# Append more data
df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])
df2.to_hdf(filename, 'data', append=True)
print(pd.read_hdf(filename, 'data'))
yields
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
Note that you need to use format='table' in the first call to df.to_hdf to make the table appendable. Otherwise, the format is 'fixed' by default, which is faster for reading and writing, but creates a table which can not be appended to.
Thus, you can process each CSV one at a time, use append=True to build the hdf5 file. Then overwrite the DataFrame or use del df to allow the old DataFrame to be garbage collected.
Alternatively, instead of calling df.to_hdf, you could append to a HDFStore:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
store = pd.HDFStore(filename)
for i in range(2):
df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])
store.append('data', df)
store.close()
store = pd.HDFStore(filename)
data = store['data']
print(data)
store.close()
yields
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
This should be possible with PyTables. You'll need to use the EArray class though.
As an example, the following is a script I wrote to import chunked training data stored as .npy files into a single .h5 file.
import numpy
import tables
import os
training_data = tables.open_file('nn_training.h5', mode='w')
a = tables.Float64Atom()
bl_filter = tables.Filters(5, 'blosc') # fast compressor at a moderate setting
training_input = training_data.create_earray(training_data.root, 'X', a,
(0, 1323), 'Training Input',
bl_filter, 4000000)
training_output = training_data.create_earray(training_data.root, 'Y', a,
(0, 27), 'Training Output',
bl_filter, 4000000)
for filename in os.listdir('input'):
print "loading {}...".format(filename)
a = numpy.load(os.path.join('input', filename))
print "writing to h5"
training_input.append(a)
for filename in os.listdir('output'):
print "loading {}...".format(filename)
training_output.append(numpy.load(os.path.join('output', filename)))
Take a look at the docs for detailed instructions, but very briefly, the create_earray function takes 1) a data root or parent node; 2) an array name; 3) a datatype atom; 4) a shape with a 0 in the dimension you want to expand; 5) a verbose descriptor; 6) a compression filter; and 7) an expected number of rows along the expandable dimension. Only the first two are required, but you'll probably use all seven in practice. The function accepts a few other optional arguments as well; again, see the docs for details.
Once the array is created, you can use its append method in the expected way.
If you have a very large single CSV file, you may want to stream the conversion to hdf, e.g.:
import numpy as np
import pandas as pd
from IPython.display import clear_output
CHUNK_SIZE = 5000000
filename = 'data.csv'
dtypes = {'latitude': float, 'longitude': float}
iter_csv = pd.read_csv(
filename, iterator=True,
dtype=dtypes, encoding='utf-8', chunksize=CHUNK_SIZE)
cnt = 0
for ix, chunk in enumerate(iter_csv):
chunk.to_hdf(
"data.hdf", 'data', format='table', append=True)
cnt += CHUNK_SIZE
clear_output(wait=True)
print(f"Processed {cnt:,.0f} coordinates..")
Tested with a 64GB CSV file and 450 Million coordinates (about 10 Minutes conversion).