Convert large csv to hdf5 - python

I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory.
How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.
I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv and to_hdf methods in its io_tools, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.

Use append=True in the call to to_hdf:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])
print(df)
# A B
# 0 0 1
# 1 2 3
# 2 4 5
# 3 6 7
# 4 8 9
# Save to HDF5
df.to_hdf(filename, 'data', mode='w', format='table')
del df # allow df to be garbage collected
# Append more data
df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])
df2.to_hdf(filename, 'data', append=True)
print(pd.read_hdf(filename, 'data'))
yields
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
Note that you need to use format='table' in the first call to df.to_hdf to make the table appendable. Otherwise, the format is 'fixed' by default, which is faster for reading and writing, but creates a table which can not be appended to.
Thus, you can process each CSV one at a time, use append=True to build the hdf5 file. Then overwrite the DataFrame or use del df to allow the old DataFrame to be garbage collected.
Alternatively, instead of calling df.to_hdf, you could append to a HDFStore:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
store = pd.HDFStore(filename)
for i in range(2):
df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])
store.append('data', df)
store.close()
store = pd.HDFStore(filename)
data = store['data']
print(data)
store.close()
yields
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90

This should be possible with PyTables. You'll need to use the EArray class though.
As an example, the following is a script I wrote to import chunked training data stored as .npy files into a single .h5 file.
import numpy
import tables
import os
training_data = tables.open_file('nn_training.h5', mode='w')
a = tables.Float64Atom()
bl_filter = tables.Filters(5, 'blosc') # fast compressor at a moderate setting
training_input = training_data.create_earray(training_data.root, 'X', a,
(0, 1323), 'Training Input',
bl_filter, 4000000)
training_output = training_data.create_earray(training_data.root, 'Y', a,
(0, 27), 'Training Output',
bl_filter, 4000000)
for filename in os.listdir('input'):
print "loading {}...".format(filename)
a = numpy.load(os.path.join('input', filename))
print "writing to h5"
training_input.append(a)
for filename in os.listdir('output'):
print "loading {}...".format(filename)
training_output.append(numpy.load(os.path.join('output', filename)))
Take a look at the docs for detailed instructions, but very briefly, the create_earray function takes 1) a data root or parent node; 2) an array name; 3) a datatype atom; 4) a shape with a 0 in the dimension you want to expand; 5) a verbose descriptor; 6) a compression filter; and 7) an expected number of rows along the expandable dimension. Only the first two are required, but you'll probably use all seven in practice. The function accepts a few other optional arguments as well; again, see the docs for details.
Once the array is created, you can use its append method in the expected way.

If you have a very large single CSV file, you may want to stream the conversion to hdf, e.g.:
import numpy as np
import pandas as pd
from IPython.display import clear_output
CHUNK_SIZE = 5000000
filename = 'data.csv'
dtypes = {'latitude': float, 'longitude': float}
iter_csv = pd.read_csv(
filename, iterator=True,
dtype=dtypes, encoding='utf-8', chunksize=CHUNK_SIZE)
cnt = 0
for ix, chunk in enumerate(iter_csv):
chunk.to_hdf(
"data.hdf", 'data', format='table', append=True)
cnt += CHUNK_SIZE
clear_output(wait=True)
print(f"Processed {cnt:,.0f} coordinates..")
Tested with a 64GB CSV file and 450 Million coordinates (about 10 Minutes conversion).

Related

Reshaping a single column in to multiple column using Python

I have an excel file containing a single column (Row's number is not fixed). Using Python 3, I want to,
Import my excel file/data in python,
Read/select the data column (first column), and
Reshape this column into multiple columns having 10 rows in each column and finally
Writing output to a new excel file.
I have tried the following code;
import pandas as pd
import numpy as np
df = pd.read_excel('sample.xlsx')
first_column = pd.DataFrame(df.iloc[:,0])
arr = np.array(first_column)
newArr = arr.reshape(10, -1)
However i am facing the following error:
newArr = arr.reshape(arr, (10, -1))
TypeError: only integer scalar arrays can be converted to a scalar index
Looking for someone to help me out achieving this in Python 3.
My Excel File
1. To read a file in python you need pandas
To read the excel file in python it would be better to first save the file as csv then read it in python. You can save the excel file as csv using Save as option in excel.
>>> import pandas as pd
>>> df = pd.read_csv('fazool.csv')
Then to print the head of the dataframe/table in python
>>> df.head()
kMEblue kMEgreen kMEturquoise kMEblack kMEbrown kMEred kMEyellow data$X count moduleColors
0 -0.762233 -0.115623 0.836647 -0.418418 -0.688068 -0.078625 0.316798 VWA5A 1 turquoise
1 -0.714720 -0.145856 0.802115 -0.420983 first_column_split.csv-0.670826 -0.039813 0.424616 EIF4G2 1 turquoise
2 -0.785788 -0.259762 0.777330 -0.301520 -0.585565 0.021812 0.412960 CFL1 1 turquoise
3 -0.736677 -0.296203 0.776179 -0.266430 -0.517727 0.109923 0.526707 NSUN2 1 turquoise
4 -0.697293 0.030126 0.772833 -0.621229 -0.733419 -0.341270 0.088465 ANXA2 1 turquoise
2. Selecting the first column of the dataframe,
>>> first_column_df = pd.DataFrame(df.iloc[:,0])
>>> first_column_df.head()
kMEblue
0 -0.762233
1 -0.714720
2 -0.785788
3 -0.736677
4 -0.697293
>>> first_column_df.columns # shows the column name
Index(['kMEblue'], dtype='object')
3. For reshaping this column into multiple columns each having ten rows you would need numpy,
>>> import numpy as np
>>> n = 10 # number to be used as chunk size for the first column
>>> first_column_df_split = pd.concat([pd.Series(j, name='y' + str(i)) for i,j in enumerate(np.split( first_column_df['kMEblue'].to_numpy(), range(n, len(first_column_df['kMEblue']), n)))], axis=1)
>>> first_column_df_split.head()
y0 y1 y2 y3 y4 y5 ... y478 y479 y480 y481 y482 y483
0 -0.762233 -0.639253 -0.673571 -0.652639 -0.703227 -0.666183 ... 0.633533 0.628803 0.716792 0.783900 0.725757 0.791240
1 -0.714720 -0.680753 -0.696416 -0.686810 -0.636661 -0.613642 ... 0.678854 0.807758 0.736286 0.627988 0.853333 0.887149
2 -0.785788 -0.638530 -0.607706 -0.613452 -0.701420 -0.583315 ... 0.663671 0.649068 0.741015 0.847084 0.718821 0.786994
3 -0.736677 -0.728837 -0.665220 -0.613386 -0.596789 -0.614878 ... 0.722638 0.587891 0.658215 0.668980 0.794392 0.835687
4 -0.697293 -0.731756 -0.627547 -0.653920 -0.641218 -0.679153 ... 0.618696 0.740690 0.737382 0.679931 0.706449 0.919852
[5 rows x 484 columns]
4. For writing this file to an excel, you can use pandas dataFrame.to_csv()
>>> first_column_df_split.to_csv("first_column_split.csv")
Adopted from here.

Reading from a .dat file Dataframe in Python

I have a .dat file which looks something like the below....
#| step | Channel| Mode | Duration|Freq.| Amplitude | Phase|
0 1 AWG Pi/2 100 2 1
1 1 SIN^2 100 1 1
2 1 SIN^2 200 0.5 1
3 1 REC 50 100 1 1
100 0 REC Pi/2 150 1 1
I had created a data frame and I wanted to read extract data from the data frame but I have an error
TypeError: expected str, bytes or os.PathLike object, not DataFrame
My code is below here,
import pandas as pd
import numpy as np
path = "updated.dat"
datContent = [i.strip().split() for i in open(path).readlines()]
#print(datContent)
column_names = datContent.pop(0)
print(column_names)
df = pd.DataFrame(datContent)
print(df)
extract_column = df.iloc[:,2]
with open (df, 'r') as openfile :
for line in openfile:
for column_search in line:
column_search = df.iloc[:,2]
if "REC" in column_search:
print ("Rec found")
Any suggestion would be appreciated
Since your post does not have any clear question, I have to guess based on your code. I am assuming that what you want to get is to find all rows in DataFrame where column Mode contains value REC.
Based on that, I prepared a small, self contained example that works on your data.
In your situation, the only line that you should use is the last one. Assuming that your DataFrame is created and filled correctly, your code below print(df) can be exchanged by this single line.
I would really recommend you reading the official documentation about indexing and selecting data from DataFrames. https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
import pandas as pd
from io import StringIO
data = StringIO("""
no;step;Channel;Mode;Duration;Freq.;Amplitude;Phase
;0;1;AWG;Pi/2;100;2;1
;1;1;SIN^2;;100;1;1
;2;1;SIN^2;;200;0.5;1
;3;1;REC;50;100;1;1
;100;0;REC;Pi/2;150;1;1
""")
df = pd.read_csv(data, sep=";")
df.loc[df.loc[:, 'Mode'] == "REC", :]

How to read an Excel spreadsheet and convert units?

I've got an excel spreadsheet that I would like to use python to convert the measurements from cm3/day to just cm3/year.
is there a way to do this?
I've looked into openpyxl mostly as this module seems to come up the most for excel editing but I guess I'm mostly confused about how to edit the units so they are all the same... I can't seem to find a module that supports what I'm trying to do.
You can do this easily with pandas. You may need to install xlrd:
pip3 install pandas xlrd
or just save your file as csv.
import pandas as pd
# Read the file with read_csv() or read_excel()
df = pd.read_excel('your_file.xlsx', index_col=0) # Your index is the first column
>>> df
measure amount
precip
1 cm3/day 45
2 cm3/day 132
3 cm3/year 9565
4 cm3/sec 5
5 cm3/day 67
6 cm3/day 52
7 cm3/sec 2
8 cm3/day 78
9 cm3/sec 3
10 cm3/day 92
Then you can use apply() to check and update values as you want. This will apply any function to each line of a pd.DataFrame with option axis=1. The applied function receives a line of your data as pd.Series object.
Let's define a function:
def _update(serie):
val = serie['amount'] # The original value
volume, time = serie['measure'].split('/') # The time unit
# Check and update
if time == 'year':
return serie
elif time == 'day':
serie['amount'] = val * 365
elif time == 'hour':
serie['amount'] = val * 24 * 365
elif time == 'sec':
serie['amount'] = val * 3600 * 24 * 365
# Update measure col
serie['measure'] = 'cm3/year'
return serie
Then apply the function:
new_df = df.apply(_update, axis=1)
>>> new_df
measure amount
precip
1 cm3/year 16425
2 cm3/year 48180
3 cm3/year 9565
4 cm3/year 157680000
5 cm3/year 24455
6 cm3/year 18980
7 cm3/year 63072000
8 cm3/year 28470
9 cm3/year 94608000
10 cm3/year 33580
# Save de new file:
new_df.to_excel('new_file.xlsx')
Hope this will help !
If the file is in "*.xlsx" format you can read the file in python like this:
#first import necessary packages
import pandas as pd
import numpy as np
data = pd.read_excel(file_name)
If in "*.csv" format do this:
#first import necessary packages
import pandas as pd
import numpy as np
data = pd.read_csv(file_name)
To perform a calculation on a column(cm3/day/sec--this format I don't get but if you had cm3/day you could convert it yo cm3/year by the below code)
#first check the type of your column
data["column"].dtype
#based on what you get as type
#If your column's data type is string
#convert it to integer
data["column_name"] = data["column_name"].astype(int)
#convert it to float
data["column_name"] = data["column_name"].astype(float)
# if your column is already of numeric type don't change it
#to convert cm3/day to cm3/year
data["column_name"] = data["column_name"]*365
PS: I can't see the linked image so I couldn't use the valid column names in the excel sheet

How to loop over multiple DataFrames and produce multiple csv?

making a change from R to Python I have some difficulties to write multiple csv using pandas from a list of multiple DataFrames:
import pandas
from dplython import (DplyFrame, X, diamonds, select, sift, sample_n,
sample_frac, head, arrange, mutate, group_by, summarize,
DelayFunction)
diamonds = [diamonds, diamonds, diamonds]
path = "/user/me/"
def extractDiomands(path, diamonds):
for each in diamonds:
df = DplyFrame(each) >> select(X.carat, X.cut, X.price) >> head(5)
df = pd.DataFrame(df) # not sure if that is required
df.to_csv(os.path.join('.csv', each))
extractDiomands(path,diamonds)
That however generates an errors. Appreciate any suggestions!
Welcome to Python! First I'll load a couple libraries and download an example dataset.
import os
import pandas as pd
example_data = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
print(example_data.head(5))
first few rows of our example data:
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
Now here's what I think you want done:
# spawn a few datasets to loop through
df_1, df_2, df_3 = example_data.head(20), example_data.tail(20), example_data.head(10)
list_of_datasets = [df_1, df_2, df_3]
output_path = 'scratch'
# in Python you can loop through collections of items directly, its pretty cool.
# with enumerate(), you get the index and the item from the sequence, each step through
for index, dataset in enumerate(list_of_datasets):
# Filter to keep just a couple columns
keep_columns = ['gre', 'admit']
dataset = dataset[keep_columns]
# Export to CSV
filepath = os.path.join(output_path, 'dataset_'+str(index)+'.csv')
dataset.to_csv(filepath)
At the end, my folder 'scratch' has three new csv's called dataset_0.csv, dataset_1.csv, and dataset_2.csv

Add two Pandas Series or DataFrame objects in-place?

I have a dataset where we record the electrical power demand from each individual appliance in the home. The dataset is quite large (2 years or data; 1 sample every 6 seconds; 50 appliances). The data is in a compressed HDF file.
We need to add the power demand for every appliance to get the total aggregate power demand over time. Each individual meter might have a different start and end time.
The naive approach (using a simple model of our data) is to do something like this:
LENGHT = 2**25
N = 30
cumulator = pd.Series()
for i in range(N):
# change the index for each new_entry to mimick the fact
# that out appliance meters have different start and end time.
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
cumulator = cumulator.add(new_entry, fill_value=0)
This works fine for small amounts of data. It also works OK with large amounts of data as long as every new_entry has exactly the same index.
But, with large amounts of data, where each new_entry has a different start and end index, Python quickly gobbles up all the available RAM. I suspect this is a memory fragmentation issue. If I use multiprocessing to fire up a new process for each meter (to load the meter's data from disk, load the cumulator from disk, do the addition in memory, then save the cumulator back to disk, and exit the process) then we have fine memory behaviour but, of course, all that disk IO slows us down a lot.
So, I think what I want is an in-place Pandas add function. The plan would be to initialise cumulator to have an index which is the union of all the meters' indicies. Then allocate memory once for that cumulator. Hence no more fragmentation issues.
I have tried two approaches but neither is satisfactory.
I tried using numpy.add to allow me to set the out argument:
# Allocate enough space for the cumulator
cumulator = pd.Series(0, index=np.arange(0, LENGTH+N))
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
cumulator, aligned_new_entry = cumulator.align(new_entry, copy=False, fill_value=0)
del new_entry
np.add(cumulator.values, aligned_new_entry.values, out=cumulator.values)
del aligned_new_entry
But this gobbles up all my RAM too and doesn't seem to do the addition. If I change the penaultiate line to cumulator.values = np.add(cumulator.values, aligned_new_entry.values, out=cumulator.values) then I get an error about not being able to assign to cumulator.values.
This second approach appears to have the correct memory behaviour but is far too slow to run:
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
for index in cumulator.index:
try:
cumulator[index] += new_entry[index]
except KeyError:
pass
I suppose I could write this function in Cython. But I'd rather not have to do that.
So: is there any way to do an 'inplace add' in Pandas?
Update
In response to comments below, here is a toy example of our meter data and the sum we want. All values are watts.
time meter1 meter2 meter3 sum
09:00:00 10 10
09:00:06 10 20 30
09:00:12 10 20 30
09:00:18 10 20 30 50
09:00:24 10 20 30 50
09:00:30 10 30 40
If you want to see more details then here's the file format description of our data logger, and here's the 4TByte archive of our entire dataset.
After messing around a lot with multiprocessing, I think I've found a fairly simple and efficient way to do an in-place add without using multiprocessing:
import numpy as np
import pandas as pd
LENGTH = 2**26
N = 10
DTYPE = np.int
# Allocate memory *once* for a Series which will hold our cumulator
cumulator = pd.Series(0, index=np.arange(0, N+LENGTH), dtype=DTYPE)
# Get a numpy array from the Series' buffer
cumulator_arr = np.frombuffer(cumulator.data, dtype=DTYPE)
# Create lots of dummy data. Each new_entry has a different start
# and end index.
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i), dtype=DTYPE)
aligned_new_entry = np.pad(new_entry.values, pad_width=((i, N-i)),
mode='constant', constant_values=((0, 0)))
# np.pad could be replaced by new_entry.reindex(index, fill_value=0)
# but np.pad is faster and more memory efficient than reindex
del new_entry
np.add(cumulator_arr, aligned_new_entry, out=cumulator_arr)
del aligned_new_entry
del cumulator_arr
print cumulator.head(N*2)
which prints:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 10
11 10
12 10
13 10
14 10
15 10
16 10
17 10
18 10
19 10
assuming that your dataframe looks something like:
df.index.names == ['time']
df.columns == ['meter1', 'meter2', ..., 'meterN']
then all you need to do is:
df['total'] = df.fillna(0, inplace=True).sum(1)

Categories

Resources