Spark load csv files and memorise filename in column - python

We have a blob storage where plenty of files are arriving during the whole day.
I have a Databricks notebook running in batch read the directorylist, looping the files and send them all into a Azure SQLDW.Works fine.
After that the processed files are moved into a archive.
But the process of looping the filelist, appending each one of them and adding the filename to a column goes a bit slow.
I was wondering if this could be done in 1 run. The loading off all csv's at once can be done, but how to memorise the corresponding filenames in a column.
Anybody has a suggestion ?

There are couple of ways which I can think of
1. spark.read.format("csv").load("path").select(input_file_name())
2. spark.sparkContext.wholeTextFiles("path").map{case(x,y) => x} <-- avoid if data is huge
Both provides all filenames in the given path.Where as former one is based on DF might be faster than later RDD one.
Note : Have n't tested the solution.

Related

Pandas HDFStore caching

I am working with a medium-size dataset that consists of around 150 HDF files, 0.5GB each. There is a scheduled process that updates those files using store.append from pd.HDFStore.
I am trying to achieve the following scenario:
For HDF file:
Keep the process that updates the store running
Open a store in a read-only mode
Run a while loop that will be continuously selecting the latest available row from the store.
Close the store on script exit
Now, this works fine, because we can have as many readers as we want, as long as all of them are in read-only mode. However, in step 3, because HDFStore caches the file, it is not returning the rows that were appended after the connection was open. Is there a way to select the newly added rows without re-opening the store?
After doing more research, I concluded that this is not possible with HDF files. The only reliable way of achieving the functionality above is to use a database (SQLite is closest - the read/write speed is lower than HDF but still faster than a fully-fledged database like Postgres or MySQL).

Merge or Concatenate Hundreds of Excel files

I have 638 Excel files in a directory that are about 3000 KB large, each. I want to concatenate all of them together, hopefully only using Python or command line (no other programming software or languages).
Essentially, this is part of a larger process that involves some simple data manipulation, and I want it all to be doable by just running a single python file (or double clicking batch file).
I've tried variations of the code below - Pandas, openpyxl, and xlrd and they seem to have about the same speed. Converting to csv seems to require VBA which I do not want to get into.
temp_list=[]
for filename in os.listdir(filepath):
temp = pd.read_excel(filepath + filename,
sheet_name=X, usecols=fields)
temp_list.append(temp)
Are there simpler command line solutions to convert these into csv files or merge into one excel document? Or is this pretty much it, just using the basic libraries to read individual files?
.xls(x) is a very (over)complicated format with lots of features and quirks accumulated over the years and is thus rather hard to parse. And it was never designed for speed or for large amounts of data but rather for ease of use for business people.
So with your number of files, your best bet is to convert those to .csv or another easy-to-parse format (or use such a format for data exchange in the first place) -- and preferrably, do this before you get to process them -- e.g. upon a file's arrival.
E.g. this is how you can save the first sheet of a .xls(x) to .csv with pywin32 using Excel's COM interface:
import win32com.client
# Need the typelib metadata to have Excel-specific constants
x = win32com.client.gencache.EnsureDispatch("Excel.Application")
# Need to pass full paths, see https://stackoverflow.com/questions/16394842/excel-can-only-open-file-if-using-absolute-path-why
w = x.Workbooks.Open("<full path to file>")
s = w.Worksheets(1)
s.SaveAs("<full path to file without extension>",win32com.client.constants.xlCSV)
w.Close(False)
Running this in parallel would normally have no effect because the same server process would be reused. You can force creating a different process for each batch as per How can I force python(using win32com) to create a new instance of excel?.

Can you give some advice on designing my Spark Stream source?

I will try to describe my requirement as best as I can. But please feel free to ask me if it still unclear.
The environment
I have 5 nodes (will be more in the future). Each of them generating a big CSV file (about 1 to 2 GB) every 5 minutes. I need to use apache spark stream to process these CSV files in five minutes. So these 5 files are my input DStream source.
What I planning to do
I plan to use textFileStream like below:
ssc.textFileStream(dataDirectory)
Every 5 minutes I will put those CSV in a directory on the HDFS. Then use the above function to generate inputDStream.
The problem of the above way
the textFileStream need one complete file instead of 5 files. I do not know how to merge files in HDFS
Question
Can you tell me how to merge files in hdfs by python?
Do you have any better suggestion than my way? Please also advice me
You can always read the files in a directory using wild card character.
That should not be a problem. Which means at any given time your DStream's RDD is the merged result of all files at that given time.
As far as the approach goes, your's is simple and works.
NB: The only thing you should be careful about is the atomicity of the CSV files themselves. Your files should go to the folder(which you are watching for incoming file) as mv not copy
Thanks
Manas

Handling big files with Google Cloud Storage API

What I need to achieve is to concatenate a list of files into a single file, using the cloudstorage library. This needs to happen inside a mapreduce shard, which has a 512MB upper limit on memory, but the concatenated file could be larger than 512MB.
The following code segment breaks when file size hit the memory limit.
list_of_files = [...]
with cloudstorage.open(filename...) as file_handler:
for a in list_of_files:
with cloudstorage.open(a) as f:
file_handler.write(f.read())
Is there a way to walk around this issue? Maybe open or append files in chunk? And how to do that? Thanks!
== EDIT ==
After some more testing, it seems that memory limit only applies to f.read(), while writing to a large file is okay. Reading files in chunks solved my issue, but I really like the compose() function as #Ian-Lewis pointed out. Thanks!
For large file you will want to break the file up into smaller files, upload each of those and then merge them together as composite objects. You will want to use the compose() function from the library. It seems there is no docs on it yet.
After you've uploaded all the parts something like the following should work. One thing to make sure of is that the paths files to be composed don't contain the bucket name or a slash at the beginning.
stat = cloudstorage.compose(
[
"path/to/part1",
"path/to/part2",
"path/to/part3",
# ...
],
"/my_bucket/path/to/output"
)
You may also want to check out using the gsutil tool if possible. It can do automatic splitting, uploading in parallel, and compositing of large files for you.

Most efficient way to store data on drive

baseline - I have CSV data with 10,000 entries. I save this as 1 csv file and load it all at once.
alternative - I have CSV data with 10,000 entries. I save this as 10,000 CSV files and load it individually.
Approximately how much more inefficient is this computationally. I'm not hugely interested in memory concerns. The purpose of the alternative method is because I frequently need to access subsets of the data and don't want to have to read the entire array.
I'm using python.
Edit: I can other file formats if needed.
Edit1: SQLite wins. Amazingly easy and efficient compared to what I was doing before.
SQLite is ideal solution for your application.
Simply import your CSV file into SQLite database table (it is going to be single file), then add indexes as necessary.
To access your data, use python sqlite3 library. You can use this tutorial on how to use it.
Compared to many other solutions, SQLite will be the fastest way to select partial data sets locally - certainly much, much faster than access 10000 files. Also read this answer which explains why SQLite is so good.
I would write all the lines to one file. For 10,000 lines it's probably not worthwhile, but you can pad all the lines to the same length - say 1000 bytes.
Then it's easy to seek to the nth line, just multiply n by the line length
10,000 files is going to be slower to load and access than one file, if only because the files' data will likely be fragmented around your disk drive, so accessing it will require a much larger number of seeks than would accessing the contents of a single file, which will generally be stored as sequentially as possible. Seek times are a big slowdown on spinning media, since your program has to wait while the drive heads are physically repositioned, which can take milliseconds. (slow seeks times aren't an issue for SSDs, but even then there will still be the overhead of 10,000 file's worth of metadata for the operating system to deal with). Also with a single file, the OS can speed things up for you by doing read-ahead buffering (as it can reasonably assume that if you read one part of the file, you will likely want to read the next part soon). With multiple files, the OS can't do that.
My suggestion (if you don't want to go the SQLite route) would be to use a single CSV file, and (if possible) pad all of the lines of your CSV file out with spaces so that they all have the same length. For example, say you make sure when writing out the CSV file to make all lines in the file exactly 80 bytes long. Then reading the (n)th line of the file becomes relatively fast and easy:
myFileObject.seek(n*80)
theLine = myFileObject.read(80)

Categories

Resources