We have a data lake containing tons of files, where I can read these the contents of the files along with their paths:
sdf = spark.read.load(source)\
.withColumn("_path", F.input_file_name())
I would like to generate a unique ID for each row, for easier downstream joining between tables, and I want this ID to be reproducible between runs.
Simplest approach is to simply use the _path column as the identifier.
sdf.withColumn("id", F.col("_path"))
However, it would be "prettier" and more compact to have some kind of integer representation. And for other tables the unique identifier could be a combination of a few columns, uglyfying this a bit more.
Another approach is to use monotonically increasing ID.
sdf.withColumn("id", F.monotonically_increasing_id())
However, in this solution there is no guarantee that when running the analysis id=2 is also id=2 when the analysis is run a week later (when new data has arrived).
A third approach is to use the hashing function:
sdf.withColumn("id", F.hash("_path"))
Which could be quite nice, because it is easy to hash a combination of columns, but this is not stable since multiple inputs can give the same output:
Running such analysis on our actual data gave 396,702 hash-ids from a single origin _path, and 24 hash-ids originating from two paths. Hence a collision rate of 0.006%.
We could simply disregard this very small portion of the data, but there must be a more elegant way of achieving what I want to achieve?
You can try the xxhash64 hash in Spark SQL, which gives a 64-bit hash value and should be more robust to hash collisions:
sdf.withColumn("id", F.expr("xxhash64(_path)"))
or to use more robust hashing algorithms,
sdf.withColumn("id", F.expr("conv(sha2(_path,256),16,10)"))
Related
I have multiple Excel files with different sheets in each file, these files have been made my people, so each one has different formats, different number of columns and also different structures to represent the data.
For example, in one sheet, the dataframe/table starts at 8th row, second column. In other it starts at 122 row, etc...
I want to retrieve something in common from these Excels, it is variable names and information.
However, I don't how could I possibly retrieve all this information without needing to parse each individual file. This is not an option because there are lot of these files with lots of sheets in each file.
I have been thinking about using regex as well as edit distance between words, but I don't know if that is the best option.
Any help is appreciated.
I will divide my answer into what I think you can do now, and suggestions for the future (if feasible).
An attempt to "solve" the problem you have with existing files.
Without regularity on your input files (such as at least a common name in the column), I think what you're describing is among the best solutions. Having said that, perhaps a "fancier" similarity metric between column names would be more useful than using regular expressions.
If you believe that there will be some regularity in the column names, you could look at string distances such as the Hamming Distance or the Levenshtein distance, and using a threshold on the distance that works for you. As an example, let's say that you have a function d(a:str, b:str) -> float that calculates a distance between column names, you could do something like this:
# this variable is a small sample of "expected" column names
plausible_columns = [
'interesting column',
'interesting',
'interesting-column',
'interesting_column',
]
for f in excel_files:
# process the file until you find columns
# I'm assuming you can put the colum names into
# a variable `columns` here.
for c in columns:
for p in plausible_columns:
if d(c,p) < threshold:
# do something to process the column,
# add to a pandas DataFrame, calculate the mean,
# etc.
If the data itself can tell you something on whether you should process it (such as having a particular distribution, or being in a particular range), you can use such features to decide on whether you should be using that column or not. Even better, you can use many of these characteristics to make a finer decision.
Having said this, I don't think a fully automated solution exists without inspecting some of the data manually, and studying the ditribution of the data, or variability in the names of the columns, etc.
For the future
Even with fancy methods to calculate features and doing some data analysis on the data you have right now, I think it would be impossible to ensure that you will always get the data you need (by the very nature of the problem). A reasonable way to solve this, in my opinion (and if this is feasible in whatever context you're working in), is to impose a stricter format in the data generation end (I suppose this is a manual thing with people inputting data to excel directly). I would argue that the best solution is to get rid of the problem at the root, and create a unified form, or excel sheet format, and distribute it to the people that will fill the files with data, so that you can ensure the data is then automatically ingested minimizing the risk of errors.
Is sorting a DataFrame in pandas memory efficient? I.e., can I sort the dataframe without reading the whole thing into memory?
Internally, pandas relies on numpy.argsort to do all the sorting.
That being said: pandas DataFrames are backed by numpy arrays, which have to be present in memory as a whole. So, to answer your question: No, pandas needs the whole dataset in memory for sorting.
Additional thoughts:
You can of course implement such a disk-based external sorting using multiple steps: Load a chunk of your dataset, sort it, save the sorted version. Repeat. Load a part of each sorted subset, join them into one DataFrame and sort it You'll have to be careful here on how much t oload from each source. For example, if your 1000 element dataset is already sorted, getting the top 10 results from each of the 10 subsets won't get you the correct top 100. It will, however, give you the correct top 10.
Without further information about your data, I suggest you let some (relational) database handle all that stuff. They're made for this kind of thing, after all.
I have a large dataset with 500 million rows and 58 variables. I need to sort the dataset using one of the 59th variable which is calculated using the other 58 variables. The variable happens to be a floating point number with four places after decimal.
There are two possible approaches:
The normal merge sort
While calculating the 59th variables, i start sending variables in particular ranges to to particular nodes. Sort the ranges in those nodes and then combine them in the reducer once i have perfectly sorted data and now I also know where to merge what set of data; It basically becomes appending.
Which is a better approach and why?
I'll assume that you are looking for a total sort order without a secondary sort for all your rows. I should also mention that 'better' is never a good question since there is typically a trade-off between time and space and in Hadoop we tend to think in terms of space rather than time unless you use products that are optimized for time (TeraData has the capability of putting Databases in memory for Hadoop use)
Out of the two possible approaches you mention, I think only one would work within the Hadoop infrastructure. Num 2, Since Hadoop leverages many nodes to do one job, sorting becomes a little trickier to implement and we typically want the 'shuffle and sort' phase of MR to take care of the sorting since distributed sorting is at the heart of the programming model.
At the point when the 59th variable is generated, you would want to sample the distribution of that variable so that you can send it through the framework then merge like you mentioned. Consider the case when the variable distribution of x contain 80% of your values. What this might do is send 80% of your data to one reducer who would do most of the work. This assumes of course that some keys will be grouped in the sort and shuffle phase which would be the case unless you programmed them unique. It's up to the programmer to set up partitioners to evenly distribute the load by sampling the key distribution.
If on the other hand we were to sort in memory then we could accomplish the same thing during reduce but there are inherent scalability issues since the sort is only as good as the amount of memory available in the node currently running the sort and dies off quickly when it starts to use HDFS to look for the rest of the data that did not fit into memory. And if you ignored the sampling issue you will likely run out of memory unless all your key values pairs are evenly distributed and you understand the memory capacity within your data.
Check out the Hadoop Comparator Class Part of HadoopStreaming Wiki Page
You can move the datasets to HDFS, use Python to write a mapper and do a hadoop streaming mapper only job. The Hadoop Streaming will automatically help you sort them.
Then you can use hdfs dfs -getmerge and -copyToLocal to move the sorted records back to local if you want.
I have two groups of files that contain data in CSV format with a common key (Timestamp) - I need to walk through all the records chronologically.
Group A: 'Environmental Data'
Filenames are in format A_0001.csv, A_0002.csv, etc.
Pre-sorted ascending
Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
Contains environmental data in CSV/column format
Very large, several GBs worth of data
Group B: 'Event Data'
Filenames are in format B_0001.csv, B_0002.csv
Pre-sorted ascending
Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
Contains event based data in CSV/column format
Relatively small compared to Group A files, < 100 MB
What is best approach?
Pre-merge: Use one of the various recipes out there to merge the files into a single sorted output and then read it for processing
Real-time merge: Implement code to 'merge' the files in real-time
I will be running lots of iterations of the post-processing side of things. Any thoughts or suggestions? I am using Python.
im thinking importing it into a db (mysql, sqlite, etc) will give better performance than merging it in script. the db typically has optimized routines for loading csv and the join will be probably be as fast or much faster than merging 2 dicts (one being very large) in python.
"YYYY-MM-DD HH:MM:SS" can be sorted with a simple ascii compare.
How about reusing external merge logic? If the first field is the key then:
for entry in os.popen("sort -m -t, -k1,1 file1 file2"):
process(entry)
This is a similar to a relational join. Since your timestamps don't have to match, it's called a non-equijoin.
Sort-Merge is one of several popular algorithms. For non-equijoins, it works well. I think this would be what you're called "pre-merge". I don't know what you mean by "merge in real time", but I suspect it's still a simple sort-merge, which is a fine technique, heavily used by real databases.
Nested Loops can also work. In this case, you read the smaller table in the outer loop. In the inner loop you find all of the "matching" rows from the larger table. This is effectively a sort-merge, but with an assumption that there will be multiple rows from the big table that will match the small table.
This, BTW, will allow you to more properly assign meaning to the relationship between Event Data and Environmental Data. Rather than reading the result of a massive sort merge and trying to determine which kind of record you've got, the nested loops handle that well.
Also, you can do "lookups" into the smaller table while reading the larger table.
This is hard when you're doing non-equal comparisons because you don't have a proper key to do a simple retrieval from a simple dict. However, you can easily extend dict (override __contains__ and __getitem__) to do range comparisons on a key instead of simple equality tests.
I would suggest pre-merge.
Reading a file takes a lot of processor time. Reading two files, twice as much. Since your program will be dealing with a large input (lots of files, esp in Group A), I think it would be better to get it over with in one file read, and have all your relevant data in that one file. It would also reduce the number of variables and read statements you will need.
This will improve the runtime of your algorithm, and I think that's a good enough reason in this scenario to decide to use this approach
Hope this helps
You could read from the files in chunks of, say, 10000 records (or whatever number further profiling tells you to be optimal) and merge on the fly. Possibly using a custom class to encapsulate the IO; the actual records could then be accessed through the generator protocol (__iter__ + next).
This would be memory friendly, probably very good in terms of total time to complete the operation and would enable you to produce output incrementally.
A sketch:
class Foo(object):
def __init__(self, env_filenames=[], event_filenames=[]):
# open the files etc.
def next(self):
if self._cache = []:
# take care of reading more records
else:
# return the first record and pop it from the cache
# ... other stuff you need ...
I'm developing an app that handle sets of financial series data (input as csv or open document), one set could be say 10's x 1000's up to double precision numbers (Simplifying, but thats what matters).
I plan to do operations on that data (eg. sum, difference, averages etc.) as well including generation of say another column based on computations on the input. This will be between columns (row level operations) on one set and also between columns on many (potentially all) sets at the row level also. I plan to write it in Python and it will eventually need a intranet facing interface to display the results/graphs etc. for now, csv output based on some input parameters will suffice.
What is the best way to store the data and manipulate? So far I see my choices as being either (1) to write csv files to disk and trawl through them to do the math or (2) I could put them into a database and rely on the database to handle the math. My main concern is speed/performance as the number of datasets grows as there will be inter-dataset row level math that needs to be done.
-Has anyone had experience going down either path and what are the pitfalls/gotchas that I should be aware of?
-What are the reasons why one should be chosen over another?
-Are there any potential speed/performance pitfalls/boosts that I need to be aware of before I start that could influence the design?
-Is there any project or framework out there to help with this type of task?
-Edit-
More info:
The rows will all read all in order, BUT I may need to do some resampling/interpolation to match the differing input lengths as well as differing timestamps for each row. Since each dataset will always have a differing length that is not fixed, I'll have some scratch table/memory somewhere to hold the interpolated/resampled versions. I'm not sure if it makes more sense to try to store this (and try to upsample/interploate to a common higher length) or just regenerate it each time its needed.
"I plan to do operations on that data (eg. sum, difference, averages etc.) as well including generation of say another column based on computations on the input."
This is the standard use case for a data warehouse star-schema design. Buy Kimball's The Data Warehouse Toolkit. Read (and understand) the star schema before doing anything else.
"What is the best way to store the data and manipulate?"
A Star Schema.
You can implement this as flat files (CSV is fine) or RDBMS. If you use flat files, you write simple loops to do the math. If you use an RDBMS you write simple SQL and simple loops.
"My main concern is speed/performance as the number of datasets grows"
Nothing is as fast as a flat file. Period. RDBMS is slower.
The RDBMS value proposition stems from SQL being a relatively simple way to specify SELECT SUM(), COUNT() FROM fact JOIN dimension WHERE filter GROUP BY dimension attribute. Python isn't as terse as SQL, but it's just as fast and just as flexible. Python competes against SQL.
"pitfalls/gotchas that I should be aware of?"
DB design. If you don't get the star schema and how to separate facts from dimensions, all approaches are doomed. Once you separate facts from dimensions, all approaches are approximately equal.
"What are the reasons why one should be chosen over another?"
RDBMS slow and flexible. Flat files fast and (sometimes) less flexible. Python levels the playing field.
"Are there any potential speed/performance pitfalls/boosts that I need to be aware of before I start that could influence the design?"
Star Schema: central fact table surrounded by dimension tables. Nothing beats it.
"Is there any project or framework out there to help with this type of task?"
Not really.
For speed optimization, I would suggest two other avenues of investigation beyond changing your underlying storage mechanism:
1) Use an intermediate data structure.
If maximizing speed is more important than minimizing memory usage, you may get good results out of using a different data structure as the basis of your calculations, rather than focusing on the underlying storage mechanism. This is a strategy that, in practice, has reduced runtime in projects I've worked on dramatically, regardless of whether the data was stored in a database or text (in my case, XML).
While sums and averages will require runtime in only O(n), more complex calculations could easily push that into O(n^2) without applying this strategy. O(n^2) would be a performance hit that would likely have far more of a perceived speed impact than whether you're reading from CSV or a database. An example case would be if your data rows reference other data rows, and there's a need to aggregate data based on those references.
So if you find yourself doing calculations more complex than a sum or an average, you might explore data structures that can be created in O(n) and would keep your calculation operations in O(n) or better. As Martin pointed out, it sounds like your whole data sets can be held in memory comfortably, so this may yield some big wins. What kind of data structure you'd create would be dependent on the nature of the calculation you're doing.
2) Pre-cache.
Depending on how the data is to be used, you could store the calculated values ahead of time. As soon as the data is produced/loaded, perform your sums, averages, etc., and store those aggregations alongside your original data, or hold them in memory as long as your program runs. If this strategy is applicable to your project (i.e. if the users aren't coming up with unforeseen calculation requests on the fly), reading the data shouldn't be prohibitively long-running, whether the data comes from text or a database.
What matters most if all data will fit simultaneously into memory. From the size that you give, it seems that this is easily the case (a few megabytes at worst).
If so, I would discourage using a relational database, and do all operations directly in Python. Depending on what other processing you need, I would probably rather use binary pickles, than CSV.
Are you likely to need all rows in order or will you want only specific known rows?
If you need to read all the data there isn't much advantage to having it in a database.
edit: If the code fits in memory then a simple CSV is fine. Plain text data formats are always easier to deal with than opaque ones if you can use them.