Big data File: Read and Create structured file

Big data File: Read and Create structured file - python

I have a 20+GB dataset that is structured as follows:
1 3
1 2
2 3
1 4
2 1
3 4
4 2
(Note: the repetition is intentional and there is no inherent order in either column.)
I want to construct a file in the following format:
1: 2, 3, 4
2: 3, 1
3: 4
4: 2
Here is my problem; I have tried writing scripts in both Python and C++ to load in the file, create long strings, and write to a file line-by-line. It seems, however, that neither language is capable of handling the task at hand. Does anyone have any suggestions as to how to tackle this problem? Specifically, is there a particular method/program that is optimal for this? Any help or guided directions would be greatly appreciated.

You can try this using Hadoop. You can run a stand-alone Map Reduce program. The mapper will output the first column as key and the second column as value. All the outputs with same key will go to one reducer. So you have a key and a list of values with that key. You can run through the values list and output the (key, valueString) which is the final output you desire. You can start this with a simple hadoop tutorial and do mapper and reducer as I suggested. However, I've not tried to scale a 20GB data on a stand-alone hadoop system. You may try. Hope this helps.

Have you tried using a std::vector of std::vector?
The outer vector represents each row. Each slot in the outer vector is a vector containing all the possible values for each row. This assumes that the row # can be used as an index into the vector.
Otherwise, you can try std::map<unsigned int, std::vector<unsigned int> >, where the key is the row number and the vector contains all values for the row.
A std::list of would work also.
Does your program run out of memory?
Edit 1: Handling large data files
You can handle your issue by treating it like a merge sort.
Open a file for each row number.
Append the 2nd column values to the file.
After all data is read, close all files.
Open each file and read the values and print them out, comma separated.

Open output file for each key.
While iterating over lines of source file append values into output files.
Join output files.

An interesting thought found also on Stack Overflow
If you want to persist a large dictionary, you are basically looking at a database.
As recommended there, use Python's sqlite3 module to write to a table where the primary key is auto incremented, with a field called "key" (or "left") and a field called "value" (or "right").
Then SELECT out of the table which was the MIN(key) and MAX(key), and with that information you can SELECT all rows that have the same "key" (or "left") value, in sorted order, and print those informations to an outfile (if the database is not a good output for you).
I have written this approach in the assumption you call this problem "big data" because the number of keys do not fit well into memory (otherwise, a simple Python dictionary would be enough). However, IMHO this question is not correctly tagged as "big data": in order to require distributed computations on Hadoop or similar, your input data should be much more than what you can hold in a single hard drive, or your computations should be much more costly than a simple hash table lookup and insertion.

Related

Python Generic Data Engine

I have been working on Python for about 1.5yrs and looking for some direction. This is the first time I can't find what I need after doing a lot of searching and must be missing something- most likely searching the wrong terms.
Problem: I am working on an app that has many processes (Could be hundreds or even thousands). Each process may have a unique input and output data format - could be multiline strings, comma separated strings, excel or csv with or without varying headers and many others. I need something that will format the input correctly and handle the output based upon the process. New processes also need to be easily added/defined. I am open to whatever is the best approach, but my thoughts are to use a database that stores the template/data definition and use that to know the format given a process. However, I'm struggling to come up with exactly how, if this is really the best approach, but it needs to be a solution that is scalable. Any direction would be appreciated. Thank you.
A couple simple examples of data
Process 1 example data (multi line string with Header)
Input of
[ABC123, XYZ453, CDE987]
and the resulting data input below would be created:
Barcode
ABC123
XYZ453
CDE987
This code below works, but is not reusable for the example 2.
list = [ABC123, XYZ453, CDE987]
input = "Barcode /r/n"
for l in list:
input = input + l + '/r/n'
Process 2 example input template (comma separated with Header):
Barcode,Location,Param1,Param2
Item1,L1,11,A
Item1,L1,22,B
Item2,L1,33,C
Item2,L2,44,F
Item3,L2,55,B
Item3,L2,66,P
Process 2 example resulting input data (comma separated with Header):
Input of
{'Barcode':['ABC123', 'XYZ453', 'CDE987', 'FGH487', 'YTR123'], 'Location':['Shelf1', 'Shelf2']}
and using the template to create the input data below:
Barcode,Location,Param1,Param2
ABC123,Shelf1,11,A
ABC123,Shelf1,22,B
XYZ453,Shelf1,33,C
XYZ453,Shelf2,44,F
CDE987,Shelf2,55,B
CDE987,Shelf2,66,P
FGH487,Shelf1,11,A
FGH487,Shelf1,22,B
YTR123,Shelf1,33,C
YTR123,Shelf2,44,F
I know how to handle each process with hardcoded loop/dataframe merge, etc. Ive done some abstraction in other cases with dicts. However, how to define/store each format that vary so much and create reusable abstracted code is where I am stuck.

Maybe you can do the output of the functions as a tuple with the keys "datatype" and "output" for the actual output

Is MapReduce a possible solution for two lists that have an id in common?

I have a list of 30m entries, containing a unique id and 4 attributes for each entry. In addition to that I have a second list with 10m entries, containing again a uniqe id and 2 other attributes.
The unique IDs in list 2 are a subset of the IDs in list 1.
I want to combine the two lists to do some analysis.
Example:
List 1:
ID|Age|Flag1|Flag2|Flag3
------------------------
ucab577|12|1|0|1
uhe4586|32|1|0|1
uhf4566|45|1|1|1
45e45tz|37|1|1|1
7ge4546|42|0|0|1
vdf4545|66|1|0|1
List 2:
ID|Country|Flag4|Flag5|Flag6
------------------------
uhe4586|US|0|0|1
uhf4566|US|0|1|1
45e45tz|UK|1|1|0
7ge4546|ES|0|0|1
I want to do analysis like:
"How many at the age of 45 have Flag4=1?" Or "What is the age structure of all IDs in US?"
My current approach is to load the two list into separate tables of a relational database and then doing a join.
Does a MapReduce approach make sense in this case?
If yes, how would a MapReduce approach look like?
How can I combine the attributes of list 1 with list 2?
Will it bring any advantages? (Currently I need more than 12 hours for importing the data)

when the files are big hadoops distributed processing helps(faster). once you bring data to hdfs then you can use hive or pig for your query. Both uses hadoop MR for processing,you do not need to write separate code for it . hive is almost sql like. from your query type i guess you can manage with hive. if your queries are more complex then you can consider pig. if you use hive here is the sample steps.
load both the files in two separate folder in hdfs.
create external tables for both of them and give location to the destination folders.
perform join and the query!
hive> create external table hiveint_r(id string, age int, Flag1 int, Flag2 int, Flag3 int)
> row format delimited
> fields terminated by '|'
> location '/user/root/data/hiveint_r'; (it is in hdfs)
table will be automatically populated with data, no need to load it.
similar way create other table, then run the join and query.
select a.* from hiveint_l a full outer join hiveint_r b on (a.id=b.id) where b.age>=30 and a.flag4=1 ;

MapReduce might be overkill for just 30m entries. How you should work really depends on your data. Is is dynamic (e.g. will new entries be added?) If not, just stick with your database, the data is now in it. 30m entries shouldn't take 12 hours to import, it's more likely 12 minutes (you should be able to get 30.000 insert/seconds with 20 byte datasize), so your approach should be to fix your import. You might want to try bulk import, LOAD DATA INFILE, use transactions and/or generate the index afterwards, try another engine (innodb, MyISAM), ...
You can get just one big table (so you can get rid of the joins when you query which will speed them up) by e.g.
Update List1 join List2 on List1.Id = List2.Id
set List1.Flag4 = List2.Flag4, List1.Flag5 = List2.Flag5, List1.Flag6 = List2.Flag6
after adding the columns to List1 of course and after adding the indexes, and you should afterwards add indexes for all your columns.
You can actually combine your data before you import it to mysql by e.g. reading list 2 into a hashmap (Hashmap in c/c++/java, array in php/python) and then generate a new import file with the combined data. It should actually just take you some seconds to read the data. You can even do evaluation here, it is just not that flexible as sql, but if you just have some fixed querys, that might be the fastest approach if your data changes often.

In map-Reduce you can process the two files by using the join techniques. There are two types of joins-Map side and reduce side.
Map side join can be efficiently used by using DistributedCache API in which one file shud be loaded in memory. In you case you can can create a HashMap with key->id and value-> Flag4 and during the map phase you can join the data based on ID. One point shud be noted that the file should be as large so that it can be saved in memory.
If both the files are large go for Reduce join.
First try to load the 2nd file in memory and create Map-side join.
OR you can go for pig. Anyway the pig executes its statements as map-reduce jobs only. But map-reduce is fast as compared to PIG and HIVE.

Extracting certain rows from a file that match a condition from another file

So first, I know there are some answers out there for similar questions, but...my problem has to do with speed and memory efficiency.
I have a 60 GB text file that has 17 fields and 460,368,082 records. Column 3 has the ID of the individual and the same individual can have several records in this file. Lets call this file, File A.
I have a second file, File B, that has the ID of 1,000,000 individuals and I want to extract the rows of File A that have an ID that is in File B.
I have a windows PC and I'm open to doing this in C or Python, or whatever is faster... but not sure how to do it fast and efficiently.
So far every solution I have come up with takes over 1.5 years according to my calculations.

What you are looking for is a sort-merge join. The idea is to sort the File A on column you are joining (ID). Sort File B on the column you are joining (ID). Then read both the files using merge algorithm, ignoring the ones that don't have a match in both.
Sorting the files may require creation of intermediate files.
If the data is in a text file with a delimiter, you can also use linux sort command line utility to perform the sort.
sort -k3,3 -t'|' fileA > fileA.sorted
sort fileB > fileB.sorted
dos2unix fileB.sorted #make sure the line endings are same style
dos2unix fileA.sorted #make sure the line endings are same style
if dos2unix is not available, this maybe used as an alternative
sort -k3,3 -t'|' fileA | tr -d '\r' > fileA.sorted
sort fileB | tr -d '\r' > fileB.sorted
Join the files
join -1 3 -2 1 -t'|' fileA.sorted fileB.sorted
The other option is if you have enough RAM is to load File B in memory in a HashMap kind of structure. Then read File A, and lookup the HashMap for a match. I think either language would work fine, just depends with which you are more comfortable with.

Depends, if it is unsorted your going to have to search the entire thing and I would use multiple threads for that. If you are going to have to do this search multiple times I would also create an index.
If you have a massive amount of memory you could create a hash table to hold strings. You could then load all of the strings from the first file into the hash table. Then, load each of the strings from the second file one at a time. For each string, check if it's in the hash table. If so, report a match. This approach uses O(m) memory (where m is the number of strings in the first file) and takes at least Ω(m + n) time and possibly more, depending on how the hash function works. This is also (almost certainly) the easiest and most direct way to solve the problem.
If you have little ram to work with but tons of disk space: https://en.wikipedia.org/wiki/External_sorting, you could get this to O(n log n) time.

It sounds like what you want to do is first read File B, collecting the IDs. You can store the IDs in a set or a dict.
Then read File A. For each line in File A, extract the ID, then see if it was in File B by checking for membership in your set or dict. If not, then skip that line and continue with the next line. If it is, then process that line as desired.

Is there an efficient alternative for growing dictionary in python? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I am reading a large text corpus in xml format and storing counts of some word occurrences in a dictionary where a key is a tuple of three elements {('a','b','c'):1}. This dictionary continuously grows in size while its values get updated.
I need to keep a dictionary in memory all the time ~25GB before writing it to hdf file.
I have tried to find some information about what data types can actually reflect the structure of the current dictionary but did not find any concrete answer. My biggest concern is memory size. Is there any data structure in python that can mitigate these constraints? I have read about pyjudy library but looks like it is heavily 32bit and barely developed anymore. I would appreciate any advice.

I don't know if there are other dict implementations, but I'd say you've got 2.5 options:
Optimise your encoding locally - that is index your words separately, so that you've got a dictionary of {'a': 1, 'b': 2, ...} and then for the triplet dictionary use {(1,2,3): 1} instead. This will reduce the memory usage, because strings won't be repeated many times.
2.1. Implement the above in an extension in some other language.
2.2. Just use another language. The whole dictionary and values in python will always have an serious overhead compared to the small values you're storing.
Edit: Longer answer to das-g's comment, because I think it deserves one. Python is good at copying only the data which changes. Immutable values remain in one place when they're assigned to a new variable. This allows the following to work with close to no new allocation:
# VmRSS: 27504 kB
In [1]: a=['a'*1024]
# VmRSS: 27540 kB
In [2]: a=['a'*1024]*10000
# VmRSS: 27540 kB
But that's not the case for identical values which come from different places. As in - they're created from scratch (for example read from a file) every time, rather than copied from existing value:
# VmRSS: 27540 kB
In [4]: a=['a'*1024 for _ in range(10000)]
# VmRSS: 38280 kB
In [5]: b=['a'*1024 for _ in range(10000)]
# VmRSS: 48840 kB
That's why if the words are read from some out-of-process source, it's worth deduplicating them yourself, because Python will not do it for you automatically.
So in practice, you could even save memory by doing something that looks silly like substituting:
def increase_triple(a, b, c):
triples[(a,b,c)] += 1
with:
WORDS = {}
def dedup(s):
s_dedup = WORDS.get(s)
if s_dedup is None:
WORDS[s] = s
return s
else:
return s_dedup
def increase_triple(a, b, c):
triples[(dedup(a),dedup(b),dedup(c))] += 1
As #StefanPochmann mentioned in the comment, the standard function intren() does pretty much what dedup() above does. Just better.

Processing big data with the traditional ways to process small/mid-size data is often a wrong approach, in terms of efficiency and maintainability. Even if you accomplish to do it for today, there is no guarantee that you will be able to do it tomorrow for several reasons(e.g. your data grows larger than your available memory, data partitioning etc.).
Depends on the behaviour of your input, you should have a look at either batch processing engines(Hadoop/Mapreduce, Apache Spark) or stream processing engines(Apache Storm, Spark Streaming).
If you want to continue working with python, Apache Spark has python interface.
Last of all, there is a project called Apache Zeppelin for interactive big data analytics(My thesis topic). It supports Apache Spark and several other platforms.

I need to keep a dictionary in memory all the time ~25GB before
writing it to hdf file.
Does that mean the data is 25GB, or the resulting dictionary is 25GB in memory? That means you'd have a lot of those 3-tuples if each element in the tuple is a word. I don't think you actually need all of this in memory. But, I really doubt that the dictionary of three-word tuples to integers is really 25GB.
According to my /usr/share/dict/words, an average word is about 10 characters. Each one is a byte in the most common case. Thirty bytes per record without the integer, and you probably have keys that are 4 bytes. So 34 bytes per record. The dictionary will, of course, add overhead. But still we're talking about more than 600 million 3-tuples, easily. And of course in this case that's distinct word tuples since you're counting each one in the value of the dictionary.
Not totally understand what your question is, I would point you at shelve. It gives you something that looks like a dictionary (interface wise) but is disk-backed.
Have you actually tried this? Premature optimization is the root of all evil :)

Honestly, I'd recommend a database, but if you're dead-set on keeping this in Python...
A mapping from words to indexes could help. Replace the key of ('a', 'b', 'c') with (1, 2, 3), where (1, 2, 3) are values in a lookup table.
lookup_table = {'a':1, 'b':2, 'c':3}
This would be of primary use if most of the words you have are really long, and are repeated a lot. If you have (a,b,c), (a,c,b), (b,a,c), (b,c,a), (c,a,b), and (c,b,a), each word is only stored once instead of 6 times.
EDIT: This depends on how you are generating your strings.
"ab c" is "ab c" : True
"a"+"b c" is "ab c" : False
Another way to do this is to read the file, but when the output grows to a certain point, sort the keys (a-z), and output to a file:
aaab:101
acbc:92
add:109
then wipe the dictionary and continue with processing. Once the file has finished processing, you can then merge the values. This doesn't use much memory because all the files are sorted; you just need to keep one line per file in memory. (Think mergesort.)
Or you could just use a database, and let it worry about this.

Python: large number of dict like objects memory use

I am using csv.DictReader to read some large files into memory to then do some analysis, so all objects from multiple CSV files need to be kept in memory. I need to read them as Dictionary to make analysis easier, and because the CSV files may be altered by adding new columns.
Yes SQL can be used, but I'd rather avoid it if it's not needed.
I'm wondering if there is a better and easier way of doing this. My concern is that I will have many dictionary objects with same keys and waste memory? The use of __slots__ was an option, but I will only know the attributes of an object after reading the CSV.
[Edit:] Due to being on legacy system and "restrictions", use of third party libraries is not possible.

If you are on Python 2.6 or later, collections.namedtuple is what you are asking for.
See http://docs.python.org/library/collections.html#collections.namedtuple
(there is even an example of using it with csv).
EDIT: It requires the field names to be valid as Python identifiers, so perhaps it is not suitable in your case.

Have you considered using pandas.
It is works very good for tables. Relevant for you are the read_csv function and the dataframe type.
This is how you would use it:
>>> import pandas
>>> table = pandas.read_csv('a.csv')
>>> table
a b c
0 1 2 a
1 2 4 b
2 5 6 word
>>> table.a
0 1
1 2
2 5
Name: a

Use python shelve. It is a dictionary like object but can be dumped on disk when required and loaded back very easily.

If all the data in one column are the same type, you can use NumPy. NumPy's loadtxt and genfromtxt function can be used to read csv file. And because it returns an array, the memory usage is smaller then dict.

Possibilities:
(1) Benchmark the csv.DictReader approach and see if it causes a problem. Note that the dicts contain POINTERS to the keys and values; the actual key strings are not copied into each dict.
(2) For each file, use csv.Reader, after the first row, build a class dynamically, instantiate it once per remaining row. Perhaps this is what you had in mind.
(3) Have one fixed class, instantiated once per file, which gives you a list of tuples for the actual data, a tuple that maps column indices to column names, and a dict that maps column names to column indices. Tuples occupy less memory than lists because there is no extra append-space allocated. You can then get and set your data via (row_index, column_index) and (row_index, column_name).
In any case, to get better advice, how about some simple facts and stats: What version of Python? How many files? rows per file? columns per file? total unique keys/column names?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.