I have a very large (~24 million lines) edge list that I'm trying to import into a Neo4j graph that is populated with nodes. The CSV file has three columns: from, to, and the period (relationship property). I've tried this using the REST API using the following (Python) code:
batch_queue.append({"method":"POST","to":'index/node/people?uniqueness=get_or_create','id':1,'body':{'key':'name','value':row[0]}})
batch_queue.append({"method":"POST","to":'index/node/people?uniqueness=get_or_create','id':2,'body':{'key':'name','value':row[1]}})
batch_queue.append({"method":"POST","to":'{1}/relationships','body':{'to':"{2}","type":"FP%s" % row[2]}})
Where the third line failed, and then also using the Cypher statement:
USING PERIODIC COMMIT
LOAD CSV FROM "file:///file-name.csv" AS line
MATCH (a:Person {name: line[0]}),(b:Person {name:line[1]})
CREATE (a)-[:FOLLOWS {period: line[2]}]->(b)
Which worked in small scale but gave me an "Unknown Error" when using the whole list (also with smaller periodic commit values).
Any guidance as to what I'm doing incorrectly would be appreciated.
You might want to look into my batch-importer for that: http://github.com/jexp/batch-import
Otherwise for LOAD CSV, see my blog post here: http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/
use the neo4j-shell for LOAD CSV
Depending on your memory available, you might have to split the data a bit. By moving a window over the file (e.g. 1M rows at once below). Do you have indexes / constraints created for :Person(name) ?
USING PERIODIC COMMIT
LOAD CSV FROM "file:///file-name.csv" AS line
WITH line
SKIP 2000000 LIMIT 1000000
MATCH (a:Person {name: line[0]}),(b:Person {name:line[1]})
CREATE (a)-[:FOLLOWS {period: line[2]}]->(b)
Related
I have to first partition by a "customer group" but I also want to make sure that I have a single csv file per "customer_group" . This is because it is timeseries data that is needed for inference and it cant be spread across multiple files.
i tried: datasink2 = spark_df1.write.format("csv").partitionBy('customer_group').option("compression","gzip").save(destination_path+'/traintestcsvzippartitionocalesce')
but it creates mutilpe smaller files inside customer_group/ path with formats csv.gz0000_part_00.gz , csv.gz0000_part_01.gz ....
i tried to use :datasink2 = spark_df1.write.format("csv").partitionBy('customer_group').coalesce(1).option("compression","gzip").save(destination_path+'/traintestcsvzippartitionocalesce')
but it throws the following error:
AttributeError: 'DataFrameWriter' object has no attribute 'coalesce'
Is there a solution?
I cannot use repartition(1) or coalesce(1) directly without the partition by as it creates only 1 file and only one worker node works at a time(serially)and is computationally super expensive
The repartition function also accepts column names as arguments, not only the number of partitions.
Repartitioning by the write partition column will make spark save one file per folder.
Please note that if one of your partitions are skewed and one customer group has a majority of the data you might get into performance issues.
spark_df1 \
.repartition("customer_group") \
.write \
.partitionBy("customer_group") \
...
I have two files on HDFS and I just want to join these two files on a column say employee id.
I am trying to simply print the files to make sure we are reading that correctly from HDFS.
lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()
I have tried foreach and println functions as well and I am not able to display file data.
I am working in python and totally new to both python and spark as well.
This is really easy just do a collect
You must be sure that all the data fits the memory on your master
my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()
If that is not the case You must just take a sample by using take method.
# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)
Another example using .ipynb:
I'm trying to populate a SQLite database using Django with data from a file that consists of 6 million records. However the code that I've written is giving me a lot of time issues even with 50000 records.
This is the code with which I'm trying to populate the database:
import os
def populate():
with open("filename") as f:
for line in f:
col = line.strip().split("|")
duns=col[1]
name=col[8]
job=col[12]
dun_add = add_c_duns(duns)
add_contact(c_duns = dun_add, fn=name, job=job)
def add_contact(c_duns, fn, job):
c = Contact.objects.get_or_create(duns=c_duns, fullName=fn, title=job)
return c
def add_c_duns(duns):
cd = Contact_DUNS.objects.get_or_create(duns=duns)[0]
return cd
if __name__ == '__main__':
print "Populating Contact db...."
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
from web.models import Contact, Contact_DUNS
populate()
print "Done!!"
The code works fine since I have tested this with dummy records, and it gives the desired results. I would like to know if there is a way using which I can lower the execution time of this code. Thanks.
I don't have enough reputation to comment, but here's a speculative answer.
Basically the only way to do this through django's ORM is to use bulk_create . So the first thing to consider is the use of get_or_create. If your database has existing records that might have duplicates in the input file, then your only choice is writing the SQL yourself. If you use it to avoid duplicates inside the input file, then preprocess it to remove duplicate rows.
So if you can live without the get part of get_or_create, then you can follow this strategy:
Go through each row of the input file and instantiate a Contact_DUNS instance for each entry (don't actually create the rows, just write Contact_DUNS(duns=duns) ) and save all instances to an array. Pass the array to bulk_create to actually create the rows.
Generate a list of DUNS-id pairs with value_list and convert them to a dict with the DUNS number as the key and the row id as the value.
Repeat step 1 but with Contact instances. Before creating each instance use the DUNS number to get the Contact_DUNS id from the dictionary of step 2. The instantiate each Contact in the following way: Contact(duns_id=c_duns_id, fullName=fn, title=job). Again, after collecting the Contact instances just pass them to bulk_create to create the rows.
This should radically improve performance as you'll be no longer executing a query for each input line. But as I said above, this can only work if you can be certain that there are no duplicates in the database or the input file.
EDIT Here's the code:
import os
def populate_duns():
# Will only work if there are no DUNS duplicates
# (both in the DB and within the file)
duns_instances = []
with open("filename") as f:
for line in f:
duns = line.strip().split("|")[1]
duns_instances.append(Contact_DUNS(duns=duns))
# Run a single INSERT query for all DUNS instances
# (actually it will be run in batches run but it's still quite fast)
Contact_DUNS.objects.bulk_create(duns_instances)
def get_duns_dict():
# This is basically a SELECT query for these two fields
duns_id_pairs = Contact_DUNS.objects.values_list('duns', 'id')
return dict(duns_id_pairs)
def populate_contacts():
# Repeat the same process for Contacts
contact_instances = []
duns_dict = get_duns_dict()
with open("filename") as f:
for line in f:
col = line.strip().split("|")
duns = col[1]
name = col[8]
job = col[12]
ci = Contact(duns_id=duns_dict[duns],
fullName=name,
title=job)
contact_instances.append(ci)
# Again, run only a single INSERT query
Contact.objects.bulk_create(contact_instances)
if __name__ == '__main__':
print "Populating Contact db...."
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
from web.models import Contact, Contact_DUNS
populate_duns()
populate_contacts()
print "Done!!"
CSV Import
First of all 6 million records is a quite a lot for sqllite and worse still sqlite isn't very good and importing CSV data directly.
There is no standard as to what a CSV file should look like, and the
SQLite shell does not even attempt to handle all the intricacies of
interpreting a CSV file. If you need to import a complex CSV file and
the SQLite shell doesn't handle it, you may want to try a different
front end, such as SQLite Database Browser.
On the other hand Mysql and Postgresql are more capable of handling CSV data and mysql's LOAD DATA IN FILE and Postgresql COPY are both painless ways to import very large amounts of data in a very short period of time.
Suitability of Sqlite.
You are using django => you are building a web app => more than one user will access the database. This is from the manual about concurrency.
SQLite supports an unlimited number of simultaneous readers, but it
will only allow one writer at any instant in time. For many
situations, this is not a problem. Writer queue up. Each application
does its database work quickly and moves on, and no lock lasts for
more than a few dozen milliseconds. But there are some applications
that require more concurrency, and those applications may need to seek
a different solution.
Even your read operations are likely to be rather slow because an sqlite database is just one single file. So with this amount of data there will be a lot of seek operations involved. The data cannot be spread across multiple files or even disks as is possible with proper client server databases.
The good news for you is that with Django you can usually switch from Sqlite to Mysql to Postgresql just by changing your settings.py. No other changes are needed. (The reverse isn't always true)
So I urge you to consider switching to mysql or postgresl before you get in too deep. It will help you solve your present problem and also help to avoid problems that you will run into sooner or later.
6,000,000 is quite a lot to import via Python. If Python is not a hard requirement, you could write a SQLite script that directly import the CSV data and create your tables using SQL statements. Even faster would be to preprocess your file using awk and output two CSV files corresponding to your two tables.
I used to import 20,000,000 records using sqlite3 CSV importer and it took only a few minutes minutes.
I have two text files that have similar formatting. The first (732KB):
>lib_1749;size=599;
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTAGACTGTCACTGACACTGATGCTCGAAAGTGTGGGTATCAAACA
--
>lib_2235;size=456;
TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTTACTGGACTGTAACTGACGTTGAGGCTCGAAAGCGTGGGGAGCAAACA
--
>lib_13686;size=69;
TACGTATGGAGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGTGTAGGTGGCCAGGCAAGTCAGAAGTGAAAGCCCGGGGCTCAACCCCGGGGCTGGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTGCTGGACTGTAACTGACACTGAGGCTCGAAAGCGTGGGGAGCAAACA
--
The second (5.26GB):
>Stool268_1 HWI-ST155_0605:1:1101:1194:2070#CTGTCTCTCCTA
TACGGAGGATGCGAGCGTTATCCGGATTTACTGGGTTTAAAGGGAGCGCAGACGGGACGTTAAGTCAGCTGTGAAAGTTTGGGGCTCAACCCTAAAACTGCTAGCGGTGAAATGCTTAGATATCGGGAGGAACTCCGGTTGCGAAGGCAGCATACTGGACTGCAACTGACGCTGATGCTCGAAAGTGTGGGTATCAAACAGG
--
Note the key difference is the header for each entry (lib_1749 vs. Stool268_1). What I need is to create a mapping file between the headers of one file and the headers of the second using the sequence (e.g., TACGGAGGATGCGAGCGTTATCCGGAT...) as a key.
Note as one final complication the mapping is not going to be 1-to-1 there will be multiple entries of the form Stool****** for each entry of lib****. This is because the length of the key in the first file was trimmed to have 200 characters but in the second file it can be longer.
For smaller files I would just do something like this in python but I often have trouble because these files are so big and cannot be read into memory at one time. Usually I try unix utilities but in this case I cannot think of how to accomplish this.
Thank you!
In my opinion, the easiest way would be to use BLAST+...
Set up the larger file as a BLAST database and use the smaller file as the query...
Then just write a small script to analyse the output - I.e. Take the top hit or two to create the mapping file.
BTW. You might find SequenceServer (Google it) helpful in setting up a custom Blast database and your BLAST environment...
BioPython should be able to read in large FASTA files.
from Bio import SeqIO
from collections import defaultdict
mapping = defaultdict(list)
for stool_record in SeqIO.parse('stool.fasta', 'fasta'):
stool_seq = str(stool_record.seq)
for lib_record in SeqIO.parse('libs.fasta', 'fasta'):
lib_seq = str(lib_record.seq)
if stool_seq.startswith(lib_seq):
mapping[lib_record.id.split(';')[0]].append(stool_record.id)
I am trying to import a JSON file for use in a Python editor so that I can perform analysis on the data. I am quite new to Python so not sure how I am meant to achieve this. My JSON file is full of tweet data, example shown here:
{"id":441999105775382528,"score":0.0,"text":"blablabla","user_id":1441694053,"created":"Fri Mar 07 18:09:33 GMT 2014","retweet_id":0,"source":"twitterfeed","geo_long":null,"geo_lat":null,"location":"","screen_name":"SevenPS4","name":"Playstation News","lang":"en","timezone":"Amsterdam","user_created":"2013-05-19","followers":463,"hashtags":"","mentions":"","following":1062,"urls":"http://bit.ly/1lcbBW6","media_urls":"","favourites_count":4514,"reply_status_id":0,"reply_user_id":0,"is_truncated":false,"is_retweet":false,"original_text":null,"status_count":4514,"description":"Tweeting the latest Playstation news!","url":null,"utc_offset":3600}
My questions:
How do I import the JSON file so that I can perform analysis on it in a Python editor?
How do I perform analysis on only a set number of the data (IE 100/200 of them instead of all of them)?
Is there a way to get rid of some of the fields such as score, user_id, created, etc without having to go through all of my data manually to do so?
Some of the tweets have invalid/unusable symbols within them, is there anyway to get rid of those without having to go through manually?
I'd use Pandas for this job, as you are will not only load the json, but perform some data analysis tasks on it. Depending on the size of your json-file, this one should do it:
import pandas as pd
import json
# read a sample json-file (replace the link with your file location
j = json.loads("yourfilename")
# you might select the relevant keys before constructing the data-frame
df = pd.DataFrame.from_dict([{k:v} for k,v in j.iteritems() if k in ["id","retweet_count"]])
# select a subset (the first five rows)
df.iloc[:5]
# do some analysis
df.retweet_count.sum()
>>> 200