py2neo - Neo4j - System Error - Create Batch Nodes/Relationships - python

Attempting to batch create nodes & relationships - batch creation is failing - Traceback at end of the post
Note code functions with smaller subset of nodes - fails when get into massive number of relationships, unclear at what limit this is occurring.
Wondering if I need to increase ulimit above 40,000 open files
Read somewhere where persons were running into Xstream issues with REST API while conducting batch create - unclear if the problem set is on the py2neo end of the spectrum, or on the Neo4j server tuning/configuration, or on the Python end of the spectrum.
Any guidance would be greatly appreciated.
One cluster within the data set ends up with around 625525 relationships out of 700+ nodes.
Total Relationships will be 1M+ - utilizing an Apple Macbook Pro Retina with x86_64 - Ubuntu 13.04, SSD, 8GB memory.
Neo4j: configured auto_indexing & auto_relationships set to ON
Nodes Clustered/Grouped via Python Panadas DataFrame.groupby()
Nodes: contain 3 properties
Relationships Properties: 1 -> IN & Out Relationships created
ulimit set to 40,000 files open
Code
https://github.com/alienone/OSINT/blob/master/MANDIANTAPT/spitball.py
Operating System: Ubuntu 13.04
Python version: 2.7.5
py2neo Version: 1.5.1
Java version: 1.7.0_25-b15
Neo4j version: Community Edition 1.9.2
Traceback
Traceback (most recent call last):
File "/home/alienone/Programming/Python/OSINT/MANDIANTAPT/spitball.py", line 63, in
main()
File "/home/alienone/Programming/Python/OSINT/MANDIANTAPT/spitball.py", line 59, in main
graph_db.create(*sorted_nodes)
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/neo4j.py", line 420, in create
return batch.submit()
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/neo4j.py", line 2123, in submit
for response in self._submit()
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/neo4j.py", line 2092, in submit
for id, request in enumerate(self.requests)
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/rest.py", line 428, in _send
return self._client().send(request)
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/rest.py", line 365, in send
return Response(request.graph_db, rs.status, request.uri, rs.getheader("Location", None), rs_body)
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/rest.py", line 279, in init
raise SystemError(body)
SystemError: None
Process finished with exit code 1

I had a similar issue. One way to deal with it is to do the batch.submit() for chunks of your data and not the whole data set. This is slower of course, but splitting one million nodes in chunks of 5000 is still faster than adding every node separately.
I use a small helper class to do this, note that all my nodes are indexed: https://gist.github.com/anonymous/6293739

Related

Is there any limits of saving result on S3 from sagemaker Processing?

※ I used google translation, if you have any question, let me know!
I am trying to run python script with huge 4 data, using sagemaker processing. And my current situation are as follows:
can run this script with 3 data
can't run the script with only 1 data (the biggest, the same structure with others)
as for all of 4 data, the script has finished (so, I suspected this error in S3, ie. when copying sagemaker result to S3)
The error I got is this InternalServerError.
Traceback (most recent call last):
File "sagemaker_train_and_predict.py", line 56, in <module>
outputs=outputs
File "{xxx}/sagemaker_constructor.py", line 39, in run
outputs=outputs
File "{masked}/.pyenv/versions/3.6.8/lib/python3.6/site-packages/sagemaker/processing.py", line 408, in run
self.latest_job.wait(logs=logs)
File "{masked}/.pyenv/versions/3.6.8/lib/python3.6/site-packages/sagemaker/processing.py", line 723, in wait
self.sagemaker_session.logs_for_processing_job(self.job_name, wait=True)
File "{masked}/.pyenv/versions/3.6.8/lib/python3.6/site-packages/sagemaker/session.py", line 3111, in logs_for_processing_job
self._check_job_status(job_name, description, "ProcessingJobStatus")
File "{masked}/.pyenv/versions/3.6.8/lib/python3.6/site-packages/sagemaker/session.py", line 2615, in _check_job_status
actual_status=status,
sagemaker.exceptions.UnexpectedStatusException: Error for Processing job sagemaker-vm-train-and-predict-2020-04-12-04-15-40-655: Failed. Reason: InternalServerError: We encountered an internal error. Please try again.
There may be some issue transferring the output data to S3 if the output is generated at a high rate and size is too large.
You can 1) try to slow down writing the output a bit or 2) call S3 from your algorithm container to upload the output directly using boto client (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html).

Dask Distributed: Reading .csv from HDFS

I'm performance testing Dask using "Distributed Pandas on a Cluster with Dask DataFrames" as a guide.
In Matthew's example, he has a 20GB file and 64 workers (8 physical nodes).
In my case, I have a 82GB file and 288 workers (12 physical nodes; there's a HDFS data node on each).
On all 12 nodes, I can access HDFS and execute a simple Python script that displays info on a file:
import pyarrow as pa
fs = pa.hdfs.connect([url], 8022)
print(str(fs.info('/path/to/file.csv')))
If I create a single-node cluster (only 24 workers) using only the machine running Dask Scheduler, I can read the .csv from HDFS and print the length:
import dask
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
dask.config.set(hdfs_backend='pyarrow')
df = dd.read_csv('hdfs://[url]:8022/path/to/file.csv')
df = client.persist(df)
print(str(len(df)))
That last line gives 1046250873 (nice!) and takes 3m17s to run.
However, when I use the full cluster, that last line calling len(df) dies and I get this error:
KilledWorker: ("('pandas_read_text-read-block-from-delayed-9ad3beb62f0aea4a07005d5c98749d7e', 1201)", 'tcp://[url]:42866')
This is similar to an issue mentioned here which has a solution here involving Dask Yarn and a config (?) that looks like: worker_env={'ARROW_LIBHDFS_DIR': ...}
However, I'm not using Yarn, although my guess is that the Dask Workers are somehow not configured with the HDFS/Arrow paths they need in order to connect.
I don't see any documentation on this, hence my question as to what I'm missing.
Edit:
Here's the error traceback I'm seeing in the output of the Dask Workers:
distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95N\x05\x00\x00\x00\x00\x00\x00(\x8c\x14dask.dataframe.utils\x94\x8c\ncheck_meta\x94\x93\x94(\x8c\x12dask
.compatibility\x94\x8c\x05apply\x94\x93\x94\x8c\x15dask.dataframe.io.csv\x94\x8c\x10pandas_read_text\x94\x93\x94]\x94(\x8c\x11pandas.io.parsers\x94\x8c\x08read_csv\x94\x93\x94(
\x8c\x0fdask.bytes.core\x94\x8c\x14read_block_from_file\x94\x93\x94h\r\x8c\x08OpenFile\x94\x93\x94(\x8c\x12dask.bytes.pyarrow\x94\x8c\x17PyArrowHadoopFileSystem\x94\x93\x94)\x8
1\x94}\x94(\x8c\x02fs\x94\x8c\x0cpyarrow.hdfs\x94\x8c\x10HadoopFileSystem\x94\x93\x94(\x8c\r10.255.200.91\x94MV\x1fNN\x8c\x07libhdfs\x94Nt\x94R\x94\x8c\x08protocol\x94\x8c\x04h
dfs\x94ub\x8c\x1a/path/to/file.csv\x94\x8c\x02rb\x94NNNt\x94R\x94K\x00J\x00\x90\xd0\x03C\x01\n\x94t\x94C\x12animal,weight,age\n\x94\x8c\x08builtins\x94\x8c\x04dict\x94
\x93\x94]\x94\x86\x94h*]\x94(]\x94(\x8c\x06animal\x94\x8c\x05numpy\x94\x8c\x05dtype\x94\x93\x94\x8c\x02O8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01|\x94NNNJ\xff\xff\xff\xffJ\xff
\xff\xff\xffK?t\x94be]\x94(\x8c\x06weight\x94h2\x8c\x02i8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94be]\x94(\x8c\x03age\x94h<e
e\x86\x94]\x94(h/h9h#eeh*]\x94(]\x94(\x8c\x0cwrite_header\x94\x89e]\x94(\x8c\x07enforce\x94\x89e]\x94(\x8c\x04path\x94Nee\x86\x94t\x94\x8c\x11pandas.core.frame\x94\x8c\tDataFra
me\x94\x93\x94)\x81\x94}\x94(\x8c\x05_data\x94\x8c\x15pandas.core.internals\x94\x8c\x0cBlockManager\x94\x93\x94)\x81\x94(]\x94(\x8c\x18pandas.core.indexes.base\x94\x8c\n_new_In
dex\x94\x93\x94hW\x8c\x05Index\x94\x93\x94}\x94(\x8c\x04data\x94\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94h0\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01
b\x94\x87\x94R\x94(K\x01K\x03\x85\x94h5\x89]\x94(h/h9h#et\x94b\x8c\x04name\x94Nu\x86\x94R\x94hY\x8c\x19pandas.core.indexes.range\x94\x8c\nRangeIndex\x94\x93\x94}\x94(hjN\x8c\x0
5start\x94K\x00\x8c\x04stop\x94K\x00\x8c\x04step\x94K\x01u\x86\x94R\x94e]\x94(h`hbK\x00\x85\x94hd\x87\x94R\x94(K\x01K\x02K\x00\x86\x94h<\x89C\x00\x94t\x94bh`hbK\x00\x85\x94hd\x
87\x94R\x94(K\x01K\x01K\x00\x86\x94h5\x89]\x94t\x94be]\x94(hYh[}\x94(h]h`hbK\x00\x85\x94hd\x87\x94R\x94(K\x01K\x02\x85\x94h5\x89]\x94(h9h#et\x94bhjNu\x86\x94R\x94hYh[}\x94(h]h`
hbK\x00\x85\x94hd\x87\x94R\x94(K\x01K\x01\x85\x94h5\x89]\x94h/at\x94bhjNu\x86\x94R\x94e}\x94\x8c\x060.14.1\x94}\x94(\x8c\x04axes\x94hV\x8c\x06blocks\x94]\x94(}\x94(\x8c\x06valu
es\x94hy\x8c\x08mgr_locs\x94h(\x8c\x05slice\x94\x93\x94K\x01K\x03K\x01\x87\x94R\x94u}\x94(h\x9dh\x7fh\x9eh\xa0K\x00K\x01K\x01\x87\x94R\x94ueust\x94b\x8c\x04_typ\x94\x8c\tdatafr
ame\x94\x8c\t_metadata\x94]\x94ub\x8c\x0cfrom_delayed\x94t\x94.'
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 59, in loads
return pickle.loads(x)
File "/usr/lib64/python3.6/site-packages/pyarrow/hdfs.py", line 38, in __init__
self._connect(host, port, user, kerb_ticket, driver, extra_conf)
File "pyarrow/io-hdfs.pxi", line 89, in pyarrow.lib.HadoopFileSystem._connect
File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unable to load libjvm
Again, I can use pyarrow to read a successfully read a file from HDFS from any of the 12 nodes.
Looking at the traceback my guess is that PyArrow isn't correctly installed on the worker nodes. I might ask on the PyArrow issue tracker to see if they can help you to diagnose that traceback.
Ho boy! After building libhdfs3 from scratch and deploying to part of the cluster and finding the same exact result (ImportError: Can not find the shared library: libhdfs3.so), I realized the issue is I've been starting the Dask workers via pssh so they aren't catching the environment variables they should.

KeyError: 'max_overlaps' on tensorflow ver Faster R-CNN

I tried to run python Faster-RCNN based on tensorflow, which clone from https://github.com/smallcorgi/Faster-RCNN_TF
I built a dataset by myself and re-write data API to make dataset fit. The images in the dataset are all composed of meaningless background and text.
I got a .txt file to record the text location in a image, such as
ID_card/train/3.jpg 1 209 39 261 89
And my goal is to find text from a new image.
But when I run
python ./tools/train_net.py --device cpu --device_id 1 --solver VGG_CNN_M_1024 --weight ./data/pretrain_model/VGG_imagenet.npy --imdb ID_card_train --network IDcard_train
I got this KeyError: 'max_overlaps'
and here is the terminal record and error traceback.
Traceback (most recent call last):
File "./tools/train_net.py", line 97, in <module>
max_iters=args.max_iters)
File"/Users/jay_fu/tasks/catpatch/ClickCatpatch/tools/../lib/fast_rcnn/train.py", line 259, in train_net
roidb = filter_roidb(roidb)
File"/Users/jay_fu/tasks/catpatch/ClickCatpatch/tools/../lib/fast_rcnn/train.py", line 250, in filter_roidb
filtered_roidb = [entry for entry in roidb if is_valid(entry)]
File"/Users/jay_fu/tasks/catpatch/ClickCatpatch/tools/../lib/fast_rcnn/train.py", line 239, in is_valid
overlaps = entry['max_overlaps']
KeyError: 'max_overlaps'
I did googled and tried to delete /cache folder, and it didn't work. Next time i run the code the folder and .pkl file would be created again and then the same error came out.
some other answer said delete another folder lib/datasets/VOCdevkit-matlab-wrapper
however smallcorgi/Faster-RCNN_TF do not contain this folder, so i have no way to go.
I wonder what happened to my code and what would cause this error. I have no idea about what to do.
Anyone can give me some help, solution or whatever a piece of idea?
edit:
I run the demo on #VegardKT 's idea, the demo works good.
terminal shows succeed and figure 1-5 shows up.

Sound files in PsychoPy wont load

I'm currently working on building an experiment in PsychoPy (v1.82.01 stand-alone). I started on the project several months ago with an older version of PsychoPy.
It worked great and I ran some pilot subjects. We have since adjusted the stimuli sounds and it won’t run.
It looks like there is an issue with referencing the sound file, but I can’t figure out what’s going on.
I recreated the first part of the experiment with a single file rather than a loop so that it would be easier to debug. The sound file is referenced using:
study_sound = sound.Sound(u‘2001-1.ogg’, secs=-1)
When I run it, I get this output:
or see below
Running: /Users/dkbjornn/Desktop/Test/test.py
2016-04-29 14:05:43.164 python[65267:66229207] ApplePersistenceIgnoreState: Existing state will not be touched. New state will be written to /var/folders/9f/3kr6zwgd7rz95bcsfw41ynw40000gp/T/org.psychopy.PsychoPy2.savedState
0.3022 WARNING Movie2 stim could not be imported and won't be available
sndinfo: failed to open the file.
Traceback (most recent call last):
File "/Users/dkbjornn/Desktop/Test/test.py", line 84, in <module>
study_sound = sound.Sound(u'2001-1.ogg', secs=-1)
File "/Applications/PsychoPy2.app/Contents/Resources/lib/python2.7/psychopy/sound.py", line 380, in __init__
self.setSound(value=value, secs=secs, octave=octave, hamming=hamming)
File "/Applications/PsychoPy2.app/Contents/Resources/lib/python2.7/psychopy/sound.py", line 148, in setSound
self._setSndFromFile(value)
File "/Applications/PsychoPy2.app/Contents/Resources/lib/python2.7/psychopy/sound.py", line 472, in _setSndFromFile
start=self.startTime, stop=self.stopTime)
File "/Applications/PsychoPy2.app/Contents/Resources/lib/python2.7/pyolib/tables.py", line 1420, in setSound
saved data to u'/Users/dkbjornn/Desktop/Test/data/99_test_2016_Apr_29_1405_1.csv'
_size, _dur, _snd_sr, _snd_chnls, _format, _type = sndinfo(path)
TypeError: 'NoneType' object is not iterable
The important thing here is the sndinfo: failed to open the file. message. Most likely, it cannot find your file on the disk. Check the following:
Is the file 2001-1.ogg in the same folder as your experiment? Not in a subfolder? Or have you accidentially changed your path, e.g. using os.chdir?
Is it actually called 2001-1.ogg? Any differences in uppercase/lowercase, spaces, etc. all count.
Alternatively, there's something in the particular way the .ogg was saved that causes the problem, even though the Sound class can read a large set of different sound codecs. Try exporting the sound file in other formats, e.g. .mp3 or .wav.

Memory error when using pandas read_csv

I am trying to do something fairly simple, reading a large csv file into a pandas dataframe.
data = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2)
The code either fails with a MemoryError, or just never finishes.
Mem usage in the task manager stopped at 506 Mb and after 5 minutes of no change and no CPU activity in the process I stopped it.
I am using pandas version 0.11.0.
I am aware that there used to be a memory problem with the file parser, but according to http://wesmckinney.com/blog/?p=543 this should have been fixed.
The file I am trying to read is 366 Mb, the code above works if I cut the file down to something short (25 Mb).
It has also happened that I get a pop up telling me that it can't write to address 0x1e0baf93...
Stacktrace:
Traceback (most recent call last):
File "F:\QA ALM\Python\new WIM data\new WIM data\new_WIM_data.py", line 25, in
<module>
wimdata = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2
)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 401, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 216, in _read
return parser.read()
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 643, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 394, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 525, in _init_dict
dtype=dtype)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 5338, in _arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1820, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1872, in form_blocks
float_blocks = _multi_blockify(float_items, items)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1930, in _multi_blockify
block_items, values = _stack_arrays(list(tup_block), ref_items, dtype)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1962, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError
Press any key to continue . . .
A bit of background - I am trying to convince people that Python can do the same as R. For this I am trying to replicate an R script that does
data <- read.table(paste(INPUTDIR,config[i,]$TOEXTRACT,sep=""), HASHEADER, DELIMITER,skip=2,fill=TRUE)
R not only manages to read the above file just fine, it even reads several of these files in a for loop (and then does some stuff with the data). If Python does have a problem with files of that size I might be fighting a loosing battle...
Windows memory limitation
Memory errors happens a lot with python when using the 32bit version in Windows. This is because 32bit processes only gets 2GB of memory to play with by default.
Tricks for lowering memory usage
If you are not using 32bit python in windows but are looking to improve on your memory efficiency while reading csv files, there is a trick.
The pandas.read_csv function takes an option called dtype. This lets pandas know what types exist inside your csv data.
How this works
By default, pandas will try to guess what dtypes your csv file has. This is a very heavy operation because while it is determining the dtype, it has to keep all raw data as objects (strings) in memory.
Example
Let's say your csv looks like this:
name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01
This example is of course no problem to read into memory, but it's just an example.
If pandas were to read the above csv file without any dtype option, the age would be stored as strings in memory until pandas has read enough lines of the csv file to make a qualified guess.
I think the default in pandas is to read 1,000,000 rows before guessing the dtype.
Solution
By specifying dtype={'age':int} as an option to the .read_csv() will let pandas know that age should be interpreted as a number. This saves you lots of memory.
Problem with corrupt data
However, if your csv file would be corrupted, like this:
name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01
Dennis, 40+, None-Ur-Bz
Then specifying dtype={'age':int} will break the .read_csv() command, because it cannot cast "40+" to int. So sanitize your data carefully!
Here you can see how the memory usage of a pandas dataframe is a lot higher when floats are kept as strings:
Try it yourself
df = pd.DataFrame(pd.np.random.choice(['1.0', '0.6666667', '150000.1'],(100000, 10)))
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# 224544 (~224 MB)
df = pd.DataFrame(pd.np.random.choice([1.0, 0.6666667, 150000.1],(100000, 10)))
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# 79560 (~79 MB)
I had the same memory problem with a simple read of a tab delimited text file around 1 GB in size (over 5.5 million records) and this solved the memory problem:
df = pd.read_csv(myfile,sep='\t') # didn't work, memory error
df = pd.read_csv(myfile,sep='\t',low_memory=False) # worked fine and in less than 30 seconds
Spyder 3.2.3
Python 2.7.13 64bits
I tried chunksize while reading big CSV file
reader = pd.read_csv(filePath,chunksize=1000000,low_memory=False,header=0)
The read is now the list. We can iterate the reader and write/append to the new csv or can perform any operation
for chunk in reader:
print(newChunk.columns)
print("Chunk -> File process")
with open(destination, 'a') as f:
newChunk.to_csv(f, header=False,sep='\t',index=False)
print("Chunk appended to the file")
I use Pandas on my Linux box and faced many memory leaks that only got resolved after upgrading Pandas to the latest version after cloning it from github.
I encountered this issue as well when I was running in a virtual machine, or somewere else where the memory is stricktly limited. It has nothing to do with pandas or numpy or csv, but will always happen if you try using more memory as you are alowed to use, not even only in python.
The only chance you have is what you already tried, try to chomp down the big thing into smaller pieces which fit into memory.
If you ever asked yourself what MapReduce is all about, you found out by yourself...MapReduce would try to distribute the chunks over many machines, you would try to process the chunke on one machine one after another.
What you found out with the concatenation of the chunk files might be an issue indeed, maybe there are some copy needed in this operation...but in the end this maybe saves you in your current situation but if your csv gets a little bit larger you might run against that wall again...
It also could be, that pandas is so smart, that it actually only loads the individual data chunks into memory if you do something with it, like concatenating to a big df?
Several things you can try:
Don't load all the data at once, but split in in pieces
As far as I know, hdf5 is able to do these chunks automatically and only loads the part your program currently works on
Look if the types are ok, a string '0.111111' needs more memory than a float
What do you need actually, if there is the adress as a string, you might not need it for numerical analysis...
A database can help acessing and loading only the parts you actually need (e.g. only the 1% active users)
There is no error for Pandas 0.12.0 and NumPy 1.8.0.
I have managed to create a big DataFrame and save it to a csv file and then successfully read it. Please see the example here. The size of the file is 554 Mb (It even worked for 1.1 Gb file, took longer, to generate 1.1Gb file use frequency of 30 seconds). Though I have 4Gb of RAM available.
My suggestion is try updating Pandas. Other thing that could be useful is try running your script from command line, because for R you are not using Visual Studio (this already was suggested in the comments to your question), hence it has more resources available.
Add these:
ratings = pd.read_csv(..., low_memory=False, memory_map=True)
My memory with these two:
#319.082.496
Without these two:
#349.110.272
Although this is a workaround not so much as a fix, I'd try converting that CSV to JSON (should be trivial) and using read_json method instead - I've been writing and reading sizable JSON/dataframes (100s of MB) in Pandas this way without any problem at all.

Categories

Resources