PhantomJS exceeds 1GB memory usage and dies loading 50MB of json - python

I am attempting to download ~55MB of json data from a webpage with PhanotomJS and python on Windows 10.
The PhantomJS process dies with "Memory exhausted" upon reaching 1GB of memory usage.
The load is made by entering a username and password and then using
myData = driver.page_source
on a page that just contains a simple header and the 55MB of text that makes up the json data.
It dies even if I'm not asking PhantomJS to do anything with it, just get the source.
If I load the page in chrome it takes about a minute to load, and lists it as having loaded 54MB, exactly as expected.
The phantomJS process takes about as long to reach 1GB RAM usage and die.
This used to work perfectly, until recently when the data to be downloaded exceeded about 50MB.
Is there a way to stream the data directly to a file from PhantomJS, or just some setting to not have it explode to 20x the necessary RAM usage? (The computer has 16GB of ram, the 1GB limit is apparently a problem in PhantomJS that they won't fix).
Is there an alternative, equally simple, way of entering a username and password and grabbing some data that doesn't have this flaw? (And does not pop up a browser window while working)

Related

Websites are taking a long time to load when scraping

I am unsure what tags to give this but I am using Selenium in python so I decided to start here. I am scraping a website thousands of times using selenium and requests in python. It starts fairly quickly but around the 3400 page load mark it slows down from around .1 seconds to 3 or 4 seconds. Any ideas on what is slowing the webpages loading. The program is being run on a very low power Linode (1 shared cpu and 1gb of ram). The cpu is pegged from the beginning when it is still running fast and from what I can tell, it is not using all the RAM. I also gave it a 10 gb swap. My internet download and upload is above 200 MB/s. I was thinking the website host themselves are limiting me but I don't know this stuff well enough to be sure.
Pretty sure it's the host. If they are limiting by your IP, you may want to use some proxies. If the website is on shared hosting or some low cost hosting then proxies won't help.

Managing Chrome memory usage with Selenium

I have an 4GB Raspberry Pi doing unit running a Selenium webscraper. Over the course of hours, Chrome's memory footprint eats up all the available memory and the entire bot crashes as a result. Is there a way for Selenium to automatically manage/reduce Chrome's memory usage? I've noticed that if I refresh the page, the memory heep is cleared.
If refreshing the page seems to work, you can just use the following code in a loop that runs every so often.
driver.navigate().refresh();

Python selenium invisible ram usage until it hits 99% by time

EDIT: Fixed after implementing page refresh instead of opening new browsers.
I believe code isn't needed for this question and since it's way too long I'll try to explain the problem without it.
I made a selenium bot that checks a website for the freshly added content. It simply starts chromedriver, checks the nesecery page, quits the driver via driver.quit() and repeats forever in a loop but after about 24 hours ram usage hits above 95% and simply locks the computer. But in the task manager, there is nothing using the ram. Ram cleaning programs can't get rid of the invisible ram usage either.
Now what I am wondering is, is it because of my unlimited amount of browser turn on and offs even though I am closing the browser with driver.quit() or do I need to just change the code with page refreshing instead of turning the browser completely off and on all the time?
Just looking for ideas, not code. Thanks a lot.
Edit: The computer in the photo has 6 gb ram but it's at 93% when there is nothing using it. It becomes like that by time. When computer restarted ram usage is at normal levels like any idle computer. And the app on top is my bot.

I want to know the way django websites consume RAM or computer memory

I am about to develop a website. And, I expect to get a lot of traffic in it. It is in python (Django). I wonder if my web application uses 2 MB of RAM for one service(like if I run it in my PC directly in terminal, it consumes 2 MB of RAM) then if I get 1000 users on my website in a particular time, will my website need 2000 MB of RAM (2MB per user * 1000 users)? Does it go this way?
To test that, and see what kind of increase in memory consumption you can expect, open a few incognito chrome tabs and connect as different users. Then you can see if the memory increases linearly to the number of users.

Ubuntu server running out of memory when loading python app

I'm running a python Flask app on a Ubuntu 14.04 server where I load some data, the two major things being:
Google News vectors, to be used with Word2vec (the GoogleNewsVec is about 4GB)
350MB json file with data
This all loads on my local Windows machine in ~5mins, which has similar specs to the server (8GB RAM). The weird thing is that the two parts seperately load fine. So when I do:
load_word2vec_model()
load_json_data()
It would load the model really quickly and get stuck at load_json_data():
[13:16:55] Loading model..
[13:17:13] Finished loading model.
[13:17:13] Loading scores..
[13:17:13] Loading scores dict from json file..
But when i do it in reverse order:
load_json_data()
load_word2vec_model()
It gets stuck at loading the word2vec model:
[13:20:29] Loading scores..
[13:20:29] Loading scores dict from json file..
[13:22:42] Finished loading from json file.
[13:22:42] Finished loading scores.
[13:22:42] Loading model..
I do not get any python error message. This leads me to believe that the server somehow reached its max. memory usage, and will not load the entire model.
On my local Windows machine it does use up a lot of memory, but eventually it will load (in about 5 minutes total). Why does this not happen on the server, I've waited for an hour but it never loads.
This is the htop output for the server:
The difference between your Windows machine and Ubuntu server is most likely caused by pagefile (Windows)/swappiness (Linux) configuration. To sum up, swapping is storing some parts of memory, preferably not-so-important stuff, to disk to make some room for some other process that asks for memory.
Now, end-user targeted Windows machines comes with a pagefile, i.e. the file that is used to write memory contents, size configuration of around 75% of the memory size. But Ubuntu servers, AWS comes to mind, usually comes with no swap partition/file and swappiness, i.e. the likelyhood that your memory will be swapped to disk, set to 0, i.e. not at all.
The solution is either setting up a swap file and swappiness configuration, or throwing more memory at the problem. The former solution will make your application work just as it is on Windows. The latter will solve it for good.
NVM:
It seems you have swap enabled.

Categories

Resources