Keepking numpy.load() safed in memory regardles of rerunning the code - python

I'm loading a huge file to process it:
file = numpy.load('path.txt')
... everytime I change a single line of code I need to reload the file which takes time. Is there a way to keep the file loaded in memory regardless of re-running the code? and how does loading the file using differernt libraries like panads may differ?

Related

Can data copy will have data issue, if copy done using python

Will there be data issue if file is getting updated frequently in 30 seconds and at the same time file is being copied.
File type - csvfile
I am copying file from one adl to other adl (platform - Azure Databricks).
Copying file using
dbutils.fs.cp(source,target)
I tried to find answer in python as well but didn't get.
If anyone have something to perform this approch, solution to this problem will be appreciated.
Can data copy will have data issue
May be NO, because I tried to reproduce your scenario in my system, I created and triggered a pipeline which is updating file in every 1 minute in databricks I am continuously coping file from one ADL to another I didn't get any error, it is coping file while file is getting updated in every 1 minute.
To find if file is busy or not, I tried to fetch some data from file if it returns data then it is available for work and copy file from one ADL to another ADL. if file is busy, it will go to else, and print engaged
path = "abfss://demo#demostoragepratik.dfs.core.windows.net/jsonvalue.csv"
if dbutils.fs.head(path, 25):
print("Available for work")
"dbutils.fs.cp("abfss://demo#demostoragepratik.dfs.core.windows.net/SampleCSV10000Rc.csv","abfss://demo#pratikdemostorage.dfs.core.windows.net/SampleCSV10000Rc16.csv")
else:
print("Engaged")
Execution

Efficient parallel pickle file reading

I'm running a program that analyzes pickle files, each pickle should be converted into a text file (of my own format).
In order to do that efficiently, I'm running N parallel python processes, using multiprocessing.Pool (N is a parameter).
Each process opens and reads a single file exclusively, converts it into text and writes it a separate directory (each pkl file is converted into a unique text file).
I'm running on a linux machine with N=multiprocessing.cpu_count() - 1.
The pkl processing time takes much longer than expected. In total I'm analyzing 100K pkl files, each file is about 1MB, and it takes about a day. The text conversion function (to_txt()) doesn't have any major calculations, just conversion to strings.
I suspect that the inefficiency is caused because I reach the disk I/O capacity (by checking top on my linux machine)
Any ideas about how to make this more efficient?
Each process function is:
from pathlib import Path
from typing import cast
def analyze(pkl_path: Path):
with pkl_path.open("rb") as pkl_file:
results = cast(Results, pickle.load(pkl_file))
with pkl_path.with_suffix(".txt").open("w") as txt_file:
txt_file.write(results.to_txt())
Something like this
def analyze(pkl_path: Path):
with pkl_path.open("rb") as pkl_file:
file = pickle.load(pkl_file)
results = cast(Results,file)
inTxt = results.to_txt()
with pkl_path.with_suffix(".txt").open("w") as txt_file:
txt_file.write(inTxt)
Like that, files are not open too long to let other processes deal with file open/close stuffs.

How slicing numpy load file is loaded into memory

If I want to load a portion of a file using numpy.load, I use slicing as:
np.load('myfile.npy')[start:end].
Does this guarantee that this portion from the file, i.e., [start:end], is only loaded into to memory or does it load the entire file first then slice it?
Thanks,
That loads the whole thing. If you don't want to load the whole thing, you could mmap the file and only copy the part you want:
part = numpy.load('myfile.npy', mmap_mode='r')[start:end].copy()

Sound files in PsychoPy wont load

I'm currently working on building an experiment in PsychoPy (v1.82.01 stand-alone). I started on the project several months ago with an older version of PsychoPy.
It worked great and I ran some pilot subjects. We have since adjusted the stimuli sounds and it won’t run.
It looks like there is an issue with referencing the sound file, but I can’t figure out what’s going on.
I recreated the first part of the experiment with a single file rather than a loop so that it would be easier to debug. The sound file is referenced using:
study_sound = sound.Sound(u‘2001-1.ogg’, secs=-1)
When I run it, I get this output:
or see below
Running: /Users/dkbjornn/Desktop/Test/test.py
2016-04-29 14:05:43.164 python[65267:66229207] ApplePersistenceIgnoreState: Existing state will not be touched. New state will be written to /var/folders/9f/3kr6zwgd7rz95bcsfw41ynw40000gp/T/org.psychopy.PsychoPy2.savedState
0.3022 WARNING Movie2 stim could not be imported and won't be available
sndinfo: failed to open the file.
Traceback (most recent call last):
File "/Users/dkbjornn/Desktop/Test/test.py", line 84, in <module>
study_sound = sound.Sound(u'2001-1.ogg', secs=-1)
File "/Applications/PsychoPy2.app/Contents/Resources/lib/python2.7/psychopy/sound.py", line 380, in __init__
self.setSound(value=value, secs=secs, octave=octave, hamming=hamming)
File "/Applications/PsychoPy2.app/Contents/Resources/lib/python2.7/psychopy/sound.py", line 148, in setSound
self._setSndFromFile(value)
File "/Applications/PsychoPy2.app/Contents/Resources/lib/python2.7/psychopy/sound.py", line 472, in _setSndFromFile
start=self.startTime, stop=self.stopTime)
File "/Applications/PsychoPy2.app/Contents/Resources/lib/python2.7/pyolib/tables.py", line 1420, in setSound
saved data to u'/Users/dkbjornn/Desktop/Test/data/99_test_2016_Apr_29_1405_1.csv'
_size, _dur, _snd_sr, _snd_chnls, _format, _type = sndinfo(path)
TypeError: 'NoneType' object is not iterable
The important thing here is the sndinfo: failed to open the file. message. Most likely, it cannot find your file on the disk. Check the following:
Is the file 2001-1.ogg in the same folder as your experiment? Not in a subfolder? Or have you accidentially changed your path, e.g. using os.chdir?
Is it actually called 2001-1.ogg? Any differences in uppercase/lowercase, spaces, etc. all count.
Alternatively, there's something in the particular way the .ogg was saved that causes the problem, even though the Sound class can read a large set of different sound codecs. Try exporting the sound file in other formats, e.g. .mp3 or .wav.

Memory error when using pandas read_csv

I am trying to do something fairly simple, reading a large csv file into a pandas dataframe.
data = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2)
The code either fails with a MemoryError, or just never finishes.
Mem usage in the task manager stopped at 506 Mb and after 5 minutes of no change and no CPU activity in the process I stopped it.
I am using pandas version 0.11.0.
I am aware that there used to be a memory problem with the file parser, but according to http://wesmckinney.com/blog/?p=543 this should have been fixed.
The file I am trying to read is 366 Mb, the code above works if I cut the file down to something short (25 Mb).
It has also happened that I get a pop up telling me that it can't write to address 0x1e0baf93...
Stacktrace:
Traceback (most recent call last):
File "F:\QA ALM\Python\new WIM data\new WIM data\new_WIM_data.py", line 25, in
<module>
wimdata = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2
)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 401, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 216, in _read
return parser.read()
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 643, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 394, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 525, in _init_dict
dtype=dtype)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 5338, in _arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1820, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1872, in form_blocks
float_blocks = _multi_blockify(float_items, items)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1930, in _multi_blockify
block_items, values = _stack_arrays(list(tup_block), ref_items, dtype)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1962, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError
Press any key to continue . . .
A bit of background - I am trying to convince people that Python can do the same as R. For this I am trying to replicate an R script that does
data <- read.table(paste(INPUTDIR,config[i,]$TOEXTRACT,sep=""), HASHEADER, DELIMITER,skip=2,fill=TRUE)
R not only manages to read the above file just fine, it even reads several of these files in a for loop (and then does some stuff with the data). If Python does have a problem with files of that size I might be fighting a loosing battle...
Windows memory limitation
Memory errors happens a lot with python when using the 32bit version in Windows. This is because 32bit processes only gets 2GB of memory to play with by default.
Tricks for lowering memory usage
If you are not using 32bit python in windows but are looking to improve on your memory efficiency while reading csv files, there is a trick.
The pandas.read_csv function takes an option called dtype. This lets pandas know what types exist inside your csv data.
How this works
By default, pandas will try to guess what dtypes your csv file has. This is a very heavy operation because while it is determining the dtype, it has to keep all raw data as objects (strings) in memory.
Example
Let's say your csv looks like this:
name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01
This example is of course no problem to read into memory, but it's just an example.
If pandas were to read the above csv file without any dtype option, the age would be stored as strings in memory until pandas has read enough lines of the csv file to make a qualified guess.
I think the default in pandas is to read 1,000,000 rows before guessing the dtype.
Solution
By specifying dtype={'age':int} as an option to the .read_csv() will let pandas know that age should be interpreted as a number. This saves you lots of memory.
Problem with corrupt data
However, if your csv file would be corrupted, like this:
name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01
Dennis, 40+, None-Ur-Bz
Then specifying dtype={'age':int} will break the .read_csv() command, because it cannot cast "40+" to int. So sanitize your data carefully!
Here you can see how the memory usage of a pandas dataframe is a lot higher when floats are kept as strings:
Try it yourself
df = pd.DataFrame(pd.np.random.choice(['1.0', '0.6666667', '150000.1'],(100000, 10)))
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# 224544 (~224 MB)
df = pd.DataFrame(pd.np.random.choice([1.0, 0.6666667, 150000.1],(100000, 10)))
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# 79560 (~79 MB)
I had the same memory problem with a simple read of a tab delimited text file around 1 GB in size (over 5.5 million records) and this solved the memory problem:
df = pd.read_csv(myfile,sep='\t') # didn't work, memory error
df = pd.read_csv(myfile,sep='\t',low_memory=False) # worked fine and in less than 30 seconds
Spyder 3.2.3
Python 2.7.13 64bits
I tried chunksize while reading big CSV file
reader = pd.read_csv(filePath,chunksize=1000000,low_memory=False,header=0)
The read is now the list. We can iterate the reader and write/append to the new csv or can perform any operation
for chunk in reader:
print(newChunk.columns)
print("Chunk -> File process")
with open(destination, 'a') as f:
newChunk.to_csv(f, header=False,sep='\t',index=False)
print("Chunk appended to the file")
I use Pandas on my Linux box and faced many memory leaks that only got resolved after upgrading Pandas to the latest version after cloning it from github.
I encountered this issue as well when I was running in a virtual machine, or somewere else where the memory is stricktly limited. It has nothing to do with pandas or numpy or csv, but will always happen if you try using more memory as you are alowed to use, not even only in python.
The only chance you have is what you already tried, try to chomp down the big thing into smaller pieces which fit into memory.
If you ever asked yourself what MapReduce is all about, you found out by yourself...MapReduce would try to distribute the chunks over many machines, you would try to process the chunke on one machine one after another.
What you found out with the concatenation of the chunk files might be an issue indeed, maybe there are some copy needed in this operation...but in the end this maybe saves you in your current situation but if your csv gets a little bit larger you might run against that wall again...
It also could be, that pandas is so smart, that it actually only loads the individual data chunks into memory if you do something with it, like concatenating to a big df?
Several things you can try:
Don't load all the data at once, but split in in pieces
As far as I know, hdf5 is able to do these chunks automatically and only loads the part your program currently works on
Look if the types are ok, a string '0.111111' needs more memory than a float
What do you need actually, if there is the adress as a string, you might not need it for numerical analysis...
A database can help acessing and loading only the parts you actually need (e.g. only the 1% active users)
There is no error for Pandas 0.12.0 and NumPy 1.8.0.
I have managed to create a big DataFrame and save it to a csv file and then successfully read it. Please see the example here. The size of the file is 554 Mb (It even worked for 1.1 Gb file, took longer, to generate 1.1Gb file use frequency of 30 seconds). Though I have 4Gb of RAM available.
My suggestion is try updating Pandas. Other thing that could be useful is try running your script from command line, because for R you are not using Visual Studio (this already was suggested in the comments to your question), hence it has more resources available.
Add these:
ratings = pd.read_csv(..., low_memory=False, memory_map=True)
My memory with these two:
#319.082.496
Without these two:
#349.110.272
Although this is a workaround not so much as a fix, I'd try converting that CSV to JSON (should be trivial) and using read_json method instead - I've been writing and reading sizable JSON/dataframes (100s of MB) in Pandas this way without any problem at all.

Categories

Resources