Python crashes when loading very large file - python

I have this error "Fatal Python error: _Py_CheckRecursiveCall: Cannot recover from stack overflow", when I load a very large .txt in memory. When i use small text file everything works perfect.
I want to load each line of my text file, split it by a "-" and put each [0] in a list, and each [1] in another list. So my text file would be like this:
aaaaa-bbbb
cccc-ddddddd
eeeee-fffff
So:
list1 = ["aaaaa", "cccc", "eeeee"]
list2 = ["bbbb", "ddddddd", "fffff"]

Try executing the code in terminal
possibly It will run smoothly

import pandas as pd
df = pd.read_csv('file_name',sep="-", header=None)

Try breaking in smaller subfile or wait after some number of lines.

Related

When I import an array from another file, do I take just the data from it or need to "build" the array with how the original file build it?

Sorry if the question is not well formulated, will reformulated if necessary.
I have a file with an array that I filled with data from an online json db, I imported this array to another file to use its data.
#file1
response = urlopen(url1)
a=[]
data = json.loads(response.read())
for i in range(len(data)):
a.append(data[i]['name'])
i+=1
#file2
from file1 import a
'''do something with "a"'''
Does importing the array means I'm filling the array each time I call it in file2?
If that is the case, what can I do to just keep the data from the array without "building" it each time I call it?
If you saved a to a file, then read a -- you will not need to rebuild a -- you can just open it. For example, here's one way to open a text file and get the text from the file:
# set a variable to be the open file
OpenFile = open(file_path, "r")
# set a variable to be everything read from the file, then you can act on that variable
file_guts = OpenFile.read()
From the Python docs on the Modules section - link - you can read:
When you run a Python module with
python fibo.py <arguments>
the code in the module will be executed, just as if you imported it
This means that importing a module has the same behavior as running it as a regular Python script, unless you use the __name__ as mentioned right after this quotation.
Also, if you think about it, you are opening something, reading from it, and then doing some operations. How can you be sure that the content you are now reading from is the same as the one you had read the first time?

Printing top few lines of a large JSON file in Python

I have a JSON file whose size is about 5GB. I neither know how the JSON file is structured nor the name of roots in the file. I'm not able to load the file in the local machine because of its size So, I'll be working on high computational servers.
I need to load the file in Python and print the first 'N' lines to understand the structure and Proceed further in data extraction. Is there a way in which we can load and print the first few lines of JSON in python?
If you want to do it in Python, you can do this:
N = 3
with open("data.json") as f:
for i in range(0, N):
print(f.readline(), end = '')
You can use the command head to display the N first line of the file. To get a sample of the json to know how is it structured.
And use this sample to work on your data extraction.
Best regards

Saving multiple items to HDFS with (spark, python, pyspark, jupyter)

I´m used to program in Python. My company now got a Hadoop Cluster with Jupyter installed. Until now I never used Spark / Pyspark for anything.
I am able to load files from HDFS as easy as this:
text_file = sc.textFile("/user/myname/student_grades.txt")
And I´m able to write output like this:
text_file.saveAsTextFile("/user/myname/student_grades2.txt")
The thing I´m trying to achieve is to use a simple "for loop" to read text files one-by-one and write it's content into one HDFS file. So I tried this:
list = ['text1.txt', 'text2.txt', 'text3.txt', 'text4.txt']
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file.saveAsTextFile("/user/myname/all.txt")
So this works for the first element of the list, but then gives me this error message:
Py4JJavaError: An error occurred while calling o714.saveAsTextFile.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
XXXXXXXX/user/myname/all.txt already exists
To avoid confusion I "blured"-out the IP address with XXXXXXXX.
What is the right way to do this?
I will have tons of datasets (like 'text1', 'text2' ...) and want to perform a python function with each of them before saving them into HDFS. But I would like to have the results all together in "one" output file.
Thanks a lot!
MG
EDIT:
It seems like that my final goal was not really clear. I need to apply a function to each text file seperately and then I want to append the output to the existing output directory. Something like this:
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file = really_cool_python_function(text_file)
text_file.saveAsTextFile("/user/myname/all.txt")
I wanted to post this as comment but could not do so as I do not have enough reputation.
You have to convert your RDD to dataframe and then write it in append mode. To convert RDD to dataframe please look into this answer:
https://stackoverflow.com/a/39705464/3287419
or this link http://spark.apache.org/docs/latest/sql-programming-guide.html
To save dataframe in append mode below link may be useful:
http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
Almost same question is here also Spark: Saving RDD in an already existing path in HDFS . But the answer provided is for scala. I hope something similar can be done in python also.
There is yet another (but ugly) approach. Convert your RDD to string. Let the resulting string be resultString . Use subprocess to append that string to destination file i.e.
subprocess.call("echo "+resultString+" | hdfs dfs -appendToFile - <destination>", shell=True)
you can read multiple files and save them by
textfile = sc.textFile(','.join(['/user/myname/'+f for f in list]))
textfile.saveAsTextFile('/user/myname/all')
you will get all part files within output directory.
If the text files all have the same schema, you could use Hive to read the whole folder as a single table, and directly write that output.
I would try this, it should be fine:
list = ['text1.txt', 'text2.txt', 'text3.txt', 'text4.txt']
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file.saveAsTextFile(f"/user/myname/{i}")

When I write to a CSV file, the stored data is deleted and replaced with the new data

I try to write to a CSV file but when i check the file, only th last set of data is displayed? How can this be fixed?
my original code is:
highscores=open("Highscores.csv","w")
toPrint=(name+","+score+"\n")
for z in toPrint:
highscores.write(z)
highscores.close()
and I also tried this:
toPrint=[]
output=(name+","+score+"\n")
toPrint.append(output)
for z in toPrint:
highscores.write(z)
highscores.close()
You need to open the file in "Append" mode instead of the "Write". Your code will remain the same only the following line will change:
highscores=open("Highscores.csv","a")

Writing output to a file in python

I am unable to write all values of the output to a file . Kindly help.
import numpy as np
theta=10
sigma=np.linspace(0,10,300)
Re=np.linspace(5,100,300)
file = open("New values sigma5.txt", "w")
for i in np.arange(0,300):
mu=np.sqrt(Re[i]*sigma)
A=(mu-1)*np.exp(mu)+(mu+1)*np.exp(-mu)
B=2*mu*(theta-1)
C=(A/B)
D1=np.exp(mu)/2*(mu+sigma)
D2=np.exp(-mu)/2*(mu-sigma)
D3=mu**2
D4=np.exp(-sigma)
D5=sigma
D6=mu**2-sigma**2
D7=D3*D4
D8=D5*D6
H=D7/D8
D9=(1/sigma)
D=D1-D2+H-D9
K1=C-D
K2=np.delete(K1,0)
K3=np.nonzero(K2>0)
K33=np.array(K3)
K4=np.shape(K3)
K5=len(K33.T)
K6=K5
K7=sigma[K6]
K77=np.array(K7)
print K77
file.write(K77)
print(K77)
file.close()
The output is given by K77. By the present form of the code, I am getting to see only the last value of K77. I dont see the other ones.
You may want to append the data. try opening the file in append mode using a instead of w in
file = open("New values sigma5.txt", "w")
Currently you are over writing the file content. Using the append mode, you can append the new data to the file.
The other problem I see is that you may want to save data to the file during every iteration, so file.write(K77) should be in the for loop.
Assuming that you want to capture each of the 300 values for K77, you need to change
...
K77=np.array(K7)
print K77
file.write(K77)
....
to (notice the different indentation)
...
K77=np.array(K7)
print K77
file.write(K77)
....
This will write each line to the file.

Categories

Resources