Pyspark Java.lang.OutOfMemoryError: Java heap space - python

I am solving a problem using spark running in my local machine.
I am reading a parquet file from the local disk and storing it to the dataframe.
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder\
content ='./files/file')
So, Content Dataframe contents around 500k rows i.e.
| 100| 0|
| 101| 100|
| 102| 100|
| 103| 100|
| 104| 100|
| 105| 100|
| 106| 101|
| 101| 101|
| 101| 101|
| 101| 101|
| 101| 102|
| 101| 102|
. .
. .
. .
I write this code to provide each EMPLOYEE_ID an EMPLOYEE_LEVEL according to their hierarchy.
content_df = content.withColumn("EMPLOYEE_LEVEL", when(col("MANAGER_ID") == 0, 1).otherwise(lit('')))
level_df ="*").filter("Level = 1")
level = 1
while True:
ldf = level_df
temp_df = content_df.join(
((ldf["EMPLOYEE_LEVEL"] == level) &
(ldf["EMPLOYEE_ID"] == content_df["MANAGER_ID"])),
"left") \
if temp_df.count() == 0:
level_df = level_df.union(temp_df)
level += 1
It's running, but very slow execution and after some period of time it gives this error.
Py4JJavaError: An error occurred while calling o383.count.
: java.lang.OutOfMemoryError: Java heap space
at scala.collection.immutable.List.$colon$colon(List.scala:117)
at scala.collection.immutable.List.$plus$colon(List.scala:220)
at org.apache.spark.sql.catalyst.expressions.String2TrimExpression.children(stringExpressions.scala:816)
at org.apache.spark.sql.catalyst.expressions.String2TrimExpression.children$(stringExpressions.scala:816)
at org.apache.spark.sql.catalyst.expressions.StringTrim.children(stringExpressions.scala:948)
at org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:351)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:595)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:486)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:486)
at org.apache.spark.sql.catalyst.trees.TreeNode$$Lambda$1822/0x0000000100d21040.apply(Unknown Source)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.TraversableLike$$Lambda$61/0x00000001001d2040.apply(Unknown Source)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:595)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:486)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:486)
at org.apache.spark.sql.catalyst.trees.TreeNode$$Lambda$1822/0x0000000100d21040.apply(Unknown Source)
at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1148)
at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1147)
at org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:555)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:486)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:486)
at org.apache.spark.sql.catalyst.trees.TreeNode$$Lambda$1822/0x0000000100d21040.apply(Unknown Source)
at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1122)
at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1121)
at org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:486)
I tried many solutions including increasing driver and executor memory, using cache() and persist() for dataframe also doesn't worked for me.
I am using Spark 3.2.1
Any help will be appreciated.
Thank you.

I figure out the problem. This error related to the mechanism of spark DAG, it use DAG lineage to track a series transformations, when the algorithms need to iterate, the lineage can grow fast and hit the limitation of memory. So break the lineage is necessary when implementing iteration algorithms.
There are mainly 2 ways: 1. add checkpoint. 2.recreate dataframe.
I modify my codes below, which just add a checkpoint to break the lineage and works for me.
epoch_cnt = 0
while True:
print('cached df', len(spark.sparkContext._jsc.getPersistentRDDs().items()))
singer_pairs_undirected_ungrouped = singer_pairs_undirected.join(old_song_group_kernel,
on=singer_pairs_undirected['src'] == old_song_group_kernel['id'],
how='left').filter(F.col('id').isNull()) \
.select('src', 'dst')
windowSpec = Window.partitionBy("src").orderBy(F.col("song_group_id_cnt").desc())
singer_pairs_vote = singer_pairs_undirected_ungrouped.join(old_song_group_kernel,
on=singer_pairs_undirected_ungrouped['dst'] ==
old_song_group_kernel['id'], how='inner') \
.groupBy('src', 'song_group_id') \
.agg(F.count('song_group_id').alias('song_group_id_cnt')) \
.withColumn('song_group_id_cnt_rnk', F.row_number().over(windowSpec)) \
.filter(F.col('song_group_id_cnt_rnk') == 1)
singer_pairs_vote_output ='src', 'song_group_id') \
.withColumnRenamed('src', 'id')
new_song_group_kernel = old_song_group_kernel.union(singer_pairs_vote_output) \
.select('id', 'song_group_id').dropDuplicates().persist().checkpoint()
current_kernel_cnt = new_song_group_kernel.count()
old_song_group_kernel = new_song_group_kernel
epoch_cnt += 1
print('epoch rounds: ', epoch_cnt)
print('previous kernel count: ', previous_kernel_cnt)
print('current kernel count: ', current_kernel_cnt)
if current_kernel_cnt <= previous_kernel_cnt:
print('Iteration done !')
previous_kernel_cnt = current_kernel_cnt


Pyspark DataFrame loop

I am new to Python and DataFrame. Here I am writing a Python code to run an ETL job in AWS Glue. Please find the same code snippet below.
test_DyF = glueContext.create_dynamic_frame.from_catalog(database="teststoragedb", table_name="testtestfile_csv")
test_dataframe = test_DyF.select_fields(['empid','name']).toDF()
now the above test_dataframe is of type pyspark.sql.dataframe.DataFrame
Now, I need to loop through the above test_dataframe. As far as I see, I could see only collect or toLocalIterator. Please find the below sample code
for row_val in test_dataframe.collect():
But both these methods are very slow and not efficient. I cannot use pandas as it is not supported by AWS Glue.
Please find the steps I am doing
source information:
productid|matchval|similar product|similar product matchval
product A|100|product X|100
product A|101|product Y|101
product B|100|product X|100
product C|102|product Z|102
expected result:
product |similar products
product A|product X, product Y
product B|product X
product C|product Z
This is the code I am writing
I am getting a distinct dataframe of the source with productID
Loop through this distinct data frame set
a) get the list of matchval for the product from the source
b) identify the similar product based on matchval filters
c) loop through to get the concatinated string ---> This loop using the rdd.collect is affecting the performance
Can you please share any better suggestion on what can be done?
please elaborate what logic you want to try it out. DF looping can be done via SQL approach or you can also follow below RDD approach
def my_function(each_record):
#loop through for each command.
Added following code further based on your input
df ="/mylocation/61250775.csv", header=True, inferSchema=True, sep="|")
seq = ['product X','product Y','product Z']
df2 = df.groupBy("productid").pivot("similar_product",seq).count()
|productid|product X|product Y|product Z|
|product B| 1| null| null|
|product A| 1| 1| null|
|product C| null| null| 1|
The final approach which match your requirement
df ="/mylocation/61250775.csv", header=True, inferSchema=True, sep="|")
>>> df.printSchema()
|-- id: string (nullable = true)
|-- matchval1: integer (nullable = true)
|-- similar: string (nullable = true)
|-- matchval3: integer (nullable = true)
from pyspark.sql.functions import concat_ws
from pyspark.sql.functions import collect_list
dfx = df.groupBy("id").agg(concat_ws(",", collect_list("similar")).alias("Similar_Items")).select(col("id"), col("Similar_Items"))
| id| Similar_Items|
|product B| product X|
|product A|product X,product Y|
|product C| product Z|
You can also use the MAP class. In my case, I was iterating through data and calculate hash for the full row.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import hashlib
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## #type: DataSource
## #args: [database = "load-test", table_name = "table_test", transformation_ctx = "datasource0"]
## #return: datasource0
## #inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "load-test", table_name = "table_test", transformation_ctx = "datasource0")
def hash_calculation(rec):
md5 = hashlib.md5()
rec["hash"] = md5.hexdigest()
print("looping the recs")
return rec
mapped_dyF = Map.apply(frame = datasource0, f = hash_calculation)

SQLAlchemy aliased column "type object 'MiscUnit' has no attribute 'codeLabel'"

I am trying to get a row from my database, which contains multiple columns which are each paired with a unit id column, like so:
1| | 300| 1| 200| 1| 1000| 4|
2| 484| 300| 1| 200| 1| 1000| 4|
To do so, I am trying to alias the various unit columns while querying them in SQLAlchemy:
diesel_engine_installed_power_MiscUnit = aliased(MiscUnit)
pv_installed_power_MiscUnit = aliased(MiscUnit)
battery_capacity_MiscUnit = aliased(MiscUnit)
mg_res = session.query(ProcRun, ProcMemoGridInput, diesel_engine_installed_power_MiscUnit, pv_installed_power_MiscUnit, battery_capacity_MiscUnit). \
). \
filter( == ProcMemoGridInput.run_id). \
filter( == 484). \
filter(ProcMemoGridInput.diesel_engine_installed_power_unit_id == \
filter(ProcMemoGridInput.pv_installed_power_unit_id == \
filter(ProcMemoGridInput.battery_capacity_unit_id == \
It is based on this solution:
Usage of "aliased" in SQLAlchemy ORM
But it tells me that AttributeError: type object 'MiscUnit' has no attribute 'codeLabel'. I don't really understand what the difference is, from what I understand this is the same process for aliasing the MiscUnit ORM object.

Unable to use saved model as starting point for training Baselines' MlpPolicy?

I'm currently using code from OpenAI baselines to train a model, using the following code in my
from baselines.common import tf_util as U
import tensorflow as tf
import gym, logging
from visak_dartdeepmimic import VisakDartDeepMimicArgParse
def train(env, initial_params_path,
save_interval, out_prefix, num_timesteps, num_cpus):
from baselines.ppo1 import mlp_policy, pposgd_simple
sess = U.make_session(num_cpu=num_cpus).__enter__()
def policy_fn(name, ob_space, ac_space):
print("Policy with name: ", name)
policy = mlp_policy.MlpPolicy(name=name, ob_space=ob_space, ac_space=ac_space,
hid_size=64, num_hid_layers=2)
saver = tf.train.Saver()
if initial_params_path is not None:
print("Tried to restore from ", initial_params_path)
saver.restore(tf.get_default_session(), initial_params_path)
return policy
def callback_fn(local_vars, global_vars):
iters = local_vars["iters_so_far"]
saver = tf.train.Saver()
if iters % save_interval == 0:, out_prefix + str(iters))
pposgd_simple.learn(env, policy_fn,
clip_param=0.2, entcoeff=0.0,
optim_epochs=10, optim_stepsize=3e-4, optim_batchsize=64,
gamma=1.0, lam=0.95, schedule='linear',
Which is based off of the code that OpenAI itself provides in the baselines repository
This works fine, except that I get some pretty weird looking learning curves which I suspect are due to some hyperparameters passed to the learn function which cause performance to decay / high variance as things go on (though I don't know for certain)
Anyways, to confirm this hypothesis I'd like to retrain the model but not from scratch: I'd like to start it off from a high point: say, iteration 1600 for which I have a saved model lying around (having saved it with in callback_fn
So now I call the train function, but this time I provide it with an inital_params_path pointing to the save prefix for iteration 1600. By my understanding, the call to saver.restore in policy_fn should restore "reset" the model to where it was at 1teration 1600 (and I've confirmed that the load routine runs using the print statement)
However, in practice I find that it's almost like nothing gets loaded. For instance, if I got statistics like
| EpLenMean | 74.2 |
| EpRewMean | 38.7 |
| EpThisIter | 209 |
| EpisodesSoFar | 662438 |
| TimeElapsed | 2.15e+04 |
| TimestepsSoFar | 26230266 |
| ev_tdlam_before | 0.95 |
| loss_ent | 2.7640965 |
| loss_kl | 0.09064759 |
| loss_pol_entpen | 0.0 |
| loss_pol_surr | -0.048767302 |
| loss_vf_loss | 3.8620138 |
for iteration 1600, then for iteration 1 of the new trial (ostensibly using 1600's parameters as a starting point), I get something like
| EpLenMean | 2.12 |
| EpRewMean | 0.486 |
| EpThisIter | 7676 |
| EpisodesSoFar | 7676 |
| TimeElapsed | 12.3 |
| TimestepsSoFar | 16381 |
| ev_tdlam_before | -4.47 |
| loss_ent | 45.355236 |
| loss_kl | 0.016298374 |
| loss_pol_entpen | 0.0 |
| loss_pol_surr | -0.039200217 |
| loss_vf_loss | 0.043219414 |
which is back to square one (this is around where my models trained from scratch start)
The funny thing is I know that the model is being saved properly at least, since I can actually replay it using
from baselines.common import tf_util as U
from baselines.ppo1 import mlp_policy, pposgd_simple
import numpy as np
import tensorflow as tf
class PolicyLoaderAgent(object):
"""The world's simplest agent!"""
def __init__(self, param_path, obs_space, action_space):
self.action_space = action_space = mlp_policy.MlpPolicy("pi", obs_space, action_space,
hid_size = 64, num_hid_layers=2)
saver = tf.train.Saver()
saver.restore(tf.get_default_session(), param_path)
def act(self, observation, reward, done):
action2, unknown =, observation)
return action2
if __name__ == "__main__":
parser = VisakDartDeepMimicArgParse()
parser.add_argument("--params-prefix", required=True, type=str)
args = parser.parse_args()
env = parser.get_env()
agent = PolicyLoaderAgent(args.params_prefix, env.observation_space, env.action_space)
while True:
ob = env.reset(0, pos_stdv=0, vel_stdv=0)
done = False
while not done:
action = agent.act(ob, reward, done)
ob, reward, done, _ = env.step(action)
and I can clearly see that its learned something as compared to an untrained baseline. The loading action is the same across both files (or rather, if there's a mistake there then I can't find it), so it appears probable to me that is correctly loading the model and then, due to something in the pposdg_simple.learn function's, promptly forgets about it.
Could anyone shed some light on this situation?
Not sure if this is still relevant since the baselines repository has changed quite a bit since this question was posted, but it seems that you are not actually initialising the variables before restoring them. Try moving the call of U.initialize() inside your policy_fn:
def policy_fn(name, ob_space, ac_space):
print("Policy with name: ", name)
policy = mlp_policy.MlpPolicy(name=name, ob_space=ob_space,
ac_space=ac_space, hid_size=64, num_hid_layers=2)
saver = tf.train.Saver()
if initial_params_path is not None:
print("Tried to restore from ", initial_params_path)
saver.restore(tf.get_default_session(), initial_params_path)
return policy

Search and Find through a list Python

I have a main text file that looks like this:
OPEN| 43565| ACA6202| 10| Acting II| 3.00| Logan, G| SEE DEPT| | 01/12/15 - 04/27/15|
OPEN| 43566| ACA6206| 10| Topics:Classical Drama/Cult II| 2.00| Jacobson, L| SEE DEPT| | 01/12/15 - 04/27/15|
OPEN| 43567| ACA6210| 10| Text II| 2.00| Logan, G| SEE DEPT| | 01/12/15 - 04/27/15|
OPEN| 43568| ACA6212| 10| Voice and Speech II| 3.00| Logan, G| SEE DEPT| | 01/12/15 - 04/27/15|
OPEN| 43569| ACA6216| 10| Movement II| 2.00| Logan, G| SEE DEPT| | 01/12/15 - 04/27/15|
OPEN| 43570| ACA6220| 10| Alexander Technique II| 2.00| Logan, G| SEE DEPT| | 01/12/15 - 04/27/15|
OPEN| 43571| ACA6224| 10| Stage Combat II| 2.00| Logan, G| SEE DEPT| | 01/12/15 - 04/27/15|
OPEN| 43572| ACA6228| 10| Practicum IV| 3.00| Logan, G| SEE DEPT| | 01/12/15 - 04/27/15|
OPEN| 44500| ACA6595| 10| Selected Topics| 1.00| Logan, G| SEE DEPT| | 01/12/15 - 04/27/15|
My code below gathers only the "SUBJECT" column and strips the numbers from the string. So for example, the output from the top of the file would print several "ACA"s.
with open ("/Users/it/Desktop/Classbook/classAbrevs.txt", "r") as myfile:
subsAndAbrevsMap = tuple(open("/Users/it/Desktop/Classbook/classAbrevs.txt", 'r'))
with open ("/Users/it/Desktop/Classbook/masterClassList.txt", "r") as myfile:
masterSchedule = tuple(open("/Users/it/Desktop/Classbook/masterClassList.txt", 'r'))
for masterline in masterSchedule:
masterSplitLine = masterline.split("|")
if masterSplitLine[0] != "STATUS":
subjectAbrev = ''.join([i for i in masterSplitLine[2] if not i.isdigit()])
I have another .txt file that looks like this:
Academy for Classical Acting,ACA
Africana Studies,AFST
American Studies,AMST
Anatomy & Regenerative Biology,ANAT
Applied Science,APSC
Art/Art History,AH
Art/Fine Arts,FA
Biological Sciences,BISC
In my code below, I check to see if the abbreviations(column 2) in my second .txt equal the abbreviations generated from my first .txt document. If it is a match I would like to append the full class name:
#open 2nd .txt, strip and split
for subsline in subsAndAbrevsMap:
subLineSplit = subsline.split(",")
print "subLineSplit is: " + subsline[0]
if subLineSplit[1] == subjectAbrev:
realSubjectName = subLineSplit[0]
print "The subject name for abrev " + subjectAbrev + " is " + realSubjectName
I want the output to print:
"The subject name for abrev ACA is Academy for Classical Acting"
What am I doing wrong?
First of all, these are csv files, so use your csv module!
# path to first file is ~/classes.csv
# path to second file is ~/abbr.csv
import csv
with open("~/classes.csv", 'rU') as classes_csv,\
open("~/abbr.csv", 'rU') as abbr_csv:
classes = csv.reader(classes_csv, delimiter='|')
abbr = csv.reader(abbr_csv, delimiter=',')
header = next(classes)
abbr_dict = {line[1].strip():line[0].strip() for line in abbr}
# create a lookup dictionary for your tags -> names
class_tags = (line[2].strip("0123456789 ") for line in classes)
# create a genexp for all the extant tags in ~/classes.csv
result = {tag:abbr_dict[tag] for tag in class_tags if tag in abbr_dict}
Then it should be easy to format your result.
for abbr,cls in result.items():
print("The abbreviation for {} is {}".format(cls,abbr))

PyGTK Spacing in an HBox

I'm new to GTK, I'm trying to figure out how to accomplish something like this:
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
I want this done in an HBox. How would I accomplish this? Thanks.
It is done with "packing".
I always keep the class reference under my pillow :
Samples in the good tutorial found here :
And finally, this shows up something like your drawing :
import gtk as g
win = g.Window ()
win.set_default_size(600, 400)
win.connect ('delete_event', g.main_quit)
hBox = g.HBox()
win.add (hBox)
f1 = g.Frame()
f2 = g.Frame()
f3 = g.Frame()
win.show_all ()
g.main ()
Have fun ! (and I hope my answer is helpful)
The answer is pack_start() and pack_end()
The function has a few parameters you can send to it that give you the desired effect
If you use Louis' example:
hBox.pack_start(f1, expand =False, fill=False)
hBox.pack_start( f2, expand=True, fill=True, padding=50)
hBox.pack_end(f3, expand=False, fill=False)
Hope that helps!

