Why AWS Glue Jobs and PySpark Use No Memory (Almost)

Tom Harrison
Tom Harrison’s Blog
3 min readOct 14, 2022

--

I am new to AWS Glue Jobs and PySpark. I made some assumptions about how my jobs used memory.

They were wrong.

Glue and PySpark use almost no memory in a typical case of transformation. Sort of.

In typical Python (or whatever language) programming, code like this will result in the use of memory:

my_ten_bytes = "0123456789"

And those 10 bytes will be available for anywhere else in the program until they go out of scope. A global or top-level variable might always be in scope, but local variables in a function, or loop variables will be cleaned up eventually, typically by a garbage collector process.

So when I started with AWS Glue, I was confused. For example, I needed to convert a string column having JSON content from a column called response_body into a structure I provided, that I could do other stuff with. So I wrote code like this:

logs_df = logs_dyf.toDF()\
.withColumn("content", from_json(col("response_body"), json_schema))
logs_with_content_dyf = DynamicFrame.fromDF(
dataframe=logs_df,
name="logs_content_dyf")

The first line seems to transform the DynamicFrame into a DataFrame then add a column named “content” whose value is converted from the string of JSON in “response_body” having a known schema that I provide in the variable “json_schema”. The next line converts the DataFrame back to a DynamicFrame, now with a new column.

But that’s not exactly what happens.

It occurred to me that I now had two variables: logs_df (a Spark DataFrame) and logs_with_content_dyf a Glue DynamicFrame. And of course they must use up tons of memory: those logs have millions of records!

But I was wrong.

Instead, what happens is that the methods both return Spark functions with the parameters passed. Said in another way, the methods return the instructions of how to do something rather than actually doing it.

In the case above, I already had a few things sitting around:

  • logs_dyf is a DynamicFrame that has the instructions to read logs from their sources.
  • The first method provides further instructions to convert from a DynamicFrame to a DataFrame, then
  • The method then will convert a column named “content” that will come from the column response_body and will be converted from JSON using the schema provided.

That is, my code so far has simply said “plan to do this, then that, then the next thing, then the thing after that” using the output of the first as the input of the second, and so on.

My code has built a plan. It has done nothing else yet.

In this case, the main user of memory is that schema that defines the JSON schema!

It almost doesn’t matter whether I chain the methods (as in logs_dyf.to_DF().withColumn()) or store parts in intermediate variables. The only thing being stored is the plan, or instructions, about what to do.

And then, at some point the code — my plans or instructions — will execute. But then, they will execute not in my program context, instead they’ll be sliced into small chunks and executed in the executor nodes — the worker CPUs/Memory that know how to run each chunk of code. Maybe if I have a 1 million records, and 10 executors, each will only need to perform those operations above on 100,000 records each.

In my case, I just couldn’t get it out of my head that somehow all of the stuff I was doing to make my code readable, testable, and composable was somehow creating a huge memory footprint. But that was wrong.

It is true that anything that causes a given dataframe to render some information, perhaps the count() of records, or a write to a file all require the code to get executed. But if you can defer these things until your transformations are complete, PySpark will use hardly any memory, and Spark will be able to resolve the full set of transformations without reading or writing anything. Until the end, when chances are we’ll write our transformed data into a new location.

--

--

30 Years of Developing Software, 20 Years of Being a Parent, 10 Years of Being Old. (Effective: 2020)