Logging in PySpark#
Introduction#
The pyspark.logger module facilitates structured client-side logging for PySpark users.
This module includes a PySparkLogger class that provides several methods for logging messages at different levels in a structured JSON format:
The logger can be easily configured to write logs to either the console or a specified file.
Customizing Log Format#
The default log format is JSON, which includes the timestamp, log level, logger name, and the log message along with any additional context provided.
Example log entry:
{
  "ts": "2024-06-28 19:53:48,563",
  "level": "ERROR",
  "logger": "DataFrameQueryContextLogger",
  "msg": "[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012\n== DataFrame ==\n\"divide\" was called from\n/.../spark/python/test_error_context.py:17\n",
  "context": {
    "file": "/path/to/file.py",
    "line": "17",
    "fragment": "divide"
    "errorClass": "DIVIDE_BY_ZERO"
  },
  "exception": {
    "class": "Py4JJavaError",
    "msg": "An error occurred while calling o52.showString.\n: org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012\n== DataFrame ==\n\"divide\" was called from\n/path/to/file.py:17 ...",
    "stacktrace": ["Traceback (most recent call last):", "  File \".../spark/python/pyspark/errors/exceptions/captured.py\", line 247, in deco", "    return f(*a, **kw)", "  File \".../lib/python3.9/site-packages/py4j/protocol.py\", line 326, in get_return_value" ...]
  },
}
Setting Up#
To start using the PySpark logging module, you need to import the PySparkLogger from the pyspark.logger.
from pyspark.logger import PySparkLogger
Usage#
Creating a Logger#
You can create a logger instance by calling the PySparkLogger.getLogger(). By default, it creates a logger named “PySparkLogger” with an INFO log level.
logger = PySparkLogger.getLogger()
Logging Messages#
The logger provides three main methods for log messages: PySparkLogger.info(), PySparkLogger.warning() and PySparkLogger.error().
- PySparkLogger.info: Use this method to log informational messages. - user = "test_user" action = "login" logger.info(f"User {user} performed {action}", user=user, action=action) 
- PySparkLogger.warning: Use this method to log warning messages. - user = "test_user" action = "access" logger.warning("User {user} attempted an unauthorized {action}", user=user, action=action) 
- PySparkLogger.error: Use this method to log error messages. - user = "test_user" action = "update_profile" logger.error("An error occurred for user {user} during {action}", user=user, action=action) 
Logging to Console#
from pyspark.logger import PySparkLogger
# Create a logger that logs to console
logger = PySparkLogger.getLogger("ConsoleLogger")
user = "test_user"
action = "test_action"
logger.warning(f"User {user} takes an {action}", user=user, action=action)
This logs an information in the following JSON format:
{
  "ts": "2024-06-28 19:44:19,030",
  "level": "WARNING",
  "logger": "ConsoleLogger",
  "msg": "User test_user takes an test_action",
  "context": {
    "user": "test_user",
    "action": "test_action"
  },
}
Logging to a File#
To log messages to a file, use the PySparkLogger.addHandler() for adding FileHandler from the standard Python logging module to your logger.
This approach aligns with the standard Python logging practices.
from pyspark.logger import PySparkLogger
import logging
# Create a logger that logs to a file
file_logger = PySparkLogger.getLogger("FileLogger")
handler = logging.FileHandler("application.log")
file_logger.addHandler(handler)
user = "test_user"
action = "test_action"
file_logger.warning(f"User {user} takes an {action}", user=user, action=action)
The log messages will be saved in application.log in the same JSON format.