Wednesday, February 21, 2024

DataBricks - Interview Questions and Answers

Originally, Databricks was a Notebook interface for using Spark without having to worry with the infrastructure for distributed computing. All you had to do was provide the desired cluster size, and Databricks took care of the rest. Before distributed computing became the norm, this was really big.

It's grown a lot since then (though I'm not sure in what order), especially to produce a front-end SQL interface like to that (which, incidentally, runs Spark underneath). Since your files are saved as files rather than tables, you can now treat Databricks like a database or data warehouse thanks to the virtual data warehouse interface that they have designed. However, once Delta Lake was revealed, your files became tables and could be utilized outside of Databricks in other contexts. Additionally, Databricks Workflows, which are integrated directly into Databricks, allow you to organize your Databricks work.

Unity Catalogue in Databricks, which allows Databricks to manage and abstract your data access through a single lens, thereby acting as a stand-alone data platform.


Why to avoid UDFs in Databricks or PySpark code?

A user-defined function (UDF) is a function defined by a user. It allows custom logic to be reused in the user environment.
Databricks uses different code optimizers for code written with included Apache Spark, SQL, and Delta Lake syntax. But the custom logic written in UDFs are kind of black box for these optimizers. The optimizers do not have the ability to efficiently plan tasks around this custom logic. Also for some UDFs, logic executes outside the JVM with additional costs around data serialization. Hence, UDFs impact performance.

Should we never use UDFs?
Avoid UDFs at all cost. When possible you should use Spark SQL built-in functions as these functions provide optimization. e.x. You want to convert a string to all UPPER case. Consider using built-in upper() instead of writing a UDF

When to use UDFs then?
Consider creating UDF only when the existing built-in SQL function doesn’t have it.

Do we need to do additional setting for shared cluster in Databricks

✅Yes, there are some additional settings and considerations to keep in mind when using shared clusters in Databricks. Shared clusters are clusters that are shared among multiple users and can be used for collaborative workloads. Here are some key points to consider:

  1. Cluster Access Control: Ensure that appropriate access controls are in place to manage who can start, stop, and modify shared clusters. You can use Databricks workspace access control lists (ACLs) to control cluster access.
  2. Cluster Configuration: Configure the shared cluster with appropriate specifications (e.g., instance type, number of worker nodes, autoscaling settings) to meet the needs of all users who will be using the cluster. Consider the workload requirements and resource constraints when configuring the cluster.
  3. Library Management: Manage libraries and dependencies carefully on shared clusters to avoid conflicts and ensure compatibility across different notebooks and jobs. You can install libraries using the Databricks UI or programmatically using Databricks REST API or Databricks CLI.
  4. Session Management: Use session pooling to efficiently manage sessions on shared clusters. Session pooling allows multiple users to share a single Spark context, reducing the overhead of creating and tearing down contexts for each user.
  5. Idle Cluster Termination: Configure idle cluster termination settings to automatically terminate the cluster when it is not in use for a specified period. This helps in cost optimization by avoiding unnecessary cluster runtime.
  6. Monitoring and Logging: Monitor cluster usage and performance to identify any bottlenecks or issues. Use Databricks monitoring tools and integrations with third-party logging and monitoring solutions to track cluster activity and health.
  7. Concurrency and Resource Management: Manage concurrency and resource contention on shared clusters to ensure fair allocation of resources among users. Configure Spark configurations such as spark.scheduler.mode and spark.executor.instances to optimize resource utilization.
  8. Cost Management: Monitor cluster costs and usage to ensure that shared clusters are cost-effective. Use Databricks cost management features such as cost tracking and budget alerts to track spending and optimize resource usage.

No comments:

Post a Comment