The objective of this article is to propose a strategy for optimizing a Spark job when resources are limited. This trend is encouraged by the ease of renting computing power from Cloud providers. In response to this problem, we often increase the resources allocated to a computing cluster. In both cases, a major concern is to optimise the calculation time of a Spark job. Finally, it is also an alternative when one wants to accelerate a calculation by using several machines within the same network. This is what we call the big data phenomenon. When the data to be processed is too large for the available computing and memory resources. There are two scenarios in which it is particularly useful. Spark is commonly used to apply transformations on data, structured in most cases. The most famous cloud providers also offer Spark integration services ( AWS EMR, Azure HDInsight, GCP Dataproc). The momentum is supported by managed services such as Databricks, which reduce part of the costs related to the purchase and maintenance of a distributed computing cluster. This technology has become the leading choice for many business applications in data engineering. Spark is currently a must-have tool for processing large datasets. Example of a time-saving optimization on a use case.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |