Interested in understanding the basics of AI? We explain definitions and frameworks in this article.
We are living in the midst of a new digital wave, meaning we have been combining digital technologies with machines, markets and nature. More recently, artificial intelligence has been at the center of this and AI promises to continue to dominate innovation in the near future.
Is this digital wave just the fourth wave in the industrial revolution as we know it, or is it the beginning of a whole new society? This question is taken from the book ‘Society 4.0’ (2021) by Bob de Wit, Professor in Strategic Leadership in the Netherlands.
So, as our society surfs this new wave, let’s have a look at what Artificial Intelligence is.
What is AI?
AI has captured the imagination of the world since 1950. Artificial Intelligence is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by animals and humans.
AI, which is based on digitalization and automating processes, mimics the cognitive process of human beings. This brings new possibilities, including learning, problem solving, knowledge representation, and social and general intelligence. AI powered tools are already used to optimize search engines, mathematics, logic, probability, and economics. This has helped to cut costs on paperwork, reduce labor, and eliminate human error. However, early versions of AI kept the higher cognitive work for human beings to take care of.
How Does Artificial Intelligence work?
In brief, AI works as follows. Data sources (datasets) are prepared and entered into a model to produce a prediction – known as the outcome. Based on the outcome, users will make decisions. What makes the AI-model groundbreaking is that it is self-learning. However, one of the challenges of AI is to identify whether the model, and the outcome, can and should be trusted.
What is Machine Learning?
Machine Learning (ML) started to gain traction around 1980. It is a field of inquiry devoted to understanding and building methods that ‘learn’, that is, leverage data to improve performance on some set of tasks. Machine Learning is considered a subset of Artificial Intelligence.
Machine Learning focuses on the ability of machines to receive data and learn for themselves, without being programmed with rules. ML differs from traditional programming since you can teach a ML program with examples rather than a list of instructions. Instead of writing instructions or rules while programming, machine learning enables you to “train” an algorithm so that it can learn on its own, and then adjust and improve as it learns more about the information it is processing.
In short, ML is using data to answer questions. The first part, ‘using data’, is also referred to as training. The second part, ‘answer questions’, is known as making predictions or inference.
Google Cloud describes the steps of ML as follows: After the step (1) gathering of data, the step of (2) data preparation is performed to optimize the accuracy of the data. The next step is (3) choosing a ML model. Followed by (4) training the model (using approx. 80% of the gathered data) and (5) evaluation of the trained model (using the other 20% of the gathered data). Then, (6) hyperparameter tuning and finally (7) prediction.
What is Deep Learning?
Deep Learning has catapulted the AI industry since 2010. Deep learning is a machine learning technique that teaches computers to do what comes naturally to humans: learn by example.
Deep learning is a key technology behind driverless cars, enabling them to recognize a stop sign, or to distinguish a pedestrian from a lamppost. While machine learning works with regression algorithms or decision trees, deep learning uses neural networks that function very similarly to the biological neural connections of our brain.
Machine Learning vs Deep Learning: What’s the Difference?
Machine Learning is the ability of machines to learn based on data without being programmed with rules. Deep Learning is a subset of Machine Learning however it uses neural networks. Neural networks reflect the behavior of the human brain, allowing computer programs to recognize patterns and solve common problems.
AI frameworks provide data scientists, AI developers, and researchers with the building blocks to architect, train, validate, and deploy models through high-level programming interfaces. There are multiple AI frameworks, including the popular Tensorflow and Apache Spark. Before we dive more into Apache Spark and Apache Hadoop, a few words on Tensorflow.
TensorFlow makes it easy to create machine learning models. TensorFlow provides a collection of workflows to develop and train models, and to easily deploy. It runs faster on GPUs (Graphics Processing Unit) and can be executed on various GPU-enabled platforms, including servers.
Apache Hadoop allows you to manage big data sets by enabling a network of computers (or “nodes”) to solve vast and intricate data problems. It is a highly scalable, cost-effective solution that stores and processes structured, semi-structured and unstructured data. Hadoop supports advanced analytics for stored data (e.g. predictive analysis, data mining, machine learning (ML), etc.). It enables big data analytics processing tasks to be split into smaller tasks. The small tasks are performed in parallel by using an algorithm (e.g. MapReduce), and are then distributed across a Hadoop cluster (i.e. nodes that perform parallel computations on big data sets).
Apache Spark can be installed in combination with Apache Hadoop.
Apache Spark is a unified engine for large-scale data analytics. In other words, Spark is a popular engine for executing data engineering, data science, and machine learning (on single-node machines or clusters). Thousands of companies use Apache Spark.
Apache Spark is built on an advanced distributed SQL engine for large-scale data. Apache Spark integrates with other frameworks for data science and machine learning, like Tensorflow, as well as frameworks for SQL analytics and BI, and frameworks for Storage and Infrastructure. Read more about Spark on https://spark.apache.org/.
Apache Spark, the largest open-source project in data processing, is the only processing framework that combines data and artificial intelligence (AI). This enables users to perform large-scale data transformations and analyses, and then run state-of-the-art machine learning (ML) and AI algorithms.
The Spark ecosystem consists of five primary modules:
Spark Core: Underlying execution engine that schedules and dispatches tasks and coordinates input and output (I/O) operations.
Spark SQL: Gathers information about structured data to enable users to optimize structured data processing.
Spark Streaming and Structured Streaming: Both add stream processing capabilities. Spark Streaming takes data from different streaming sources and divides it into micro-batches for a continuous stream. Structured Streaming, built on Spark SQL, reduces latency and simplifies programming.
Machine Learning Library (MLlib): A set of machine learning algorithms for scalability plus tools for feature selection and building ML pipelines. The primary API for MLlib is DataFrames, which provides uniformity across different programming languages like Java, Scala and Python.
GraphX: User-friendly computation engine that enables interactive building, modification and analysis of scalable, graph-structured data.
Apache Spark vs. Hadoop
Like Hadoop, Spark splits up large tasks across different nodes. However, it tends to perform faster than Hadoop and it uses random access memory (RAM) to cache and process data instead of a file system. This enables Spark to handle use cases that Hadoop cannot. Spark processes and retains data in memory for subsequent steps, whereas MapReduce processes data on disk. As a result, for smaller workloads, Spark’s data processing speeds are up to 100x faster than MapReduce. Spark is faster because it uses random access memory (RAM) instead of reading and writing intermediate data to disks. Hadoop stores data on multiple sources and processes it in batches via MapReduce. Spark is not, itself, a memory-based technology. Spark can perform up to 100x faster than Hadoop for small workloads. According to Apache, Spark typically performs up to 3x faster than Hadoop for large workloads.
Hadoop runs at a lower cost since it relies on any disk storage type for data processing. Spark runs at a higher cost because it relies on in-memory computations for real-time data processing, which requires it to use high quantities of RAM to spin up nodes.
Though both Hadoop and Spark platforms process data in a distributed environment, Hadoop is ideal for batch processing and linear data processing. Spark is ideal for real-time processing and processing live unstructured data streams.
When data volume rapidly grows, Hadoop quickly scales to accommodate the demand via Hadoop Distributed File System (HDFS). In turn, Spark relies on the fault tolerant HDFS for large volumes of data.
Spark enhances security with authentication via shared secret or event logging, whereas Hadoop uses multiple authentication and access control methods. Though, overall, Hadoop is more secure, Spark can integrate with Hadoop to reach a higher security level.
Spark is superior to Hadoop in Machine Learning because it includes MLlib, which performs iterative in-memory ML computations. It also includes tools that perform regression, classification, persistence, pipeline construction, evaluation, etc.
Hadoop is most effective for scenarios that involve processing big data sets in environments where data size exceeds available memory, batch processing with tasks that exploit disk read and write operations, building data analysis infrastructure with a limited budget, completing jobs that are not time-sensitive, and historical and archive data analysis.
Spark is most effective for scenarios that involve dealing with chains of parallel operations by using iterative algorithms. As mentioned before, Sparks achieves quick results with in-memory computations and option to analyze stream data analysis in real time.
Hardware for AI Frameworks
Spark can be run on the same nodes as Hadoop (HDFS). Alternatively, run Hadoop and Spark on a common cluster. If this is not possible, run Spark on different nodes in the same local-area network as Hadoop (HDFS). Apache recommends having 4-8 disks per node, configured without RAID.
Spark can perform a lot of its computation in memory. In general, Spark can run well with anywhere from 8 GiB to hundreds of gigabytes of memory per machine. In all cases, we recommend allocating only at most 75% of the memory for Spark; leave the rest for the operating system and buffer cache. How much memory you will need will depend on your application. Once data is in memory, most applications are either CPU- or network-bound.
Spark scales to tens of CPU cores per machine. You should likely provision at least 8-16 cores per machine. Apache recommends 10 Gbps (or higher) networking.
Leaseweb and AI
Leaseweb offers dedicated servers in multiple hardware configurations to meet your specific infrastructure requirements to run your AI environments. By selecting the right size infrastructure, you lay the perfect foundation for your AI workloads, in terms of optimal performance and costs, in which you have full control and self-service.
Learn more about what Leaseweb has to offer here.