For its second edition, the whole Adaltas crew is gathering in Corsica for a whole week with 2 days dedicated to technology the 23rd and the 24th of september 2021.
After a year and a half of sanitary restrictions, we all deserve a majestic site. Porto Vecchio in Corsica is up to our expectations. Warm sun, transparent sea with white sand beaches, and impressive landscapes in the backcountry. We rented 2 villas to host everyone. Choose between the proximity of waking up on one of the nicest beach of Mediterranea and the confort of a luxury villa with a majestic scenery.
We are grateful to our customers for the flexibility they provided us. Organizing the event was a challenging task. Despite the late notice, almost everyone managed to be present. Some choose to take a few days off before the summit, others worked remotely, and a few are joining us in the middle of the weeks to attend the talks.
Participant can choose between one of the 3 formats available:
- presentation: between 20 minutes and 1 hour
- demonstration: between 45mn and 2h
- training: between 1h and 2h
Adaltas is a team of hackers, leaders and innovators in software development located in France, Morocco and Canada.
Program
Once an intervention has been carried out, its supported resources as well as an article covering the intervention will be published on the Adaltas website. Here is the calendar and the list of topics covered during this week.
Thursday, September 23rd, 2021
Friday, September 24th, 2021
Abstracts
TDP, the 100% open source Big Data distribution
- Speaker: Leo SCHOUKROUN
- Duration: 1h
- Format: talk + demo
- Schedule: Thursday, September 23rd, 2021 at 10:00
Commercial Apache Hadoop distributions have come and gone. The two leaders, Cloudera and Hortonworks, have merged: HDP is no more, CDH is now CDP. MapR was acquired by HP. IBM BigInsights is discontinued. Some of these organizations are important contributors to the Apache Hadoop project. Their clients rely on them to get secured, tested and stable Hadoop (and other software of the Big Data ecosystem) builds. Hadoop is now a decade-old project with thousands of commits, dozens of dependencies, and a complex architecture.
TOSIT Data Platform (TDP) is an initiative from the TOSIT, a French organization promoting Open Source software. TDP is a collection of Ansible roles to deploy big data software from the Hadoop ecosystem (ex: HDFS, YARN, Hive, HBase, etc.) to remote machines automatically and in a secured fashion (Kerberos, SSL, ACLs, …). The deployed services are built directly from the Apache source code of the projects.
This presentation details the process we went through to make our releases of the Apache projects (building, patching, testing, packaging) and features a deployment demonstration of a TDP cluster on freshly provisioned Virtual Machines.
Jupyter Lab/Notebook integration with Spark on Hadoop and Kerberos
- Speaker: Aargan COINTEPAS
- Duration: 45mn
- Format: discussion + demo
- Schedule: Thursday, September 23rd, 2021 at 11:00
Jupyter is currently one of the most popular notebook web servers. A large panel of users including data scientists but not only rely upon Python and Jupyter for their experiments. In big data clusters based on the Apache ecosystem, Apache Zeppelin is a popular service. It is packaged in both the Cloudera HDP and CDP distributions. However, our users are not familiar with the Apache Zeppelin, and it is not as mature and rich as Jupyter can be.
In this talk, I present two ways to connect your Jupyter server to your Spark cluster:
- shell script to launch a personal Jupyter server
- spark-magic to create the Spark interpreter for a Jupyter shared server
GitOps with Argo CD
- Speaker: Paul-Adrien CORDONNIER
- Duration: 1h
- Format: talk + demo
- Schedule: Thursday, September 23rd, 2021 at 12:00
The GitOps Pattern states that Git repositories are the source of truth for defining the desired application state.
Application definitions, configurations, and environments should be declarative and version-controlled.
Argo CD automates the deployment of the desired application states in the specified target environments. Application deployments can track updates to branches, tags, or pinned to a specific version of manifests at a Git commit.
Automated infrastructure deployment with Nikita, on the road to version 1.0
- Speaker: David WORMS
- Duration: 1h
- Format: discussion + demo
- Schedule: Thursday, September 23rd, 2021 at 14:30
Automation is central when operating and scaling complex systems. The more servers and services there are to manage, the harder it gets for a team to fulfill their operational duties without proper automation in place.
Nikita presents several advantages over comparative solutions like Ansible, Puppet, and Chef. It is written in JavaScript, a language familiar to many with a large ecosystem. SSH is transparent, every action run locally or over a remote secured connection. There is no state which makes it perfect for GitOps and CI/CD integration. It combines the best of both worlds between a declarative API and language, when written in CoffeeScript, with an imperative language. Templating is also supported if you wish to take that road. It is extremely flexible, the core is written in 140 lines, the rest are just plugins.
Nikita, an almost 10 years old project at Adaltas, has been entirely rewritten during the confinements. The initiative encompasses a lot of areas. It includes a monorepo Git organization, the ability for action to return values, async/await native support across the API, a plugin architecture where everything is a plugin, schema validation, dynamic templating, a new website, and a lot more.
Before version 1.0.0, the public API should not be considered stable. The upcoming version 1.0.0 defines the public API. In the case of Nikita, it meant a lot of work. There are still a few things to work on before being ready. We are getting closer. Current version is 1.0.0-alpha.2
. We will present Nikita, its usage, its improvements, and its future evolutions.
Terraform modules and workspaces, avoid code duplication and isolate resources
- Speaker: Ferdinand DE BAECQUE
- Duration: 1h
- Format: talk + demo
- Schedule: Thursday, September 23rd, 2021 at 15:30
Terraform is an IaC (Infrastructure as Code) tool that helps users to manage their Cloud resources. Every day, I use it to create and update GCP resources. It works as follows. Terraform builds a plan to compare what has already been created in a system versus the code pushed by a user. The plan
stage detects the resources to create/update/delete. After that, the apply
stage performs the action described in the plan
stage, and it persists the associated output information in a “state” file.
When a team is managing a large infrastructure with multiple entities using the same type of resource, the use of Module and Workspace can facilitate their work in order to templatize how the resources are defined and isolate the resources’ state by entities. If resources from a new entity need to be created, its resources’ definitions are standardized and the plan only concerns the resources of that entity.
This presentation introduces Terraform Modules and Workspaces then demonstrates how I use them with a simple example.
Demystifying the Linux overlay filesystem used by Docker
- Speaker: David WORMS
- Duration: 1h
- Format: talk
- Schedule: Thursday, September 23rd, 2021 at 16:30
Overlay filesystems (also called union filesystems) is a fundamental technology in Docker to create images and containers. They allow creating a union of directories to create a filesystem. Multiple filesystems, which are just directories, are superposed one on top of another to create a new filesystem.
During this talk, we will learn how to create an overlay filesystem ourselves. Then, we deep dive into its usage with Docker to build images and run containers.
Ansible best practices and limitations when maintaining complexe configuration
- Speaker: Xavier HERMAND
- Duration: 1h
- Format: talk
- Schedule: Thursday, September 23rd, 2021 at 17:30
Ansible is the de facto standard for open-source configuration management. Its readable YAML format is what made it so popular, as well as its extensibility using Python. However, using Ansible as it is supposed to be used isn’t always obvious. This talk will cover the bad and best practices of using Ansible, and the recent evolution: Ansible collections.
We will go through the limitations of using Ansible for configuration management when dealing with configuration versioning and dependencies. As an example, we will showcase how Ansible has been used to deploy a TDP Hadoop cluster, and describe those challenges.
The final goal of this talk is to open discussion on different potential technical solutions that enable complex configuration management.
What does monitoring mean to a data scientist?
In IT, monitoring is a prevailing technique to assure the good functioning of the resources of the underlying system. For example, we monitor the use of CPU, RAM, network or application performance. All that with the aim to keep the automated process running and to offer a good experience to the users. Being crucial for smooth production operations, it is of great importance also when deploying machine learning models.
But does monitoring as MLOps practice differs from the one in DevOps context? In short, DevOps mostly deals with the development and deployment of code. In contrast, MLOps handles at the same time the code, the data and the model, all of them being interdependent. Thus, in addition to the resources, we need to monitor the quality and the content of the data and the behavior of the model, which itself is data dependent. This step is complex and often left out in the textbooks. Proprietary data science platforms include it more and more often, but the open-source libraries and good literature are scarce.
During the talk I will further detail the mechanisms that lead to changes in data and the ways they affect the model. We will see the practical example of model degradation, which opens the new important question: When to alert to retrain?
Blockchain 101, the tech behind all the hype and speculation
- Speaker: Gauthier LEONARD
- Duration: 1h30
- Format: talk + demo
- Schedule: Friday, September 24th, 2021 at 10:00
Cryptocurrencies are booming in 2021, with a market cap moving from 750 to 2,400+ billion dollars. Let’s face it, this is mainly due to speculation. A lot of people involved do not actually have a clue of what is behind the tokens they invest in.
But if we put that aside and look at the technical fundamentals, we can acknowledge that 2020-2021 are bringing along a bunch of new blockchains with better, faster (viable?) consensus mechanisms, like Proof of Stake (PoS).
Before going into those, we need to take a step back: Why decentralization? What is a blockchain? a cryptocurrency? Proof of Work? a block? a wallet? a smart contract? the Ethereum Virtual Machine (EVM)? an ERC-20? an NFT? Decentralized Finance (DeFi)?
This presentation will go through all the main aspects of blockchain technology and a bit more!
Disclaimer: Not a financial advice
WasmEdge, cloud native WebAssembly runtime for edge computing
- Speaker: Guillaume BOUTRY
- Duration: 45min
- Format: talk + demo
- Schedule: Friday, September 24th, 2021 at 11:00
With many security challenges solved by design in its core conception, lots of projects benefit from using WASM.
WasmEdge runtime is an efficient Virtual Machine optimized for edge computing. Its main use cases are:
- Jamstack apps, through a static front end with a serverless backend (FaaS)
- Automobiles
- IoT and Stream processing
It is an embeddable virtual machine that can be used as a process, in a process, or orchestrated as a native container (providing an OCI compliant interface).
This talk covers the key features of this project and concludes with a demonstration project.
Azure Log Analytics to query and to analyze data in cloud and on-premises environments
- Speaker: Claire Playe
- Duration: 1h
- Format: talk + use case
- Schedule: Friday, September 24th, 2021 at 12:00
All systems and applications produce log files that contain essential information to detect issues, errors, and trends. The centralization of logging data sourced from a large variety of infrastructures and applications provides more than just a holistic view. It is used to find security breaches, understanding user behavior on applications, for real-time monitoring and alerting, and also to take operational decisions.
Microsoft Azure Log Analytics is a Microsoft tool integrated into the Azure platform. It collects and stores data from various log sources. A query language named Kusto read-only is used to process data and return results, visual reports named “Workbook” can be created, and specific alerts on identified patterns can be configured through queries.
This presentation provides an overview of Logs Analytics and its different tools. I also present a use case of a real-live implementation for real-time monitoring and alerting with Data Factory, Power BI, and Databricks.
Azure Purview, the SaaS Data Catalog service proposed by Microsoft
- Speaker: Jules HAMELIN-BOYER
- Duration: 1h
- Format: talk + demo
- Schedule: Friday, September 24th, 2021 at 14:30
Managing Data governance is similar to reading the Terms & Services of new services when signing up: everyone should do it, yet few are willing to do the effort from end to end.
In public preview at the moment, Azure Purview is a service that enables discovery, governance, and mapping of sources on Microsoft’s cloud platform.
Based on Apache Atlas API, the tool proposes solutions to a broad understanding of the data environment. Labeling sensitive data, granting access to Data Stewards, assigning Data Experts on a specific data set, exposing lineage between data flow… The services proposed are numerous and are each of them answers specific requirements.
In this presentation, the discussion will explore the current state of Azure Purview. From the glossary module to the scan automation, an in-depth walkthrough of the Data Catalog is proposed. To better understand the implementation of the tool from a practical perspective, a demo of a user-defined solution for Databricks Lineage is performed.
HBase RegionServer collocation
- Speaker: Pierre BERLAND
- Duration: 1h
- Format: talk + demo
- Schedule: Friday, September 24th, 2021 at 15:30
Nowadays, many companies still have on-premise infrastructures to manage their data. Their bare-metal servers thus benefit from the entire memory space of their system. Among these companies, those relying on HBase are wasting resources, because each RegionServer is deployed on its own Worker node to optimize the advantages of scalability, which implies power and money loss.
RegionServers are the processes that manage the storage and retrieval of data in Apache HBase. Since the JVMs they use are capped to 30GB, would it be possible to put several RegionServers on a single machine? Indeed, every machine could largely welcome more RegionServers, in order to exploit the RAM available. Consequently, we could pull some machines out of the cluster while still meeting performance requirements, which will drastically reduce license costs, directly depending on the number of nodes used.
For this presentation, we will dive into the study I made on this topic and expose its outcomes.
Containerized deep learning environment with docker and nvidia-docker
- Speaker: Robert SOARES
- Duration: 1h
- Format: talk + demo
- Schedule: Friday, September 24th, 2021 at 16:30
In this world where artificial intelligence coexists more and more with us, it is important to understand how it works.
We mainly have two types of artificial intelligence, machine learning, and deep learning. The latter is part of a broader family of machine learning methods based on artificial neural networks with representation learning.
In the context of deep learning where operations are essentially matrix multiplications, GPUs prove to be more efficient compared to CPUs. This is why the use of GPUs (Graphics Processing Unit) has grown in recent years. Indeed, GPU(Graphics Processing Unit) is considered the heart of deep learning because of its architecture.
However, in practice, how do you use and communicate with your GPUs from your Python/R codes. Technologies, including Cuda and CudNN, emerged to communicate easily and efficiently with a GPU. Deep learning libraries like TensorFlow and Keras rely on these technologies.
This talk explains how to set up a containerized deep learning environment for data scientists based on Nvidia GPUs and how all its different technologies are interwoven.
Modern analytics on Azure, challenges and lessons learned
- Speaker: Nabil MELLAL
- Duration: 1h
- Format: talk
- Schedule: Friday, September 24th, 2021 at 17:30
For 2 years now we’ve been building an analytics and ml platform on Azure for a customer.
In this session, we share the decisions we’ve made, the architecture, technologies, design principles, and most of all the challenges faced and lessons learned.