Expected outcomes & behaviour Clause Samples
Expected outcomes & behaviour. As stated before, the expected outcome for this module is: • Dynamic and transparent energy optimization for running jobs. EAR will dynamically analyse running jobs and will select the optimal frequency. HEROES will apply a default policy for non-expert users and will give the opportunity to select a specific energy optimization policy to energy-aware users. Anyway, system specific settings will be considered. Energy optimization policies will be applied when supported by system limitations. • Automatic job accounting: Metrics gathered by EAR will be reported to HEROES DB. Not all the data will be reported, only data needed for HEROES job accounting and for the decision module. Given a job could generate lots of signatures at runtime, we will select relevant data such as 1 signature per application phase and phase duration. • System monitoring: EAR will report system status and relevant metrics / events to the decision module to characterize our ecosystem.
Expected outcomes & behaviour. The Costs Module main objective is to: • Provide means to manage business models for the applications / workflows and resources. The history of each business model will be kept for auditing. • Provide a centralized accounting information for billing purposes (keep track of all consumed resources and associated prices).
Expected outcomes & behaviour. 4.8.5.1 Log events / messages and store in central repository Thanks to the many source plugins available for FluentD, a component of the HEROES platform will be able to log using native libraries (python, java, etc.) interacting directly with the FluentD td-agent (TreasureData Agent) [11] process. In case of system services, the rsyslog [34] source plugin will be used to forward any event / message that will normally be sent just to the OS (Operating System) logging facilities. The td-agent process will then forward the logs to the central repository, where Elasticsearch will be able to tag, analyse and index the received data, that will then be ready for consultation, search, and further analysis with ▇▇▇▇▇▇.
4.8.5.2 HEROES User access to logged information HEROES User will be able to access only pre-digested information through the HEROES web UI. When a User requests a “Dashboard” page from the HEROES UI, the UI will send pre- configured requests to the Kibana API / interface through the Service Gateway. The Service Gateway will check the request to confirm / deny access to the requesting User. If access is granted, ▇▇▇▇▇▇ will send back data & graphs to the HEROES UI that will incorporate them in the User’s “Dashboard”. Only HEROES administrators will be able to access the raw logs from the Kibana direct interface. ▇▇▇▇▇▇ shall be configured with SSO with the HEROES platform.
Expected outcomes & behaviour. The Data Management Module will be the core of the following workflows.
4.4.5.1 Service Data Management Workflow The Service Data Management workflow takes care of all automated User Data and Internal Data movement operations within the HEROES platform. The main component is RClone: by making use of its features the workflow first checks if the target data exists on the target as an exact copy. If not, the source data is transferred to the target. The operation result is then sent to the Service Gateway together with any error code or extended message.
Expected outcomes & behaviour. The Application and Container Management & Orchestration Module main objective is to allow the creation, management, and orchestration of application. The following types of incoming tasks are expected: • Create / Delete application • Update application • Publish / Unpublish application The application packager (organization application administrator or Application Provider) can create an application within the HEROES platform using a container. The application creation includes: • The application information will be stored in a database within the Application and Container Management & Orchestration module. • The storage of the application’s container will be managed by the Data Management module After its creation, application will become visible for modifications or publication for its owner. The owner of an application must make functional tests within the HEROES platform with its application before each new application publishment. After publishment, the cost and license of the published application will be available for the Cost Management module.
4.5.5.1 Application referencing flow within the platform The Application referencing flow (as defined in Figure 21) starts by an “Application packager” creating a new HEROES application and ends by a new HEROES application in the platform: • “Application packager” sends a request of new HEROES application through the UI or with API including: o A selected or uploaded container for the new application. o A configuration for the application, including but not limited to: Name, Metadata and Parameters. • The “Service Gateway” transfers the container registering request to the Service Data Management which will store and tag the container. • At the end, the Service Gateway references the application in the Application and Container Management & Orchestration module.
4.5.5.2 Application updating and publishing flow within the platform The Application updating and publishing flow (as defined in Figure 22) starts by “Application packager” updating a new HEROES application and ends by an updated HEROES application in the platform: • “Application packager” sends a request of HEROES application update through the UI or with API including: o The selected application name / id to identify the correct application. o Optional modifications of the application configuration (Metadata and Parameters). o Optional license, cost, and target as required parameters for the application publishment. o Optional application publishm...
Expected outcomes & behaviour. The Workflow & Job Management Module main objective is to allow the creation, management, and orchestration of custom workflows, both for engineering and architecture management. The following types of tasks are expected: • Create / Delete architecture / engineering / scientific workflows • Run a specific workflow interacting with the rest of the HEROES architecture • Log all the states and changes of the workflow • Give proper metadata about the workflows state and execution The Workflow & Job Management main tool (Ryax) can easily create workflows for the HEROES platform using their own GUI or importing it as code from an external repository. The workflows creation and later execution include some interaction with other modules that are part of the HEROES infrastructure, where: • The execution of the workflow could be triggered by the user directly using the User Interface via a web browser. • The user can define the data used by the workflow through HEROES web interface or APIs. Then, the Workflow & Job Management module interacts indirectly with the Data Management module to transfer data to the target infrastructures and used by the different workflow tasks. • All the workflow submission parameters are forwarded by the Service Gateway to the Workflow Management Module. It then uses them through the Decision Module to get some information about placement strategy for the workflow. • The user chooses if he continues or not with the workflow execution via web browser using the User Interface. • When the target infrastructure is selected, the Deployment and Integration Module deploys the required resources if needed. • The toolset is selected from an available Singularity container on the Application and Container Management and orchestration module. • The job is submitted to the Resource Providers using the Submission / Transfer module. • The job is monitored using the Submission / Transfer Module and the obtained metrics are placed in the Telemetry / Accounting Module. • The metadata retrieved from the execution and workflow status is saved in the
4.6.5.1 Internal architecture workflow The Workflow & Job Management module can store and execute workflows out of the engineering scope, adding an extra functionality for manage internal processes for the HEROES architecture. These workflows can be automated or can be executed manually by HEROES administrators, performing desired tasks to maintain the platform, or just adding extra functionalities. An interna...
Expected outcomes & behaviour. The Identity Management module will be responsible for handling authentication and authorization control whether it concerns internal request made by HEROES’ services or external request from a user using the published APIs. The following workflows schemas will present the expected behaviour of the module when confronted to a user’s request to log in onto the HEROES platform and then proceed to call upon one of the published API.
4.1.5.1 Authentication Workflow The Authentication Workflow (as defined in Figure 6) starts by a “user’s” authentication request and ends by the authentication of the user: • “Users” (user or provider accounts) sends authentication request with valid credentials to the Service Gateway • The Service Gateway transfers the request to Keycloak • ▇▇▇▇▇▇▇▇ returns a token to propagate the “User” identity if credentials are valid.
Expected outcomes & behaviour. The Deployment & Integration Module main objective is to implement / execute tasks that will be sent from the Workflow Management module to ensure that the resources are available for receiving ML/HPC jobs. The following types of incoming tasks are expected: • Deploy / undeploy cloud compute resources • Check if HPC Centre HEROES resources and services are active and correctly configured The Deployment & Integration Module will also contain a library of Ansible [13] playbooks that will be used to install and configure the HEROES Runtime both in the cloud platforms and in the HPC Centres. 4.2.5.1 Deployment workflows The Cloud Platform Deployment workflow will take care of: • Start / Stop compute nodes on existing Private Cloud Clusters within an Organization scope • Deploy / undeploy temporary compute nodes for single users and organizations without an existing Private Cloud Cluster. (Please note that this will cover all types of different compute nodes, including but not limited to CPU nodes, GPU nodes, Remote Visualization nodes.) • The above tasks will be accomplished using the Cloud Platform CLI and APIs, mainly through scripts. The scripts will be exposed to the Workflow module either directly or via APIs. The HPC Centre Compliance Check workflow will take care of: • Check that the HEROES Runtime is active on the HPC Centre Headnode (see section 4.2.1) • Check that the HEROES Runtime is configured as expected • The above tasks will be accomplished using Ansible roles, containerized applications and services, scripts. The Ansible roles will be exposed to the Workflow module either directly or via APIs.
Expected outcomes & behaviour. The decision module will operate in four successive steps:
1. Parametrization by each organization Within an organization, the managers define a list of platforms available for each workflow. They also either choose an optimization strategy (e.g., best speed, best price, lowest power consumption with configurable weights) or allow each user to select his own strategy.
2. Prediction of jobs resources and cost For each job, accounting logs and energy data are used to predict the “optimum” job resources (memory, execution time) but also its power consumption. The cost API will then give an estimated price for executing the submitted workflow on the available platforms using the predicted resources.
3. Selection of the best platform Based on the selected strategy and other constraints (data localization, available licenses, GPUs…) the decision module will then select the “best” platform to execute the submitted workflow. This platform will be proposed to the user as the “best” solution, but he will still have the possibility to choose a different platform.
4. Workflow execution monitoring During the workflow execution (only if the workflow is composed of multiple jobs) the decision module will check at each step if the selected platform is still the best platform (if by the end of the workflow, the selected platform is overloaded, it will be wise to execute the last jobs of the workflow on a different platform).
