Accepted Posters

Performance comparison of the Numerical Flow Iteration to Lagrangian and Semi-Lagrangian approaches for solving the Vlasov equation in the six-dimensional phase-space

Rostislav-Paul Wilhelm, Manuel Torrilhon

Abstract: The Vlasov system arising from kinetic theory and used for modelling high-temperature plasma dynamics is known for being both high-dimensional and exhibiting turbulence due to the non-linearity of the system. To solve the Vlasov system one requires both advanced discretization techniques as well as the use of high performance computing.

Classical grid-based and particle-based approaches are heavily memory-bound and only allow for low resolutions in the full six-dimensional case which often is insufficient to capture the correct dynamics. Additionally the high memory footprint complicates parallelization and MPI-communication leads to sup-optimal scaling.

The authors recently suggested an alternative approach: the Numerical Flow Iteration (NuFI). In this scheme one reduces the memory footprint by several orders of magnitude and trades it for more computations on-the-fly. In theory this increases the flop/Byte rate and allows for computation of higher resolution simulations than was possible with classical approaches. Additionally NuFI preserves the solution structure and thereby potentially reproduces the dynamics more accurately.

In this work we want to investigate how NuFI compares to other state-of-the-art approaches using benchmarks in the six-dimensional case. In particular, we want to discuss scaling results on high-performance hardware and advantages as well as limitations of the different approaches.

Extended abstract, Poster PDF

Condition Number Estimation in a Solution Process of a Large and Sparse Linear System

Yuya Kudo, Yuki Satake, Takeshi Fukaya, Takeshi Iwashita

Abstract: A system of linear equations having a large sparse coefficient matrix often arises in a wide range of applications. In these applications, the Krylov subspace iterative methods such as the Conjugate Gradient (CG) method are typically used to solve the linear system. In the iterative method, it is typical to evaluate the accuracy of a numerical solution by the relative residual norm. However, from application viewpoints, the (relative) error norm is more important than the residual norm, which more directly represents the accuracy of the computed approximate solution vector. Although it is difficult to calculate the relative error norm, it can be bounded by the product of the condition number and the relative residual norm. This fact indicates the usefulness of estimating the condition number of a coefficient matrix, and we consider two condition number estimation methods under the assumption that we solve a linear system with a large sparse symmetric positive-definite matrix. One of the estimation methods is the Lanczos method, which gives the computation of both the largest and smallest eigenvalues. The other method is based on the error vector sampling (ES), which can be used for finding the smallest eigenvalue. In this method, the largest eigenvalue is calculated by using the power method. We conduct a numerical test using a large sparse matrix with more than 500,000 dimensions from a matrix database to evaluate the performance of both the methods.

Extended abstract, Poster PDF

IC(p) preconditioning for large symmetric linear equations

Koki Masui, Fumihiko Ino

Abstract: Iterative methods are often used to solve large-scale linear equations such as high frequency electromagnetic field analysis. However, as the matrix size becomes larger, convergence of iterative methods such as the Krylov subspace method becomes worse, and it takes a lot of time to solve equations. The preconditioning method such as Incomplete Cholesky (IC(p)) preconditioning [2] with fill-in is used to improve the convergence of iterative methods, where p is the level of fill-in. While increasing the fill-in level generally improves convergence, the number of non-zero elements in the preconditioning matrix increases by a factor of several, sometimes more than ten, for each additional level, which can exhaust memory in huge problems. Therefore, in this study, we present a method that is able to control the level of fill-in independent of the matrix value and conducted numerical experiments. As a result, our proposed method succeeded in controlling the non-zero elements, and reducing the calculation time compared to IC(0) and IC(1).

Extended abstract, Poster PDF

Performance evalution of multi-level parareal method

Koki Hiromi, Akihiro Fujii, Teruo Tanaka

Abstract: Computer performance is improving by incorporating parallelism these days. simulation is generally parallelized in the spatial direction, but this has limitations and does not take full advantage of computer performance. Parallelism in time direction for computer simulation has attained research attention, and the research area is studied actively. Parareal is a popular algorithm, and it is usually implemented as a 2-level algorithm. 2level has the problem of overhead where the parts that cannot be parallelized. We attempted to remedy this problem by performing the coarse calculation again and calling the Parareal method recursively. The three-level implementation of the Parareal method with recursive calls was effective in reducing sequential time. However, it was not faster than the 2-level Parareal method. We will proceed with performance analysis for problems with a larger number of time steps and different size ratios between levels, and present the results in a poster.

Extended abstract, Poster PDF

Near kernel component setting method using iteration matrix in SA-AMG method

Hiromichi Sakuta, Akihiro Fujii, Teruo Tanaka

Abstract: Analysis by computer simulation comes down to solving simultaneous equations Ax=b, and it is important to solve simultaneous equations fast and stably. One method for solving large-scale simultaneous linear equations is the SA-AMG method. This method coarsens the problem matrix in a hierarchical manner to obtain a solution, and is capable of solving large-scale problems at high speed. However, it depends on the problem setup. Conventionally, it is known that the convergence can be improved by setting the near kernel vector (0 eigenvalue component) [1]. In this study, we set the largest eigenvalue component of the transformation matrix applied to the error vector in the smoother as the component that is difficult to converge, and aim to verify its effectiveness.

Extended abstract, Poster PDF

An investigation of parallel performance of block epsilon-cirulant preconditioner for time-dependent PDEs

Ryo Yoda, Matthias Bolten

Abstract: This work investigates the parallel performance of a memory-distributed implementation of the block epsilon-circulant (BEC) preconditioning with MPI. This method is a promising parallel-in-time approach in a massively parallel environment for all-at-once linear systems arising from time-dependent PDEs. The BEC preconditioner introduces a weighted parameter epsilon into the block circulant preconditioner and achieves independent convergence for spatial mesh sizes with sufficiently small epsilon. However, its parallel performance has not been fully investigated. This work presents its parallel result for convection-diffusion problems.

Extended abstract, Poster PDF

Digital Twin: An Autonomous System on Public Utility of Chemical Fiber Factory

Jerry Chen, Rick Chang, Jiann-Shing Shieh

Abstract: Digital twin is a virtual representation of the physical entity. The physical entity and its virtual counterpart are linked by data. This allows the user to explore much more potential information of the physical environment with computational techniques for minimum costs. Application of digital twin involves real-time monitoring, simulation, prediction, optimization, etc. The data driven model is the core of this approach; an accurate depiction of the physical world enables the virtual part to meet the application requirements. Also, in order to maintain the accuracy of model, the digital twin has to be governed by a system structure to deal with the uncertainty. This research proposes an architecture of autonomous system and its application on public utility to build a digital twin for chemical fiber factory.

Extended abstract, Poster PDF

Accurate and Fast Monocular 3D Object Detection with Adaptive Feature Aggregation Centric Enhance Network

Peng-Wei Lin, Chih-Ming Hsu

Abstract: Three-dimensional (3D) object detection is crucial in autonomous driving. Monocular 3D object detection has become a popular area of research in autonomous driving because of its ease of deployment and cost-effectiveness. In real-world applications of autonomous driving, a detector must be both real-time and accurate. These can be achieved using deep learning. A one-stage centerbased object detector is suitable for real-world applications. However, in center-based object detectors, object-centric estimation plays an important role because it significantly influences detection results. To address this issue, we proposed a real-time monocular 3D object detection neural network called the adaptive feature aggregate centric enhance network. The model is an anchor-free and center-based method. To enhance accuracy while maintaining inference speed, we propose an adaptive feature aggregation network that aggregates multiscale features with weighting. In addition, we proposed a centric enhance module for heatmap prediction to improve the accuracy of object localization and classification. Our model can achieve 35 frames per second using an Nvidia RTX3070 accelerator. Extensive experiments on the KITTI benchmark demonstrated that our method has good mean average precision (mAP) for small objects.

Extended abstract, Poster PDF

Graph500 benchmark with automatic performance tuning

Masahiro Nakao, Koji Ueno, Katsuki Fujisawa, Yuetsu Kodama, Mitsuhisa Sato

Abstract: In various fields such as social networks and drug discovery, there are many attempts to represent data relationships as graph structures and analyze them on computers at high speed. We have been developing breadth-first search (BFS) on the Graph500 benchmark applying various techniques and have achieved the world's top performance in the Graph500 list (https://graph500.org) as of the time of writing (October 2023). However, since most existing research, including our study, targets specific graphs and computer systems, the burden of performance tuning for users has become an issue. Therefore, this study develops an automatic performance tuning function that automatically determines the optimal parameters for BFS in the Graph500 benchmark. Finally, the performance achieved about 27% performance improvement from 113,146 GTEPS to 143,487 GTEPS.

Extended abstract, Poster PDF

An Evaluation of Discontinuous Galerkin Method based Global Nonhydrostatic Atmospheric Dynamical Core on A64FX Platform

Xuanzhengbo Ren, Yuta Kawai, Hirofumi Tomita, Takahiro Katagiri, Seiya Nishizawa, Tetsuya Hoshino, Masatoshi Kawai, Toru Nagai

Abstract: For future high-resolution atmospheric simulations, a dynamical core using discontinuous Galerkin Method (DGM), called SCALE-DG, is being developed as an option of high-order fluid schemes in SCALE library. Since the spatial discretization is done locally, we expect the computational performance is highly desirable in modern computer architectures. In this study, we evaluated the scalability and single process performance of SCALE-DG on Fujitsu A64FX based supercomputer Fugaku and Flow. Results show that SCALE-DG performed excellently in strong and weak scaling, while a load-imbalance issue was observed in one of the essential routines when running in the horizontally explicit vertically implicit (HEVI) temporal discretization scheme. The cause of load imbalance and other performance issues are discussed.

Extended abstract, Poster PDF

Efficient Sample Exchange for Large-Scale Training Distributed Deep Learning with Local Sampling

Truong Thao Nguyen, Yusuke Tanimura

Abstract: Stochastic gradient descent (SGD) is the most prevalent algorithm for training Deep Neural Networks (DNN). SGD iterates the input data set in each training epoch processing data samples in a random access fashion (global shuffling). Because this puts enormous pressure on the I/O subsystem, the most common approach to distributed SGD in HPC environments is to replicate the entire dataset to node-local SSDs. However, due to rapidly growing data set sizes, this approach has become increasingly infeasible. In this context, an alternative way is to partition the dataset among workers, i.e., each worker uses the same part of the dataset for all the epochs (known as local shuffling). Our prior work~\cite{Nguyen_IPDPS2022} showed that the local shuffling could not achieve similar validation accuracy as the default global shuffling strategy in large-scale training. In this context, our prior work proposed a novel partial-local shuffling strategy that randomly exchanges only a proportion of the dataset among workers in each epoch. Through extensive experiments on up to 2,048 GPUs of ABCI, we demonstrated that validation accuracy of global shuffling can be maintained when carefully tuning the partially distributed exchange. However, exchanging the samples randomly between workers leads to a personalized all-to-all communication pattern which is sensitive to network congestion when scaling up. In this study, we propose an exchange strategy that is scalable.

Extended abstract, Poster PDF

Fast adjacent communication among with RDMA, MPI-RMA, and Double buffering

Kota Yoshimoto, Akihiro Fujii, Teruo Tanaka

Abstract: In large-scale scientific computing programs, parallelization by MPI communication is generally used. MPI is convenient because it can be executed on many computers. However, inter-process communication often becomes a bottleneck in highly parallel computers. There is an interface called RDMA (Remote Direct Memory Access) to reduce the delay caused by communication. In this poster presentation, we will evaluate the adjacent communication routines with MPI-Isend/Irecv, MPI-RMA, RDMA, and one with double buffering. MPI-RMA is effective when the number of processes per node is small. However, when there are many processes per node, RDMA using double buffering is also effective. We will present numerical experiment result with different inter-process topologies and different message sizes in the poster

Extended abstract, Poster PDF

An optimisation of particle information exchange using one-sided communication for the MPS method

Aoto Abe, Kazumi Kayajima, Dai Wada, Takaaki Miyajima

Abstract: The Moving Particle Simulation (MPS) method is one of the computational methods for simulating fluid behaviour, classified as a particle-based method. It can simulate phenomena such as large deformation and separation of fluids more easily than stencil-based methods. Dynamic domain decomposition is inevitable for large-scale simulations on distributed-memory systems to balance the load of each process. Particles are moved during the computational process, so the particles needed for calculation may exist in another sub-region in another process. In this case, inter-process communication of particle data is required. As a pre-processing step in particle data communication, it must inform how many particles will be transferred from where to where. Collective communication is often used for this step as a naive implementation. When collective communication is used, the communication time becomes a scaling bottleneck since the elapsed time increases rapidly in proportion to the number of processes. We propose a technique to reduce the amount of communication using one-sided communication while considering a background of particle movement. Specifically, using one-sided communication MPI_Put to shorten the communication time of exchanging information instead of collective communication MPI_Alltoall. A test program modelling the particle information exchange is made and used to evaluate the proposed technique. The results show that the proposed technique reduces communication time by 1/3.63 for six nodes in 36 processes and by 1/4.38 for nine nodes in 72 processes.

Extended abstract, Poster PDF

Performance evaluation of a computer cluster for the realization of submesoscale-resolved earth system models

Rin Irie, Helen Stewart, Tetsuya Fukuda, Tsuneko Kura, Masaki Hisada

Abstract: Earth system models (ESM), which model the complex interactions between physical and biological systems in the climate, are among the most computationally demanding applications of HPC today. In the 6th Coupled Model Intercomparison Project (CMIP6), the current finest horizontal resolution for ESMs computed globally is on the mesoscale (O(10^4~10^6)m). With the goal of resolving ESMs on the submesoscale (O(10^3)m), we construct a computer cluster capable of computing ocean simulations at resolutions of up to O(10^3)m. We evaluate the parallel performance and usefulness of the constructed cluster by using the ocean physics component of MITgcm, a popular ocean general circulation model. In this evaluation, we measure the computation time to validate strong scaling for a range of conditions, including the file I/O process number and MPI communication method. As a result, we confirm that the speed-up in the computation time is maintained for up to 384 processes when using RDMA. In the poster, the results of the benchmarking and a comparison of the cluster and supercomputer Fugaku will be discussed in more detail.

Extended abstract, Poster PDF

An energy-aware job scheduling method supporting on-demand job execution

Daiki Nakai, Keichi Takahashi, Yoichi Shimomura, Hiroyuki Takizawa

Abstract: There is a growing demand to execute on-demand jobs with deadlines on a publicly shared HPC system. There are two potential situations, however, in which a node is unavailable for on-demand job execution upon a job request. One is that the node is running another job, and the other is that the node is in a low-power mode or powered off. In the former case, the running job must be suspended, and in the latter case, the node needs to be started up. This work proposes a job scheduling method for selecting whether to suspend running jobs or to start up sleeping nodes when a sufficient amount of resources is unavailable to complete an on-demand job in time. In addition, by estimating the time required to suspend and resume a job, the proposed method considers the optimal combination of jobs to be suspended to minimize the time for suspending the jobs. Our evaluation indicates that the deadline achievement rate is improved by 39% thanks to the proper selection of running jobs to be suspended for on-demand jobs with strict deadlines. In the case of on-demand jobs with loose deadlines, node start successfully improves the execution efficiency of normal jobs by 18%, and suppresses the increase in power consumption by 0.26% while maintaining a 100% deadline achievement rate.

Extended abstract, Poster PDF

Slurm Simulator Development: Balancing Speed, Accuracy, and Maintainability

Nikolay A. Simakov, Robert L. DeLeon

Abstract: Slurm is an open-source job scheduling system widely used in many high-performance computing (HPC) resources. A Slurm simulator facilitates parameter tuning to optimize throughput or meet specific workload objectives. In the previous simulator version (v2)[3], the priorities were to minimize the changes to core Slurm and have a high simulation accuracy. This resulted in speed-dependent accuracy and a simulation speed only 20-40 times faster than real-time (for a midsized system). This is not a very practical simulation speed, and it is more beneficial to trade some accuracy for increased speed. To achieve the desired speed-up goal, we use the same strategy as in our original Slurm simulator (v1)[1,2], namely: serialize the code and call all Slurm functions from a single thread in an event-driven fashion. The resulting version (v3) of our simulator has more than 500 times acceleration over real-time, allowing simulation of a month-long workload in 90 minutes.

Extended abstract, Poster PDF

Clustering Based Job Runtime Prediction for Backfilling Using Classification

Hang Cui, Keichi Takahashi, Yoichi Shimomura, Hiroyuki Takizawa

Abstract: An underestimation of job runtime causes rescheduling, which has a large impact on the scheduling performance. Most of the previous studies use regression algorithms to predict the job runtime, which do not consider the difference in performance impact between underestimation and overestimation. As a result, researchers have to use additional approaches to avoid underestimation. We propose a machine learning based method that predicts the job runtime while avoiding underestimation at the same time. Instead of regression, we treat the runtime prediction problem as a classification problem to classify jobs into clusters, each of which has a predicted runtime plus a statistically reasonable offset value. To organize those clusters, past jobs recorded in log data are clustered based on their recorded runtimes, and then the standard deviation of runtimes within the cluster is also calculated so that two-sigma is used as the reasonable offset for the cluster. As a result, the runtime prediction with a statistically reasonable offset value can avoid underestimation more effectively than the existing approaches, which combine regressions with additional underestimation avoidance methods. The evaluation results show that the proposed mechanism has a lower underestimation rate than the existing mechanism used with regression algorithms. Due to the lower underestimation rate, the average wait time and makespan of all the scheduled jobs using the runtime predicted by the proposed mechanism are shorter than those of the existing mechanism used with regression algorithms. Furthermore, the proposed mechanism outperforms the others in the evaluation also from a viewpoint of bounded slowdown.

Extended abstract, Poster PDF

A New Matrix Reordering Method for GPU Acceleration of an ILU Preconditioner

Kengo Suzuki, Takeshi Fukaya, Takeshi Iwashita

Abstract: In this study, we improve the performance of the preconditioned Krylov subspace method on a GPU. Specifically, we propose a new matrix reordering method, called MMC, for enabling the ILU(0) preconditioner to exploit the GPU parallelism. MMC is a variant of our previous block-based multi-color reordering technique for CPU SIMD. By changing the definition of the dependencies in multi-coloring, MMC can further improve the fine-grained concurrency of the ILU(0) preconditioner.

We compared the MMC-based ILU(0) preconditioner (MMC-ILU(0)) with the standard multi-color ordering-based ILU(0) (MC-ILU(0)) and one of the iterative triangular solver-based ILU(0) (Async-ILU(0)), received much attention recently, by using them with the FGMRES(50) method. The numerical results showed that the proposed MMC-ILU(0) was faster than MC-ILU(0) on 7 out of 9 tests; in the best case, it improved the solver speed by about 50%. The numerical results also demonstrated that MMC-ILU(0) outperformed Async-ILU(0) on 7 out of 9 tests, including that MC-ILU(0) could work well in the tests where Async-ILU(0) took many iterations to converge.

Extended abstract, Poster PDF

Application of GPUs in CFD-based Turbine Wake Simulation

Ji Qi, Kenji Ono

Abstract: Understanding wind turbine wakes is a key aspect of wind farm design. The growth of computation capacity has allowed wind turbine wake researches to be conducted on computers using CFD-based methods instead of in wind tunnels. However, such methods involve solving the turbulent Navier-Stokes equations repeatedly on numerous grid points, which makes the performance of computation crucial to both speed and quality of the simulation.

GPUs, originally designed for graphic computation, can run a large number of light-weight SIMD threads simultaneously, achieving a very high degree of parallelism, thus being potentially suitable for large scale CFD computations. Boosted by the advancement of GPU-targeted programming languages, applications of GPUs in large scale scientific computing have been increasing in recent years.

In this study, we couple the wind turbine actuator line model with large eddy simulation of incompressible turbulent flow on a structed orthogonal grid, and implement the solver program using hybrid CUDA C++/MPI parallelization targeting the use of multi-GPU systems. We then conduct GPU kernel profiling, strong and weak scaling experiments on Kyushu University's supercomputer ITO, the results show that our program is capable of solving a typical wind turbine wake simulation problem with reasonable performance.

Extended abstract, Poster PDF

Efficient implementation and acceleration of DIP-NMF-MM algorithm for high-precision 4D PET image reconstruction

Yoshinao Yuasa, Kaito Matsumura, Tatsuya Yokota, Satoshi Ohshima, Hidekata Hontani, Toru Nagai, Yuichi Kimura, Muneyuki Sakata, Takahiro Katagiri

Abstract: Positron emission tomography (PET) is a technique to observe biomolecular mechanisms such as metabolic processes by measuring the distribution of radioactive tracers in a human body. PET images are reconstructed from observed data called sinograms, and there are high expectations for the use of GPUs to handle large data and computationally intensive algorithms. Yokota et al. proposed a new method using non-negative matrix factorization (NMF) and Deep Image Prior (DIP). The method achieved more accurate reconstruction than conventional methods. Furthermore, Matsumura et al used the MM algorithm instead of gradient descent in the reconstruction process to achieve more stable PET image reconstruction than conventional methods. In this study, we evaluate the performance of a 4D PET image reconstruction code that implements the DIP-NMF-MM algorithm in a 2X higherresolution execution and four acceleration methods.For 2X high-resolution execution, the GPU memory is not sufficient for normal execution when the input data size is doubled. To solve the problem of insufficient GPU memory, we use NVIDIA's Unified Virtual Memory (UVM). For the optimization of execution time, we implemented the following four speed-up methods and evaluated their performance. 1. optimization of noise addition 2. Computational graphing of the entire process 3. Embedding input data into the graph 4. distributed execution of U-Net

Extended abstract, Poster PDF

Accelerating Lattice Boltzmann method with C++ standard language parallel algorithm

Ziheng Yuan, Takashi Shimokawabe

Abstract: This poster aims to display the application and performance of standard C++ parallel algorithms in multi-GPU computation by solving a 3D fluid simulation problem using the Immersed Boundary Lattice Boltzmann Method (IB-LBM). The main content focuses on how standard C++ parallel algorithm helps simulation code to execute in parallel, and how to use the existing CUDA library to implement the current unavailable GPU-to-GPU communication functions in standard C++ parallel algorithms, thereby accelerating the speed of communication. Compared to directly using MPI for communication in standard C++ parallel algorithms, CUDA-aware MPI can reduce communication speed by 78.5% under 32 GPUs and by 60.3% under 64 GPUs. Meanwhile, the optimized C++ standard parallel code reaches CUDA's 96% speed under 32 GPUs and 76% speed under 64 GPUs.

Extended abstract, Poster PDF

Enhancing spatial parallelism on loop structure for FPGA

Yuka Sano, Taisuke Boku, Norihisa Fujita, Ryohei Kobayashi, Mitsuhisa Sato, Miwako Tsuji

Abstract: In today's HPC systems, GPUs with high computational performance and memory bandwidth are the leading players. However, GPU-based acceleration is designed to excel when utilizing many computation cores and performing SIMD/STMD manner. One of the alternative solutions is FPGA (Field Programmable Gate Array).

Currently, it is available to program FPGA devices in high-level language. However, the programmer needs high optimization skills to exploit its potential performance. To solve this problem, we have been developing an OpenACC-ready compiler for FPGA. This research has been performed based on Omni OpenACC compiler in collaboration with the Center for Computational Sciences at the University of Tsukuba (CCS) and RIKEN Center for Computational Science (R-CCS).

In this study, we evaluate and examine high-level synthesis-based FPGA programming techniques towards the compiler-based performance optimization. We try various techniques to increase the number of computational elements by spatial parallelism, such as pipelining, loop unrolling, and simultaneous execution of multiple kernels. Here we target the CG (Conjugate Gradient) method code for matrix calculation described in OpenCL.

Based on the optimization methods obtained in this research, we are implementing the functionality to generate OpenCL code from OpenACC using the Omni OpenACC compiler. This feature will provide existing FPGA programmers with a more straightforward programming environment than OpenCL. Additionally, the programming approach of adding directives to sequential code is expected to reduce the amount of code and development time. Furthermore, FPGA acceleration efforts are expected to expand to applications that have been reluctant to use FPGA-based acceleration until now.

Extended abstract, Poster PDF

Performance Evaluation of OpenSWPC using Various GPU Programming Methods

Tatsumasa Seimi, Akira Nukada

Abstract: We ported a full application OpenSWPC to NVIDIA GPUs using OpenACC, OpenMP, standard parallelism, and CUDA Fortran. These four methods are easy-to-use, but each of them has limitation. We show performance comparison between the four methods, while porting cost does not differ so much. Although CUDA Fortran has some performance issues, the other three are realistic methods currently available.

Extended abstract, Poster PDF

Experimenting with GPTune for Optimizing Linear Algebra Computations

Makoto Morishita, Osni Marques, Yang Liu, Takahiro Katagiri

Abstract: In High Performance Computing (HPC), software often has many parameters that impact its performance. However, it is difficult to determine optimal values for such parameters in an impromptu way. The Auto-Tuning (AT) of parameters is therefore an area of great interest. The purpose of this work is to understand the methodology of GPTune, which is an AT framework developed by DOE's Exascale Computing Project, and use the framework in a set of applications of interest.

Extended abstract, Poster PDF

Auto-tuning of Hyperparameters by Parallel Search Using Xcrypt

Tatsuro Hanyu, Masatoshi Kawai, Takahiro Katagiri, Tasuku Hiraishi, Tetsuya Hoshino, Toru Nagai

Abstract: This study explores hyperparameter optimization in Convolutional Neural Networks (CNNs) using Xcrypt, a Perl-based scripting language. The focus is on auto-tuning batch sizes for ResNet50, demonstrating the significant role hyperparameters play in AI model performance. The research leverages Xcrypt's job-level parallel programming for efficient hyperparameter search across supercomputers, bypassing the need for extensive human resources and time. It presents results from extensive batch size tuning, highlighting the efficiency of Xcrypt in managing and executing computational tasks across varied supercomputing environments.

Extended abstract, Poster PDF

A Proposal of Automatic Parallelization using Transformer-based Large Language Models

Soratouch Pornmaneerattanatri, Keichi Takahashi, Yutaro Kashiwa, Kohei Ichikawa, Hajimu Iida

Abstract: Capabilities of the current generation of computer hardware require parallel programming to utilize all of its computing performance. Its demand deep knowledge programming to coherent between the hardware and software. This deep knowledge programming is a high learning curve. There are many researches create tools, reduce the need to learn parallel programming such as automatic parallelization tools. These tools usually utilize the static analysis to determined the loop pattern in source code and insert parallel annotation in the source code. However, manually parallelized annotation to the source code demonstrate the higher performance over the code parallelized by automatic parallelization tools. In related field, software engineering research start adopt Natural language processing model based on deep learning techniques for understand the pattern of source code and later develop for solving downstream tasks. We propose building a generative model for OpenMP directives trained on source code from public GitHub repositories utilizing CodeT5 / CodeT5+, a transformer-based Large Language Model(LLM) designed for code understanding and generation. For this purpose, we collected 57,170 OpenMP-parallelized for-loops from GitHub. Since training LLMs is computationally demanding, we employ models pre-trained on C and C++ source codes, and use our collected dataset for fine-tuning. We evaluate this models by the performance improvement of the source code modified by the tools with various benchmarks providing both serial and OpenMP source codes.

Extended abstract, Poster PDF

Job level parallel search in software auto-tuning

Yuga Yajima, Akihiro Fujii, Teruo Tanaka

Abstract: Software auto-tuning (AT) is technology to improve performance by automatically controlling parameters which affect performance (performance parameters) in a program. In AT, to search for appropriate values for the performance parameters, the program is iteratively executed while fitting various values to the performance parameters. On the other hand, machine learning field is attracting a lot of attention. In machine learning program, it is important to select proper hyperparameters. Therefore, hyperparameters of machine learning can be treated as performance parameters. However, machine learning programs take a long time per execution. Therefore, we have reduced the time by parallelizing the search. By setting the number of parallels appropriately, the search can be properly conducted. The appropriate number of parallels depends on the amount of available computing resource, but if other tasks are being executed, the available computing resource is not constant. Therefore, it is difficult to set the appropriate number of parallels. In this study, we implemented a mechanism that dynamically recognizes the available computing resource on a supercomputer and sets the appropriate number of parallels. The maximum number of parallels was recognized by measuring the period from the job submission time to the start time, and a time reduction of more than 50 percent was achieved.

Extended abstract, Poster PDF

A MapReduce-based Inter-Organizational and Distributed Process Log Clustering Framework

Thanh-Hai Nguyen, Kyoung-Sook Kim, Sang-Eun Ahn, Kwanghoon Pio Kim

Abstract: In this paper, we propose a distributed process log clustering framework that collects and classifies the distributed process execution event logs recorded by the distributed operations of the interor-ganizational business process management system. The proposed framework is implemented onto the MapReduce-based distributed processing framework and applied to the SICN-oriented process mining system. The framework proposed and implemented in this paper undertakes the splitting, mapping, shuffling, and reducing operations of the MapReduce's preprocessing functionality, and it is embedded into the SICN-oriented process mining system as one of the essential components of a specific process mining algorithm, which is the rho-Algorithm, and a predictive process monitoring algorithm to be developed in the authors' research group.

Extended abstract, Poster PDF

A Bigdata Acquisition Framework of Deep-Learning-based CCTV-Video Contextualization Machines

Eun-Bee Cho, Kyung-Hee Sun, Dinh-Lam Pham, Kyoung-Sook Kim, Jeong-Hyun Chang, Kwanghoon Pio Kim

Abstract: This paper proposes a Bigdata acquisition framework of CCTV- video contextualization machines and implements its concrete system by developing a video-object detection deep-learning model based on the YOLO neural network architecture and its variants. The primary functionality of the proposed framework is for detecting active contextual clues, like objects, motions, and physical environs, on every CCTV-video frame and codifying the detected clues into a new formation of structured code-format named as COME-Code1. Consequently, through the proposed CCTV-video contextualization machines as a crime-surveillance Bigdata acquisition tool, we can initiate not only a new era of CCTV-surveillance Bigdata achieves and their engineering disciplines, but also a new paradigm of CCTV-driven crime-prevention services that are detecting, predicting, and preventing criminal situations and behaviors as well as providing intelligence-led policing and predictive patrol scheduling operations in real-time.

Extended abstract, Poster PDF

Using lossy compression for interactive analysis over network

Rei Aoyagi, Keichi Takahashi, Yoichi Shimomura, Hiroyuki Takizawa

Abstract: Today's large-scale scientific simulations are executed on HPC systems hosted by computing centers. To reveal scientific insights, researchers conduct post-processing on the simulation results by running an interactive data analysis tool on the HPC system and retrieving the post-processed results. In certain scenarios, transferring the simulation results directly from the center is essential. In such scenarios, a portion of the data can be streamed over the network. However, maintaining interactivity poses two challenges: (1) limited network bandwidth, and (2) network latency. To tackle these challenges, we propose middleware to enable interactive array analysis over the network. By using error-bounded lossy compression, we increase the effective network bandwidth. Furthermore, we employ multi-level caching to hide the network latency and combine prefetching to improve the cache hit ratio. The cache replacement and prefetching policies are designed considering the data access pattern of interactive analysis. Our middleware demonstrates up to 58.9% reduction in latency compared to a state-of-the-art array database.

Extended abstract, Poster PDF

Attempt for Quantitative Evaluation of Warm Water Cooling using LINPACK and GeoFEM on the JCAHPC OFP

Jorji Nonaka, Fumiyoshi Shoji, Toshihiro Hanawa

Abstract: Energy efficiency is already an important topic for the HPC sites as we can observe from the activities of the EE HPC WG (Energy Efficient HPC Working Group). Its importance has become even more evident with the steep rise in global energy prices, triggered by some military conflicts with no foreseeable end. For this purpose, the warm water cooling technique has been widely recognized as an effective approach for improving the energy efficiency of HPC and Data Centers. Thanks to the "JCAHPC Large-Scale HPC Challenge" program, we had the opportunity to do some experiments running some applications (LINPACK and GeoFEM) using the warm water cooling-based JCAHPC OakForest-PACS (OFP) under different cooling water temperature settings. In this poster, we will try to shed light on the possible influence of the external temperature in the facility's energy consumption as well as the influence of the cooling water temperature on the performance of the utilized applications. The evaluations include some considerations on the contrastive behaviors for groups of compute nodes related to the influence of the cooling water temperature on the LINPACK and GeoFEM performance, and the use of sorted groups of compute nodes, based on the single-node GeoFEM performance, for evaluating possible influence on the parallel GeoFEM.

Extended abstract, Poster PDF

Eco-Comp: Towards Responsible Computing in Materials Science

El-Tayeb Bentria, Sai Surag Lingampalli, Fadwa El-Mellouhi

Abstract: Computational methods such as Density functional theory and Molecular dynamics (MD) simulations have become a key focus in material science, especially with the rise of machine learning interatomic potentials that enables the simulation of multi-million atomic systems. The computational intensity of these simulations necessitates their deployment in high-performance computers (HPCs) and the usage of multi node runs. However large number of submitted calculation by users are targeting speed regardless of efficiency. To foster sustainable and ethical computing practices. We built Eco-Comp, a user-friendly automated Python tool that allows material scientists to optimize their simulations' computing power with one command. In this study, we employed the Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) to find the optimal allocation of computing resources based on the simulation input. Through the analysis of bulk metallic systems and surface reactions, we identified various factors that affected parallel efficiency. Through this, we propose rules for responsible computing in HPC architecture that Eco-Comp uses. Coinciding with the Sustainable Development Goal 9 of the United Nations, which focuses on sustainable industrialization, This Poster gives a broad overview of the Eco-Comp software and its potential use for the material science community through an interactive guide.

Extended abstract, Poster PDF

A Power Management Method to Improve Energy Budget Utilization

Sho Ishii, Keichi Takahashi, Yoichi Shimomura, Hiroyuki Takizawa

Abstract: While high performance computing (HPC) requires high computational efficiency, power management is also becoming increasingly important to make HPC technology more cost-effective and hence sustainable. Since IT infrastructure and power demand are becoming economically unaffordable, it is anticipated that future HPC systems will have a set energy budget. Previous work identified that improving the energy budget utilization, which is the ratio of actual energy consumption to the allocated energy budget, can increase the system throughput. This study proposes a power management method to improve energy budget utilization by dynamically adjusting the power cap of each job considering the power consumption and energy budget on each node. Specifically, the power cap of each job is adjusted so that the surplus energy generated by other preceding jobs is consumed to increase the energy budget utilization. To evaluate the effectiveness of the proposed method, we compare the proposed method with a baseline using a fixed power cap with our own simulator. The evaluation results show that in the best case, the proposed method can reduce the makespan by 10.7% while improving the energy budget utilization by 30.5%, compared to the baseline method.

Extended abstract, Poster PDF

Performance Analysis of Applications under CPU Power Constraints

Riki Takahashi, Keiichiro Fukazawa

Abstract: This study addressed the power issues of supercomputers. The optimization of power control at the application level had not been extensively explored, but it became an increasingly important topic. Analyzing application performance under power constraints is essential for optimizing power control at the application level. Although power modeling for supercomputers and performance estimation for applications have been undertaken, there is scarcely any analysis on application performance with power constraints.

Consequently, this study has done the following. Firstly, the research measured and evaluated the performance differences under CPU power constraints using the memory-bound benchmark application and the CPU-bound application. Secondly, it evaluated the relationship between B/F values and execution performance under CPU power constraints in practical scientific computing applications. Finally, the study conducted comprehensive measurements of execution performance in scientific computing applications, not only in general but also in specific parts such as main computation and data writing, to understand the variations in power performance.

In our evaluation, we focused on the performance of benchmark applications under CPU power constraints. Our results showed that memory-bound applications tend to maintain their performance better than CPU-bound applications. In practical scientific computing applications with higher B/F values tend to have less performance reduction under CPU power constraints. Furthermore, by measuring execution characteristics under power constraints in different sections of scientific computing applications, we clarified the variations in power characteristics between these sections. These results suggest that finer power control within the application, based on sectional performance information, can further enhance power efficiency.

Extended abstract, Poster PDF

Performance Evaluation of Support Vector Machines with Quantum-inspired Annealers

Ryoga Fukuhara, Makoto Morishita, Takahiro Katagiri

Abstract: Quantum computers, drawing considerable attention due to their capacity for simultaneous parallel computations stemming from their quantum nature, are poised to emerge as the next-generation high-speed computing systems, boasting vastly superior computational capabilities compared to conventional, or classical, computers. This expectation arises from the prediction that Moore's Law will eventually plateau, rendering substantial speed improvements in classical computers unattainable. Meanwhile, the demands for data processing continue to surge, intensifying the need for high-performance computing solutions, thus elevating expectations for novel computer paradigms. Consequently, there is a global acceleration in the development of quantum computers. Conversely, when it comes to addressing combinatorial optimization problems that involve finding optimal combinations of variables to enhance a specific metric among multiple options within various constraints, there is a diversification of quantum annealing methods, semiconductor annealing machines, and other quantum-related hardware. Notably, semiconductor annealing machines have garnered attention as non-von Neumann computers capable of performing annealing processes to rapidly derive optimal solutions for combinatorial optimization problems at room temperature. Nevertheless, the range of practical applications for these machines is not yet extensive. As a result, in this study, we undertook the implementation of SVM (Support Vector Machine) , a well-established machine learning algorithm, on a CMOS annealing machine, which falls under the category of quantum-inspired annealers.

Extended abstract, Poster PDF

Calibrating Simulations of Quantum Annealers for Predictive Models

Michael Zielewski, Keichi Takahashi, Yoichi Shimomura, Hiroyuki Takizawa

Abstract: The annealing schedule defining the evolution of a quantum annealer has a significant impact on the output of the annealer. Pausing is a modification of the standard forward annealing schedule in which certain parameters of the schedule are held constant for a period of time. Pauses have been shown to significantly improve results, but only when the location of the pause is near an instance-wise optimal pause location. The standard procedure for finding the optimal pause location is a costly grid search, however, our previous work showed that quantum annealing simulations can be used to train a neural network to quickly predict a high-quality pause location while avoiding any costs associated with the grid search approach.

We also showed that our proposal was not achieving the maximum benefit from pausing. Our analysis indicated that the simulated and true optimal pause locations did not exactly align. In this work, we study how the simulation sweeps parameter, which the number of Monte Carlo iterations, influences the optimal pause location. We show that while this parameter can be used to calibrate the simulation so that the simulated and true optimal pause locations align, such an approach is costly and would need to be repeated for each problem. Our future interests lie in modifying the simulator so that the calibration can be performed once, at the system level.

Extended abstract, Poster PDF

Contact

Poster Chair: Keichi Takahashi (Tohoku University)
E-mail: hpca24-pc-chair [at] sighpc.ipsj.or.jp