Get maximum number of jobs allowed in SLURM cluster as a user Maximizing Your Job Count in SLURM A Users Guide Problem As a user on a SLURM cluster you want to run as many jobs as possible simultaneously However you re oft 2 min read 06-10-2024 8
Setting resources dynamically on snakemake Dynamically Managing Resources in Snakemake Workflows Snakemake is a powerful workflow management system that simplifies complex computational pipelines One key 2 min read 04-10-2024 11
How enroot shares image cache and data in multi-node situations? Sharing the Load How Enroot Manages Image Cache and Data in Multi Node Environments In the world of containerized applications efficiency and scalability are pa 2 min read 04-10-2024 8
SLURM slurmschd.log - extreme big file size SLURMs Bulging Log Why slurmschd log is Eating Your Disk Space and How to Tame It The Problem Your SLURM slurmschd log file has ballooned in size consuming prec 2 min read 04-10-2024 4
munge/slurm authentication issue (Protocol authentication error) Munge Slurm Authentication Errors A Comprehensive Guide to Troubleshooting Protocol Authentication Error Introduction The Protocol authentication error is a com 2 min read 04-10-2024 12
In SLURM, lscpu and slurmd -c are not matched. so resources are not usable Understanding Resource Mismatch in SLURM lscpu vs slurmd c In the context of High Performance Computing HPC many administrators rely on SLURM Simple Linux Utili 2 min read 30-09-2024 8
SLURM batch job - how to run a preparation task once per node on each node that will receive jobs from the same batch file? Running a Preparation Task Once Per Node in SLURM Batch Jobs When working with SLURM Simple Linux Utility for Resource Management its not uncommon to find yours 3 min read 30-09-2024 12
Running module commands in srun Running Module Commands in SRUN A Comprehensive Guide In the world of high performance computing HPC managing software environments efficiently is crucial for o 2 min read 30-09-2024 8
How to get an estimate when a job is going to start accoriding to current schedule? How to Estimate Job Start Dates According to Current Schedules When managing a project one of the most critical factors to consider is the timeline Understandin 3 min read 29-09-2024 7
SLURM maximum buffer size Understanding SLURMs Maximum Buffer Size A Comprehensive Guide When working with SLURM Simple Linux Utility for Resource Management many users encounter various 3 min read 26-09-2024 12
Issues with Loading Pretrained Model and File Locking in DeepSpeed and Hugging Face Transformers Issues with Loading Pretrained Model and File Locking in Deep Speed and Hugging Face Transformers In the world of machine learning and natural language processi 3 min read 20-09-2024 14
Parameter tuning with Slurm, Optuna, PyTorch Lightning, and KFold Parameter Tuning with Slurm Optuna Py Torch Lightning and K Fold Parameter tuning is a crucial step in optimizing machine learning models In this article we wil 4 min read 17-09-2024 22
How can I let higher priority Slurm jobs pass through while not sharing individual CPUs among tasks? Allowing Higher Priority Slurm Jobs to Pass Through Without Sharing CPUs When managing a cluster with Slurm Workload Manager one common challenge system adminis 2 min read 16-09-2024 23
How do I setup Distributed Data Parallel (DDP) training using the PyTorch Lightning CLI? Setting Up Distributed Data Parallel DDP Training Using Py Torch Lightning CLI Distributed Data Parallel DDP is a powerful way to train your machine learning mo 2 min read 16-09-2024 17
Run one program with different arguments in parallel with SLURM Running One Program with Different Arguments in Parallel Using SLURM Introduction In high performance computing HPC environments efficiently running multiple in 3 min read 15-09-2024 27
How to Set Exception Rules for Slurm Executor in Snakemake? How to Set Exception Rules for Slurm Executor in Snakemake Snakemake is a popular workflow management system that allows users to define complex data workflows 2 min read 14-09-2024 32
Custom Select Plugin in Slurm Boosting Your Slurm Workflows with Custom Select Plugins Slurm the popular workload manager offers a robust framework for managing high performance computing HP 2 min read 13-09-2024 13
Slurm: using GPU sharding Unleashing GPU Power Slurm and the Art of Sharding Slurm the popular workload manager often finds itself handling computationally intensive tasks that demand th 2 min read 13-09-2024 28
Slurm: How to obtain only jobID using jobName through a script Extracting Job ID from Slurm Job Name A Practical Guide Finding the Job ID of a running or completed Slurm job based on its name can be a common task for system 2 min read 05-09-2024 12
SLURM+Docker: How to kill docker-created processes using SLURMs scancel Mastering SLURM and Docker Ensuring Process Termination with scancel When managing GPU intensive deep learning workloads on a SLURM cluster efficiently handling 2 min read 05-09-2024 30
slurmd unable to communicate with slurmctld Troubleshooting Slurmd Unable to Communicate with Slurmctld This article aims to help you diagnose and fix the common issue of slurmd failing to communicate wit 3 min read 05-09-2024 12
Queue SLURM jobs to run X minutes after each other Scheduling SLURM Jobs with Time Delays A Step by Step Guide Running a series of tasks in a specific order with calculated time delays is a common requirement in 2 min read 05-09-2024 12
SLURM job array $SLURM_ARRAY_TASK_ID not working Debugging SLURM Job Arrays Why SLURM ARRAY TASK ID Might Not Work as Expected Using SLURM job arrays is a powerful way to run multiple instances of your script 2 min read 03-09-2024 15
Slurm : invalid job credential Troubleshooting Invalid Job Credential Errors in Slurm A Practical Guide Slurm the popular workload manager is known for its flexibility and scalability However 3 min read 03-09-2024 12
Can I create a job name that reflects the array task ID? Dynamically Naming Slurm Array Jobs A Guide to Tailoring Job Identifiers Running large scale simulations or analyses often involves using job arrays a powerful 2 min read 03-09-2024 19