Sr. HPC Systems Engineer
2 settimane fa
Research Computing is seeking a Sr. HPC Systems Engineer who will design, build, and maintain advanced high-performance computing environments supporting Johns Hopkins University's research mission. This position focuses on the reliable operation, configuration, and optimization of HPC and AI systems, including multi-node CPU and GPU clusters, high-speed InfiniBand and Ethernet networks, and large-scale parallel and object storage. The engineer implements and automates secure, efficient, and reproducible computing platforms used by faculty, researchers, and students across diverse scientific disciplines. Assignments include both ticket-based support and project-based deployments. The role operates with moderate independence, collaborating closely with the IT Architect, Research Computing, and reporting to the IT Manager for Research Computing to ensure scalable, sustainable, and high-performance systems that enable cutting-edge scientific discovery.
Specific Duties & Responsibilities
- Support and administer production systems used by researchers and Research Centers.
- Provide technical leadership/project management for system configuration, implementation, management, and user support for both new and existing systems.
- Research and recommend new functionality for HPC management and administration tools by exploring system-wide impacts, working with functional users to define current and future processes.
- Expertise with architecting, operating, and debugging large scale HPC network and storage infrastructure, including MPI, NCCL, RDMA, Infiniband, and parallel file systems
- Work with scientific support specialists to assign tasks and provide oversight as appropriate to HPC engineering team to support scientific researchers who use a broad spectrum of applications from diverse fields.
- Analyze results of server monitoring and implement changes to improve performance, processing, and utilization.
- Propose, maintain, and enforce policies, practices and security procedures.
- Provide break/fix support, setup/installation support, escalation support, and solutions support.
- Collaborate closely with a variety of stakeholders, both internal and external, on all aspects of projects.
- Other duties as assigned.
In Addition to the Duties Described Above
- Deploy, configure, and maintain large-scale Linux-based HPC clusters comprising CPU and GPU nodes, high-speed interconnects, and parallel file systems.
- Implement and optimize workload schedulers (Slurm) and job submission policies to maximize system throughput and fair-share usage.
- Administer and monitor distributed storage systems (GPFS, Lustre, WekaFS, Ceph, MinIO) to ensure reliability and performance across multi-petabyte environments.
- Maintain high-speed fabric and network infrastructure (Infiniband, Ethernet) to support low-latency data transfer and MPI workloads.
- Support research groups in deploying, testing, and optimizing scientific applications and AI/ML workflows on shared computing resources.
- Develop and maintain automation and monitoring frameworks for system provisioning, metrics collection, and alerting (Prometheus, Grafana, ELK).
- Participate in capacity planning, hardware lifecycle management, and evaluation of new technologies in collaboration with architects and management.
- Ensure security and compliance through configuration hardening, patch management, and integration with campus identity and access control systems.
- Document system designs, procedures, and troubleshooting guides to support knowledge transfer and team continuity.
- Contribute to a collaborative engineering culture that emphasizes service quality, innovation, and continuous improvement in research computing operations.
- Participate in on-call rotation to ensure high availability and timely response to system alerts.
Minimum Qualifications
- Bachelor's Degree.
- Six years related experience.
- Additional education may substitute for required experience and additional related experience may substitute for required education beyond a high school diploma/graduation equivalent, to the extent permitted by the JHU equivalency formula.
- Eight + years of experience in high-performance computing systems administration or engineering, including experience with cluster management, workload scheduling (e.g., Slurm), and distributed or parallel storage.
- Deep proficiency in Linux systems administration, configuration management (Ansible, Puppet, or Salt), performance monitoring, and tuning for HPC workloads.
- Experience with high-speed interconnects (Infiniband, 100/400 Gb Ethernet) and parallel file systems (e.g., GPFS, Lustre, BeeGFS, or WekaFS).
- Working knowledge of containerization and orchestration (Singularity, Docker, Kubernetes for HPC).
- Ability to automate deployments and routine operations through scripting (Bash, Python).
- Familiarity with data-center operations, GPU acceleration, and research software environments (e.g., CUDA, MPI, AI/ML frameworks).
- Strong analytical and troubleshooting skills, with proven ability to support complex research workloads in multi-user, multi-tenant environments.
- Experience collaborating with faculty and research groups to translate scientific requirements into practical and performant computing solutions.
Technical Skills & Expected Level of Proficiency
- Automation - Authority
- Cloud Infrastructur - Authority
- Cloud Migration - Authority
- Cloud Security - Authority
- Cloud Strategy - Authority
- Job Scheduling Systems - Authority
- Operating Software - Authority
- Scripting - Authority
- Software Development Life Cycle - Authority
- Systems Architecture - Authority
- Systems Analysis - Authority
- Systems Configuration - Authority
- Systems Design - Authority
- Systems Development - Authority
- Systems Engineering - Authority
- Systems Integration - Authority
Classified Title: Sr. HPC Systems Engineer
Job Posting Title (Working Title): Sr. HPC Systems Engineer ( Research Computing)
Role/Level/Range: ATP/04/PF
Starting Salary Range: $85,500 - $149,800 Annually (Commensurate w/exp.)
Employee group: Full Time
Schedule: Mon-Fri, 8:30am-5pm
FLSA Status: Exempt
Location: Johns Hopkins Bayview
Department name: Research Computing
Personnel area: University Administration
-
HPC Sr. Scientific Software Engineer
2 settimane fa
Italia Johns Hopkins University A tempo pieno 80.000 € - 140.000 € all'anoResearch Computing is seeking a HPC Sr. Scientific Software Engineer who will design, build, and support Johns Hopkins University's high-performance computing and AI research infrastructure. This role integrates elements of both systems and software engineering, ensuring scalable, secure, and reproducible environments for scientific and data-intensive...
-
HPC Systems Engineer
7 giorni fa
Italia Johns Hopkins University A tempo pieno 60.000 € - 120.000 € all'anoThe Advanced Research Computing at Hopkins (ARCH) group is seeking a highly qualified and motivated HPC Systems Engineer to join the systems team. This system (ROCKFISH), with over 45,000 cores and several petabytes of storage, serves the HPC and data intensive science needs of researchers at Johns Hopkins University. The Systems Engineer contributes to the...
-
SuccessFactors Recruiting: HPC Systems Engineer
5 giorni fa
Italia Johns Hopkins University A tempo pieno 73.300 € - 128.300 € all'anoMinimum QualificationsBachelor's degree.Four years related experience.Additional education may substitute for required experience, andadditional related experience may substitute for required educationbeyond a high school diploma/graduation equivalent, to the extentpermitted by the JHU equivalency formula.Classified Title: Systems Engineer Job Posting Title...
-
Research HPC Engineer
3 settimane fa
Italia IFOM A tempo pienoIFOM is an internationally recognized non-profit cancer research institute supported by the Italian Association for Cancer Research (AIRC) based in Milan within a highly dynamic urban environment IFOM continuous success is ensured by the recruitment of highly selected undergraduate, graduate, and post-graduate researchers from over 24 countries worldwide....
-
HPC Scientific Software Director
1 settimana fa
Italia Johns Hopkins University A tempo pieno 150.000 € - 200.000 € all'anoResearch Computing is seeking an HPC Scientific Software Director who will be the technical and strategic lead for the Research Computing software engineering organization, responsible for architecting, developing, and maintaining the software ecosystem that powers Johns Hopkins University's high-performance and AI computing environments. The role guides a...
-
Sr SAP Systems Analyst
7 giorni fa
Italia Trusted Consumer Self-Care Products A tempo pieno 90.000 € - 120.000 € all'anoAt Perrigo, we are driven by our mission to Makes Lives Better Through Trusted Health and Wellness Solutions, Accessible to All. We are proud to be a Top 10 player in the European Consumer Self-Care market and the largest U.S. store brand provider of over the counter and infant formula. Dedicated to providing The Best Self-Care for Everyone, we are the...
-
Sr. IAM Administrator
2 settimane fa
Italia Johns Hopkins University A tempo pieno 73.300 € - 128.300 € all'anoEnterprise Directory and Messaging is seeking a Sr. IAM Administrator.The Sr. Identity and Access Management (IAM) Administrator is responsible for planning, implementing, and managing IAM services for the University. Serves as the Lead Administrator and provides technical expertise in identity and user account management, including troubleshooting and...
-
Europe IT Systems Engineer
3 settimane fa
Italia Johnson Electric A tempo pienoCome Innovate Motion with us We are seeking for the Europe Region a detail-oriented and experienced IT Systems Engineer to join our Europe IT team in our new IT Infrastructure Competency center in Italy. In this role, you will be the key person in the design and implementation of a dedicated IT System infrastructure aligned with Cybersecurity requirements...
-
Systems & Network Engineer
2 settimane fa
Italia ADENTIS Italia A tempo pienoCHI SIAMO ADENTIS Italia è un gruppo europeo di consulenza ingegneristica e operativa facente parte del Gruppo MoOngy. Presente in 14 paesi europei con oltre 8.500 dipendenti, abbiamo aperto, da quattro anni, la prima sede italiana a Milano e, vista la continua crescita, stiamo rafforzando e ampliando il nostro team con persone che condividano gli stessi...
-
Systems & Network Engineer
2 settimane fa
Italia ADENTIS Italia A tempo pienoCHI SIAMO ADENTIS Italia è un gruppo europeo di consulenza ingegneristica e operativa facente parte del Gruppo MoOngy. Presente in 14 paesi europei con oltre 8.500 dipendenti, abbiamo aperto, da quattro anni, la prima sede italiana a Milano e, vista la continua crescita, stiamo rafforzando e ampliando il nostro team con persone che condividano gli stessi...