About the Role
Optimize ML model training and inference performance, improve efficiency of large-scale training, ensure low-latency inference in real-time systems, and optimize high-throughput inference in research.
Requirements
Seeks an engineer with experience in low-level systems programming and optimization, focusing on improving ML model training and inference performance across host, GPU, storage, and networking systems.
Full Job Description
<p>We are looking for an engineer with experience in low-level systems programming and optimization to join our growing ML team. </p>
<p><a href="https://www.janestreet.com/machine-learning" target="_blank">Machine learning</a> is a critical pillar of Jane Street's global business. Our ever-evolving trading environment serves as a unique, rapid-feedback platform for ML experimentation, allowing us to incorporate new ideas with relatively little friction.</p>
<p>Your part here is optimizing the performance of our models – both training and inference. We care about efficient large-scale training, low-latency inference in real-time systems, and high-throughput inference in research. Part of this is improving straightforward CUDA, but the interesting part needs a whole-systems approach, including storage systems, networking, and host- and GPU-level considerations. Zooming in, we also want to ensure our platform makes sense even at the lowest level – is all that throughput actually goodput? Does loading that vector from the L2 cache really take that long?</p>
<p>If you’ve never thought about a career in finance, you’re in good company. Many of us were in the same position before working here. If you have a curious mind and a passion for solving interesting problems, we have a feeling you’ll fit right in. </p>
<p>There’s no fixed set of skills, but here are some of the things we’re looking for:</p>
<ul>
<li>An understanding of modern ML techniques and toolsets</li>
<li>The experience and systems knowledge required to debug a training run’s performance end to end</li>
<li>Low-level GPU knowledge of PTX, SASS, warps, cooperative groups, Tensor Cores, and the memory hierarchy</li>
<li>Debugging and optimization experience using tools like CUDA GDB, NSight Systems, NSight Compute</li>
<li>Library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN, and cuBLAS</li>
<li>Intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization, and asynchronous memory loads</li>
<li>Background in Infiniband, RoCE, GPUDirect, PXN, rail optimization, and NVLink, and how to use these networking technologies to link up GPU clusters</li>
<li>An understanding of the collective algorithms supporting distributed GPU training in NCCL or MPI</li>
<li>An inventive approach and the willingness to ask hard questions about whether we're taking the right approaches and using the right tools</li>
</ul>
<p> </p>
<p><em>If you're a recruiting agency and want to partner with us, please reach out to </em><a href="mailto:agency-partnerships@janestreet.com"><em>agency-partnerships@janestreet.com</em></a><em>.</em></p>