MVAPICH :: Publications

Journals (31)
1	K. Suresh, K. Khorassani, C. Chen, B. Ramesh, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, Network Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries, IEEE Micro, Jan 2023.
2	K. Khorassani, C. Chen, B. Ramesh, A. Shafi, H. Subramoni, and DK Panda, High Performance MPI over the Slingshot Interconnect, Special Issue of Journal of Computer Science and Technology (JCST), Feb 2023.
3	DK Panda, H. Subramoni, C. Chu, and M. Bayatpour, The MVAPICH project: Transforming Research into High-Performance MPI Library for HPC Community , Journal of Computational Science (JOCS), Special Issue on Translational Computer Science, Oct 2020.
4	J. Hashmi, C. Chu, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, FALCON-X: Zero-copy MPI Derived Datatype Processing on Modern CPU and GPU Architectures, Journal of Parallel and Distributed Computing (JPDC), Volume 144, October 2020, Pages 1-13, doi.org/10.1016/j.jpdc.2020.05.008,
5	Ammar Awan, A. Jain, C. Chu, H. Subramoni, and DK Panda, Communication Profiling and Characterization of Deep Learning Workloads on Clusters with High-Performance Interconnects, IEEE Micro, vol. 40, no. 1, pp. 35-43, 1 Jan.-Feb. 2020.,
6	A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and DK Panda, Effcient Design for MPI Asynchronous Progress without Dedicated Resources, Parallel Computing - Systems & Applications, Volume 85, July 2019, Pages 13-26, https://doi.org/10.1016/j.parco.2019.03.003,
7	Ammar Awan, K. Vadambacheri Manian, C. Chu, H. Subramoni, and DK Panda, Optimized Large-Message Broadcast for Deep Learning Workloads: MPI, MPI+NCCL, or NCCL2?, Volume 85, July 2019, Pages 141-152, https://doi.org/10.1016/j.parco.2019.03.005,
8	S. Chakraborty, Ignacio Laguna, Murali Emani, Kathryn Mohror, DK Panda, Martin Schulz, and H. Subramoni, EReinit: Scalable and Efficient Fault Tolerance for Bulk-Synchronous MPI Applications, Concurrency and Computation: Practice and Experience, 14 August 2018, https://doi.org/10.1002/cpe.4863,
9	S. Ramesh, A. Mahéo, S. Shende, A. Malony, H. Subramoni, A. Ruhela, and DK Panda, MPI performance engineering with the MPI tool interface: The integration of MVAPICH and TAU, ISSN 0167-8191, Volume 77, Sep 2018.
10	H. Wang, S. Potluri, D. Bureddy, and DK Panda, GPU-Aware MPI on RDMA-Enabled Cluster: Design, Implementation and Evaluation, IEEE Transactions on Parallel & Distributed Systems, Vol. 25, No. 10, pp. 2595-2605, Oct 2014.
11	S. Sur, S. Potluri, K. Kandalla, H. Subramoni, K. Tomko, and DK Panda, Co-Designing MPI Library and Applications for InfiniBand Clusters IEEE Computer, Nov 2011.
12	P. Lai, P. Balaji, R. Thakur, and DK Panda, ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many-Core Architectures Computer Science: Research and Development, Special Issue of Scientific Papers from ISC '09, Jun 2009.
13	A. Vishnu, M. Koop, A. Moody, A. Mamidala, S. Narravula, and DK Panda, Topology Agnostic Hot-Spot Avoidance with InfiniBand Concurrency and Computation: Practice and Experience, Special Issue of Best Papers from CCGrid '07, Jan 2008.
14	H. Jin, P. Balaji, C. Yoo, J. -Y. Choi, and DK Panda, Exploiting NIC Architectural Support for Enhancing IP based Protocols on High Performance Networks OSU-CISRC-5/04-TR37, Nov 2005.
15	J. Liu, A. Mamidala, A. Vishnu, and DK Panda, Performance Evaluation of InfiniBand with PCI Express, IEEE Micro, Jan 2005.
16	J. Liu, J. Wu, and DK Panda, High Performance RDMA-Based MPI Implementation over InfiniBand, Int'l Journal of Parallel Programming: Volume 32, Number 3, Jun 2004.
17	J. Liu, B. Chandrasekaran, W. Yu, J. Wu, D. Buntinas, S. Kini, P. Wyckoff, and DK Panda, Micro-Benchmark Performance Comparison of High-Speed Cluster Interconnects IEEE Micro, Jan 2004.
18	A. Wagner, D. Buntinas, R. Brightwell, and DK Panda, Application-Bypass Reduction for Large-Scale Clusters. Int'l Journal of High Performance Computing and Networking Internationall Journal of High Performance Computing and Networking, Cluster 2003 Special Issue. In Press, Dec 2003.
19	R. Sivaram, C. Stunkel, and DK Panda, HIPIQS: A High-Performance Switch Architecture using Input Queuing IEEE Transactions on Parallel and Distributed Systems. Vol. 13, No. 3, pp. 275-289, Mar 2002.
20	M. Banikazemi, B. Abali, L. Herger, and DK Panda, Design Alternatives for Virtual Interface Architecture (VIA) and an Implementation on IBM Netfinity NT Cluster Journal of Parallel and Distributed Computing, Special Issue on Clusters, Volume 61, Number 11, pp. 1512-1545, Nov 2001.
21	M. Banikazemi, R. K. Govindaraju, R. Blackmore, and DK Panda, MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems IEEE Transactions on Parallel and Distributed Systems, Vol. 12, No. 10, pp. 1081-1093, Oct 2001.
22	B. Abali, C. B. Stunkel, J. Herring, M. Banikazemi, DK Panda, C. Aykanat, and Y. Aydogan, Adaptive Routing on the New Switch Chip for IBM SP Systems Journal of Parallel and Distributed Computing, Special Issue on Routing in Computer and Communication Networks, Volume 61, Number 9, pp. 1148-1179, Sep 2001.
23	R. Kesavan, and DK Panda, Efficient Multicast on Irregular Switch-based Cut-Through Networks with Up-Down Routing IEEE Transactions on Parallel and Distributed Systems, Vol. 12, No. 8, pp. 808-828, Aug 2001.
24	R. Sivaram, R. Kesavan, DK Panda, and C. Stunkel Architectural Support for Efficient Multicasting in Irregular Networks, Architectural Support for Efficient Multicasting in Irregular Networks IEEE Transactions on Parallel and Distributed Systems, Vol. 12, No. 5, pp. 489-513, May 2001.
25	R. Sivaram, C. Stunkel, and DK Panda, Implementing Multidestination Worms in Switch-Based Parallel Systems: Architectural Alternatives and their Impact IEEE Transactions on Parallel and Distributed Systems, Vol. 11, No. 8, pp. 794-812, Aug 2000.
26	R. Kesavan, and DK Panda, Multiple Multicast with Minimized Node Contention on Wormhole k-ary n-cube Networks IEEE Transactions on Parallel and Distributed Systems, Vol. 10, No. 4, pp. 371-393, Apr 1999.
27	D. Dai, and DK Panda, Exploiting the Benefits of Multiple-Path Network in DSM Systems: Architectural Alternatives and Performance Evaluation IEEE Transactions on Computers, Special Issue on Cache Memory, Vol. 48, No. 2, pp. 236-244, Feb 1999.
28	R. Prakash, and DK Panda, Designing Communication Strategies for Heterogeneous Parallel Systems, Parallel Computing, Volume 24, pp. 2035-2052, Dec 1998.
29	R. Sivaram, DK Panda, and C. B. Stunkel, Efficient Broadcast and Multicast on Multistage Interconnection Networks using Multiport Encoding, IEEE Transactions on Parallel and Distributed Systems, Vol. 9, No. 10, pp. 1004-1028, Oct 1998.
30	D. Basak, and DK Panda, Designing Clustered Multiprocessor Systems under Packaging and Technological Advancements IEEE Transactions on Parallel and Distributed Systems, Vol. 7, No. 9, pp. 962-978, Sep 1996.
31	Srinivasan Ramesh, Aurele Maheo, Sameer Shende, Allen Malony, H. Subramoni, and DK Panda, MPI Performance Engineering with the MPI Tool Interface: the Integration of MVAPICH and TAU, May 2018.

Book Chapter (2)
1	X. Lu, J. Zhang, and DK Panda, Building Efficient HPC Cloud with SR-IOV Enabled InfiniBand: The MVAPICH2 Approach , Book "Research Advances in Cloud Computing", edited by Sanjay Chaudhary, Gaurav Somani, and Rajkumar Buyya, Springer International Publishing , Aug 2017.
2	X. Lu, and DK Panda, Contribution on Multiple Chapters related to OpenStack, Virtualized HPC, HPC Network Fabric, and HPC Workload Management , Book "The Crossroads of Cloud and HPC: OpenStack for Scientific Research; Exploring OpenStack Cloud Computing for Scientific Workloads", Edited by Stig Telfer - OpenStack Foundation Publishing (Invited Book Chapter) , Nov 2016.

Conferences & Workshops (414)
1	Design and Implementation of an IPC-based Collective MPI Library for Intel GPUs C. Chen, G. Kuncham, P. Kousha, H. Subramoni, and DK Panda, PRACTICE & EXPERIENCE IN ADVANCED RESEARCH COMPUTING, Jul 2024 [Bib - Plain]
2	Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference J. Yao, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, 38th IEEE International Parallel & Distributed Processing Symposium, May 2024 [Bib - Plain]
3	Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters Q. Zhou, B. Ramesh, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2024, May 2024 [Bib - Plain]
4	Accelerating Large Language Model Training with Hybrid GPU-based Compression L. Xu, Q. Anthony, Q. Zhou, N. Alnaasan, R. Gulhane, A. Shafi, H. Subramoni, and DK Panda, IEEE/ACM International Symposium on Cluster, Cloud, and Internet Computing 2024, May 2024 [Bib - Plain]
5	Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference J. Yao, N. Alnaasan, T. Chen, A. Shafi, H. Subramoni, and DK Panda, 30th IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, & ANALYTICS, Dec 2023 [Bib - Plain]
6	HARVEST: High-Performance Artificial Vision Framework for Expert Labeling using Semi-Supervised Training N. Alnaasan, M. Lieber, A. Shafi, H. Subramoni, S. Shearer, and DK Panda, 2023 IEEE International Conference on Big Data, Dec 2023 [Bib - Plain]
7	Benchmarking Modern Databases for Storing and Profiling Very Large Scale HPC Communication Data P. Kousha, Q. Zhou, H. Subramoni, and DK Panda, The 15th BenchCouncil International Symposium On Benchmarking, Measuring And Optimizing, Dec 2023 [Bib - Plain]
8	MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators C. Chen, K. Khorassani, P. Kousha, Q. Zhou, J. Yao, H. Subramoni, and DK Panda, Sixth Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Nov 2023 [Bib - Plain]
9	Democratizing HPC Access and Use with Knowledge Graphs P. Kousha, V. Sathu, M. Lieber, H. Subramoni, and DK Panda, D-HPC 2023: The First International Workshop on Democratizing High-Performance Computing, Nov 2023 [Bib - Plain]
10	DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICs B. Michalowicz, K. Suresh, H. Subramoni, DK Panda, and S. Poole, Practice and Experience in Advanced Research Computing 23, Jul 2023 [Bib - Plain]
11	Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication N. Contini, B. Ramesh, K. Suresh, T. Tran, B. Michalowicz, M. Abduljabbar, H. Subramoni, and DK Panda, International Conference on Supercomputing 2023, Jun 2023 [Bib - Plain]
12	SAI: AI-Enabled Speech Assistant Interface for Science Gateways in HPC P. Kousha, A. Jain, A. Kolli, M. Lieber, M. Han, N. Contini, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2023, May 2023 [Bib - Plain]
13	A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs K. Suresh, B. Michalowicz, B. Ramesh, N. Contini, J. Yao, S. Xu, A. Shafi, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
14	Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication Q. Zhou, Q. Anthony, L. Xu, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
15	MCR-DL: Mix-and-Match Communication Runtime for Deep Learning Q. Anthony, Ammar Awan, J. Rasley, Y. He, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
16	Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc K. Khorassani, C. Chen, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
17	In-Depth Evaluation of a Lower-Level Direct-Verbs API on InfiniBand-based Clusters: Early Experiences B. Michalowicz, K. Suresh, B. Ramesh, A. Shafi, H. Subramoni, M. Abduljabbar, and DK Panda, 25th Workshop on Advances in Parallel and Distributed Computational Models, May 2023 [Held in conjunction with IPDPS 2023] [Bib - Plain]
18	Implementing and Optimizing a GPU-aware MPI Library for Intel GPUs: Early Experiences C. Chen, K. Khorassani, G. Kuncham, R. Vaidya, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, THE 23RD IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2023 [Bib - Plain]
19	AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
20	Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads Q. Zhou, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
21	Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI K. Al Attar, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, IEEE Cluster '22, Sep 2022 [Bib - Plain]
22	Network-Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries K. Suresh, K. Khorassani, C. Chen, B. Ramesh, M. Abduljabbar, A. Shafi, and DK Panda, Hot Interconnects 29, Aug 2022 [Bib - Plain]
23	High Performance MPI over the Slingshot Interconnect: Early Experiences K. Khorassani, C. Chen, B. Ramesh, A. Shafi, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2022 [Best Student Paper Award] [Bib - Plain]
24	Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems C. Chen, K. Khorassani, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, Heterogeneity in Computing Workshop (HCW 2022), May 2022 [held in conjunction with IPDPS'22] [Bib - Plain]
25	OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, 23rd Parallel and Distributed Scientific and Engineering Computing Workshop (PDSEC) at IPDPS22, May 2022 [Bib - Plain]
26	Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters A. Jain, A. Shafi, Q. Anthony, P. Kousha, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Bib - Plain]
27	Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters Q. Zhou, P. Kousha, Q. Anthony, K. Khorassani, A. Shafi, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Bib - Plain]
28	OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries and Machine Learning Applications on HPC Systems N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Research Poster] [Best Poster Award] [Bib - Plain]
29	DistMILE: A Distributed Multi-Level Framework for Scalable Graph Embedding Yuntian He, Saket Gurukar, P. Kousha, H. Subramoni, and Dhabaleswar K. Panda and Srinivasan Parthasarathy, 28th IEEE International Conference on High Performance Computing, Data, Analytics, and Data Science, Dec 2021 [Bib - Plain]
30	Towards Architecture-aware Hierarchical Communication Trees on Modern HPC Systems B. Ramesh, J. Hashmi, S. Xu, A. Shafi, M. Ghazimirsaeed, M. Bayatpour, H. Subramoni, and DK Panda, 28th IEEE International Conference on High Performance Computing, Data, Analytics, and Data Science, Dec 2021 [Best Paper Finalist] [Bib - Plain]
31	Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs A. Jain, N. Alnaasan, A. Shafi, H. Subramoni, and DK Panda, 28th IEEE Hot Interconnects, Aug 2021 [Bib - Plain]
32	BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs M. Bayatpour, N. Sarkauskas, H. Subramoni, J. Hashmi, and DK Panda, ISC HIGH PERFORMANCE 2021, Jun 2021 [Bib - Plain]
33	Designing a ROCm-aware MPI Library for AMD GPUs: Early Experiences K. Khorassani, J. Hashmi, C. Chu, C. Chen, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2021, Jun 2021 [Bib - Plain]
34	SUPER: SUb-Graph Parallelism for TransformERs A. Jain, T. Moon, T. Benson, H. Subramoni, S. Jacobs, DK Panda, and B. Essen, 35th IEEE International Parallel & Distributed Processing Symposium, May 2021 [Bib - Plain]
35	Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters Q. Zhou, C. Chu, N. Senthil Kumar, P. Kousha, M. Ghazimirsaeed, H. Subramoni, and DK Panda, 35th IEEE International Parallel & Distributed Processing Symposium, May 2021 [Best Paper Finalist] [Bib - Plain]
36	Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems K. Khorassani, C. Chu, Q. Anthony, H. Subramoni, and DK Panda, The 21st IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2021 [Bib - Plain]
37	Efficient MPI-based Communication for GPU-Accelerated Dask Applications A. Shafi, J. Hashmi, H. Subramoni, and DK Panda, The 21st IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2021 [Bib - Plain]
38	Blink: Towards Efficient RDMA-based Communication Coroutines for Parallel Python Applications A. Shafi, J. Hashmi, H. Subramoni, and DK Panda, 27TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS, Dec 2020 [Bib - Plain]
39	GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN Training A. Jain, Ammar Awan, A. Aljuhani, J. Hashmi, Q. Anthony, H. Subramoni, DK Panda, R. Machiraju, and A. Parwani, SC 2020, Nov 2020 [Bib - Plain]
40	Exploring Hybrid MPI+Kokkos Tasks Programming Model Samuel Khuvis, K. Tomko, J. Hashmi, and DK Panda, The 3rd Annual Parallel Applications Workshop, Alternatives to MPI+X (PAW-ATM), Nov 2020 [held in conjunction with SC’20] [Bib - Plain]
41	Design and Characterization of Infiniband Hardware Tag Matching in MPI M. Bayatpour, M. Ghazimirsaeed, S. Xu, H. Subramoni, and DK Panda, The 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, Nov 2020 [Bib - Plain]
42	Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR M. Ghazimirsaeed, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, 6th Workshop on Machine Learning in HPC Environments, Nov 2020 [Bib - Plain]
43	Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters C. Chu, K. Khorassani, Q. Zhou, H. Subramoni, and DK Panda, 22nd IEEE International Conference on Cluster Computing (IEEE Cluster 2020), Sep 2020 [Bib - Plain]
44	Accelerated Real-time Network Monitoring and Profiling at Scale using OSU INAM P. Kousha, S. D. Kamal Raj, H. Subramoni, DK Panda, H. Na, T. Dockendorf, and K. Tomko, Practice and Experience in Advanced Research Computing 2020, Jul 2020 [Bib - Plain]
45	NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems C. Chu, P. Kousha, Ammar Awan, K. Khorassani, H. Subramoni, and DK Panda, The 34th ACM International Conference on Supercomputing (ICS-2020), Jun 2020 [Bib - Plain]
46	Communication-Aware Hardware-Assisted MPI Overlap Engine M. Bayatpour, J. Hashmi, S. Chakraborty, K. Suresh, M. Ghazimirsaeed, B. Ramesh, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2020, Jun 2020 [Bib - Plain]
47	HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow Ammar Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2020, Jun 2020 [Bib - Plain]
48	Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures J. Hashmi, S. Xu, B. Ramesh, M. Bayatpour, H. Subramoni, and DK Panda, 34th IEEE International Parallel & Distributed Processing Symposium (IPDPS '20), May 2020 [Bib - Plain]
49	High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems C. Chu, J. Hashmi, K. Khorassani, H. Subramoni, and DK Panda, 26th IEEE International Conference on High Performance Computing, Data, Analytics and Data Science (HiPC '19), Dec 2019 [Bib - Plain]
50	Design and Evaluation of Shared Memory CommunicationBenchmarks on Emerging Architectures using MVAPICH2 S. Xu, J. Hashmi, S. Chakraborty, H. Subramoni, and DK Panda, Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, held in conjunction with SC '19, Nov 2019 [Bib - Plain]
51	Leveraging Network-level parallelism with Multiple Process-Endpoints for MPI Broadcast A. Ruhela, B. Ramesh, S. Chakraborty, H. Subramoni, J. Hashmi, and DK Panda, Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Nov 2019 [Bib - Plain]
52	OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks K. Vadambacheri Manian, C. Chu, Ammar Awan, K. Khorassani, H. Subramoni, and DK Panda, 10th International Workshop in Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Nov 2019 [Bib - Plain]
53	Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera A. Jain, Ammar Awan, H. Subramoni, and DK Panda, 3rd Deep Learning on Supercomputers Workshop (DLS) at SC19, Nov 2019 [Bib - Plain]
54	Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters A. Jain, Ammar Awan, Q. Anthony, H. Subramoni, and DK Panda, 21st IEEE International Conference on Cluster Computing, Sep 2019 [Bib - Plain]
55	Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects Ammar Awan, A. Jain, C. Chu, H. Subramoni, and DK Panda, 26th Symposium on High-Performance Interconnects (HotI '19), Aug 2019 [Bib - Plain]
56	Designing Scalable and High-performance MPI Libraries on Amazon Elastic Fabric Adapter S. Chakraborty, S. Xu, H. Subramoni, and DK Panda, HOT Interconnects 26, Aug 2019 [Bib - Plain]
57	Performance Evaluation of MPI Libraries on GPU-enabled OpenPOWER Architectures: Early Experiences K. Khorassani, C. Chu, H. Subramoni, and DK Panda, International Workshop on OpenPOWER for HPC, held in conjunction with ISC'19, Jun 2019 [Bib - Plain]
58	Reduction Operations on Modern Supercomputers: Challenges and Solutions M. Bayatpour, J. Hashmi, S. Chakraborty, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2019, Jun 2019 [Best Poster Award] [Bib - Plain]
59	FALCON: Efficient Designs for Zero-copy MPI Datatype Processing on Emerging Architectures J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS '19), May 2019 [Best Paper Finalist] [Bib - Plain]
60	C-GDR: High-Performance Container-aware GPUDirect MPI Communication Schemes on RDMA Networks J. Zhang, X. Lu, C. Chu, and DK Panda, 33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS '19), May 2019 [Bib - Plain]
61	Design and Characterization of Shared Address Space MPI Collectives on Modern Architectures J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, The 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing (CCGRID 2019), May 2019 [Bib - Plain]
62	Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation Ammar Awan, J. Bedorf, C. Chu, H. Subramoni, and DK Panda, The 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing (CCGRID 2019), May 2019 [Bib - Plain]
63	Characterizing CUDA Unified Memory (UM)-AwareMPI Designs on Modern GPU Architectures K. Vadambacheri Manian, Ammar Awan, A. Ruhela, C. Chu, and DK Panda, 12th Workshop on General Purpose Processing Using GPU (GPGPU 2019) @ ASPLOS 2019, Apr 2019 [Bib - Plain]
64	OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training Ammar Awan, C. Chu, H. Subramoni, X. Lu, and DK Panda, 25th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2018 [Bib - Plain]
65	Cooperative Rendezvous Protocols for Improved Performance and Overlap S. Chakraborty, M. Bayatpour, J. Hashmi, H. Subramoni, and DK Panda, 2018 The International Conference for High Performance Computing, Networking, Storage, and Analysis, Nov 2018 [Best Student Paper Finalist] [Bib - Plain]
66	Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? Ammar Awan, C. Chu, H. Subramoni, and DK Panda, The EuroMPI 2018 Conference, Sep 2018 [Bib - Plain]
67	Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures M. Li, X. Lu, H. Subramoni, and DK Panda, The EuroMPI 2018 Conference, Sep 2018 [Bib - Plain]
68	Efficient Asynchronous Communication Progress for MPI without Dedicated Resources A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and DK Panda, EuroMPI 2018, Sep 2018 [Bib - Plain]
69	SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives M. Bayatpour, J. Hashmi, S. Chakraborty, H. Subramoni, P. Kousha, and DK Panda, IEEE Cluster 2018, Sep 2018 [Best Paper Award] [Bib - Plain]
70	Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018 [Bib - Plain]
71	Kernel-assisted Communication Engine for MPI on Emerging Manycore Processors J. Hashmi, K. Hamidouche, H. Subramoni, and DK Panda, 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'17), Dec 2017 [Bib - Plain]
72	Designing Registration Caching Free High-Performance MPI Library with Implicit On-Demand Paging (ODP) of InfiniBand M. Li, X. Lu, H. Subramoni, and DK Panda, 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'17), Dec 2017 [Bib - Plain]
73	MPI-LiFE: Designing High-Performance Linear Fascicle Evaluation of Brain Connectome with MPI , X. Lu, F. Pestilli, C.F. Caiafa, and DK Panda, 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'17), Dec 2017 [Bib - Plain]
74	Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds? J. Zhang, X. Lu, and DK Panda, 10th IEEE/ACM International Conference on Utility and Cloud Computing, Dec 2017 [Best Student Paper Award] [Bib - Plain]
75	An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures Ammar Awan, H. Subramoni, and DK Panda, 3rd Workshop on Machine Learning in High Performance Computing Environments, held in conjunction with SC17, Nov 2017 [Bib - Plain]
76	Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and DK Panda, SuperComputing 2017, Nov 2017 [Bib - Plain]
77	MPI Performance Engineering with the MPI Tool Interface: the Integration of MVAPICH and TAU DK Panda, 24th European MPI Users' Group Meeting, Sep 2017 [Best Paper] [Bib - Plain]
78	Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning C. Chu, X. Lu, Ammar Awan, H. Subramoni, J. Hashmi, Bracy Elton, and DK Panda, ICPP 2017 : International Conference on Parallel Processing, Aug 2017 [Bib - Plain]
79	MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling A. Venkatesh, C. Chu, K. Hamidouche, S. Potluri, Davide Rossetti, and DK Panda, ICPP 2017 : International Conference on Parallel Processing, Aug 2017 [Bib - Plain]
80	Exploiting and Evaluating OpenSHMEM on KNL Architecture J. Hashmi, M. Li, H. Subramoni, and DK Panda, Fourth Workshop on OpenSHMEM and Related Technologies, Aug 2017 [Bib - Plain]
81	Designing Dynamic and Adaptive MPI Point-to-point Communication Protocols for Efficient Overlap of Computation and Communication H. Subramoni, S. Chakraborty, and DK Panda, International Supercomputing Conference (ISC ’17), Jun 2017 [Hans Meuer Award (Most Outstanding Research Paper)] [Bib - Plain]
82	High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV enabled InfiniBand Clusters J. Zhang, X. Lu, and DK Panda, 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS '17), May 2017 [Bib - Plain]
83	Designing Locality and NUMA Aware MPI Runtime for Nested Virtualization based HPC Cloud with SR-IOV Enabled InfiniBand J. Zhang, X. Lu, and DK Panda, 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '17), Apr 2017 [Bib - Plain]
84	S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters Ammar Awan, K. Hamidouche, J. Hashmi, and DK Panda, 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb 2017 [Slides] [Bib - Plain]
85	Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA M. Li, X. Lu, K. Hamidouche, J. Zhang, and DK Panda, 23rd IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2016 [Bib - Plain]
86	Re-designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters D. Banerjee, K. Hamidouche, and DK Panda, 8th IEEE International Conference on Cloud Computing Technology and Science (IEEE CloudCom '16), Dec 2016 [Bib - Plain]
87	Enabling Performance Efficient Runtime Support for Hybrid MPI+UPC++ Programming Models J. Hashmi, K. Hamidouche, and DK Panda, 18th IEEE International Conference on High Performance Computing and Communications (HPCC'16), Dec 2016 [Bib - Plain]
88	Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications C. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and DK Panda, First Workshop on Optimization of Communication in HPC runtime systems (COMHPC, SC Workshop), Nov 2016 [Bib - Plain]
89	OpenSHMEM NonBlocking Data Movement Operations with MVAPICH2-X: Early Experiences K. Hamidouche, J. Zhang, K. Tomko, and DK Panda, PGAS Applications Workshop, Nov 2016 [Bib - Plain]
90	Designing MPI Library with On-Demand Paging (ODP) of InfiniBand: Challenges and Benefits M. Li, K. Hamidouche, X. Lu, H. Subramoni, J. Zhang, and DK Panda, SuperComputing 2016, Nov 2016 [Bib - Plain]
91	Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters C. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and DK Panda, 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'16), Oct 2016 [Bib - Plain]
92	Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Ammar Awan, K. Hamidouche, A. Venkatesh, and DK Panda, The 23rd European MPI Users' Group Meeting (EuroMPI 16), Sep 2016 [Best Paper Runner-Up] [Bib - Plain]
93	SLURM-V: Extending SLURM for Building Efficient HPC Cloud with SR-IOV and IVShmem J. Zhang, X. Lu, S. Chakraborty, and DK Panda, 22nd International European Conference on Parallel and Distributed Computing (Euro-Par '16), Aug 2016 [Bib - Plain]
94	High Performance MPI Library for Container-based HPC Cloud on InfiniBand Clusters J. Zhang, X. Lu, and DK Panda, The 45th International Conference on Parallel Processing (ICPP '16), Aug 2016 [Bib - Plain]
95	INAM^2: InfiniBand Network Analysis & Monitoring with MPI H. Subramoni, A. Augustine, M. Arnold, J. Perkins, X. Lu, K. Hamidouche, and DK Panda, International Supercomputing Conference, Jun 2016 [Slides] [Bib - Plain]
96	Performance Characterization of Hypervisor- and Container-based Virtualization for HPC on SR-IOV Enabled InfiniBand Clusters J. Zhang, X. Lu, and DK Panda, IPDRM '16 (IPDPS Workshop), May 2016 [Bib - Plain]
97	Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled System C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and DK Panda, The 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS '16), May 2016 [Bib - Plain]
98	CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters C. Chu, K. Hamidouche, A. Venkatesh, Ammar Awan, and DK Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'16), May 2016 [Bib - Plain]
99	SHMEMPMI - Shared Memory based PMI for Improved Performance and Scalability S. Chakraborty, H. Subramoni, J. Perkins, and DK Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'16), May 2016 [Bib - Plain]
100	High Performance OpenSHMEM Strided Communication Support with InfiniBand UMR M. Li, K. Hamidouche, X. Lu, J. Zhang, J. Lin, and DK Panda, HiPC '15, Dec 2015 [Bib - Plain]
101	A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, DK Panda, D. Kerbyson, and A. Hoise, Supercomputing 2015, Nov 2015 [Best Student Paper Finalist] [Bib - Plain]
102	GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks Ammar Awan, K. Hamidouche, A. Venkatesh, J. Perkins, H. Subramoni, and DK Panda, EuroMPI 2015, Sep 2015 [Bib - Plain]
103	High Performance MPI Datatype Support with User-mode Memory Registration: Challenges, Designs and Benefits M. Li, H. Subramoni, K. Hamidouche, X. Lu, and DK Panda, IEEE Cluster 2015, Sep 2015 [Bib - Plain]
104	Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters K. Hamidouche, A. Venkatesh, Ammar Awan, H. Subramoni, and DK Panda, IEEE Cluster 2015, Sep 2015 [Bib - Plain]
105	Impact of InfiniBand DC Transport Protocol on Energy Consumption of All-to-all Collective Algorithms H. Subramoni, A. Venkatesh, K. Hamidouche, K. Tomko, and DK Panda, 23rd International Symposium on High Performance Interconnects 2015, Aug 2015 [Bib - Plain]
106	High Performance and Scalable Design of MPI-3 RMA on Xeon Phi Clusters M. Li, K. Hamidouche, X. Lu, J. Lin, and DK Panda, Euro-Par '2015, Aug 2015 [Bib - Plain]
107	A Case for Non-Blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X Ammar Awan, K. Hamidouche, C. Chu, and DK Panda, OpenSHMEM 2015 for PGAS Programming in the Exascale Era, Aug 2015 [Bib - Plain]
108	Designing Non-Blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters H. Subramoni, Ammar Awan, K. Hamidouche, D. Pekurovsky, A. Venkatesh, S. Chakraborty, K. Tomko, and DK Panda, ISC '15, Jul 2015 [Bib - Plain]
109	On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI S. Chakraborty, H. Subramoni, J. Perkins, Ammar Awan, and DK Panda, HIPS '15 (IPDPS Workshop), May 2015 [Bib - Plain]
110	High-Performance Coarray Fortran Support with MVAPICH2-X: Initial Experience and Evaluation J. Lin, K. Hamidouche, X. Lu, M. Li, and DK Panda, HIPS '15 (IPDPS Workshop), May 2015 [Bib - Plain]
111	Non-blocking PMI Extensions for Fast MPI Startup S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and DK Panda, CCGrid '15, May 2015 [Bib - Plain]
112	MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds J. Zhang, X. Lu, M. Arnold, and DK Panda, CCGrid '15, May 2015 [Bib - Plain]
113	Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters R. Rajachandrasekar, A. Venkatesh, K. Hamidouche, and DK Panda, CCGrid '15, May 2015 [Bib - Plain]
114	High Performance MPI Library over SR-IOV Enabled InfiniBand Clusters J. Zhang, X. Lu, J. Jose, M. Li, R. Shi, and DK Panda, International Conference on High Performance Computing (HiPC'14), Dec 2014 [Bib - Plain]
115	Designing Efficient Small Message Transfer Mechanism for Inter-node MPI Communication on InfiniBand GPU Clusters R. Shi, S. Potluri, K. Hamidouche, M. Li, J. Perkins, D. Rossetti, and DK Panda, IEEE International Conference on High Performance Computing (HiPC ’14), Dec 2014 [Bib - Plain]
116	A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on Infiniband Clusters A. Venkatesh, H. Subramoni, K. Hamidouche, and DK Panda, IEEE International Conference on High Performance Computing (HiPC ’14), Dec 2014 [Bib - Plain]
117	Scalable MiniMD Design with Hybrid MPI and OpenSHMEM M. Li, J. Lin, X. Lu, K. Hamidouche, K. Tomko, and DK Panda, OUG '14 (Co-located with PGAS), Oct 2014 [Bib - Plain]
118	Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar, and DK Panda, International Conference on Partitioned Global Address Space Programming Models (PGAS '14), Oct 2014 [Bib - Plain]
119	PMI Extensions for Scalable MPI Startup S. Chakraborty, H. Subramoni, J. Perkins, A. Moody, M. Arnold, and DK Panda, EuroMPI/ASIA 2014, Sep 2014 [Bib - Plain]
120	Understanding the Memory-Utilization of MPI Libraries: Challenges and Designs in Implementing the MPI_T Interface R. Rajachandrasekar, J. Perkins, K. Hamidouche, M. Arnold, and DK Panda, EuroMPI/ASIA 2014, Sep 2014 [Bib - Plain]
121	HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement using MPI Datatypes on GPU Clusters R. Shi, X. Lu, S. Potluri, K. Hamidouche, J. Zhang, and DK Panda, International Conference on Parallel Processing (ICPP’14), Sep 2014 [Bib - Plain]
122	Designing Topology-Aware Communication Schedules for Alltoall Operations in Large InfiniBand Clusters H. Subramoni, K. Kandalla, J. Jose, K. Tomko, K. Schulz, D. Pekurovsky, and DK Panda, International Conference on Parallel Processing (ICPP’14), Sep 2014 [Bib - Plain]
123	High Performance OpenSHMEM for MIC Clusters: Extensions, Runtime Designs, and Application Co-Design J. Jose, K. Hamidouche, X. Lu, S. Potluri, J. Zhang, K. Tomko, and DK Panda, IEEE CLUSTER’14, Sep 2014 [Bib - Plain]
124	Scalable Graph500 Design with MPI-3 RMA M. Li, X. Lu, S. Potluri, K. Hamidouche, J. Jose, K. Tomko, and DK Panda, IEEE CLUSTER’14, Sep 2014 [Bib - Plain]
125	Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? J. Zhang, X. Lu, J. Jose, R. Shi, and DK Panda, Euro-Par 2014 Parallel Processing, Aug 2014 [Bib - Plain]
126	MIC-Check: A Distributed Checkpointing Framework for the Intel Many Integrated Cores Architecture R. Rajachandrasekar, S. Potluri, A. Venkatesh, K. Hamidouche, M. W. Rahman, and DK Panda, International Symposium on High Performance and Distributed Computing (HPDC), Jun 2014 [Bib - Plain]
127	Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand : Early Experiences H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty, and DK Panda, IEEE International Supercomputing Conference (ISC ’14), Jun 2014 [Bib - Plain]
128	High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS’14), May 2014 [Bib - Plain]
129	Optimizing Collective Communication in UPC J. Jose, K. Hamidouche, J. Zhang, A. Venkatesh, and DK Panda, International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS '14), May 2014 [Slides] [Bib - Plain]
130	A Comprehensive Performance Evaluation of OpenSHMEM Libraries on InfiniBand Clusters J. Jose, J. Zhang, A. Venkatesh, S. Potluri, and DK Panda, OpenSHMEM Workshop, Mar 2014 [Bib - Plain]
131	Initial Study of Multi-Endpoint Runtime for MPI+OpenMP Hybrid Programming Model on Multi-Core Systems M. Luo, X. Lu, K. Hamidouche, K. Kandalla, and DK Panda, International Symposium on Principles and Practice of Parallel Programming (PPoPP '14), Feb 2014 [Bib - Plain]
132	The MVAPICH Project: Evolution and Sustainability of an Open Source Production Quality MPI Library for HPC DK Panda, K. Tomko, K. Schulz, and A. Majumdar, Int'l Workshop on Sustainable Software for Science: Practice and Experiences, Nov 2013 [Bib - Plain]
133	MVAPICH-PRISM: A Proxy-based Communication Framework using InfiniBand and SCIF for Intel MIC Clusters S. Potluri, D. Bureddy, K. Hamidouche, A. Venkatesh, K. Kandalla, H. Subramoni, and DK Panda, Internationall Conference on Supercomputing, Nov 2013 [Bib - Plain]
134	A Novel Functional Partitioning Approach to Design High-Performance MPI-3 Non-Blocking Alltoallv Collective on Multi-core Systems K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, and DK Panda, International Conference on Parallel Processing (ICPP '13), Oct 2013 [Bib - Plain]
135	UPC on MIC: Early Experiences with Native and Symmetric Modes M. Luo, M. Li, A. Venkatesh, X. Lu, and DK Panda, International Conference on Partitioned Global Address Space Programming Models (PGAS '13), Oct 2013 [Bib - Plain]
136	Optimizing Collective Communication in OpenSHMEM J. Jose, K. Kandalla, S. Potluri, J. Zhang, and DK Panda, International Conference on Partitioned Global Address Space Programming Models (PGAS '13), Oct 2013 [Bib - Plain]
137	Efficient Inter-node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and DK Panda, International Conference on Parallel Processing 2013, Oct 2013 [Bib - Plain]
138	Design of Network Topology Aware Scheduling Services for Large InfiniBand Clusters H. Subramoni, D. Bureddy, K. Kandalla, K. Schulz, B. Barth, J. Perkins, M. Arnold, and DK Panda, IEEE Cluster (Cluster '13), Sep 2013 [Bib - Plain]
139	A Scalable and Portable Approach to Accelerate Hybrid HPL on Heterogeneous CPU-GPU Clusters R. Shi, S. Potluri, K. Hamidouche, X. Lu, K. Tomko, and DK Panda, IEEE Cluster (Cluster '13), Sep 2013 [Bib - Plain]
140	Efficient and Truly Passive MPI-3 RMA Using InfiniBand Atomics M. Li, S. Potluri, K. Hamidouche, J. Jose, and DK Panda, EuroMPI 2013, Sep 2013 [Slides] [Bib - Plain]
141	Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, and DK Panda, International Symposium on High-Performance Interconnects (HotI '13), Aug 2013 [Bib - Plain]
142	MVAPICH2-MIC: A High-Performance MPI Library for Xeon Phi Clusters with InfiniBand S. Potluri, K. Hamidouche, D. Bureddy, and DK Panda, Extreme Scaling Workshop, Aug 2013 [Bib - Plain]
143	Optimized MPI Gather collective for Many Integrated Core (MIC) InfiniBand Clusters A. Venkatesh, K. Kandalla, and DK Panda, Extreme Scaling Workshop, Aug 2013 [Bib - Plain]
144	A 1PB/s File System to Checkpoint Three Million MPI Tasks R. Rajachandrasekar, A. Moody, K. Mohror, and DK Panda, International Conference on High Performance Distributed Computing (HPDC '13), Jun 2013 [Slides] [Bib - Plain]
145	Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models J. Jose, S. Potluri, K. Tomko, and DK Panda, International Supercomputing Conference (ISC '13), Jun 2013 [Slides] [Bib - Plain]
146	MIC-RO: Enabling Efficient Remote Offload on Heterogeneous Many Integrated Core (MIC) Clusters with InfiniBand K. Hamidouche, S. Potluri, H. Subramoni, K. Kandalla, and DK Panda, International Conference on Supercomputing (ICS '13), Jun 2013 [Bib - Plain]
147	Extending OpenSHMEM for GPU Computing S. Potluri, D. Bureddy, H. Wang, H. Subramoni, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '13), May 2013 [Slides] [Bib - Plain]
148	Evaluation of Energy Characteristics of MPI Communication Primitives with RAPL A. Venkatesh, K. Kandalla, and DK Panda, International Workshop on High Performance (High-Performance, Power-Aware Computing Workshop), May 2013 [Bib - Plain]
149	High Performance RDMA-Based Design of HDFS over InfiniBand N. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and DK Panda, International Conference on Supercomputing (SC '12), Nov 2012 [Slides] [Bib - Plain]
150	Design of a Scalable InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and DK Panda, International Conference on Supercomputing (SC '12), Nov 2012 [Bib - Plain]
151	Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand M. Luo, H. Wang, and DK Panda, International Conference on Partitioned Global Address Space Programming Models (PGAS '12), Oct 2012 [Slides] [Bib - Plain]
152	Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation J. Jose, K. Kandalla, M. Luo, and DK Panda, International Conference on Parallel Processing (ICPP '12), Sep 2012 [Bib - Plain]
153	OMB-GPU: A Micro-benchmark suite for Evaluating MPI Libraries on GPU Clusters D. Bureddy, H. Wang, A. Venkatesh, S. Potluri, and DK Panda, EuroMPI 2012, Sep 2012 [Bib - Plain]
154	Minimizing Network Contention in InfiniBand Clusters with a QoS-Aware Data-Staging Framework R. Rajachandrasekar, J. Jaswani, H. Subramoni, and DK Panda, IEEE Cluster (Cluster '12), Sep 2012 [Bib - Plain]
155	Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? Int'l Workshop on Parallel Algorithm and Parallel Software (IWPAPS12) K. Kandalla, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and DK Panda, held in conjunction with IEEE Cluster (Cluster '12), Sep 2012 [Bib - Plain]
156	A Scalable InfiniBand Network-Topology-Aware Performance Analysis Tool for MPI H. Subramoni, J. Vienne, and DK Panda, International Workshop on Productivity and Performance (Proper '12), Aug 2012 [Bib - Plain]
157	Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing System J. Vienne, J. Chen, M. W. Rahman, N. Islam, H. Subramoni, and DK Panda, International Symposium on High-Performance Interconnects (HotI 2012), Aug 2012 [Bib - Plain]
158	Congestion Avoidance on Manycore High Performance Computing Systems M. Luo, DK Panda, C. Iancu, and K. Z. Ibrahim, International Conference on Supercomputing (ICS '12), Jun 2012 [Bib - Plain]
159	Redesigning MPI Shared Memory Communication for Large Multi-Core Architecture M. Luo, H. Wang, J. Vienne, and DK Panda, International Supercomputing Conference 2012, Jun 2012 [Bib - Plain]
160	Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne, and DK Panda, International Parallel and Distributed Processing Symposium 2012, May 2012 [Bib - Plain]
161	Designing Network Failover and Recovery in MPI for Multi-Rail InfiniBand Clusters S. P. Raikar, H. Subramoni, K. Kandalla, J. Vienne, and DK Panda, International Workshop on System Management Techniques, May 2012 [Bib - Plain]
162	Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI R. Rajachandrasekar, X. Besseron, and DK Panda, International Workshop on System Management Techniques, May 2012 [Bib - Plain]
163	Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication S. Potluri, H. Wang, D. Bureddy, A. Singh, C. Rosales, and DK Panda, International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), May 2012 [Slides] [Bib - Plain]
164	Intra-MIC MPI Communication using MVAPICH2: Early Experience S. Potluri, K. Tomko, D. Bureddy, and DK Panda, TACC-Intel Highly-Parallel Computing Symposium, Apr 2012 [Slides] [Bib - Plain]
165	Multi-threaded UPC Runtime with Network Endpoints: Design Alternatives and Evaluation on Multi-core Architectures M. Luo, J. Jose, S. Sur, and DK Panda, International Conference on High Performance Computing (HiPC '11), Dec 2011 [Slides] [Bib - Plain]
166	UPC Queues for Scalable Graph Traversals: Design and Evaluation on InfiniBand Clusters J. Jose, S. Potluri, M. Luo, S. Sur, and DK Panda, Fifth Conference on Partitioned Global Address Space Programming Model (PGAS '11), Oct 2011 [Slides] [Bib - Plain]
167	Can a Decentralized Metadata Service Layer benefit Parallel Filesystems? Workshop on Interfaces and Architectures for Scientific Data Storage (IASDS '11) V. Meshram, X. Besseron, X. Ouyang, R. Rajachandrasekar, and DK Panda, held in conjunction with Cluster '11, Sep 2011 [Bib - Plain]
168	MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur, and DK Panda, International Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), Sep 2011 [Slides] [Bib - Plain]
169	Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K. Tomko, R. McLay, K. Schulz, and DK Panda, IEEE Cluster '11, Sep 2011 [Bib - Plain]
170	Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design Implementation and Evaluation with MVAPICH2 H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur, and DK Panda, IEEE Cluster '11, Sep 2011 [Slides] [Bib - Plain]
171	Optimizing MPI One Sided Communication on Multi-core InfiniBand Clusters using Shared Memory Backed Windows S. Potluri, H. Wang, V. Dhanraj, S. Sur, and DK Panda, EuroMPI '11, Sep 2011 [Bib - Plain]
172	Design and Implementation of Key Proposed MPI-3 One-Sided Communication Semantics on InfiniBand S. Potluri, S. Sur, D. Bureddy, and DK Panda, EuroMPI '11, Sep 2011 [Slides] [Poster/Short Paper] [Bib - Plain]
173	CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart X. Ouyang, R. Rajachandrasekar, X. Besseron, H. Wang, J. Huang, and DK Panda, International Conference on Parallel Processing (ICPP '11), Sep 2011 [Slides] [Bib - Plain]
174	Can Checkpoint/Restart Mechanisms Benefit from Hierarchical Data Staging? Workshop on Resiliency in High Performance Computing in Clusters R. Rajachandrasekar, X. Ouyang, X. Besseron, V. Meshram, and DK Panda, Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids 2011, held in conjunction with EuroPar, Aug 2011 [Bib - Plain]
175	INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool N. Dandapanthula, H. Subramoni, J. Vienne, K. Kandalla, S. Sur, DK Panda, and R. Brightwell, 4th International Workshop on Productivity and Performance (PROPER 2011), Aug 2011 [Slides] [Bib - Plain]
176	Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL K. Kandalla, H. Subramoni, J. Vienne, K. Tomko, S. Sur, and DK Panda, Hot Interconnect '11, Aug 2011 [Bib - Plain]
177	High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur, and DK Panda, International Supercomputing Conference '11 (ISC'11), Jun 2011 [Bib - Plain]
178	MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur, and DK Panda, International Supercomputing Conference '11 (ISC'11), Jun 2011 [Slides] [Bib - Plain]
179	Efficient Intra-node Communication on Intel-MIC Clusters S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla, and DK Panda, International Symposium on Cluster, May 2011 [Slides] [Bib - Plain]
180	SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience J. Jose, M. Li, X. Lu, K. Kandalla, M. Arnold, and DK Panda, International Symposium on Cluster, May 2011 [Slides] [Bib - Plain]
181	High Performance Pipelined Process Migration with RDMA X. Ouyang, R. Rajachandrasekar, X. Besseron, and DK Panda, International Symposium on Cluster, May 2011 [Slides] [Bib - Plain]
182	Beyond Block I/O: Rethinking Traditional Storage Primitives X. Ouyang, D. Nellans, R. Wipfel, D. Flynn, and DK Panda, 17th IEEE International Symposium on High Performance Computer Architecture (HPCA-17), Feb 2011 [Slides] [Bib - Plain]
183	Scalable Earthquake Simulation on Petascale Supercomputers Y. Cui, K. B. Olsen, T. H. Jordan, K. Lee, J. Zhou, P. Small, D. Roten, G. Ely, DK Panda, A. Chourasia, J. Levesque, S. M. Day, and P. Maechling, SuperComputing 2010, Nov 2010 [Bib - Plain]
184	Unifying UPC and MPI Runtimes: Experience with MVAPICH J. Jose, M. Luo, S. Sur, and DK Panda, International Workshop on Partitioned Global Address Space (PGAS '10), Oct 2010 [Slides] [Bib - Plain]
185	RDMA-Based Job Migration Framework for MPI over InfiniBand Int'l Conference on Cluster Computing (Cluster '10) X. Ouyang, S. Marcarelli, R. Rajachandrasekar, and DK Panda, IEEE International Conference on Cluster Computing 2010, Sep 2010 [Bib - Plain]
186	Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters H. Subramoni, P. Lai, S. Sur, and DK Panda, International Conference on Parallel Processing (ICPP '10), Sep 2010 [Slides] [Bib - Plain]
187	Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters K. Kandalla, E. Mancini, S. Sur, and DK Panda, International Conference on Parallel Processing (ICPP '10), Sep 2010 [Slides] [Bib - Plain]
188	High Performance Design and Implementation of Nemesis Communication Layer for Two-sided and One-Sided MPI Semantics in MVAPICH2 M. Luo, S. Potluri, P. Lai, E. Mancini, H. Subramoni, K. Kandalla, S. Sur, and DK Panda, International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2 '10), Sep 2010 [Bib - Plain]
189	Design and Evaluation of Generalized Collective Communication Primitives with Overlap using ConnectX-2 Offload Engine H. Subramoni, K. Kandalla, S. Sur, and DK Panda, International Symposium on High Performance Interconnects 2010, Aug 2010 [Bib - Plain]
190	Quantifying Performance Benefits of Overlap using MPI-2 in a Seismic Modeling Application S. Potluri, P. Lai, K. Tomko, S. Sur, Y. Cui, M. Tatineni, K. Schulz, W. Barth, A. Majumdar, and DK Panda, 24th International Conference on Supercomputing (ICS), Jun 2010 [Bib - Plain]
191	Designing Truly One-Sided MPI-2 RMA Intra-node Communication on Multi-core Systems P. Lai, S. Sur, and DK Panda, 24th International Conference on Supercomputing (ICS), Jun 2010 [Slides] [Bib - Plain]
192	High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand H. Subramoni, P. Lai, R. Kettimuthu, and DK Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'10), May 2010 [Slides] [Bib - Plain]
193	Enhancing Checkpoint Performance with Staging IO and SSD X. Ouyang, S. Marcarelli, and DK Panda, IEEE International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI), May 2010 [Slides] [Bib - Plain]
194	Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters: Case Studies with Scatter and Gather K. Kandalla, H. Subramoni, A. Vishnu, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC 10), Apr 2010 [Bib - Plain]
195	Designing High-Performance and Resilient Message Passing on InfiniBand M. Koop, P. Shamis, I. Rabinovitz, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC 10), Apr 2010 [Bib - Plain]
196	Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand P. Lai, H. Subramoni, S. Narravula, A. Mamidala, and DK Panda, International Conference on Parallel Processing (ICPP '09), Sep 2009 [Slides] [Bib - Plain]
197	Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems X. Ouyang, K. Gopalakrishnan, and DK Panda, International Conference on Parallel Processing (ICPP '09), Sep 2009 [Slides] [Bib - Plain]
198	CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems R. Gupta, P. Beckman, H. Park, E. Lusk, P. Hargrove, A. Geist, DK Panda, A. Lumsdaine, and J. Dongarra, International Conference on Parallel Processing (ICPP '09), Sep 2009 [Bib - Plain]
199	Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand T. Gangadharappa, M. Koop, and DK Panda, International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2 '09), Sep 2009 [Bib - Plain]
200	Impact of Node Level Caching in MPI Job Launch Mechanisms J. Sridhar, and DK Panda, EuroPVM/MPI '09, Sep 2009 [Slides] [Bib - Plain]
201	An Efficient Hardware-Software Approach to Network Fault Tolerance with InfiniBand A. Vishnu, M. Krishnan, and DK Panda, International Conference on Cluster Computing (Cluster '09), Sep 2009 [Slides] [Bib - Plain]
202	Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters M. Koop, M. Luo, and DK Panda, International Conference on Cluster Computing (Cluster '09), Sep 2009 [Slides] [Bib - Plain]
203	Design Alternatives for Implementing Fence Synchronization in MPI-2 One-sided Communication on InfiniBand Clusters G. Santhanaraman, T. Gangadharappa, S. Narravula, A. Mamidala, and DK Panda, International Conference on Cluster Computing (Cluster '09), Sep 2009 [Slides] [Bib - Plain]
204	RDMA over Ethernet - A Preliminary Study H. Subramoni, P. Lai, M. Luo, and DK Panda, International Workshop on High Performance Distributed Computing (HPI-DC '09), Sep 2009 [Slides] [Bib - Plain]
205	ProOnE: A General Purpose Protocol Onload Engine for Multi- and Many-Core Architectures P. Lai, P. Balaji, R. Thakur, and DK Panda, International Supercomputing Conference (ISC), Jun 2009 [Bib - Plain]
206	Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters K. Kandalla, H. Subramoni, G. Santhanaraman, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC'09), May 2009 [Slides] [Bib - Plain]
207	Fast Checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on Multicore Architecture X. Ouyang, K. Gopalakrishnan, DK Panda, Fast Checkpointing by Write Aggregation with Dynamic Buffer, and Interleaving on Multicore Architecture, Int'l Conference on High Performance Computing 2009, Feb 2009 [Slides] [Bib - Plain]
208	ScELA: Scalable and Extensible Launching Architecture for Clusters J. Sridhar, M. Koop, J. Perkins, and DK Panda, International Symposium on High Performance Computing (HiPC), Dec 2008 [Slides] [Bib - Plain]
209	Designing High Performance pNFS With RDMA on InfiniBand R. Noronha, X. Ouyang, and DK Panda, International Symposium on High Performance Computing (HiPC), Dec 2008 [Bib - Plain]
210	Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur, and DK Panda, International Symposium on High Performance Computing (HiPC), Dec 2008 [Slides] [Bib - Plain]
211	Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand H. Subramoni, G. Marsh, S. Narravula, P. Lai, and DK Panda, Workshop on High Performance Computational Finance (In conjunction with SC '08), Nov 2008 [OSU Technical Report Version (OSU-CISRC-10/08-TR51)] [Bib - Plain]
212	Scalable MPI Design over InfiniBand using eXtended Reliable Connection M. Koop, J. Sridhar, and DK Panda, IEEE Cluster 2008, Sep 2008 [Slides] [Bib - Plain]
213	Efficient One-Copy MPI Shared Memory Communication in Virtual Machines W. Huang, M. Koop, and DK Panda, IEEE Cluster 2008, Sep 2008 [Slides] [Bib - Plain]
214	IMCa: A High Performance Caching Frontend for GlusterFS on InfiniBand R. Noronha, and DK Panda, International Conference on Parallel Processing 2008, Sep 2008 [Slides] [Bib - Plain]
215	Performance of HPC middleware over InfiniBand WAN S. Narravula, H. Subramoni, P. Lai, R. Noronha, and DK Panda, International Conference on Parallel Processing 2008, Sep 2008 [Bib - Plain]
216	Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems L. Chai, P. Lai, H. Jin, and DK Panda, International Conference on Parallel Processing 2008, Sep 2008 [Slides] [Bib - Plain]
217	Can Software Reliability Outperform Hardware Reliability on High Performance Interconnects? A Case Study with MPI over InfiniBand M. Koop, R. Kumar, and DK Panda, 22nd ACM International Conference on Supercomputing (ICS '08), Jun 2008 [Bib - Plain]
218	Advanced RDMA-based Admission Control for Modern Data-Centers P. Lai, S. Narravula, K. Vaidyanathan, and DK Panda, CCGrid '08, May 2008 [Slides] [Bib - Plain]
219	Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, and S. Narravula, CCGrid '08, May 2008 [Slides] [Bib - Plain]
220	MPI Collectives on modern Multicore clusters: Performance Optimizations and Communication Characteristics A. Mamidala, R. Kumar, D. De, and DK Panda, CCGrid '08, May 2008 [Bib - Plain]
221	Scaling Alltoall Collective on Multi-core Systems R. Kumar, A. Mamidala, and DK Panda, International Workshop on Communication Architecture for Clusters, Apr 2008 [Slides] [Bib - Plain]
222	pNFS/PVFS2 over InfiniBand: Early Experiences L. Chai, X. Ouyang, R. Noronha, and DK Panda, Petascale Data Storage Workshop, Nov 2007 [Slides] [Bib - Plain]
223	Virtual Machine Aware Communication Libraries for High Performance Computing W. Huang, M. Koop, Q. Gao, and DK Panda, SuperComputing (SC'07), Nov 2007 [Slides] [Best Student Paper Finalist] [Bib - Plain]
224	Enhancing the Performance of NFSv4 with RDMA R. Noronha, L. Chai, S. Shepler, and DK Panda, International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI'07), Sep 2007 [Bib - Plain]
225	MPI-2 One Sided Usage and Implementation for Read Modify Write operations: A case study with HPCC G. Santhanaraman, S. Narravula, A. Mamidala, and DK Panda, EuroPVM/MPI 2007, Sep 2007 [Bib - Plain]
226	Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram M. Koop, S. Sur, and DK Panda, IEEE International Conference on Cluster Computing, Sep 2007 [Bib - Plain]
227	High Performance Virtual Machine Migration with RDMA over Modern Interconnects W. Huang, Q. Gao, J. Liu, and DK Panda, IEEE International Conference on Cluster Computing, Sep 2007 [Best Paper] [Bib - Plain]
228	Efficient Asynchronous Memory Copy Operations on Multi-Core Systems and I/OAT K. Vaidyanathan, L. Chai, W. Huang, and DK Panda, IEEE International Conference on Cluster Computing, Sep 2007 [Bib - Plain]
229	Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand Q. Gao, W. Huang, M. Koop, and DK Panda, International Conference on Parallel Processing (ICPP'07), Sep 2007 [Slides] [Bib - Plain]
230	High Performance MPI over iWARP: Early Experiences S. Narravula, A. Mamidala, A. Vishnu, G. Santhanaraman, and DK Panda, High Performance MPI over iWARP: Early Experiences, Sep 2007 [Bib - Plain]
231	Designing NFS With RDMA For Security, Performance and Scalability R. Noronha, L. Chai, T. Talpey, and DK Panda, International Conference on Parallel Processing 2007, Sep 2007 [Bib - Plain]
232	Designing Next Generation Clusters: Evaluation of InfiniBand DDR/QDR on Intel Computing Platforms H. Subramoni, M. Koop, and DK Panda, International Symposium on Hot Interconnects (HotI), Aug 2007 [Slides] [Bib - Plain]
233	Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand M. Koop, W. Huang, K. Gopalakrishnan, and DK Panda, International Symposium on Hot Interconnects (HotI), Aug 2007 [Bib - Plain]
234	Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms S. Sur, M. Koop, L. Chai, and DK Panda, International Symposium on Hot Interconnects (HotI), Aug 2007 [Slides] [Bib - Plain]
235	High Performance MPI Design using Unreliable Datagram for Ultra-Scale InfiniBand Clusters M. Koop, S. Sur, Q. Gao, and DK Panda, 21st International ACM Conference on Supercomputing (ICS '07), Jun 2007 [Bib - Plain]
236	Nomad: Migrating OS-bypass Networks in Virtual Machines W. Huang, J. Liu, M. Koop, B. Abali, and DK Panda, Third International SIGPLAN/SIGOPS Conference on Virtual Execution Environments (VEE), Jun 2007 [Bib - Plain]
237	High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations S. Narravula, A. Mamidala, A. Vishnu, K. Vaidyanathan, and DK Panda, International Sympsoium on Cluster Computing and the Grid, May 2007 [Slides] [Bib - Plain]
238	Design and Implementation of High Performance MVAPICH2: MPI2 over InfiniBand W. Huang, G. Santhanaraman, H. Jin, Q. Gao, and DK Panda, International Sympsoium on Cluster Computing and the Grid, May 2007 [Bib - Plain]
239	Benefits of I/O Acceleration Technology (I/OAT) in Clusters K. Vaidyanathan, and DK Panda, International Symposium on Performance Analysis of Systems and Software (ISPASS), Apr 2007 [Bib - Plain]
240	Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji, and DK Panda, Workshop on NSF Next Generation Software(NGS) Program; held in conjunction with IPDPS, Apr 2007 [Bib - Plain]
241	Improving Scalability of OpenMP Applications on MultiCore Systems Using Large Page Support R. Noronha, and DK Panda, International Workshop on Multithreaded Architectures and Applications (MTAAP), Mar 2007 [Bib - Plain]
242	High Performance MPI on IBM 12x InfiniBand Architecture A. Vishnu, B. Benton, and DK Panda, International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS), Mar 2007 [Bib - Plain]
243	Automatic Path Migration over InfiniBand: Early Experience A. Vishnu, A. Mamidala, S. Narravula, and DK Panda, Third International Workshop on System Management Techniques, Mar 2007 [Bib - Plain]
244	Designing Efficient Asynchronous Memory Operations Using Hardware Copy Engine: A Case Study with I/OAT K. Vaidyanathan, W. Huang, L. Chai, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC), Mar 2007 [Bib - Plain]
245	Using Connection-Oriented and Connection-Less Transport on Performance and Scalability of Collective and One-sided operations: Trade-offs and Impact A. Mamidala, S. Narravula, A. Vishnu, G. Santhanaraman, and DK Panda, International Symposium on Principles and Practice of Parallel Programming (PPoPP 2007), Mar 2007 [Bib - Plain]
246	DDSS: A Low-Overhead Distributed Data Sharing Substrate for Cluster-Based Data-Centers over Modern Interconnects K. Vaidyanathan, S. Narravula, and DK Panda, International Conference on High Performance Computing (HiPC), Dec 2006 [Slides] [Bib - Plain]
247	Finding Bugs in Large-Scale Parallel Programs by Detecting Anomaly in Data Movements Q. Gao, F. Qin, and DK Panda, SuperComputing 2006, Nov 2006 [Bib - Plain]
248	Analyzing the Impact of Supporting Out-of-Order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, DK Panda, R. Thakur, and W. Gropp, SuperComputing 2006, Nov 2006 [Bib - Plain]
249	High-Performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-Depth Performance Analysis S. Sur, M. Koop, and DK Panda, SuperComputing 2006, Nov 2006 [Bib - Plain]
250	A Software Based Approach for Providing Network Fault Tolerance in Clusters Using the uDAPL Interface: MPI Level Design and Performance Evaluation A. Vishnu, P. Gupta, A. Mamidala, and DK Panda, SuperComputing 2006, Nov 2006 [Bib - Plain]
251	NemC: A Network Emulator for Cluster-of-Clusters H. Jin, S. Narravula, K. Vaidyanathan, and DK Panda, International Conf. on Computer Commn. and Networks, Oct 2006 [Bib - Plain]
252	Designing Efficient MPI Intra-node Communication Support for Modern Computer Architectures L. Chai, A. Hartono, and DK Panda, International Conference on Cluster Computing, Sep 2006 [Bib - Plain]
253	Efficient Shared Memory and RDMA based design for MPI\_Allgather over InfiniBand A. Mamidala, A. Vishnu, and DK Panda, EuroPVM/MPI, Sep 2006 [Bib - Plain]
254	Exploiting RDMA operations for Providing Efficient Fine-Grained Resource Monitoring in Cluster-based Servers K. Vaidyanathan, H. Jin, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies, Sep 2006 [Bib - Plain]
255	Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand M. Koop, W. Huang, A. Vishnu, and DK Panda, International Symposium on Hot Interconnect 2006 (HotI'06), Aug 2006 [Slides] [Bib - Plain]
256	Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Q. Gao, W. Yu, W. Huang, and DK Panda, International Conference on Parallel Processing (ICPP), Aug 2006 [Slides] [Bib - Plain]
257	High Performance Block I/O for Global File System (GFS) with InfiniBand RDMA S. Liang, W. Yu, and DK Panda, International Conference on Parallel Processing (ICPP), Aug 2006 [Bib - Plain]
258	A Case for High Performance Computing with Virtual Machines W. Huang, J. Liu, B. Abali, and DK Panda, International Conference on Supercomputing (ICS), Jun 2006 [Slides] [Bib - Plain]
259	High Performance VMM-Bypass I/O in Virtual Machines J. Liu, W. Huang, B. Abali, and DK Panda, USENIX Annual Technical Conference, Jun 2006 [Bib - Plain]
260	An MPI-Stream Hybrid Programming Model for Computational Clusters E. Mancini, G. Marsh, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid), May 2006 [Slides] [Bib - Plain]
261	Natively Supporting True One-sided Communication in MPI on Multi-core Systems with InfiniBand G. Santhanaraman, P. Balaji, K. Gopalakrishnan, R. Thakur, W. Gropp, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid), May 2006 [Bib - Plain]
262	Reducing Connection Memory Requirements of MPI for InfiniBand Clusters: A Message Coalescing Approach M. Koop, T. Jones, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid), May 2006 [Bib - Plain]
263	Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System L. Chai, Q. Gao, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid), May 2006 [Bib - Plain]
264	Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective A. Vishnu, M. Koop, A. Moody, A. Mamidala, S. Narravula, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid), May 2006 [Bib - Plain]
265	Designing Efficient Cooperative Caching Schemes for Multi-Tier Data-Centers over RDMA-enabled Networks S. Narravula, H. Jin, K. Vaidyanathan, and DK Panda, International Symposium on Cluster Computing and the Grid (CCGrid), May 2006 [Bib - Plain]
266	MPI over uDAPL: Can High Performance and Portability Exist Across Architectures? L. Chai, R. Noronha, and DK Panda, International Sympsoium on Cluster Computing and the Grid 2006, May 2006 [Bib - Plain]
267	Designing High Performance and Scalable MPI Intra-node Communication Support for Clusters L. Chai, and DK Panda, International Sympsoium on Cluster Computing and the Grid 2006, May 2006 [Slides] [Bib - Plain]
268	Designing Next-Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. Jin, and DK Panda, Workshop on NSF Next Generation Software(NGS) Program; held in conjuction with IPDPS, Apr 2006 [Slides] [Bib - Plain]
269	Shared Receive Queue based Scalable MPI Design for InfiniBand Clusters S. Sur, L. Chai, H. Jin, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '06), Apr 2006 [Bib - Plain]
270	Adaptive Connection Management for Scalable MPI over InfiniBand W. Yu, Qi Gao, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '06), Apr 2006 [Slides] [Bib - Plain]
271	Efficient SMP-Aware MPI-Level Broadcast over InfiniBand's Hardware Multicast A. Mamidala, L. Chai, H. Jin, and DK Panda, Communication Architecture for Clusters (CAC) Workshop, Apr 2006 [Bib - Plain]
272	Asynchronous Zero-Copy Communication for Synchronous Sockets Direct Protocol (SDP) over InfiniBand P. Balaji, S. Bhagvat, H. Jin, and DK Panda, Communication Architecture for Clusters (CAC) Workshop, Apr 2006 [Bib - Plain]
273	Benefits of High Speed Interconnects to Cluster File Systems: A Case Study with Lustre W. Yu, R. Noronha, S. Liang, and DK Panda, Communication Architecture for Clusters (CAC) Workshop, Apr 2006 [Bib - Plain]
274	RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits S. Sur, L. Chai, H. Jin, and DK Panda, International Symposium on Principles and Practice of Parallel Programming (PPoPP 2006), Mar 2006 [Slides] [Bib - Plain]
275	A Case for UDP Offload Engines in LambdaGrids V. Vishwanathz, P. Balaji, W. Feng, J. Leigh, and DK Panda, International Workshop on Protocols for Fast Long-Distance Networks (PFLDnet 2006), Feb 2006 [Bib - Plain]
276	High Performance RDMA Based All-to-all Broadcast for InfiniBand Clusters S. Sur, U. Bondhugula, A. Mamidala, H. Jin, and DK Panda, International Conference on High Performance Computing (HiPC 2005), Dec 2005 [Bib - Plain]
277	Supporting MPI-2 One Sided Communication on Multi-Rail InfiniBand Clusters: Design Challenges and Performance Benefits A. Vishnu, G. Santhanaraman, W. Huang, H. Jin, and DK Panda, International Conference on High Performance Computing (HiPC 2005), Dec 2005 [Bib - Plain]
278	Supporting iWARP Compatibility and Features for Regular Network Adapters P. Balaji, H. Jin, K. Vaidyanathan, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies, Sep 2005 [Slides] [Bib - Plain]
279	Head-to-TOE Evaluation of High-Performance Sockets over Protocol Offload Engines P. Balaji, W. Feng, Q. Gao, R. Noronha, W. Yu, and DK Panda, IEEE Cluster Computing 2005, Sep 2005 [Slides] [Bib - Plain]
280	Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device S. Liang, R. Noronha, and DK Panda, IEEE Cluster Computing 2005, Sep 2005 [Slides] [Bib - Plain]
281	Benefits of Quadrics Scatter/Gather to PVFS2 Noncontiguous I/O W. Yu, and DK Panda, International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI) 2005. Sept. 2005., Sep 2005 [Slides] [Bib - Plain]
282	Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? S. Sur, A. Vishnu, H. Jin, W. Huang, and DK Panda, Hot Interconnect 13 (HOTI 05), Aug 2005 [Slides] [Bib - Plain]
283	Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng, P. Balaji, C. Baron, L. N. Bhuyan, and DK Panda, Hot Interconnect 13 (HOTI 05), Aug 2005 [Slides] [Bib - Plain]
284	Performance Evaluation of MM5 on Clusters With Modern Interconnects: Scalability and Impact R. Noronha, and DK Panda, Euro-Par, Aug 2005 [Bib - Plain]
285	Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H. Jin, S. Narravula, K. Vaidyanathan, P. Balaji, and DK Panda, Workshop on High Performance Interconnects for Distributed Computing (HPI-DC); In conjunction with HPDC-14, Jul 2005 [Bib - Plain]
286	High Performance Support of Parallel Virtual File System (PVFS2) over Quadrics W. Yu, S. Liang, and DK Panda, International Conference on Supercomputing (ICS '05), Jun 2005 [Bib - Plain]
287	LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster H. Jin, S. Sur, L. Chai, and DK Panda, International Conference on Parallel Processing (ICPP-05), Jun 2005 [Slides] [Bib - Plain]
288	Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, H. Jin, and DK Panda, IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 05), May 2005 [Slides] [Bib - Plain]
289	Can High Performance Software DSM Systems Designed With InfiniBand Features Benefit from PCI-Express? R. Noronha, and DK Panda, DSM Workshop, May 2005 [Bib - Plain]
290	Designing Multi-Level, Multi-Tier Data Center Architecture for Securing Distributed Infrastructure and Assets DK Panda, DHS Homeland Security Conference, Apr 2005 [Bib - Plain]
291	Analysis of Design Considerations for Optimizing Multi-Channel MPI over InfiniBand L. Chai, S. Sur, H. Jin, and DK Panda, Workshop on Communication Architecture on Clusters (CAC '05), Apr 2005 [Bib - Plain]
292	Scheduling of MPI-2 One Sided Operations over InfiniBand W. Huang, G. Santhanaraman, H. Jin, and DK Panda, Workshop on Communication Architecture on Clusters (CAC '05), Apr 2005 [Slides] [Bib - Plain]
293	Performance Modeling of Subnet Management on Fat Tree InfiniBand Networks using OpenSM A. Vishnu, A. Mamidala, and H.- W, Workshop on System Management Tools on Large Scale Parallel Systems, Apr 2005 [Bib - Plain]
294	Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu, T. S. Woodall, R. L. Graham, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS 2005). April 2005., Apr 2005 [Slides] [Bib - Plain]
295	On the Provision of Prioritization and Soft QoS in Dynamically Reconfigurable Shared Data-Centers over InfiniBand P. Balaji, S. Narravula, K. Vaidyanathan, H. Jin, and DK Panda, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 05), Mar 2005 [Slides] [Bib - Plain]
296	Workload-driven Analysis of File Systems in Shared Multi-Tier Data-Centers over InfiniBand K. Vaidyanathan, P. Balaji, H. Jin, and DK Panda, Computer Architecture Evaluation using Commercial Workloads (in conjunction with HPCA), Feb 2005 [Slides] [Bib - Plain]
297	Scalable Startup of Parallel Programs over InfiniBand W. Yu, J. Wu, and DK Panda, International Conference on High Performance Computing (HiPC '04), Dec 2004 [Slides] [Bib - Plain]
298	Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation J. Liu, A. Vishnu, and DK Panda, SuperComputing 2004 Conference (SC 04), Nov 2004 [Slides] [Bib - Plain]
299	Reducing Diff Overhead in Software DSM Systems using RDMA Operations in InfiniBand R. Noronha, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies in conjunction with the IEEE Cluster, Sep 2004 [Slides] [Bib - Plain]
300	Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. Jin, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies in conjunction with the IEEE Cluster, Sep 2004 [Slides] [Bib - Plain]
301	Sockets vs RDMA Interface over 10-Gigabit Networks: An In-depth analysis of the Memory Traffic Bottleneck P. Balaji, H. V. Shah, and DK Panda, Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies in conjunction with the IEEE Cluster, Sep 2004 [Slides] [Bib - Plain]
302	Scalable and High Performance NIC-Based Allgather over Myrinet/GM W. Yu, D. Buntinas, and DK Panda, International Conference on Cluster Computing 2004, Sep 2004 [Slides] [Bib - Plain]
303	Efficient Barrier and Allreduce on IBA Clusters using Hardware Multicast and Adaptive Algorithms A. Mamidala, J. Liu, and DK Panda, International Conference on Cluster Computing 2004, Sep 2004 [Bib - Plain]
304	NIC-Based Offload of Dynamic User-Defined Modules for Myrinet Clusters A. Wagner, H. Jin, R. Riesen, and DK Panda, International Conference on Cluster Computing 2004, Sep 2004 [Bib - Plain]
305	Zero-Copy MPI Derived Datatype Communication over InfiniBand G. Santhanaraman, J. Wu, and DK Panda, EuroPVM/MPI 2004, Sep 2004 [Slides] [Bib - Plain]
306	Efficient Implementation of MPI-2 Passive One-Sided Communication on InfiniBand Clusters W. Jiang, J. Liu, H. Jin, DK Panda, D. Buntinas, R. Thakur, and W. Gropp, EuroPVM/MPI 2004, Sep 2004 [Slides] [Bib - Plain]
307	Performance Evaluation of InfiniBand with PCI Express J. Liu, A. Mamidala, A. Vishnu, and DK Panda, Hot Interconnect 12 (HOTI 04), Aug 2004 [Bib - Plain]
308	Efficient and Scalable All-to-All Personalized Exchange for InfiniBand-based Clusters S. Sur, H. Jin, and DK Panda, International Conference on Parallel Processing (ICPP '04), Aug 2004 [Bib - Plain]
309	Design and Implementation of MPICH2 over InfiniBand with RDMA Support J. Liu, W. Jiang, P. Wyckoff, DK Panda, D. Ashton, D. Buntinas, W. Gropp, and B. Toonen, International Parallel and Distributed Processing Symposium (IPDPS 04), Apr 2004 [Slides] [Bib - Plain]
310	Fast and Scalable MPI-Level Broadcast using InfiniBand's Hardware Multicast Support J. Liu, A. Mamidala, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS 04), Apr 2004 [Slides] [Bib - Plain]
311	High Performance Implementation of MPI Datatype Communication over InfiniBand J. Wu, P. Wyckoff, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS 04), Apr 2004 [Bib - Plain]
312	Host-Assisted Zero-Copy Remote Memory Access Communication on InfiniBand V. Tipparaju, G. Santhanaraman, J. Nieplocha, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS 04), Apr 2004 [Bib - Plain]
313	Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand J. Liu, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC 04), Apr 2004 [Slides] [Bib - Plain]
314	Efficient and Scalable Barrier over Quadrics and Myrinet with a New NIC-Based Collective Message Passing Protocol W. Yu, and DK Panda, International Workshop on Communication Architecture for Clusters (CAC 04), Apr 2004 [Slides] [Bib - Plain]
315	High Performance MPI-2 One-Sided Communication over InfiniBand W. Jiang, J. Liu, H. Jin, DK Panda, W. Gropp, and R. Thakur, International Symposium on Cluster Computing and the Grid (CCGrid 04), Apr 2004 [Slides] [Bib - Plain]
316	Unifier: Unifying Cache Management and Communication Buffer Management for PVFS over InfiniBand J. Wu, P. Wyckoff, DK Panda, and R. Ross, International Symposium on Cluster Computing and the Grid (CCGrid 04), Apr 2004 [Bib - Plain]
317	Designing High Performance DSM Systems using InfiniBand Features R. Noronha, and DK Panda, International Workshop on Distributed Shared Memory Systems, Apr 2004 [Slides] [Bib - Plain]
318	Sockets Direct Protocol over InfiniBand in Clusters: Is it Beneficial? Int'l Symposium on Performance Analysis of Systems and Software (ISPASS 04). March P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and DK Panda, International Symposium on Performance Analysis of Systems and Software, Apr 2004 [Bib - Plain]
319	Sockets Direct Procotol over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and DK Panda, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 04), Apr 2004 [Slides] [Bib - Plain]
320	Evaluating the Impact of RDMA on Storage I/O over InfiniBand J. Liu, DK Panda, and M. Banikazemi, SAN-03 Workshop (in conjunction with HPCA), Feb 2004 [Slides] [Bib - Plain]
321	Application-Bypass Reduction for Large-Scale Clusters A. Wagner, D. Buntinas, R. Brightwell, and DK Panda, Cluster 2003 Conference, Dec 2003 [Bib - Plain]
322	Supporting Efficient Noncontiguous Access in PVFS over InfiniBand J. Wu, P. Wyckoff, and DK Panda, Cluster 2003 Conference, Dec 2003 [Bib - Plain]
323	Optimizing Mechanisms for Latency Tolerance in Remote Memory Access Communication V. Tipparaju, M. Krishnan, J. Nieplocha, G. Santhanaraman, and DK Panda, Cluster 2003 Conference, Dec 2003 [Bib - Plain]
324	Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics J. Liu, B. Chandrasekaran, J. Wu, W. Jiang, S. Kini, W. Yu, D. Buntinas, P. Wyckoff, and DK Panda, SuperComputing 2003, Nov 2003 [Bib - Plain]
325	Scalable NIC-based Reduction on Large-scale Clusters A. Moody, J. Fernandez, F. Petrini, and DK Panda, SuperComputing (SC) Conference, Nov 2003 [Bib - Plain]
326	High Performance Broadcast Support in LA-MPI over Quadrics W. Yu, S. Sur, DK Panda, R. T. Aulwes, and R. Graham, Los Alamos Computer Science Institute (LACSI) Symposium, Oct 2003 [Slides] [Bib - Plain]
327	High Performance and Reliable NIC-Based Multicast over Myrinet/GM-2 W. Yu, D. Buntinas, and DK Panda, International Conference on Parallel Processing, Oct 2003 [Slides] [Bib - Plain]
328	PVFS over InfiniBand: Design and Performance Evaluation J. Wu, P. Wyckoff, and DK Panda, International Conference on Parallel Processing, Oct 2003 [Bib - Plain]
329	Designing a Portable MPI-2 over Modern Interconnects using uDAPL Interface L. Chai, R. Noronha, P. Gupta, G. Brown, and DK Panda, Euro PVM/MPI Conference, Sep 2003 [Bib - Plain]
330	Efficient Hardware Multicast Group Management for Multiple MPI Communicators over InfiniBand A. Mamidala, H. Jin, and DK Panda, Euro PVM/MPI Conference, Sep 2003 [Slides] [Bib - Plain]
331	Design Alternatives and Performance Trade-offs for Implementing MPI-2 over InfiniBand W. Huang, G. Santhanaraman, H. Jin, and DK Panda, Euro PVM/MPI Conference, Sep 2003 [Slides] [Bib - Plain]
332	Fast and Scalable Barrier using RDMA and Multicast Mechanisms for InfiniBand-Based Clusters S. Kini, J. Liu, J. Wu, P. Wyckoff, and DK Panda, Euro PVM/MPI Conference, Sep 2003 [Bib - Plain]
333	Demotion-Based Exclusive Caching through Demote Buffering: Design and Evaluations over Different Networks J. Wu, P. Wyckoff, and DK Panda, Workshop on Storage Network Architecture and Parallel I/O (SNAPI), Sep 2003 [Bib - Plain]
334	MIBA: A Micro-benchmark Suite for Evaluating InfiniBand Architecture Implementations B. Chandrasekaran, P. Wyckoff, and DK Panda, Performance TOOLS 2003, Sep 2003 [Bib - Plain]
335	Micro-Benchmark Level Performance Comparison of High-Speed Cluster Interconnects J. Liu, B. Chandrasekaran, W. Yu, J. Wu, D. Buntinas, S. P. Kinis, P. Wyckoff, and DK Panda, Hot Interconnects 10, Aug 2003 [Bib - Plain]
336	High Performance RDMA-Based MPI Implementation over InfiniBand J. Liu, J. Wu, S. Kini, P. Wyckoff, and DK Panda, International Conference on Supercomputing (ICS '03), Jun 2003 [Bib - Plain]
337	QoS-aware Middleware for Cluster-based Servers to Support Interactive and Resource-Adaptive Applications S. Senapathi, B. Chandrasekharan, D. Stredney, H.-W. Shen, and DK Panda, High Performance Distributed Computing, Jun 2003 [Bib - Plain]
338	Impact of High Performance Sockets on Data Intensive Applications P. Balaji, J. Wu, T. Kurc, U. Catalyurek, DK Panda, and J. Saltz, High Performance Distributed Computing, Jun 2003 [Bib - Plain]
339	Application-Bypass Broadcast in MPICH over GM D. Buntinas, DK Panda, and R. Brightwell, Cluster Computing and Grid (CCGrid '03), May 2003 [Bib - Plain]
340	Optimizing Barrier and Lock Operations in ARMCI D. Buntinas, A. Saify, DK Panda, and Jarek Nieplocha, International Workshop on Communication Architecture for Clusters (CAC '03), Apr 2003 [Bib - Plain]
341	Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters R. Gupta, P. Balaji, DK Panda, and J. Nieplocha, International Parallel and Distributed Processing Symposium (IPDPS '03), Apr 2003 [Bib - Plain]
342	NIC-Based Reduction in Myrinet Clusters: Is It Beneficial? D. Buntinas, and DK Panda, SAN-02 Workshop (in conjunction with HPCA), Apr 2003 [Bib - Plain]
343	A Portable Client/Server Communication Middleware over SANs: Design and Performance Evaluation with InfiniBand J. Liu, M. Banikazemi, B. Abali, and DK Panda, SAN-02 Workshop (in conjunction with HPCA), Apr 2003 [Bib - Plain]
344	Supporting Strong Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and DK Panda, In SAN-03 Workshop (in conjunction with HPCA), Feb 2003 [Slides] [Bib - Plain]
345	Impact of On-Demand Connection Management in MPI over VIA J. Wu, J. Liu, P. Wyckoff, and DK Panda, Cluster '02, Sep 2002 [Bib - Plain]
346	Efficient Barrier using Remote Memory Operations on VIA-Based Clusters R. Gupta, V. Tipparaju, J. Nieplocha, and DK Panda, Cluster '02, Sep 2002 [Bib - Plain]
347	High Performance User-Level Sockets over Gigabit Ethernet P. Balaji, P. Shivam, P. Wyckoff, and DK Panda, Cluster '02, Sep 2002 [Bib - Plain]
348	A QoS Framework for Clusters to support Applications with Resource Adaptivity and Predictable Performance S. Senapathi, DK Panda, D. Stredney, and H.-W. Shen, International Workshop on Quality of Service (IWQoS), May 2002 [Bib - Plain]
349	Can User Level Protocols Take Advantage of Multi-CPU NICs? P. Shivam, P. Wyckoff, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS '02), Apr 2002 [Bib - Plain]
350	MPI/IO on DAFS Over VIA: Implementation and Performance Evaluation J. Wu, and DK Panda, Communication Architecture for Clusters (CAC'02) Workshop, Apr 2002 [Bib - Plain]
351	Protocols and Strategies for Optimizing Remote Memory Operations on Clusters (CAC'02) Workshop J. Nielplocha, V. Tipparaju, A. Saify, and DK Panda, held in conjunction with IPDPS '02, Apr 2002 [Bib - Plain]
352	NIC-Based Atomic Operations on Myrinet/GM D. Buntinas, DK Panda, and W. Gropp, SAN-1 Workshop, Feb 2002 [Bib - Plain]
353	EMP: Zero-copy OS-bypass NIC-driven Gigabit Ethernet Message Passing P. Shivam, P. Wyckoff, and DK Panda, Supercomputing '01., Feb 2002 [Bib - Plain]
354	Implementing TreadMarks over GM on Myrinet: Challenges, Design Experiences and Performance Evaluation R. Noronha, and DK Panda, The Workshop on Communication Architecture for Clusters held in conjunction with IPDPS 2003, Sep 2001 [Slides] [Bib - Plain]
355	NIC-based Rate Control for Proportional Bandwidth Allocation in Myrinet Clusters A. Gulati, DK Panda, P. Sadayappan, and P. Wyckoff, International Conference on Parallel Processing, Sep 2001 [Bib - Plain]
356	Implementing TreadMarks over VIA on Myrinet and Gigabit Ethernet: Challenges, Design Experience, and Performance Evaluation M. Banikazemi, J. Liu, DK Panda, and P. Sadayappan, International Conference on Parallel Processing 2001, Sep 2001 [Bib - Plain]
357	Performance Benefits of NIC-Based Barrier on Myrinet/GM D. Buntinas, DK Panda, and P. Sadayappan, Workshop on Communication Architecture for Clusters (CAC '01), Apr 2001 [Bib - Plain]
358	Can Scatter Communication Take Advantage of Multidestination Message Passing? M. Banikazemi, and DK Panda, Int'l Symposium on High Performance Computing 2000, Apr 2001 [Bib - Plain]
359	Fast NIC-Based Barrier over Myrinet/GM D. Buntinas, DK Panda, and P. Sadayappan, International Parallel and Distributed Processing Symposium, Apr 2001 [Bib - Plain]
360	Characterization and Enhancement of Static Mapping Heuristics for Heterogeneous Systems Praveen Holenarsipur, V. Yarmolenko, J. Duato, DK Panda, and P. Sadayappan, International Symposium on High Performance Computing (HiPC '00), Dec 2000 [Bib - Plain]
361	Dynamic Mapping Heuristics in Heterogeneous Systems V. Yarmolenko, J. Duato, DK Panda, and P. Sadayappan, Workshop on Network-Based Computing, Aug 2000 [Bib - Plain]
362	Balancing Web Server Load for Adaptive Video Distribution A. Paul, W.-C. Feng, DK Panda, and P. Sadayappan, Workshop on Multimedia Computing, Aug 2000 [Bib - Plain]
363	Implementing TreadMarks on Virtual Interface Architecture (VIA): Design Issues and Alternatives M. Banikazemi, DK Panda, and P. Sadayappan, Ninth Workshop on Scalable Shared Memory Multiprocessors, Jun 2000 [Bib - Plain]
364	TupleQ: Fully-Asynchronous and Zero-Copy MPI over InfiniBand M. Koop, J. Sridhar, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Slides] [Bib - Plain]
365	MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand M. Koop, T. Jones, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Slides] [Bib - Plain]
366	Designing Passive Synchronization for MPI-2 One-Sided Communication to Maximize Overlap G. Santhanaraman, S. Narravula, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Bib - Plain]
367	VIBe: A Micro-benchmark Suite for Evaluating Virtual Interface Architecture (VIA) Implementations M. Banikazemi, J. Liu, S. Kutlug, A. Ramakrishna, P. Sadayappan, H. Sah, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Bib - Plain]
368	Efficient Multicast Algorithms for Heterogeneous Switch-based Irregular Networks of Workstations A. Singhal, M. Banikazemi, P. Sadayappan, and DK Panda, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Bib - Plain]
369	Efficient Virtual Interface Architecture Support for the IBM SP Switch-Connected NT Clusters M. Banikazemi, V. Moorthy, L. Herger, DK Panda, and B. Abali, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Bib - Plain]
370	Adaptive Routing in RS/6000 SP-like Bidirectional Multistage Interconnection Networks M. Banikazemi, C. B. Stunkel, DK Panda, and B. Abali, International Parallel and Distributed Processing Symposium (IPDPS), May 2000 [Bib - Plain]
371	Comparison and Evaluation of Design Choices for Implementing the Virtual Interface Architecture (VIA) M. Banikazemi, B. Abali, and DK Panda, Fourth International Workshop on Communication, Jan 2000 [Bib - Plain]
372	Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination Messages D. Buntinas, DK Panda, J. Duato, and P. Sadayappan, Fourth International Workshop on Communication and Architectural Support for Network-Based Parallel Computing (CANPC'00), Jan 2000 [Bib - Plain]
373	Fast Collective Communication Algorithms for Reflective Memory Network Clusters V. Moorthy, DK Panda, and P. Sadayappan, Fourth International Workshop on Communication and Architectural Support for Network-Based Parallel Computing (CANPC'00), Jan 2000 [Bib - Plain]
374	Implementing Efficient MPI on LAPI for the IBM-SP: Experiences and Performance Evaluation M. Banikazemi, R. Govindaraju, R. Blackmore, and DK Panda, International Parallel Processing Symposium (IPPS'99), Jan 2000 [Bib - Plain]
375	Low Latency Message Passing on Workstation Clusters using SCRAMNet V. Moorthy, M. Jacunski, M. Pillai, P. Ware, DK Panda, T. Page, P. Sadayappan, V. Nagarajan, and J. Daniel, International Parallel Processing Symposium (IPPS'99), Jan 2000 [Bib - Plain]
376	Communication Modeling of Heterogeneous Networks of Workstations for Performance Characterization of Collective Operations M. Banikazemi, S. Prabhu, J. Sampathkumar, DK Panda, and P. Sadayappan, International Workshop on Heterogeneous Computing (HCW'99), Jan 2000 [Bib - Plain]
377	All-to-All Broadcast on Switch-Based Clusters of Workstations M. Jacunski, P. Sadayappan, and DK Panda, International Parallel Processing Symposium 1999, Apr 1999 [Bib - Plain]
378	Low Latency Message-Passing for Reflective Memory Networks M. Jacunski, V. Moorthy, P. Ware, M. Pillai, DK Panda, and P. Sadayappan, International Workshop on Communication, Jan 1999 [Bib - Plain]
379	Where to Provide Support for Efficient Multicasting in Irregular Networks: Network Interface or Switch? International Conference on Parallel Processing R. Sivaram, R. Kesavan, DK Panda, and Craig B. Stunkel, International Conference on Parallel Processing, Aug 1998 [ pp. 452-459] [Bib - Plain]
380	Experiences with Software MPEG-2 Video Decompression on an SMP PC A. Bala, D. Shah, W.-C. Feng, and DK Panda, ICPP Workshop, Aug 1998 [Bib - Plain]
381	HIPIQS: A High-Performance Switch Architecture using Input Queuing R. Sivaram, C. Stunkel, and DK Panda, International Parallel Processing Symposium (IPPS '98), Aug 1998 [Bib - Plain]
382	Prioritized Demand Multiplexing (PDM): A Low-Latency Virtual Channel Flow Control Framework for Prioritized Traffic A-H. Smai, DK Panda, and L-E. Thorelli, International Conference on High Performance Computing, Dec 1997 [Bib - Plain]
383	How Much Does Network Contention Affect Distributed Shared Memory Performance? D. Dai, and DK Panda, International Conference on Parallel Processing 1997, Dec 1997 [pp. 454-461] [Bib - Plain]
384	Optimal Multicast with Packetization and Network Interface Support R. Kesavan, and DK Panda, International Conference on Parallel Processing (ICPP'97), Dec 1997 [pp. 370-377] [Bib - Plain]
385	Multicasting on Switch-based Irregular Networks using Multi-drop Path-based Multidestination Worms R. Kesavan, and DK Panda, Parallel Computing, Routing, and Communication Workshop, Dec 1997 [Bib - Plain]
386	Multicasting in Irregular Networks with Cut-Through Switches using Tree-Based Multidestination Worms R. Sivaram, DK Panda, and C. B. Stunkel, Parallel Computing, Routing, and Communication Workshop, Dec 1997 [Bib - Plain]
387	How Can We Design Better Networks for DSM Systems? D. Dai, and DK Panda, Parallel Computing, Routing, and Communication Workshop, Dec 1997 [Bib - Plain]
388	Implementing Multidestination Worms in Switch-Based Parallel Systems: Architectural Alternatives and their Impact C. B. Stunkel, R. Sivaram, and DK Panda, International Symposium on Computer Architecture (ISCA'97), Jun 1997 [Bib - Plain]
389	A Reliable Hardware Barrier Synchronization Scheme R. Sivaram, C. B. Stunkel, and DK Panda, International Parallel Processing Symposium (IPPS'97), Apr 1997 [Bib - Plain]
390	Efficient Collective Communication on Heterogeneous Networks of Workstations M. Banikazemi, V. Moorthy, and DK Panda, International Conference on Parallel Processing, Aug 1996 [Bib - Plain]
391	Impact of Adaptivity on the Behavior of Networks of Workstations under Bursty Traffic F. Silla, M. P. Malumbres, J. Duato, D. Dai, and DK Panda, International Conference on Parallel Processing, Aug 1996 [Bib - Plain]
392	Designing Processor-cluster Based Systems: Interplay Between Cluster Organizations and Collective Communication Algorithms D. Basak, and DK Panda, International Conference on Parallel Processing, Aug 1996 [Bib - Plain]
393	Reducing Cache Invalidation Overheads in Wormhole DSMs using Multidestination Message Passing D. Dai, and DK Panda, International Conference on Parallel Processing, Aug 1996 [Bib - Plain]
394	Minimizing Node Contention in Multiple Multicast on Wormhole k-ary n-cube Networks R. Kesavan, and DK Panda, International Conference on Parallel Processing, Aug 1996 [Bib - Plain]
395	Hybrid Algorithms for Complete Exchange in 2D Meshes N. S. Sundar, D. N. Jayasimha, DK Panda, and P. Sadayappan, Proceedings of the International Conference on Supercomputing, May 1996 [Bib - Plain]
396	Multicast on Irregular Switch-based Networks with Wormhole Routing R. Kesavan, K. Bondalapati, and DK Panda, Proceedings of the Third International Symposium on High Performance Computer Architecture (HPCA-3), Feb 1996 [Bib - Plain]
397	Fast Barrier Synchronization in Wormhole k-ary n-cube Networks with Multidestination Worms DK Panda, International Symposium on High Performance Computer Architecture, Jan 1995 [Bib - Plain]
398	Issues in Designing Scalable Systems with k-ary n-cube cluster-c organization DK Panda, and D. Basak, International Workshop on Parallel Processing, Dec 1994 [Bib - Plain]
399	Architectural Issues in Designing Heterogeneous Parallel Systems with Passive Star-Coupled Optical Interconnection R. Prakash, and DK Panda, International Symposium on Parallel Architectures, Dec 1994 [Bib - Plain]
400	Designing Large Hierarchical Multiprocessor Systems under Processor D. Basak, and DK Panda, International Parallel Processing Conference (ICPP '94), Aug 1994 [Bib - Plain]
401	Message-Ordering for Wormhole-Routed Multiport Systems with Link Contention and Routing Adaptivity DK Panda, and V. Dixit-Radiya, Scalable High Performance Computing Conference, May 1994 [Bib - Plain]
402	Complete Exchange in 2D Meshes N. S. Sundar, D. N. Jayasimha, DK Panda, and P. Sadayappan, Scalable High Performance Computing Conference, May 1994 [Bib - Plain]
403	Multidestination Message Passing Mechanism Conforming to Base Wormhole Routing Scheme DK Panda, S. Singal, and P. Prabhakaran, Parallel Routing and Communication Workshop, May 1994 [Bib - Plain]
404	Scalable Architecture with k-ary n-cube cluster-c Organizations D. Basak, and DK Panda, Symposium on Parallel and Distributed Processing, Dec 1993 [Bib - Plain]
405	Task Assignment in Distributed-Memory Systems with Adaptive Wormhole Routing V. Dixit-Radiya, and DK Panda, Symposium on Parallel and Distributed Processing, Dec 1993 [Bib - Plain]
406	Optimal Phase Barrier Synchronization in k-ary n-cube Wormhole-routed Systems using Multirendezvous Primitives DK Panda, Workshop on Fine-Grain Massively Parallel Coordination, May 1993 [Bib - Plain]
407	Analysis of Routing in Pyramid Architectures T. Mzaik, S. Chandra, J. M. Jagadeesh, and DK Panda, IEEE National Aerospace and Electronics Conference (NAECON), May 1993 [Bib - Plain]
408	Benefits of Processor Clustering in Designing Large Parallel Systems: When and How? D. Basak, DK Panda, and M. Banikazemi, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
409	Global Reduction in Wormhole k-ary n-cube Networks with Multidestination Exchange Worms DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
410	An Efficient Scheme for Complete Exchange in 2D Tori Y.-C. Tseng, S. K. S. Gupta, and DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
411	Clustering and Intra-Processor Scheduling for Explicitly-Parallel Programs on Distributed-Memory Systems V. Dixit-Radiya, and DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
412	Impact of Multiple Consumption Channels on Wormhole Routed k-ary n-cube Networks S. Balakrishnan, and DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
413	Barrier Synchronization in Distributed-Memory Multiprocessors using Rendezvous Primitives S. K. S. Gupta, and DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]
414	A Trip-based Multicasting Model for Wormhole-routed Networks with Virtual Channels Y. C. Tseng, and DK Panda, International Parallel Processing Symposium, Apr 1993 [Bib - Plain]

Technical Reports (8)
1	K. Vaidyanathan, P. Lai, S. Narravula, and DK Panda, Benefits of Dedicating Resource Sharing Services in Data-Centers for Emerging Multi-Core Systems, OSU-CISRC-8/07-TR53
2	K. Vaidyanathan, H. Jin, S. Narravula, and DK Panda, Accurate Load Monitoring for Cluster-based Web Data-Centers over RDMA-enabled Networks OSU-CISRC-7/05-TR49
3	G. Marsh, A. Sampat, S. Potluri, and DK Panda, Scaling Advanced Message Queuing Protocol (AMQP) Architecture with Broker Federation and InfiniBand OSU Technical Report (OSU-CISRC-5/09-TR17)
4	W. Huang, J. Liu, B. Abali, and DK Panda, InfiniBand Support in Xen Virtual Machine Environment, OSU-CISRC-2/06--TR18
5	P. Balaji, W. Feng, and DK Panda, The Convergence of Ethernet and Ethernot: A 10-Gigabit Ethernet Perspective, OSU-CISRC-1/06-TR10
6	H. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji, and DK Panda, Performance Evaluation of RDMA over IP: A Case Study with Ammasso Gigabit Ethernet NIC, OSU-CISRC-6/05-TR40
7	K. Vaidyanathan, P. Balaji, J. Wu, H. Jin, and DK Panda, An Architectural Study of Cluster-Based Multi-Tier Data-Centers,
8	S. Krishnamoorthy, P. Balaji, K. Vaidyanathan, H. Jin, and DK Panda, Dynamic Reconfigurability Support for providing Soft QoS Guarantees in Cluster-based Multi-Tier Data-Centers over InfiniBand,

Ph.D. Disserations (36)
1	M. Bayatpour, Designing High Performance Hardware-assisted Communication Middlewares for Next-Generation HPC Systems, May 2021
2	C. Chu, Accelerator-enabled Communication Middleware for Large-scale Heterogeneous HPC Systems with Modern Interconnects, Jul 2020
3	J. Hashmi, Designing High Performance Shared-Address-Space and Adaptive Communication Middlewares for Next-Generation HPC Systems, Apr 2020
4	Ammar Awan, Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems, Apr 2020
5	S. Chakraborty, High Performance and Scalable Cooperative Communication Middleware for Next Generation Architectures, Jun 2019
6	J. Zhang, Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters, Jul 2018
7	M. Li, Designing High-Performance Remote Memory Access for MPI and PGAS Models with Modern Networking Technologies on Heterogeneous Clusters, Nov 2017
8	A. Venkatesh, High-Performance Heterogeneity/Energy-Aware Communication for MultiPetaflop HPC Systems, Dec 2016
9	R. Rajachandrasekar, Designing Scalable And Efficient I/O Middleware for Fault-Resilient High-performance Computing Clusters, Nov 2014
10	J. Jose, Designing High Performance and Scalable Unified Communication Runtime (UCR) for HPC and Big Data Middleware, Aug 2014
11	S. Potluri, Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects, May 2014
12	K. Kandalla, High Performance Non-Blocking Collective Communication for Next Generation InfiniBand Clusters, Jul 2013
13	M. Luo, Designing Efficient MPI and UPC Runtime for Multicore Clusters with InfiniBand and Heterogeneous System, Jul 2013
14	H. Subramoni, Topology-Aware MPI communication and Scheduling for High Performance Computing Systems, Jul 2013
15	X. Ouyang, Efficient Storage Middleware Design in InfiniBand Clusters for High-End Computing, Mar 2012
16	G. Santhanaraman, Designing Scalable And High Performance One Sided Communication Middleware For Modern Interconnects, Jun 2009
17	M. Koop, High-Performance Multi-Transport MPI Design For Ultra-Scale Infiniband Clusters, Jun 2009
18	L. Chai, High Performance And Scalable MPI Intra-Node Communication Middleware For Multi-Core Clusters, Mar 2009
19	W. Huang, High Performance Network I/O In Virtual Machines Over Modern Interconnects, Aug 2008
20	R. Noronha, Designing High-Performance and Scalable Clustered Network Attached Storage With InfiniBand, Aug 2008
21	S. Narravula, Designing High-Performance and Scalable Distributed Datacenter Services over Modern Interconnects, Aug 2008
22	A. Mamidala, Scalable and High Performance Collective Communication For Next Generation Multicore InfiniBand Clusters, May 2008
23	K. Vaidyanathan, High Performance and Scalable Soft Shared State for Next-Generation Datacenters, May 2008
24	A. Vishnu, High Performance and Network Fault Tolerant MPI with Multi-Pathing Over InfiniBand, Dec 2007
25	S. Sur, Scalable and High Performance MPI Design for Very Large InfiniBand Clusters, Aug 2007
26	W. Yu, Enhancing MPI with Modern Networking Mechanisms in Cluster Interconncts, Jun 2006
27	P. Balaji, High Performance Communication Support for Sockets Based Applications over High-Speed Networks, Jun 2006
28	J. Liu, Designing High Performance and Scalable MPI over InfiniBand, Sep 2004
29	J. Wu, Communication and Memory Management in Networked Storage Systems, Sep 2004
30	D. Buntinas, Improving Cluster Performance through the Use of Programmable Network Interfaces, Jun 2003
31	M. Banikazemi, Design and Implementation of High Performance Communication Subsystems for Clusters, Dec 2000
32	D. Dai, Designing Efficient Communication Subsystems for Distributed Shared Memory (DSM) Systems, Mar 1999
33	R. Kesavan, Communication Mechanisms and Algorithms for Supporting Scalable Collective Communication on Parallel Systems, Oct 1998
34	R. Sivaram, Architectural Support for Efficient Communication in Scalable Parallel Systems, Aug 1998
35	D. Basak, Designing High Performance Parallel Systems: A Processor-Cluster Based Approach, Jul 1996
36	V. Dixit-Radiya, Mapping on Wormhole-routed Distributed-Memory Systems: A Temporal Communication Graph-based Approach, Mar 1995

M.S. Thesis (31)
1	S. Srivastava, MVAPICH2-AutoTune: An Automatic Collective Tuning Framework for the MVAPICH2 MPI Library, May 2021
2	N. Senthil Kumar, Designing Optimized MPI+NCCL Hybrid Collective Communication Routines for Dense Many-GPU Clusters, May 2021
3	Kamal Raj Sankarapandian, Profiling MPI Primitives in Real-time Using OSU INAM, Apr 2020
4	R. Biswas, Benchmarking and Accelerating TensorFlow-based Deep Learning on Modern HPC Systems, Jul 2018
5	A. Augustine, Designing a Scalable Network Analysis and Monitoring Tool with MPI Support, Aug 2016
6	V. Dhanraj, Enhancement of LIMIC-Based Collectives for Multi-core Clusters, Aug 2012
7	A. Singh, Optimizing All-to-all and Allgather Communications on GPGPU Clusters, Apr 2012
8	S. Pai Raikar, Network Fault-Resilient MPI for Multi-Rail InfiniBand Clusters, Dec 2011
9	N. Dandapanthula, InfiniBand Network Analysis and Monitoring using OpenSM, Aug 2011
10	V. Meshram, Distributed Metadata Management for Parallel Systems, Aug 2011
11	G. Marsh, Evaluation of High Performance Financial Messaging on Modern Multi-core Systems, Mar 2010
12	K. Gopalakrishnan, Enhancing Fault Tolerance in MPI for Modern InfiniBand Clusters, Aug 2009
13	T. Gangadharappa, Designing Support For MPI-2 Programming Interfaces On Modern Interconnects, Jun 2009
14	J. Sridhar, Scalable Job Startup And Inter-Node Communication In Multi-Core Infiniband Clusters, Jun 2009
15	R. Kumar, Enhancing MPI Point-to-Point and Collectives for Clusters with Onloaded/Offloaded InfiniBand Adapters, Aug 2008
16	S. Bhagvat, Designing and Enhancing the Sockets Direct Protocol (SDP) over iWARP and InfiniBand, Aug 2006
17	S. Krishnamoorthy, Dynamic Re-Configurability Support to Provide Soft QoS Guarantees in Cluster-Based Multi-Tier Data-Centers over InfiniBand, Jun 2004
18	W. Jiang, High Performance MPICH2 One-Sided Communication Implementation over InfiniBand, Jun 2004
19	A. Wagner, Static and Dynamic Processing Offload on Myrinet Clusters with Programmable NIC Support, Jun 2004
20	A. Moody, NIC-based Reduction on Large-Scale Quadrics Clusters, Dec 2003
21	B. Chandrasekharan, Micro-benchmark Level Performance Evaluation and Comparison of High Speed Cluster Interconnects, Sep 2003
22	S. Kini, Efficient Collective Communication using Multicast and RDMA Operations for InfiniBand-based Clusters, Jun 2003
23	S. Senapathi, QoS-Aware Middleware to Support Interactive and Resource Adaptive Applications on Myrinet Clusters, Sep 2002
24	P. Shivam, High Performance User Level Protocol on Gigabit Ethernet, Aug 2002
25	R. Gupta, Efficient Collective Communication using Remote Memory Operations on VIA-Based Clusters, Aug 2002
26	A. Saify, Optimizing Collective Communication Operations in ARMCI, Jul 2002
27	S. Desai, Mechanisms for Implementing Efficient Collective Communication in Clusters with Application Bypass, Jun 2002
28	V. Tipparaju, Optimizing ARMCI Get/Put Operations on Myrinet/GM, Sep 2001
29	A. Gulati, A Proportional Bandwidth Allocation Scheme for Myrinet Clusters, Jun 2001
30	V. Kota, Designing Efficient Inter-Cluster Communication Layer for Distributed Computing, Jun 2001
31	S. Kutlug, Performance Evaluation and Analysis of User Level Networking Protocols in Clusters, Jun 2000

CUDA

ROCM

MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, RoCE, and Slingshot

Journals (31)

Book Chapter (2)

Conferences & Workshops (414)

Design and Implementation of an IPC-based Collective MPI Library for Intel GPUs

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters

Accelerating Large Language Model Training with Hybrid GPU-based Compression

Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference

HARVEST: High-Performance Artificial Vision Framework for Expert Labeling using Semi-Supervised Training

Benchmarking Modern Databases for Storing and Profiling Very Large Scale HPC Communication Data

MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators

Democratizing HPC Access and Use with Knowledge Graphs

DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICs

Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication

SAI: AI-Enabled Speech Assistant Interface for Science Gateways in HPC

A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc

In-Depth Evaluation of a Lower-Level Direct-Verbs API on InfiniBand-based Clusters: Early Experiences

Implementing and Optimizing a GPU-aware MPI Library for Intel GPUs: Early Experiences

AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters

Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads

Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI

Network-Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries

High Performance MPI over the Slingshot Interconnect: Early Experiences

Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems

Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters

Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries and Machine Learning Applications on HPC Systems

DistMILE: A Distributed Multi-Level Framework for Scalable Graph Embedding

Towards Architecture-aware Hierarchical Communication Trees on Modern HPC Systems

Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs

BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs

Designing a ROCm-aware MPI Library for AMD GPUs: Early Experiences

SUPER: SUb-Graph Parallelism for TransformERs

Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters

Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems

Efficient MPI-based Communication for GPU-Accelerated Dask Applications

Blink: Towards Efficient RDMA-based Communication Coroutines for Parallel Python Applications

GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN Training

Exploring Hybrid MPI+Kokkos Tasks Programming Model

Design and Characterization of Infiniband Hardware Tag Matching in MPI

Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR

Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters

Accelerated Real-time Network Monitoring and Profiling at Scale using OSU INAM

NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems

Communication-Aware Hardware-Assisted MPI Overlap Engine

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow

Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures

High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems

Design and Evaluation of Shared Memory CommunicationBenchmarks on Emerging Architectures using MVAPICH2

Leveraging Network-level parallelism with Multiple Process-Endpoints for MPI Broadcast

OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks

Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera

Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters

Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects

Designing Scalable and High-performance MPI Libraries on Amazon Elastic Fabric Adapter

Performance Evaluation of MPI Libraries on GPU-enabled OpenPOWER Architectures: Early Experiences

Reduction Operations on Modern Supercomputers: Challenges and Solutions

FALCON: Efficient Designs for Zero-copy MPI Datatype Processing on Emerging Architectures

C-GDR: High-Performance Container-aware GPUDirect MPI Communication Schemes on RDMA Networks

Design and Characterization of Shared Address Space MPI Collectives on Modern Architectures

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

Characterizing CUDA Unified Memory (UM)-AwareMPI Designs on Modern GPU Architectures

OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training

Cooperative Rendezvous Protocols for Improved Performance and Overlap

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures

Efficient Asynchronous Communication Progress for MPI without Dedicated Resources

SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives

Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores

Kernel-assisted Communication Engine for MPI on Emerging Manycore Processors

Designing Registration Caching Free High-Performance MPI Library with Implicit On-Demand Paging (ODP) of InfiniBand

MPI-LiFE: Designing High-Performance Linear Fascicle Evaluation of Brain Connectome with MPI

Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?