Accelerating PyTorch Training Workloads with FP8