WebVTT has a feature called Regions. This feature allows for rollup captions for multiple speakers.
This WebVTT example is taken directly out of the WebVTT spec on 2014-10-19. As of this date no browser I've tested this in seems to completely do the right thing with regions. Safari 8 is reported to have implemented regions, but I'm either misunderstanding how regions are supposed to work or the implementation isn't correct to the specification.