We review the limitations of BLEU and ROUGE – the most popular metrics used to assess reference summaries against hypothesis summaries, and come up with criteria for what a good metric should behave like and propose concrete ways to use and test recent Transformers-based Language Models to assess reference summaries against hypothesis summaries.