What is Machine Learning on Code?
Why is MLonCode hard?
“Programming languages, in theory, are complex, flexible and powerful, but the programs that real people actually write are mostly simple and rather repetitive, and thus they have usefully predictable statistical properties that can be captured in statistical language models and leveraged for software engineering tasks.”This statement justifies the usefulness of Big Code: the more source code is analyzed, the stronger the statistical properties emphasized, and the better the achieved metrics of a trained machine learning model. The underlying relations are the same as in e.g. the current state-of-the-art Natural Language Processing models: XLNet, ULMFiT, etc. Likewise, universal MLonCode models can be trained and leveraged in downstream tasks.