Hi Celso,

I have little background on your problem. I assume you are familiar with neural network and NLP. This Pytorch-based article [1] seems related.

* The most difficult part is how to encode the (code,docstring) into embedding (code_embedding, docstring_embedding).

I assume both (code, docstring) are strings.

If either (code, docstring) is short and categorical, I recommend using word2vector embedding[2].

If either (code, docstring) is long, you will need to start using recurrent neural networks [3].

If either (code, docstring) is very long, you will need to come up with tricks like splitting the text into a fixed splits, encoding each separately and concatenating. I recommend using RNN and word2vector before dipping into this option.

* Once you have the embedding (code_embedding, docstring_embedding), applying N-pair loss is straight forward. There are standard implementations [4]

Resources (sorted based on importance)


[2] Word embedding https://www.tensorflow.org/tutorials/text/word_embeddings

[3] RNN for text https://www.tensorflow.org/tutorials/text/text_classification_rnn

[4] https://www.tensorflow.org/addons/api_docs/python/tfa/losses/npairs_loss

I write reviews on computer vision papers. Writing tips are welcomed.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store