Hi Celso,

1 min readMay 26, 2020

Hi Celso,

I have little background on your problem. I assume you are familiar with neural network and NLP. This Pytorch-based article [1] seems related.

* The most difficult part is how to encode the (code,docstring) into embedding (code_embedding, docstring_embedding).

I assume both (code, docstring) are strings.

If either (code, docstring) is short and categorical, I recommend using word2vector embedding[2].

If either (code, docstring) is long, you will need to start using recurrent neural networks [3].

If either (code, docstring) is very long, you will need to come up with tricks like splitting the text into a fixed splits, encoding each separately and concatenating. I recommend using RNN and word2vector before dipping into this option.

* Once you have the embedding (code_embedding, docstring_embedding), applying N-pair loss is straight forward. There are standard implementations [4]

Resources (sorted based on importance)

[1]https://ai.facebook.com/blog/neural-code-search-ml-based-code-search-using-natural-language-queries/

[2] Word embedding https://www.tensorflow.org/tutorials/text/word_embeddings

[3] RNN for text https://www.tensorflow.org/tutorials/text/text_classification_rnn

[4] https://www.tensorflow.org/addons/api_docs/python/tfa/losses/npairs_loss

Written by Ahmed Taha

Responses (1)