Background:
In NLP-related code, there is a form of N2N decoder that requires the previous round's prediction as input for the current round, along with the current round's features, to predict the result of the current round. From what I've seen, most of the code uses teacher forcing and quickly obtains the result through matrix multiplication. During testing, the results are predicted one at a time through a for loop.
Problem:
During testing, when using torch.no_grad(), there are no memory issues. However, if we don't use teacher forcing during training, which means not obtaining the result through matrix multiplication, we will encounter an out of memory error.
Solution:
In all places where addition or list append is involved, use the form .data[0] to not inherit the computation graph.
Example:
In this project: GitHub - gongliym/data2text-transformer: Enhanced Transformer Model for Data-to-Text Generation
Since the ultimate goal is to generate text, it involves the N2N form. During testing, if torch.no_grad() is not used, the memory will overflow. If in the file .model/src/model/transformer.py, at the positions where self-addition occurs, such as line 380:
tensor = tensor + attn
change it to:
tensor = tensor + attn.data[0]
then the memory overflow issue will not occur.
Additionally:
The situation described above does not apply to the command torch.cuda.empty_cache(). I found that using this command does not release the cache and greatly increases training time.