Deep learning and attention models. Accurately predicting the behavior of a customer with specific characterizations, after certain life events or with a unique set of preferences, can quickly become very broad and complex. In theory, every small event could trigger a client to get actively involved in their financial situation, for example a simple discussion with a relative at a birthday party. This complexity and vast amount of (different) data sets, tend to be handled best by neural networks (▇▇▇▇▇▇▇▇, 1982). A neural network is a set of multiple interconnected layers of neurons, inspired by the human brain, which can be trained to represent data at high levels of abstraction (LeCun, ▇▇▇▇▇▇, and ▇▇▇▇▇▇, 2015). Neurons are artificial units, transmitting signals to the next neuron. Influenced by input from neurons in the previous layer or the input data, these neurons output a level of activity. By use of backpropagation, this network of neurons can be trained to recognize patterns and are able to predict the outcome of new data instances. Backpropagation is the gradient based learning method applied in the training of most neural networks. Due to the differentiable activation functions, the network is able to back propagate the contribution of each neuron to the error of the training instance. This is used to update the weights of each of the neurons in order to achieve higher accuracy in the next iteration. Required is a known output matched to the input, therefore backpropagation is mostly used in supervised learning tasks. Backpropagation was first described by ▇▇▇▇▇, Boser, et al. (1989) and is considered the accelerator neural networks needed to be further developed into an applicable algorithm. Deep learning is the collection of neural networks with multiple layers of neurons. Given the sequential nature of the problem at hand (sequences of events), recurrent neural networks seem best suited. This specific type of a neural network architecture allow previous output of a unit to influence the next input and are therefore able to account for events in the past when predicting the next event. For example, RNN’s are widely used in text translation, where the output (the translation) not only depends on a single word, but on the sequence of words. General ▇▇▇’s however suffer from the difficulty of learning dependencies over time. Due to the gradient based backpropagation algorithm, it gets harder to train the weights of the recurrent layers when the number of layers is increased (i.e. the length of the input sequence is increased), this is known as the vanishing gradient problem (▇▇▇▇▇▇, ▇▇▇▇▇▇, and ▇▇▇▇▇▇▇▇, 1994). Over time extended variations of the vanilla RNN models, Long-Short Term Memory (LSTM) and Gated Recurrent Units (GRU), became more popular due to the fact that those are less affected by this issue, allowing them to handle longer input sequences (▇▇▇▇▇▇▇▇▇▇, 1998). LSTM’s are a specific type of RNN architecture where information can be stored and removed from an internal memory cell (▇▇▇▇▇▇▇▇▇▇ and ▇▇▇▇▇▇▇▇▇▇▇, 1997). These LSTM cell make use of four gates, in contrary to the single gate included in a general RNN cell, making them able to train what information to use from the input, what to forget and store in the memory cell and what to output in each state. Therefore being less susceptible to the vanishing gradient problem. In order to divide emphasis more equally over the input sequence and not specifically on the end, one can use a bidirectional LSTM (bi-LSTM). In this type of architecture there are two layers of hidden recurrent nodes, both connected to the input and output. However the second layer is differentiated by inputting sequences in a reverse order (▇▇▇▇▇▇▇▇ and ▇▇▇▇▇▇▇, 1997). This advanced RNN structure should be able to detect clients, with a specific set of characteristics or events, whom will be more or less likely to get actively involved in their pension. However, it is the particular event or set of events that caused the trigger in the life of the client that is most interesting and valuable to the business. This is why attention mechanisms will be researched and implemented on top of the deep learning architectures. In a typical ‘many-to-one’ classifying problem tackled by a RNN, where a sequence is used to predict a single outcome, all information from the input sequence is summarized in the final hidden state of the recurrent layer. Attention models are able to not only use the intermediate hidden states, but can put more or less emphasis on previous hidden states. This technique has recently proved its value in visual recognition tasks (▇▇ et al., 2015), where the challenge was to describe the content of an image. The attention model was able to focus on the specific part of the image crucial in describing the next word of the output. More recent developments in the field of attention models have either revolved around the application to more complex problems or combining the methodology with other advancements in the field of deep learning. Examples include visual question answering (▇▇▇▇▇▇▇▇, ▇▇▇▇▇▇▇, and ▇▇▇▇▇, 2017), where the machine learns to answer a question based on (parts of) an image and Spatial Transformer Networks (▇▇▇▇▇▇▇▇▇, ▇▇▇▇▇▇▇▇, ▇▇▇▇▇▇▇▇▇, et al., 2015), which deals with the inability to be spatially invariant to (graphical) input data. With the emphasis on accuracy with the latest developments in for example deep learning, models get more complex and therefore harder to interpret. This creates a tension between model performance and interpretabilty (▇▇▇▇▇▇▇▇ and ▇▇▇, 2017). However, interpretability of models is vital in understanding the results as well as in gaining the user’s thrust in the results (▇▇▇▇▇▇▇, ▇▇▇▇▇, and ▇▇▇▇▇▇▇▇, 2016). This is why a second advantage of applying attention models, next to the performance, is the explainability. Due to the weights trained by the model, the attention can be extracted and visualized, resulting in the possibility to pinpoint the part of the sequence or image that contributed most in each phase of the prediction (▇▇ et al., 2015). In our case, we should be able to apply attention to the sequence of (life)events in order to conclude which event(s) contributed most to the activation of the client.
Appears in 2 contracts
Sources: End User Agreement, End User Agreement
Deep learning and attention models. Accurately predicting the behavior of a customer with specific characterizations, after certain life events or with a unique set of preferences, can quickly become very broad and complex. In theory, every small event could trigger a client to get actively involved in their financial situation, for example a simple discussion with a relative at a birthday party. This complexity and vast amount of (different) data sets, tend to be handled best by neural networks (▇▇▇▇▇▇▇▇, 1982). A neural network is a set of multiple interconnected layers of neurons, inspired by the human brain, which can be trained to represent data at high levels of abstraction (LeCun, ▇▇▇▇▇▇, and ▇▇▇▇▇▇, 2015). Neurons are artificial units, transmitting signals to the next neuron. Influenced by input from neurons in the previous layer or the input data, these neurons output a level of activity. By use of backpropagation, this network of neurons can be trained to recognize patterns and are able to predict the outcome of new data instances. Backpropagation is the gradient based learning method applied in the training of most neural networks. Due to the differentiable activation functions, the network is able to back propagate the contribution of each neuron to the error of the training instance. This is used to update the weights of each of the neurons in order to achieve higher accuracy in the next iteration. Required is a known output matched to the input, therefore backpropagation is mostly used in supervised learning tasks. Backpropagation was first described by ▇▇▇▇▇, Boser▇▇▇▇▇, et al. (1989) and is considered the accelerator neural networks needed to be further developed into an applicable algorithm. Deep learning is the collection of neural networks with multiple layers of neurons. Given the sequential nature of the problem at hand (sequences of events), recurrent neural networks seem best suited. This specific type of a neural network architecture allow previous output of a unit to influence the next input and are therefore able to account for events in the past when predicting the next event. For example, RNN’s are widely used in text translation, where the output (the translation) not only depends on a single word, but on the sequence of words. General ▇▇▇’s however suffer from the difficulty of learning dependencies over time. Due to the gradient based backpropagation algorithm, it gets harder to train the weights of the recurrent layers when the number of layers is increased (i.e. the length of the input sequence is increased), this is known as the vanishing gradient problem (▇▇▇▇▇▇, ▇▇▇▇▇▇, and ▇▇▇▇▇▇▇▇, 1994). Over time extended variations of the vanilla RNN models, Long-Short Term Memory (LSTM) and Gated Recurrent Units (GRU), became more popular due to the fact that those are less affected by this issue, allowing them to handle longer input sequences (▇▇▇▇▇▇▇▇▇▇, 1998). LSTM’s are a specific type of RNN architecture where information can be stored and removed from an internal memory cell (▇▇▇▇▇▇▇▇▇▇ and ▇▇▇▇▇▇▇▇▇▇▇, 1997). These LSTM cell make use of four gates, in contrary to the single gate included in a general RNN cell, making them able to train what information to use from the input, what to forget and store in the memory cell and what to output in each state. Therefore being less susceptible to the vanishing gradient problem. In order to divide emphasis more equally over the input sequence and not specifically on the end, one can use a bidirectional LSTM (bi-LSTM). In this type of architecture there are two layers of hidden recurrent nodes, both connected to the input and output. However the second layer is differentiated by inputting sequences in a reverse order (▇▇▇▇▇▇▇▇ and ▇▇▇▇▇▇▇, 1997). This advanced RNN structure should be able to detect clients, with a specific set of characteristics or events, whom will be more or less likely to get actively involved in their pension. However, it is the particular event or set of events that caused the trigger in the life of the client that is most interesting and valuable to the business. This is why attention mechanisms will be researched and implemented on top of the deep learning architectures. In a typical ‘many-to-one’ classifying problem tackled by a RNN, where a sequence is used to predict a single outcome, all information from the input sequence is summarized in the final hidden state of the recurrent layer. Attention models are able to not only use the intermediate hidden states, but can put more or less emphasis on previous hidden states. This technique has recently proved its value in visual recognition tasks (▇▇ et al., 2015), where the challenge was to describe the content of an image. The attention model was able to focus on the specific part of the image crucial in describing the next word of the output. More recent developments in the field of attention models have either revolved around the application to more complex problems or combining the methodology with other advancements in the field of deep learning. Examples include visual question answering (▇▇▇▇▇▇▇▇, ▇▇▇▇▇▇▇, and ▇▇▇▇▇, 2017), where the machine learns to answer a question based on (parts of) an image and Spatial Transformer Networks (▇▇▇▇▇▇▇▇▇, ▇▇▇▇▇▇▇▇, ▇▇▇▇▇▇▇▇▇, et al., 2015), which deals with the inability to be spatially invariant to (graphical) input data. With the emphasis on accuracy with the latest developments in for example deep learning, models get more complex and therefore harder to interpret. This creates a tension between model performance and interpretabilty (▇▇▇▇▇▇▇▇ and ▇▇▇, 2017). However, interpretability of models is vital in understanding the results as well as in gaining the user’s thrust in the results (▇▇▇▇▇▇▇, ▇▇▇▇▇, and ▇▇▇▇▇▇▇▇, 2016). This is why a second advantage of applying attention models, next to the performance, is the explainability. Due to the weights trained by the model, the attention can be extracted and visualized, resulting in the possibility to pinpoint the part of the sequence or image that contributed most in each phase of the prediction (▇▇ et al., 2015). In our case, we should be able to apply attention to the sequence of (life)events in order to conclude which event(s) contributed most to the activation of the client.
Appears in 1 contract
Sources: End User Agreement