{"id":5059,"date":"2020-10-17T16:44:40","date_gmt":"2020-10-17T16:44:40","guid":{"rendered":"https:\/\/www.dinu.at\/profile\/home\/?p=5059"},"modified":"2022-08-13T10:58:40","modified_gmt":"2022-08-13T10:58:40","slug":"xai-and-strategy-extraction-via-reward-redistribution","status":"publish","type":"post","link":"https:\/\/www.dinu.at\/profile\/home\/xai-and-strategy-extraction-via-reward-redistribution\/","title":{"rendered":"XAI and Strategy Extraction via Reward Redistribution"},"content":{"rendered":"<div id=\"themify_builder_content-5059\" data-postid=\"5059\" class=\"themify_builder_content themify_builder_content-5059 themify_builder\">\n\n    <\/div>\n\n\n\n<h2>Abstract<\/h2>\n\n\n\n<p>In reinforcement learning, an agent interacts with an environment from which it receives rewards, that are then used to learn a task. However, it is often unclear what strategies or concepts the agent has learned to solve the task. Thus, interpretability of the agent\u2019s behavior is an important aspect in practical applications, next to the agent\u2019s performance at the task itself. However, with the increasing complexity of both tasks and agents, interpreting the agent\u2019s behavior becomes much more difficult. Therefore, developing new interpretable RL agents is of high importance. To this end, we propose to use Align-RUDDER as an interpretability method for reinforcement learning. Align-RUDDER is a method based on the recently introduced RUDDER framework, which relies on contribution analysis of an LSTM model, to redistribute rewards to key events. From these key events a strategy can be derived, guiding the agent\u2019s decisions in order to solve a certain task. More importantly, the key events are in general interpretable by humans, and are often sub-tasks; where solving these sub-tasks is crucial for solving the main task. Align-RUDDER enhances the RUDDER framework with methods from multiple sequence alignment (MSA) to identify key events from demonstration trajectories. MSA needs only a few trajectories in order to perform well, and is much better understood than deep learning models such as LSTMs. Consequently, strategies and concepts can be learned from a few expert demonstrations, where the expert can be a human or an agent trained by reinforcement learning. By substituting RUDDER\u2019s LSTM with a profile model that is obtained from MSA of demonstration trajectories, we are able to interpret an agent at three stages: First, by extracting common strategies from demonstration trajectories with MSA. Second, by encoding the most prevalent strategy via the MSA profile model and therefore explaining the expert\u2019s behavior. And third, by allowing the interpretation of an arbitrary agent\u2019s behavior based on its demonstration trajectories.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Abstract In reinforcement learning, an agent interacts with an environment from which it receives rewards, that are then used to learn a task. However, it is often unclear what strategies or concepts the agent has learned to solve the task. Thus, interpretability of the agent\u2019s behavior is an important aspect in practical applications, next to [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"spay_email":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[1],"tags":[],"jetpack_featured_media_url":"","jetpack_publicize_connections":[],"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p7SrVj-1jB","jetpack-related-posts":[{"id":4697,"url":"https:\/\/www.dinu.at\/profile\/home\/align-rudder-learning-from-few-demonstrations-by-reward-redistribution\/","url_meta":{"origin":5059,"position":0},"title":"Align-RUDDER: Learning From Few Demonstrations by  Reward Redistribution","date":"30. September 2020","format":false,"excerpt":"Reinforcement Learning algorithms require a large number of samples to solve complex tasks with sparse and delayed rewards. Complex tasks can often be hierarchically decomposed into sub-tasks. A step in the Q-function can be associated with solving a sub-task, where the expectation of the return increases. RUDDER has been introduced\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":5068,"url":"https:\/\/www.dinu.at\/profile\/home\/align-rudder-learning-from-few-demonstrations-by-reward-redistribution-2\/","url_meta":{"origin":5059,"position":1},"title":"Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution","date":"10. March 2022","format":false,"excerpt":"Abstract Reinforcement learning algorithms require many samples when solving complex hierarchical tasks with sparse and delayed rewards. For such complex tasks, the recently proposed RUDDER uses reward redistribution to leverage steps in the Q-function that are associated with accomplishing sub-tasks. However, often only few episodes with high rewards are available\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":5142,"url":"https:\/\/www.dinu.at\/profile\/home\/a-neuro-symbolic-perspective-on-large-language-models-llms\/","url_meta":{"origin":5059,"position":2},"title":"A Neuro-Symbolic Perspective on Large Language Models (LLMs)","date":"22. January 2023","format":false,"excerpt":"We are excited to present our work, combining the power of a symbolic approach and Large Language Models (LLMs). Our Symbolic API bridges the gap between classical programming (Software 1.0) and differentiable programming (Software 2.0). Conceptually, our framework uses neural networks - specifically LLMs - at its core, and composes\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.dinu.at\/wp-content\/uploads\/2023\/01\/symai_logo.png?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":4672,"url":"https:\/\/www.dinu.at\/profile\/home\/overcoming-catastrophic-forgetting-with-context-dependent-activations-xda-and-synaptic-stabilization\/","url_meta":{"origin":5059,"position":3},"title":"Overcoming Catastrophic Forgetting with Context-Dependent Activations (XdA) and Synaptic Stabilization","date":"25. November 2019","format":false,"excerpt":"Abstract Overcoming Catastrophic Forgetting in neural networks is crucial to solving continuous learning problems. Deep Reinforcement Learning uses neural networks to make predictions of actions according to the current state space of an environment. In a dynamic environment, robust and adaptive life-long learning algorithms mark the cornerstone of their success.\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":5072,"url":"https:\/\/www.dinu.at\/profile\/home\/reactive-exploration-to-cope-with-non-stationarity-in-lifelong-reinforcement-learning\/","url_meta":{"origin":5059,"position":4},"title":"Reactive Exploration to Cope with Non-Stationarity in Lifelong Reinforcement Learning","date":"1. August 2022","format":false,"excerpt":"Abstract In lifelong learning, an agent learns throughout its entire life without resets, in a constantly changing environment, as we humans do. Consequently, lifelong learning comes with a plethora of research problems such as continual domain shifts, which result in non-stationary rewards and environment dynamics. These non-stationarities are difficult to\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":5076,"url":"https:\/\/www.dinu.at\/profile\/home\/a-dataset-perspective-on-offline-reinforcement-learning\/","url_meta":{"origin":5059,"position":5},"title":"A Dataset Perspective on Offline Reinforcement Learning","date":"1. August 2022","format":false,"excerpt":"Abstract The application of Reinforcement Learning (RL) in real world environments can be expensive or risky due to sub-optimal policies during training. In Offline RL, this problem is avoided since interactions with an environment are prohibited. Policies are learned from a given dataset, which solely determines their performance. Despite this\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"builder_content":"","_links":{"self":[{"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/posts\/5059"}],"collection":[{"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/comments?post=5059"}],"version-history":[{"count":5,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/posts\/5059\/revisions"}],"predecessor-version":[{"id":5117,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/posts\/5059\/revisions\/5117"}],"wp:attachment":[{"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/media?parent=5059"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/categories?post=5059"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/tags?post=5059"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}