{"id":4697,"date":"2020-09-30T07:10:04","date_gmt":"2020-09-30T07:10:04","guid":{"rendered":"https:\/\/www.dinu.at\/profile\/home\/?p=4697"},"modified":"2020-09-30T07:10:06","modified_gmt":"2020-09-30T07:10:06","slug":"align-rudder-learning-from-few-demonstrations-by-reward-redistribution","status":"publish","type":"post","link":"https:\/\/www.dinu.at\/profile\/home\/align-rudder-learning-from-few-demonstrations-by-reward-redistribution\/","title":{"rendered":"Align-RUDDER: Learning From Few Demonstrations by  Reward Redistribution"},"content":{"rendered":"<div id=\"themify_builder_content-4697\" data-postid=\"4697\" class=\"themify_builder_content themify_builder_content-4697 themify_builder\">\n\n    \n\t\t<!-- module_row -->\n\t\t<div  class=\"themify_builder_row module_row clearfix module_row_0 themify_builder_4697_row module_row_4697-0\" data-id=\"8c768a5\">\n\t\t\t\t\t\t<div class=\"row_inner col_align_top\" >\n                                    <div  class=\"module_column tb-column col-full first tb_4697_column module_column_0 module_column_4697-0-0\" data-id=\"da9e05d\" >\n                                                                <div class=\"tb-column-inner\">\n                            \n\n    <!-- module plain text -->\n    <div  id=\"plain-text-4697-0-0-0\" class=\"module module-plain-text plain-text-4697-0-0-0  \" data-id=\"2fc653f\">\n        <!--insert-->\n        Reinforcement Learning algorithms require a large number of samples to solve complex tasks with sparse and delayed rewards. Complex tasks can often be hierarchically decomposed into sub-tasks. A step in the Q-function can be associated with solving a sub-task, where the expectation of the return increases. RUDDER has been introduced to identify these steps and then redistribute reward to them, thus immediately giving reward if sub-tasks are solved. Since the problem of delayed rewards is mitigated, learning is considerably sped up. However, for complex tasks, current exploration strategies as deployed in RUDDER struggle with discovering episodes with high rewards. Therefore, we assume that episodes with high rewards are given as demonstrations and do not have to be discovered by exploration. Typically the number of demonstrations is small and RUDDER's LSTM model as a deep learning method does not learn well. Hence, we introduce Align-RUDDER, which is RUDDER with two major modifications. First, Align-RUDDER assumes that episodes with high rewards are given as demonstrations, replacing RUDDER's safe exploration and lessons replay buffer. Second, we replace RUDDER's LSTM model by a profile model that is obtained from multiple sequence alignment of demonstrations. Profile models can be constructed from as few as two demonstrations as known from bioinformatics. Align-RUDDER inherits the concept of reward redistribution, which considerably reduces the delay of rewards, thus speeding up learning. Align-RUDDER outperforms competitors on complex artificial tasks with delayed reward and few demonstrations. On the MineCraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently. Code is published on <a href=\"https:\/\/github.com\/ml-jku\/align-rudder\">GitHub<\/a>     <\/div>\n    <!-- \/module plain text -->\n\n                        <\/div>\n                    \t\t<\/div>\n\t\t                                <\/div>\n                                <!-- \/row_inner -->\n                        <\/div>\n                        <!-- \/module_row -->\n\t\t<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Reinforcement Learning algorithms require a large number of samples to solve complex tasks with sparse and delayed rewards. Complex tasks can often be hierarchically decomposed into sub-tasks. A step in the Q-function can be associated with solving a sub-task, where the expectation of the return increases. RUDDER has been introduced to identify these steps and [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"spay_email":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[1],"tags":[123,124,122,121,120],"jetpack_featured_media_url":"","jetpack_publicize_connections":[],"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p7SrVj-1dL","jetpack-related-posts":[{"id":5068,"url":"https:\/\/www.dinu.at\/profile\/home\/align-rudder-learning-from-few-demonstrations-by-reward-redistribution-2\/","url_meta":{"origin":4697,"position":0},"title":"Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution","date":"10. March 2022","format":false,"excerpt":"Abstract Reinforcement learning algorithms require many samples when solving complex hierarchical tasks with sparse and delayed rewards. For such complex tasks, the recently proposed RUDDER uses reward redistribution to leverage steps in the Q-function that are associated with accomplishing sub-tasks. However, often only few episodes with high rewards are available\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":5059,"url":"https:\/\/www.dinu.at\/profile\/home\/xai-and-strategy-extraction-via-reward-redistribution\/","url_meta":{"origin":4697,"position":1},"title":"XAI and Strategy Extraction via Reward Redistribution","date":"17. October 2020","format":false,"excerpt":"Abstract In reinforcement learning, an agent interacts with an environment from which it receives rewards, that are then used to learn a task. However, it is often unclear what strategies or concepts the agent has learned to solve the task. Thus, interpretability of the agent\u2019s behavior is an important aspect\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":5076,"url":"https:\/\/www.dinu.at\/profile\/home\/a-dataset-perspective-on-offline-reinforcement-learning\/","url_meta":{"origin":4697,"position":2},"title":"A Dataset Perspective on Offline Reinforcement Learning","date":"1. August 2022","format":false,"excerpt":"Abstract The application of Reinforcement Learning (RL) in real world environments can be expensive or risky due to sub-optimal policies during training. In Offline RL, this problem is avoided since interactions with an environment are prohibited. Policies are learned from a given dataset, which solely determines their performance. Despite this\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":5072,"url":"https:\/\/www.dinu.at\/profile\/home\/reactive-exploration-to-cope-with-non-stationarity-in-lifelong-reinforcement-learning\/","url_meta":{"origin":4697,"position":3},"title":"Reactive Exploration to Cope with Non-Stationarity in Lifelong Reinforcement Learning","date":"1. August 2022","format":false,"excerpt":"Abstract In lifelong learning, an agent learns throughout its entire life without resets, in a constantly changing environment, as we humans do. Consequently, lifelong learning comes with a plethora of research problems such as continual domain shifts, which result in non-stationary rewards and environment dynamics. These non-stationarities are difficult to\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":5142,"url":"https:\/\/www.dinu.at\/profile\/home\/a-neuro-symbolic-perspective-on-large-language-models-llms\/","url_meta":{"origin":4697,"position":4},"title":"A Neuro-Symbolic Perspective on Large Language Models (LLMs)","date":"22. January 2023","format":false,"excerpt":"We are excited to present our work, combining the power of a symbolic approach and Large Language Models (LLMs). Our Symbolic API bridges the gap between classical programming (Software 1.0) and differentiable programming (Software 2.0). Conceptually, our framework uses neural networks - specifically LLMs - at its core, and composes\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.dinu.at\/wp-content\/uploads\/2023\/01\/symai_logo.png?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":4672,"url":"https:\/\/www.dinu.at\/profile\/home\/overcoming-catastrophic-forgetting-with-context-dependent-activations-xda-and-synaptic-stabilization\/","url_meta":{"origin":4697,"position":5},"title":"Overcoming Catastrophic Forgetting with Context-Dependent Activations (XdA) and Synaptic Stabilization","date":"25. November 2019","format":false,"excerpt":"Abstract Overcoming Catastrophic Forgetting in neural networks is crucial to solving continuous learning problems. Deep Reinforcement Learning uses neural networks to make predictions of actions according to the current state space of an environment. In a dynamic environment, robust and adaptive life-long learning algorithms mark the cornerstone of their success.\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"builder_content":"Reinforcement Learning algorithms require a large number of samples to solve complex tasks with sparse and delayed rewards. Complex tasks can often be hierarchically decomposed into sub-tasks. A step in the Q-function can be associated with solving a sub-task, where the expectation of the return increases. RUDDER has been introduced to identify these steps and then redistribute reward to them, thus immediately giving reward if sub-tasks are solved. Since the problem of delayed rewards is mitigated, learning is considerably sped up. However, for complex tasks, current exploration strategies as deployed in RUDDER struggle with discovering episodes with high rewards. Therefore, we assume that episodes with high rewards are given as demonstrations and do not have to be discovered by exploration. Typically the number of demonstrations is small and RUDDER's LSTM model as a deep learning method does not learn well. Hence, we introduce Align-RUDDER, which is RUDDER with two major modifications. First, Align-RUDDER assumes that episodes with high rewards are given as demonstrations, replacing RUDDER's safe exploration and lessons replay buffer. Second, we replace RUDDER's LSTM model by a profile model that is obtained from multiple sequence alignment of demonstrations. Profile models can be constructed from as few as two demonstrations as known from bioinformatics. Align-RUDDER inherits the concept of reward redistribution, which considerably reduces the delay of rewards, thus speeding up learning. Align-RUDDER outperforms competitors on complex artificial tasks with delayed reward and few demonstrations. On the MineCraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently. Code is published on <a href=\"https:\/\/github.com\/ml-jku\/align-rudder\">GitHub<\/a>","_links":{"self":[{"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/posts\/4697"}],"collection":[{"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/comments?post=4697"}],"version-history":[{"count":7,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/posts\/4697\/revisions"}],"predecessor-version":[{"id":4704,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/posts\/4697\/revisions\/4704"}],"wp:attachment":[{"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/media?parent=4697"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/categories?post=4697"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/tags?post=4697"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}