{"id":5076,"date":"2022-08-01T14:37:12","date_gmt":"2022-08-01T14:37:12","guid":{"rendered":"https:\/\/www.dinu.at\/profile\/home\/?p=5076"},"modified":"2022-08-13T10:59:16","modified_gmt":"2022-08-13T10:59:16","slug":"a-dataset-perspective-on-offline-reinforcement-learning","status":"publish","type":"post","link":"https:\/\/www.dinu.at\/profile\/home\/a-dataset-perspective-on-offline-reinforcement-learning\/","title":{"rendered":"A Dataset Perspective on Offline Reinforcement Learning"},"content":{"rendered":"<div id=\"themify_builder_content-5076\" data-postid=\"5076\" class=\"themify_builder_content themify_builder_content-5076 themify_builder\">\n\n    <\/div>\n\n\n\n<h2>Abstract<\/h2>\n\n\n\n<p>The application of Reinforcement Learning (RL) in real world environments can be expensive or risky due to sub-optimal policies during training. In Offline RL, this problem is avoided since interactions with an environment are prohibited. Policies are learned from a given dataset, which solely determines their performance. Despite this fact, how dataset characteristics influence Offline RL algorithms is still hardly investigated. The dataset characteristics are determined by the behavioral policy that samples this dataset. Therefore, we define characteristics of behavioral policies as exploratory for yielding high expected information in their interaction with the Markov Decision Process (MDP) and as exploitative for having high expected return. We implement two corresponding empirical measures for the datasets sampled by the behavioral policy in deterministic MDPs. The first empirical measure SACo is defined by the normalized unique state-action pairs and captures exploration. The second empirical measure TQ is defined by the normalized average trajectory return and captures exploitation. Empirical evaluations show the effectiveness of TQ and SACo. In large-scale experiments using our proposed measures, we show that the unconstrained off-policy Deep Q-Network family requires datasets with high SACo to find a good policy. Furthermore, experiments show that policy constraint algorithms perform well on datasets with high TQ and SACo. Finally, the experiments show, that purely dataset-constrained Behavioral Cloning performs competitively to the best Offline RL algorithms for datasets with high TQ.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Abstract The application of Reinforcement Learning (RL) in real world environments can be expensive or risky due to sub-optimal policies during training. In Offline RL, this problem is avoided since interactions with an environment are prohibited. Policies are learned from a given dataset, which solely determines their performance. Despite this fact, how dataset characteristics influence [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"spay_email":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[1],"tags":[],"jetpack_featured_media_url":"","jetpack_publicize_connections":[],"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p7SrVj-1jS","jetpack-related-posts":[{"id":5059,"url":"https:\/\/www.dinu.at\/profile\/home\/xai-and-strategy-extraction-via-reward-redistribution\/","url_meta":{"origin":5076,"position":0},"title":"XAI and Strategy Extraction via Reward Redistribution","date":"17. October 2020","format":false,"excerpt":"Abstract In reinforcement learning, an agent interacts with an environment from which it receives rewards, that are then used to learn a task. However, it is often unclear what strategies or concepts the agent has learned to solve the task. Thus, interpretability of the agent\u2019s behavior is an important aspect\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":5072,"url":"https:\/\/www.dinu.at\/profile\/home\/reactive-exploration-to-cope-with-non-stationarity-in-lifelong-reinforcement-learning\/","url_meta":{"origin":5076,"position":1},"title":"Reactive Exploration to Cope with Non-Stationarity in Lifelong Reinforcement Learning","date":"1. August 2022","format":false,"excerpt":"Abstract In lifelong learning, an agent learns throughout its entire life without resets, in a constantly changing environment, as we humans do. Consequently, lifelong learning comes with a plethora of research problems such as continual domain shifts, which result in non-stationary rewards and environment dynamics. These non-stationarities are difficult to\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":4697,"url":"https:\/\/www.dinu.at\/profile\/home\/align-rudder-learning-from-few-demonstrations-by-reward-redistribution\/","url_meta":{"origin":5076,"position":2},"title":"Align-RUDDER: Learning From Few Demonstrations by  Reward Redistribution","date":"30. September 2020","format":false,"excerpt":"Reinforcement Learning algorithms require a large number of samples to solve complex tasks with sparse and delayed rewards. Complex tasks can often be hierarchically decomposed into sub-tasks. A step in the Q-function can be associated with solving a sub-task, where the expectation of the return increases. RUDDER has been introduced\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":5068,"url":"https:\/\/www.dinu.at\/profile\/home\/align-rudder-learning-from-few-demonstrations-by-reward-redistribution-2\/","url_meta":{"origin":5076,"position":3},"title":"Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution","date":"10. March 2022","format":false,"excerpt":"Abstract Reinforcement learning algorithms require many samples when solving complex hierarchical tasks with sparse and delayed rewards. For such complex tasks, the recently proposed RUDDER uses reward redistribution to leverage steps in the Q-function that are associated with accomplishing sub-tasks. However, often only few episodes with high rewards are available\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":4672,"url":"https:\/\/www.dinu.at\/profile\/home\/overcoming-catastrophic-forgetting-with-context-dependent-activations-xda-and-synaptic-stabilization\/","url_meta":{"origin":5076,"position":4},"title":"Overcoming Catastrophic Forgetting with Context-Dependent Activations (XdA) and Synaptic Stabilization","date":"25. November 2019","format":false,"excerpt":"Abstract Overcoming Catastrophic Forgetting in neural networks is crucial to solving continuous learning problems. Deep Reinforcement Learning uses neural networks to make predictions of actions according to the current state space of an environment. In a dynamic environment, robust and adaptive life-long learning algorithms mark the cornerstone of their success.\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":5064,"url":"https:\/\/www.dinu.at\/profile\/home\/the-balancing-principle-for-parameter-choice-in-distance-regularized-domain-adaptation\/","url_meta":{"origin":5076,"position":5},"title":"The balancing principle for parameter choice in distance-regularized domain adaptation","date":"24. September 2021","format":false,"excerpt":"Abstract We address the unsolved algorithm design problem of choosing a justified regularization parameter in unsupervised domain adaptation. This problem is intriguing as no labels are available in the target domain. Our approach starts with the observation that the widely-used method of minimizing the source error, penalized by a distance\u2026","rel":"","context":"In &quot;General&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"builder_content":"","_links":{"self":[{"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/posts\/5076"}],"collection":[{"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/comments?post=5076"}],"version-history":[{"count":11,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/posts\/5076\/revisions"}],"predecessor-version":[{"id":5118,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/posts\/5076\/revisions\/5118"}],"wp:attachment":[{"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/media?parent=5076"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/categories?post=5076"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.dinu.at\/profile\/home\/wp-json\/wp\/v2\/tags?post=5076"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}