{"id":8183,"date":"2023-11-22T17:19:46","date_gmt":"2023-11-23T01:19:46","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=8183"},"modified":"2025-04-24T17:04:23","modified_gmt":"2025-04-24T17:04:23","slug":"diving-deep-into-langchains-comparison-evaluators","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/diving-deep-into-langchains-comparison-evaluators\/","title":{"rendered":"Diving Deep into LangChain\u2019s Comparison Evaluators"},"content":{"rendered":"\n<h2 class=\"wp-block-heading pw-subtitle-paragraph tt fy tg be b tu tv tw tx ty tz ua ub uc ud ue uf ug uh ui ro dq\" id=\"dcb7\">Mastering Pairwise Assessments for Optimized Language Model Outputs<\/h2>\n\n\n\n<div class=\"ew tb tc td te\">\n<div class=\"ab cm\">\n<div class=\"hy bg hz ia ib ic\">\n<figure class=\"yb yc yd ye yf yg lp lq paragraph-image\">\n<div class=\"yh yi dl yj bg yk\" tabindex=\"0\" role=\"button\">\n<div class=\"lp lq amo\">\n<\/div><\/div><\/figure><\/div><\/div><\/div>\n\n\n\n<figure class=\"wp-block-image alignnone bg xl yl c\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*jKHnRyHfDmYvpR9_\" alt=\"langchain comparison evaluators\"\/><figcaption class=\"wp-element-caption\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@dietmarbecker?utm_source=medium&amp;utm_medium=referral\">Dietmar Becker<\/a>\u00a0on\u00a0<a href=\"http:\/\/Unsplash.com\">Unsplash<\/a><\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading yq yr tg be ys yt yu tw mk yv yw tz mp yx yy yz za zb zc zd ze zf zg zh zi zj bj\" id=\"dd53\">Introduction<\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph zk zl tg be b tu zm zn zo tx zp zq zr mq zs zt zu mv zv zw zx na zy zz aba abb ew bj wp-block-paragraph\" id=\"5193\">In LangChain, comparison evaluators are designed to measure and compare outputs from two different chains or LLMs. These tools are invaluable for A\/B testing between models or analyzing distinct versions. Moreover, they can be employed to generate preference scores for AI-assisted reinforcement learning.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj wp-block-paragraph\" id=\"aad9\">At their core, these evaluators derive from the PairwiseStringEvaluator class, facilitating a comparison between two output strings. This could result from two distinct prompts, models, or simply different versions of the same model. Essentially, these evaluators assess pairs of strings, providing a detailed evaluation score and other pertinent information.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj wp-block-paragraph\" id=\"1807\">To craft a tailored comparison evaluator, developers can inherit from the PairwiseStringEvaluator class and modify the&nbsp;<code class=\"eg abh abi abj abk b\">_evaluate_string_pairs<\/code>&nbsp;method. Asynchronous evaluations are also supported by overwriting the&nbsp;<code class=\"eg abh abi abj abk b\">_evaluate_string_pairs<\/code>&nbsp;method.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj wp-block-paragraph\" id=\"6432\">Key features of a comparison evaluator include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code class=\"eg abh abi abj abk b\">evaluate_string_pairs<\/code>: Overwrite this to design custom evaluators.<\/li>\n\n\n\n<li><code class=\"eg abh abi abj abk b\">aevaluate_string_pairs<\/code>: Use this for asynchronous evaluations.<\/li>\n\n\n\n<li><code class=\"eg abh abi abj abk b\">requires_input<\/code>: Determines if an input string is needed.<\/li>\n\n\n\n<li><code class=\"eg abh abi abj abk b\">requires_reference<\/code>: Specifies if a reference label is essential.<\/li>\n<\/ul>\n\n\n\n<p class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj wp-block-paragraph\" id=\"5828\">Comparison evaluators excel at juxtaposing outputs from two models or prompts, yielding a score that elucidates the preference between the two outputs. They can be adapted to cater to specific comparative analysis requirements.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj wp-block-paragraph\" id=\"83ad\">For detailed evaluation, the PairwiseStringEvalChain class\u2019s&nbsp;<code class=\"eg abh abi abj abk b\">evaluate_string_pairs<\/code>&nbsp;method compares two output strings, determining the preferred one based on specific criteria. This function can be used with or without a reference. While using a reference provides a more reliable result, the absence of one will rely on the evaluator&#8217;s preference, which might be less accurate.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj wp-block-paragraph\" id=\"76bf\">Customization is at the heart of these evaluators. Developers can define their evaluation criteria or use predefined ones from LangChain. Additionally, one can customize the evaluation prompt for task-specific instructions, ensuring the evaluator scores the output as desired.<\/p>\n\n\n\n<h2 class=\"wp-block-heading acx yr tg be ys mg acy mh mk ml acz mm mp mq ada mr mu mv adb mw mz na adc nb ne add bj\" id=\"2be6\">Comparison evaluators in LangChain help measure two different chains or LLM outputs.<\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph zk zl tg be b tu zm zn zo tx zp zq zr mq zs zt zu mv zv zw zx na zy zz aba abb ew bj wp-block-paragraph\" id=\"3c14\">These evaluators are helpful for comparative analyses, such as A\/B testing between two language models or comparing different versions of the same model. They can also help generate preference scores for ai-assisted reinforcement learning.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj wp-block-paragraph\" id=\"2c1c\">These evaluators inherit from the&nbsp;<code class=\"eg abh abi abj abk b\">PairwiseStringEvaluator<\/code>&nbsp;class, providing a comparison interface for two strings &#8211; typically, the outputs from two different prompts or models or two versions of the same model.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"zk zl ade be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj wp-block-paragraph\" id=\"3342\">tl;dr<em class=\"tg\">: A comparison evaluator evaluates a pair of strings and returns a dictionary containing the evaluation score and other relevant details.<\/em><\/p>\n<\/blockquote>\n\n\n\n<p class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj wp-block-paragraph\" id=\"1a0b\">To create a custom comparison evaluator, inherit from the&nbsp;<code class=\"eg abh abi abj abk b\">PairwiseStringEvaluator<\/code>&nbsp;class and overwrite the&nbsp;<code class=\"eg abh abi abj abk b\">_evaluate_string_pairs method.<\/code><\/p>\n\n\n\n<p class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj wp-block-paragraph\" id=\"9744\">If you require asynchronous evaluation, overwrite the&nbsp;<code class=\"eg abh abi abj abk b\">_aevaluate_string_pairs<\/code>&nbsp;method.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj wp-block-paragraph\" id=\"a927\">Here\u2019s a summary of the essential methods and properties of a comparison evaluator:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code class=\"eg abh abi abj abk b\">evaluate_string_pairs<\/code>: Evaluate the output string pairs. This function should be overwritten when creating custom evaluators.<\/li>\n\n\n\n<li><code class=\"eg abh abi abj abk b\">aevaluate_string_pairs<\/code>: Asynchronously evaluate the output string pairs. This function should be overwritten for asynchronous evaluation.<\/li>\n\n\n\n<li><code class=\"eg abh abi abj abk b\">requires_input<\/code>: This property indicates whether this evaluator requires an input string.<\/li>\n\n\n\n<li><code class=\"eg abh abi abj abk b\">requires_reference<\/code>: This property specifies whether this evaluator requires a reference label.<\/li>\n<\/ul>\n\n\n\n<p class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj wp-block-paragraph\" id=\"1b4e\">In summary, comparison evaluators allow comparing two models\/prompts by evaluating their outputs. They return a score quantifying the preference between the two outputs. You can customize them for your specific comparative analysis needs.<\/p>\n\n\n\n<div class=\"ab cm abu abv pk hb\" role=\"separator\"><\/div>\n\n\n\n<div class=\"ew tb tc td te\">\n<div class=\"ab cm\">\n<div class=\"hy bg hz ia ib ic\">\n<blockquote class=\"abz\"><p id=\"ef38\" class=\"aca acb tg be acc acd ace acf acg ach aci abb dq\" data-selectable-paragraph=\"\">Want to learn how to build modern software with LLMs using the newest tools and techniques in the field?&nbsp;<a class=\"af hd\" href=\"https:\/\/www.comet.com\/production\/site\/llm-course\/?utm_source=Heartbeat&amp;utm_medium=referral&amp;utm_content=Medium&amp;utm_campaign=Heartbeat_LangChain_Series_HS\" target=\"_blank\" rel=\"noopener ugc nofollow\">Check out this free LLMOps course<\/a>&nbsp;from industry expert Elvis Saravia of DAIR.AI.<\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"ab cm abu abv pk hb\" role=\"separator\"><\/div>\n\n\n\n<div class=\"ew tb tc td te\">\n<div class=\"ab cm\">\n<div class=\"hy bg hz ia ib ic\">\n<h2 id=\"64ac\" class=\"yq yr tg be ys yt acj tw mk yv ack tz mp yx acl yz za zb acm zd ze zf acn zh zi zj bj\">Pairwise string comparison<\/h2>\n<p id=\"457d\" class=\"pw-post-body-paragraph zk zl tg be b tu zm zn zo tx zp zq zr mq zs zt zu mv zv zw zx na zy zz aba abb ew bj\" data-selectable-paragraph=\"\">Often, you will want to compare predictions of an LLM, Chain, or Agent for a given input.<\/p>\n<p id=\"9567\" class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj\" data-selectable-paragraph=\"\">The&nbsp;<code class=\"eg abh abi abj abk b\">StringComparison<\/code>&nbsp;evaluators facilitate this so you can answer questions like:<\/p>\n<ul class=\"\">\n<li id=\"d9e4\" class=\"zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\">Which LLM or prompt produces a preferred output for a given question?<\/li>\n<li id=\"e39c\" class=\"zk zl tg be b tu acr zn zo tx acs zq zr mq act zt zu mv acu zw zx na acv zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\">Which examples should I include for the few-shot example selection?<\/li>\n<li id=\"6ac3\" class=\"zk zl tg be b tu acr zn zo tx acs zq zr mq act zt zu mv acu zw zx na acv zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\">Which output is better to include for fine-tuning?<\/li>\n<\/ul>\n<p id=\"93e5\" class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj\" data-selectable-paragraph=\"\">The simplest and often most reliable automated way to choose a preferred prediction for a given input is to use the&nbsp;<code class=\"eg abh abi abj abk b\">pairwise_string<\/code>&nbsp;evaluator.<\/p>\n<pre class=\"yb yc yd ye yf abl abk abm bo abn ba bj\"><span id=\"1947\" class=\"abo yr tg abk b bf abp abq l abr abs\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">from<\/span> langchain.evaluation <span class=\"hljs-keyword\">import<\/span> load_evaluator\n\nevaluator = load_evaluator(<span class=\"hljs-string\">\"labeled_pairwise_string\"<\/span>)<\/span><\/pre>\n<h2 id=\"c65c\" class=\"acx yr tg be ys mg acy mh mk ml acz mm mp mq ada mr mu mv adb mw mz na adc nb ne add bj\" data-selectable-paragraph=\"\">Method:&nbsp;<code class=\"eg abh abi abj abk b\">evaluate_string_pairs<\/code><\/h2>\n<p id=\"bcd6\" class=\"pw-post-body-paragraph zk zl tg be b tu zm zn zo tx zp zq zr mq zs zt zu mv zv zw zx na zy zz aba abb ew bj\" data-selectable-paragraph=\"\">The&nbsp;<code class=\"eg abh abi abj abk b\">evaluate_string_pairs<\/code>&nbsp;method of the&nbsp;<code class=\"eg abh abi abj abk b\">PairwiseStringEvalChain<\/code>&nbsp;class is designed to evaluate and compare two output strings (<code class=\"eg abh abi abj abk b\">prediction<\/code>&nbsp;and&nbsp;<code class=\"eg abh abi abj abk b\">prediction_b<\/code>) to determine which one is preferred based on certain criteria.<\/p>\n<h2 id=\"4265\" class=\"acx yr tg be ys mg acy mh mk ml acz mm mp mq ada mr mu mv adb mw mz na adc nb ne add bj\" data-selectable-paragraph=\"\">Parameters:<\/h2>\n<ul class=\"\">\n<li id=\"c706\" class=\"zk zl tg be b tu zm zn zo tx zp zq zr mq zs zt zu mv zv zw zx na zy zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\"><strong class=\"be fx\">prediction<\/strong>: The output string from the first model.<\/li>\n<li id=\"3b8b\" class=\"zk zl tg be b tu acr zn zo tx acs zq zr mq act zt zu mv acu zw zx na acv zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\"><strong class=\"be fx\">prediction_b<\/strong>: The output string from the second model.<\/li>\n<li id=\"4f19\" class=\"zk zl tg be b tu acr zn zo tx acs zq zr mq act zt zu mv acu zw zx na acv zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\"><strong class=\"be fx\">input<\/strong>: (Optional) The input or task string that led to the predictions.<\/li>\n<li id=\"9688\" class=\"zk zl tg be b tu acr zn zo tx acs zq zr mq act zt zu mv acu zw zx na acv zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\"><strong class=\"be fx\">reference<\/strong>: (Optional) A reference string for comparison.<\/li>\n<li id=\"c2d9\" class=\"zk zl tg be b tu acr zn zo tx acs zq zr mq act zt zu mv acu zw zx na acv zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\"><strong class=\"be fx\">callbacks<\/strong>: (Optional) Callbacks to use during the evaluation.<\/li>\n<li id=\"8183\" class=\"zk zl tg be b tu acr zn zo tx acs zq zr mq act zt zu mv acu zw zx na acv zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\"><strong class=\"be fx\">tags<\/strong>: (Optional) List of tags to associate with the evaluation.<\/li>\n<li id=\"6666\" class=\"zk zl tg be b tu acr zn zo tx acs zq zr mq act zt zu mv acu zw zx na acv zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\"><strong class=\"be fx\">metadata<\/strong>: (Optional) Additional metadata to associate with the evaluation.<\/li>\n<li id=\"1f8e\" class=\"zk zl tg be b tu acr zn zo tx acs zq zr mq act zt zu mv acu zw zx na acv zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\"><strong class=\"be fx\">include_run_info<\/strong>: (Optional) Boolean to decide whether to include run information in the result.<\/li>\n<li id=\"acb3\" class=\"zk zl tg be b tu acr zn zo tx acs zq zr mq act zt zu mv acu zw zx na acv zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\"><strong class=\"be fx\">kwargs<\/strong>: Additional keyword arguments.<\/li>\n<\/ul>\n<h2 id=\"62ea\" class=\"acx yr tg be ys mg acy mh mk ml acz mm mp mq ada mr mu mv adb mw mz na adc nb ne add bj\" data-selectable-paragraph=\"\">Process:<\/h2>\n<ol class=\"\">\n<li id=\"9535\" class=\"zk zl tg be b tu zm zn zo tx zp zq zr mq zs zt zu mv zv zw zx na zy zz aba abb acw acp acq bj\" data-selectable-paragraph=\"\">The method starts by preparing the input using the&nbsp;<code class=\"eg abh abi abj abk b\">_prepare_input<\/code>&nbsp;method. This method organizes the input data (<code class=\"eg abh abi abj abk b\">prediction<\/code>,&nbsp;<code class=\"eg abh abi abj abk b\">prediction_b<\/code>,&nbsp;<code class=\"eg abh abi abj abk b\">input<\/code>, and&nbsp;<code class=\"eg abh abi abj abk b\">reference<\/code>) into a dictionary format suitable for evaluation.<\/li>\n<li id=\"e112\" class=\"zk zl tg be b tu acr zn zo tx acs zq zr mq act zt zu mv acu zw zx na acv zz aba abb acw acp acq bj\" data-selectable-paragraph=\"\">The method then evaluates the prepared input. The evaluation process is abstracted away in this method, but it likely involves comparing the two predictions (<code class=\"eg abh abi abj abk b\">prediction<\/code>&nbsp;and&nbsp;<code class=\"eg abh abi abj abk b\">prediction_b<\/code>) based on the criteria defined elsewhere in the class or module.<\/li>\n<li id=\"cf41\" class=\"zk zl tg be b tu acr zn zo tx acs zq zr mq act zt zu mv acu zw zx na acv zz aba abb acw acp acq bj\" data-selectable-paragraph=\"\">After evaluation, the method prepares the output using the&nbsp;<code class=\"eg abh abi abj abk b\">_prepare_output<\/code>&nbsp;method. This method organizes the raw evaluation result into a more structured and readable format.<\/li>\n<\/ol>\n<h2 id=\"e7aa\" class=\"acx yr tg be ys mg acy mh mk ml acz mm mp mq ada mr mu mv adb mw mz na adc nb ne add bj\" data-selectable-paragraph=\"\">Returns:<\/h2>\n<p id=\"7977\" class=\"pw-post-body-paragraph zk zl tg be b tu zm zn zo tx zp zq zr mq zs zt zu mv zv zw zx na zy zz aba abb ew bj\" data-selectable-paragraph=\"\">The method returns a dictionary with the following keys:<\/p>\n<ul class=\"\">\n<li id=\"6dc5\" class=\"zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\"><strong class=\"be fx\">reasoning<\/strong>: Explains why one prediction is preferred over the other.<\/li>\n<li id=\"74db\" class=\"zk zl tg be b tu acr zn zo tx acs zq zr mq act zt zu mv acu zw zx na acv zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\"><strong class=\"be fx\">value<\/strong>: Indicates the preferred prediction. It can be \u2018A\u2019 (for&nbsp;<code class=\"eg abh abi abj abk b\">prediction<\/code>), &#8216;B&#8217; (for&nbsp;<code class=\"eg abh abi abj abk b\">prediction_b<\/code>), or&nbsp;<code class=\"eg abh abi abj abk b\">None<\/code>&nbsp;if there&#8217;s no preference.<\/li>\n<li id=\"3761\" class=\"zk zl tg be b tu acr zn zo tx acs zq zr mq act zt zu mv acu zw zx na acv zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\"><strong class=\"be fx\">score<\/strong>: A numerical score representing the preference. It\u2019s&nbsp;<code class=\"eg abh abi abj abk b\">1<\/code>&nbsp;for &#8216;A&#8217;,&nbsp;<code class=\"eg abh abi abj abk b\">0<\/code>&nbsp;for &#8216;B&#8217;, and&nbsp;<code class=\"eg abh abi abj abk b\">0.5<\/code>&nbsp;if there&#8217;s no preference.<\/li>\n<\/ul>\n<p id=\"6ee7\" class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj\" data-selectable-paragraph=\"\">In essence, the&nbsp;<code class=\"eg abh abi abj abk b\">_evaluate_string_pairs<\/code>&nbsp;method is a utility to compare two model outputs and determine which is better based on predefined criteria.<\/p>\n<h2 id=\"185b\" class=\"acx yr tg be ys mg acy mh mk ml acz mm mp mq ada mr mu mv adb mw mz na adc nb ne add bj\" data-selectable-paragraph=\"\">Note: You can customize the LLM by passing in a value for&nbsp;<code class=\"eg abh abi abj abk b\">llm<\/code>&nbsp;argument. By default, it uses&nbsp;<code class=\"eg abh abi abj abk b\">GPT-4<\/code>.<\/h2>\n<pre class=\"yb yc yd ye yf abl abk abm bo abn ba bj\"><span id=\"4fd8\" class=\"abo yr tg abk b bf abp abq l abr abs\" data-selectable-paragraph=\"\">evaluator.evaluate_string_pairs(\n    prediction=<span class=\"hljs-string\">\"Sikhism was founded by Guru Nanak Dev Ji in the 15th century.\"<\/span>,\n    prediction_b=<span class=\"hljs-string\">\"Sikhism was established by a philosopher named Ravi in the 16th century.\"<\/span>,\n    <span class=\"hljs-built_in\">input<\/span>=<span class=\"hljs-string\">\"Who is the founder of Sikhism?\"<\/span>,\n    reference=<span class=\"hljs-string\">\"Sikhism was founded by Guru Nanak Dev Ji in the late 15th century.\"<\/span>,\n    verbose=<span class=\"hljs-literal\">True<\/span>\n)<\/span><\/pre>\n<pre class=\"abt abl abk abm bo abn ba bj\"><span id=\"b219\" class=\"abo yr tg abk b bf abp abq l abr abs\" data-selectable-paragraph=\"\">{'reasoning': \"Assistant A's response is more helpful, relevant, and correct. It accurately identifies Guru Nanak Dev Ji as the founder of Sikhism in the 15th century, which aligns with the reference answer provided. On the other hand, Assistant B's response is incorrect. It incorrectly identifies a philosopher named Ravi as the founder of Sikhism in the 16th century, which is not accurate according to the reference answer and historical facts. Therefore, Assistant A's response demonstrates a greater depth of thought and knowledge about the topic. \\n\\nFinal Verdict: [[A]]\",\n 'value': 'A',\n 'score': 1}<\/span><\/pre>\n<h2 id=\"bf0a\" class=\"acx yr tg be ys mg acy mh mk ml acz mm mp mq ada mr mu mv adb mw mz na adc nb ne add bj\" data-selectable-paragraph=\"\">Use without reference<\/h2>\n<p id=\"df7b\" class=\"pw-post-body-paragraph zk zl tg be b tu zm zn zo tx zp zq zr mq zs zt zu mv zv zw zx na zy zz aba abb ew bj\" data-selectable-paragraph=\"\">When references aren\u2019t available, you can still predict the preferred response.<\/p>\n<p id=\"62c1\" class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj\" data-selectable-paragraph=\"\">The results will reflect the evaluation model\u2019s preference, which is less reliable and may result in preferences that are factually incorrect.<\/p>\n<pre class=\"yb yc yd ye yf abl abk abm bo abn ba bj\"><span id=\"73b7\" class=\"abo yr tg abk b bf abp abq l abr abs\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">from<\/span> langchain.evaluation <span class=\"hljs-keyword\">import<\/span> load_evaluator\n\nevaluator = load_evaluator(<span class=\"hljs-string\">\"pairwise_string\"<\/span>)\n\nevaluator.evaluate_string_pairs(\n    prediction=<span class=\"hljs-string\">\"Stars are primarily made of hydrogen.\"<\/span>,\n    prediction_b=<span class=\"hljs-string\">\"Stars are primarily composed of hydrogen, which undergoes nuclear fusion to produce helium, releasing energy in the process.\"<\/span>,\n    <span class=\"hljs-built_in\">input<\/span>=<span class=\"hljs-string\">\"What is the primary component of a star?\"<\/span>,\n    verbose=<span class=\"hljs-literal\">True<\/span>\n)<\/span><\/pre>\n<pre class=\"abt abl abk abm bo abn ba bj\"><span id=\"f53d\" class=\"abo yr tg abk b bf abp abq l abr abs\" data-selectable-paragraph=\"\">{'reasoning': \"Both Assistant A and Assistant B provided correct and relevant answers to the user's question. However, Assistant B's response was more detailed and insightful, explaining not only that stars are primarily composed of hydrogen, but also how this hydrogen undergoes nuclear fusion to produce helium, releasing energy in the process. This additional information demonstrates a greater depth of thought and understanding of the topic. Therefore, Assistant B's response is superior based on the evaluation criteria. \\n\\nFinal Verdict: [[B]]\",\n 'value': 'B',\n 'score': 0}<\/span><\/pre>\n<h2 id=\"1e88\" class=\"yq yr tg be ys yt yu tw mk yv yw tz mp yx yy yz za zb zc zd ze zf zg zh zi zj bj\">Defining the Criteria<\/h2>\n<p id=\"2e16\" class=\"pw-post-body-paragraph zk zl tg be b tu zm zn zo tx zp zq zr mq zs zt zu mv zv zw zx na zy zz aba abb ew bj\" data-selectable-paragraph=\"\">By default, the LLM is instructed to select the \u2018preferred\u2019 response based on helpfulness, relevance, correctness, and depth of thought.<\/p>\n<p id=\"6500\" class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj\" data-selectable-paragraph=\"\">You can customize the criteria by passing in a criteria argument, where the criteria could take any of the following forms:<\/p>\n<ul class=\"\">\n<li id=\"243a\" class=\"zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\">Criteria or its string value \u2014 to use one of the default criteria and their descriptions. You can pick from the following right out of the box:&nbsp;<code class=\"eg abh abi abj abk b\">conciseness<\/code>,&nbsp;<code class=\"eg abh abi abj abk b\">relevance<\/code>,&nbsp;<code class=\"eg abh abi abj abk b\">correctness<\/code>,&nbsp;<code class=\"eg abh abi abj abk b\">coherence<\/code>,&nbsp;<code class=\"eg abh abi abj abk b\">harmfulness<\/code>,&nbsp;<code class=\"eg abh abi abj abk b\">maliciousness<\/code>,&nbsp;<code class=\"eg abh abi abj abk b\">helpfulness<\/code>,&nbsp;<code class=\"eg abh abi abj abk b\">controversiality<\/code>,&nbsp;<code class=\"eg abh abi abj abk b\">misogyny<\/code>,&nbsp;<code class=\"eg abh abi abj abk b\">criminality<\/code>,&nbsp;<code class=\"eg abh abi abj abk b\">insensitivity<\/code>,&nbsp;<code class=\"eg abh abi abj abk b\">depth<\/code>,&nbsp;<code class=\"eg abh abi abj abk b\">creativity<\/code>,&nbsp;<code class=\"eg abh abi abj abk b\">detail<\/code><\/li>\n<li id=\"1e95\" class=\"zk zl tg be b tu acr zn zo tx acs zq zr mq act zt zu mv acu zw zx na acv zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\">Constitutional principle \u2014 use any of the constitutional principles defined in LangChain.<\/li>\n<li id=\"ffcb\" class=\"zk zl tg be b tu acr zn zo tx acs zq zr mq act zt zu mv acu zw zx na acv zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\">Dictionary: a list of custom criteria, where the key is the name of the criteria, and the value is the description.<\/li>\n<li id=\"13e1\" class=\"zk zl tg be b tu acr zn zo tx acs zq zr mq act zt zu mv acu zw zx na acv zz aba abb aco acp acq bj\" data-selectable-paragraph=\"\">A list of criteria or constitutional principles \u2014 to combine multiple criteria in one.<\/li>\n<\/ul>\n<p id=\"c309\" class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj\" data-selectable-paragraph=\"\">Here\u2019s an example of determining the scientific rigour and quality of a given text:<\/p>\n<pre class=\"yb yc yd ye yf abl abk abm bo abn ba bj\"><span id=\"56ae\" class=\"abo yr tg abk b bf abp abq l abr abs\" data-selectable-paragraph=\"\">scientific_criteria = {\n    <span class=\"hljs-string\">\"accuracy\"<\/span>: <span class=\"hljs-string\">\"Is the information presented accurate based on known scientific knowledge?\"<\/span>,\n    <span class=\"hljs-string\">\"comprehensiveness\"<\/span>: <span class=\"hljs-string\">\"Does the text cover the topic in a thorough manner, addressing all relevant aspects?\"<\/span>,\n    <span class=\"hljs-string\">\"referencing\"<\/span>: <span class=\"hljs-string\">\"Are claims and statements backed up with appropriate citations or sources?\"<\/span>,\n    <span class=\"hljs-string\">\"objectivity\"<\/span>: <span class=\"hljs-string\">\"Is the writing unbiased and free from personal opinions or beliefs?\"<\/span>,\n    <span class=\"hljs-string\">\"terminology\"<\/span>: <span class=\"hljs-string\">\"Does the text use correct and appropriate scientific terms and language?\"<\/span>,\n    <span class=\"hljs-string\">\"methodology\"<\/span>: <span class=\"hljs-string\">\"If applicable, is the scientific method or approach described in a clear and rigorous manner?\"<\/span>,\n    <span class=\"hljs-string\">\"relevance\"<\/span>: <span class=\"hljs-string\">\"Is the information presented relevant to the current state of the field or topic?\"<\/span>,\n    <span class=\"hljs-string\">\"innovation\"<\/span>: <span class=\"hljs-string\">\"Does the text introduce new concepts, theories, or methodologies?\"<\/span>,\n}\n\nevaluator = load_evaluator(<span class=\"hljs-string\">\"pairwise_string\"<\/span>, criteria=scientific_criteria)\n\nevaluator.evaluate_string_pairs(\n    prediction=<span class=\"hljs-string\">\"The theory of relativity, proposed by Einstein, suggests that time and space are relative and all the motion must be relative to a frame of reference.\"<\/span>,\n    prediction_b=<span class=\"hljs-string\">\"Einstein's relativity idea posits that if you travel super fast, like near the speed of light, time slows down relative to others who are stationary.\"<\/span>,\n    <span class=\"hljs-built_in\">input<\/span>=<span class=\"hljs-string\">\"Explain the theory of relativity in a sentence.\"<\/span>,\n)<\/span><\/pre>\n<pre class=\"abt abl abk abm bo abn ba bj\"><span id=\"7692\" class=\"abo yr tg abk b bf abp abq l abr abs\" data-selectable-paragraph=\"\">{'reasoning': \"Both Assistant A and Assistant B provided accurate and relevant responses to the user's question. However, Assistant A's response is more comprehensive as it covers both the aspects of relativity - time and space, and the concept of motion relative to a frame of reference. On the other hand, Assistant B's response focuses only on the time aspect of relativity and does not mention the space aspect or the concept of relative motion. Both responses use appropriate scientific terminology and are objective, without any personal opinions or beliefs. Neither response introduces new concepts, theories, or methodologies, which is appropriate given the user's request for a one-sentence explanation. Neither assistant provided references, but this is not expected in a one-sentence explanation. Therefore, based on the criteria provided, Assistant A's response is superior. \\n\\nFinal Verdict: [[A]]\",\n 'value': 'A',\n 'score': 1}<\/span><\/pre>\n<h2 id=\"8274\" class=\"yq yr tg be ys yt yu tw mk yv yw tz mp yx yy yz za zb zc zd ze zf zg zh zi zj bj\">Customize the Evaluation Prompt<\/h2>\n<p id=\"28ba\" class=\"pw-post-body-paragraph zk zl tg be b tu zm zn zo tx zp zq zr mq zs zt zu mv zv zw zx na zy zz aba abb ew bj\" data-selectable-paragraph=\"\">You can use your custom evaluation prompt to add task-specific instructions or instruct the evaluator to score the output.<\/p>\n<p id=\"08aa\" class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj\" data-selectable-paragraph=\"\"><strong class=\"be fx\">Note<\/strong>: If you use a prompt that expects to generate a result in a unique format, you may also have to pass in a custom output parser (<code class=\"eg abh abi abj abk b\">output_parser=your_parser()<\/code>) instead of the default&nbsp;<code class=\"eg abh abi abj abk b\">PairwiseStringResultOutputParser<\/code>.<\/p>\n<pre class=\"yb yc yd ye yf abl abk abm bo abn ba bj\"><span id=\"f261\" class=\"abo yr tg abk b bf abp abq l abr abs\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">from<\/span> langchain.prompts <span class=\"hljs-keyword\">import<\/span> PromptTemplate\n\nprompt_template = PromptTemplate.from_template(\n    <span class=\"hljs-string\">\"\"\"\n**Task**: Compare the two responses, A and B, based on the provided criteria.\nProvide a step-by-step reasoning for your preference and conclude with either [[A]] or [[B]] on a separate line.\nEnsure your evaluation is objective and based solely on the given criteria.\n\n**Criteria**:\n{criteria}\n\n**Data**:\n- **Input Context**: {input}\n- **Reference Answer**: {reference}\n- **Response A**: {prediction}\n- **Response B**: {prediction_b}\n\n**Begin Reasoning Below**:\n\n\"\"\"<\/span>\n)\nevaluator = load_evaluator(\n    <span class=\"hljs-string\">\"labeled_pairwise_string\"<\/span>, prompt=prompt_template\n)\n\n<span class=\"hljs-comment\"># The prompt was assigned to the evaluator<\/span>\n<span class=\"hljs-built_in\">print<\/span>(evaluator.prompt)\n\n<\/span><\/pre>\n<pre class=\"abt abl abk abm bo abn ba bj\"><span id=\"c542\" class=\"abo yr tg abk b bf abp abq l abr abs\" data-selectable-paragraph=\"\">input_variables=['input', 'prediction_b', 'reference', 'prediction'] partial_variables={'criteria': 'For this evaluation, you should primarily consider the following criteria:\\nhelpfulness: Is the submission helpful, insightful, and appropriate?\\nrelevance: Is the submission referring to a real quote from the text?\\ncorrectness: Is the submission correct, accurate, and factual?\\ndepth: Does the submission demonstrate depth of thought?'} template='\\n**Task**: Compare the two responses, A and B, based on the provided criteria. \\nProvide a step-by-step reasoning for your preference and conclude with either [[A]] or [[B]] on a separate line. \\nEnsure your evaluation is objective and based solely on the given criteria.\\n\\n**Criteria**:\\n{criteria}\\n\\n**Data**:\\n- **Input Context**: {input}\\n- **Reference Answer**: {reference}\\n- **Response A**: {prediction}\\n- **Response B**: {prediction_b}\\n\\n**Begin Reasoning Below**:\\n\\n'<\/span><\/pre>\n<pre class=\"abt abl abk abm bo abn ba bj\"><span id=\"96d9\" class=\"abo yr tg abk b bf abp abq l abr abs\" data-selectable-paragraph=\"\">evaluator.evaluate_string_pairs(\n    prediction=<span class=\"hljs-string\">\"The primary gas in Earth's atmosphere is carbon dioxide.\"<\/span>,\n    prediction_b=<span class=\"hljs-string\">\"Earth's atmosphere is primarily composed of nitrogen.\"<\/span>,\n    <span class=\"hljs-built_in\">input<\/span>=<span class=\"hljs-string\">\"What is the primary gas in Earth's atmosphere?\"<\/span>,\n    reference=<span class=\"hljs-string\">\"The primary gas in Earth's atmosphere is nitrogen.\"<\/span>,\n)<\/span><\/pre>\n<pre class=\"abt abl abk abm bo abn ba bj\"><span id=\"beea\" class=\"abo yr tg abk b bf abp abq l abr abs\" data-selectable-paragraph=\"\">{<span class=\"hljs-string\">'reasoning'<\/span>: <span class=\"hljs-string\">\"Helpfulness: Both responses attempt to answer the question, but Response B is more helpful because it provides the correct answer.\\n\\nRelevance: Both responses are relevant to the input context as they both refer to the primary gas in Earth's atmosphere.\\n\\nCorrectness: Response A is incorrect because the primary gas in Earth's atmosphere is not carbon dioxide, it's nitrogen. Response B is correct.\\n\\nDepth: Neither response demonstrates a significant depth of thought, as they both provide straightforward answers to the question. However, Response B is more accurate.\\n\\nBased on these criteria, Response B is the better response.\\n\\n[[B]]\"<\/span>,\n <span class=\"hljs-string\">'value'<\/span>: <span class=\"hljs-string\">'B'<\/span>,\n <span class=\"hljs-string\">'score'<\/span>: <span class=\"hljs-number\">0<\/span>}<\/span><\/pre>\n<p id=\"5b16\" class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj\" data-selectable-paragraph=\"\">In conclusion, LangChain\u2019s comparison evaluators offer a robust and versatile toolset for assessing and contrasting the outputs of different chains or LLMs.<\/p>\n<p id=\"3b27\" class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj\" data-selectable-paragraph=\"\">They are indispensable in A\/B testing, model version analysis, and AI-driven reinforcement learning. Built on the foundational PairwiseStringEvaluator class, these evaluators provide detailed insights into pairs of strings, making them invaluable for developers and researchers. The flexibility to craft custom evaluators, define unique evaluation criteria, and modify evaluation prompts ensures that users can tailor evaluations to specific needs.<\/p>\n<p id=\"2375\" class=\"pw-post-body-paragraph zk zl tg be b tu abc zn zo tx abd zq zr mq abe zt zu mv abf zw zx na abg zz aba abb ew bj\" data-selectable-paragraph=\"\">As LLMs evolve and integrate into various applications, such evaluators will be crucial in ensuring the optimal performance, accuracy, and utility of language models and their outputs.<\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Mastering Pairwise Assessments for Optimized Language Model Outputs Introduction In LangChain, comparison evaluators are designed to measure and compare outputs from two different chains or LLMs. These tools are invaluable for A\/B testing between models or analyzing distinct versions. Moreover, they can be employed to generate preference scores for AI-assisted reinforcement learning. At their core, [&hellip;]<\/p>\n","protected":false},"author":68,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[65,7],"tags":[70,71,52,31,34],"coauthors":[166],"class_list":["post-8183","post","type-post","status-publish","format-standard","hentry","category-llmops","category-tutorials","tag-langchain","tag-language-models","tag-llm","tag-llmops","tag-prompt-engineering"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Diving Deep into LangChain\u2019s Comparison Evaluators - Comet<\/title>\n<meta name=\"description\" content=\"LangChain comparison evaluators measure + compare outputs from 2 different chains or LLMs + are invaluable for A\/B testing + AI-assisted RL\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/diving-deep-into-langchains-comparison-evaluators\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Diving Deep into LangChain\u2019s Comparison Evaluators\" \/>\n<meta property=\"og:description\" content=\"LangChain comparison evaluators measure + compare outputs from 2 different chains or LLMs + are invaluable for A\/B testing + AI-assisted RL\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/diving-deep-into-langchains-comparison-evaluators\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-11-23T01:19:46+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:04:23+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*jKHnRyHfDmYvpR9_\" \/>\n<meta name=\"author\" content=\"Harpreet Sahota\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Harpreet Sahota\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Diving Deep into LangChain\u2019s Comparison Evaluators - Comet","description":"LangChain comparison evaluators measure + compare outputs from 2 different chains or LLMs + are invaluable for A\/B testing + AI-assisted RL","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/diving-deep-into-langchains-comparison-evaluators\/","og_locale":"en_US","og_type":"article","og_title":"Diving Deep into LangChain\u2019s Comparison Evaluators","og_description":"LangChain comparison evaluators measure + compare outputs from 2 different chains or LLMs + are invaluable for A\/B testing + AI-assisted RL","og_url":"https:\/\/www.comet.com\/site\/blog\/diving-deep-into-langchains-comparison-evaluators\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-11-23T01:19:46+00:00","article_modified_time":"2025-04-24T17:04:23+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*jKHnRyHfDmYvpR9_","type":"","width":"","height":""}],"author":"Harpreet Sahota","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Harpreet Sahota","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/diving-deep-into-langchains-comparison-evaluators\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/diving-deep-into-langchains-comparison-evaluators\/"},"author":{"name":"Harpreet Sahota","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/46036ab474aa916e2873daece26a28d6"},"headline":"Diving Deep into LangChain\u2019s Comparison Evaluators","datePublished":"2023-11-23T01:19:46+00:00","dateModified":"2025-04-24T17:04:23+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/diving-deep-into-langchains-comparison-evaluators\/"},"wordCount":1314,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/diving-deep-into-langchains-comparison-evaluators\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*jKHnRyHfDmYvpR9_","keywords":["LangChain","Language Models","LLM","LLMOps","Prompt Engineering"],"articleSection":["LLMOps","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/diving-deep-into-langchains-comparison-evaluators\/","url":"https:\/\/www.comet.com\/site\/blog\/diving-deep-into-langchains-comparison-evaluators\/","name":"Diving Deep into LangChain\u2019s Comparison Evaluators - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/diving-deep-into-langchains-comparison-evaluators\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/diving-deep-into-langchains-comparison-evaluators\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*jKHnRyHfDmYvpR9_","datePublished":"2023-11-23T01:19:46+00:00","dateModified":"2025-04-24T17:04:23+00:00","description":"LangChain comparison evaluators measure + compare outputs from 2 different chains or LLMs + are invaluable for A\/B testing + AI-assisted RL","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/diving-deep-into-langchains-comparison-evaluators\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/diving-deep-into-langchains-comparison-evaluators\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/diving-deep-into-langchains-comparison-evaluators\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*jKHnRyHfDmYvpR9_","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*jKHnRyHfDmYvpR9_"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/diving-deep-into-langchains-comparison-evaluators\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Diving Deep into LangChain\u2019s Comparison Evaluators"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/46036ab474aa916e2873daece26a28d6","name":"Harpreet Sahota","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/2d21512be19ba7e19a71a803309e2a88","url":"https:\/\/secure.gravatar.com\/avatar\/a6ca5a533fc9f143a0a7428037ff652aa0633d66bf27e76ae89b955ae72a0f2d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/a6ca5a533fc9f143a0a7428037ff652aa0633d66bf27e76ae89b955ae72a0f2d?s=96&d=mm&r=g","caption":"Harpreet Sahota"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/theartistsofdatasciencegmail-com\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8183","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/68"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=8183"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8183\/revisions"}],"predecessor-version":[{"id":15447,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8183\/revisions\/15447"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=8183"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=8183"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=8183"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=8183"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}