Mike McMeekin


Savvy Search Marketer Series: Ad Copy Testing Tips, Part III

In the first part of the Savvy Search Marketer series on Ad Copy Testing Tips, we focused primarily on the different metrics you use, or could use, to help manage and evaluate your ad testing.  In the second part, we focused more on the ad testing experiment itself and some tips and tricks to avoid being "Fooled by Randomness" (great book, by the way, but one of my favorite authors, Nassim Taleb).

For this, the third and final installment in this series on Ad Copy Testing Tips, we'll continue on with the examples presented in the previous posts, so if you haven't read them yet, please do so now to ensure you have the context needed to follow along in this wrap up.

We’ll start off with a segue on sample sizes – let’s take the first example from Part 2 in this series and add slightly more complexity by looking at the Ad-KW combinations to see if we can determine if this could be the driver of the CTR differences between ads.


  • First off, notice the large differences in impression distribution for the same keywords against the two ads.  The Even Ad Rotation only evenly distributes at the ad group level, and in fact, only evenly distributes the ads in the auction.  This means advertisers can’t control the distribution of an ad to each keyword (or against any other variable).
  • When you compare the individual keywords for each ad (i.e. KW1 Ad1 vs Ad2), we have statistical significance for each KW – but recall we did not have significance at the ad group level.  If we were to evenly distribute impressions by ad and keyword, we would have had ad group level significance as well (2.45% vs 3.58% CTR).  This demonstrates the potential for keyword/ad group organization to confound results that otherwise would have been significant and actionable.


Takeaway: Many variables are mostly, if not entirely, out of the control of the advertiser when doing a test and need to at least be considered as potentially impacting the validity of any test (including those that deliver insignificant results). 

  • Luckily, keyword impression distribution is one in which advertisers can control to a certain extent by effectively organizing keywords into ad groups.  An ad test can thus be just as effective in helping identify keywords that should be organized together.  When the same ad has similar results for the same keyword (i.e. no statistical significance), this can be a signal to continue grouping those keywords together. Conversely, those with different performance should be considered for restructuring.
  • Some of the other variables, many of which are visible through performance reporting, include:
    • ML vs SB Placement – the extent to which ads are distributed unevenly between the ML and SB has a significant impact on results – especially when the metric being evaluated is CTR.  Since the ML has CTR’s 10X-50X higher than the sidebar, it only takes a slightly uneven distribution to create very different CTR results.  Ensure minimal bid changes (if not eliminating them all together) to try and control for this.
    • Match Types – the more an ad group is exposed to Broad Match the more likely that the variance sin query distribution are contributing to any CTR difference.  Consider only ad testing with Exact Match and then extrapolating those learnings, within reason, to other match types.
    • User location, demographics, etc – given the variation of user behavior and personalization algorithms to show ads differently for different types of users, distribution of impressions to different users can ad uncertainty to ad tests.  Consider randomly choosing different geo- or demo-graphic targets to test with to attempt to control for this.
    • Marketplace differences – at any given time there are dozens of experiments occurring in the Bing Ads marketplace.  These experiments can affect ranking and placement and thus impact keyword/ad combinations differently.  Again, the smaller the sample size the more likely the distribution of impressions against these sets of traffic can invalidate the results of a test.

On a regular basis, Bing Ads provides guidance on what types of ads to write depending on what type of advertiser you are or keywords you bid on.  Over the last year, we have done a number of in depth analyses that control for many of these variables mentioned above, in order identify those ad copy elements (i.e. phrases, words, symbols, etc) that can help any advertiser improve their CTR and Quality Score.

As you can see in this article, there is substantial opportunity for advertisers in the Travel space to make improvements in their ad copy without needing to test and test and test – or risk getting the wrong signal from their tests that may be invalidated by the variables mentioned above.

In this example for travel, it bears mentioning that over 80% of the ads in the travel space are using the ad copy elements that are highlighted in the article, yet they can produce, when used as guided, an average of ~110% increases in Ad Quality.

In the last part of our series, we’ll dive deeper into how to use Bing Ads reporting to gather the data necessary to assess the impact of some of the variables mentioned in this article as well as provide more examples of Vertical specific ad copy opportunities for advertisers in the Bing Ads marketplace.

Have any thoughts or comments about this blog or any other thoughts about ad testing challenges and opportunities? Comment below and we can discuss.

Thanks for Reading!

Mike McMeekin and Vivian Li

Advertiser Insights and Analytics, Bing Ads





Keep reading

Savvy Search Marketer Series: Ad Copy Testing Tips, Part II

In the first part of the Savvy Search Marketer series, we focused primarily on the different metrics advertisers use, or could use, to help manage and evaluate their ad testing.  In this, the second part of our series, we are going to focus more on the ad testing experiment itself and some tips and tricks to avoid being “Fooled by Randomness” (great book, by the way, by one of my favorite authors, Nassim Taleb).  

This post will cover some of the basics around gathering a large enough sample set, what confounding variables may exist that are difficult (or impossible!) to control for, but should be considered, and guidance on how to reduce the cost of experimentation with data regarding what to test.

While the tools provided by Bing Ads and other advertising platforms make the operations of an ad test pretty easy, it would be a mistake to not take the time to plan and set up your test to ensure the results you get are useful and actionable.  You should be asking yourself questions like…

  • How many ad groups will you test with? 
  • How many ads will you test simultaneously? 
  • Will you test against exact match only or on all match types? 
  • How many different keywords are in the ad groups you want to test? 
  • Will you test in a nationally targeted campaign or against your geo-targeted campaigns as well? 
  • Do you have exit criteria to determine the success or failure of your test ads or will you just be waiting a pre-determined amount of time? 
  • How many new variables will you test in your ad creative? 

These and many more will help you define goals before embarking on an ad test of any scale.  In the remainder of this article, we’ll discuss how the answers to these could impact the efficiency and efficacy of your ad testing approach.

Ensuring You Have the Right Sample Size For Your Test

Even those of us who aren’t formally trained statisticians know the importance of gathering enough data to make an informed decision.  The great thing is there are tons of free tools online that can help you ensure you have collected enough data – here is a great simple tool for comparing CTR’s or conversion rates.

As we in Bing Ads observe advertiser behavior, we see that most will add 2 or more ads to each ad group in which they want to test, in many cases pushing the total number of ads to 4 or more.  The total number of ads that you will have live in an ad group during an ad test is a crucial decision that should be considered carefully, as the more ads you have live, the more data you will have to collect before you can be confident in your results.

For instance, let’s say you had the following scenario over a 1 week period in which at 100% share of voice (SOV), an ad group could max out at 10,000 impressions:

  • At a 95% confidence level, you would not have enough impressions at this difference in CTR (.5%) to say that Ad 2 is actually better than Ad 1.  In fact, you’d need somewhere around 8,200 impressions for each Ad at this CTR difference to get statistically significant results that would tell you that Ad 2 is better for any reason other than randomness.
  • This # of impressions could be sufficient if the CTR difference was great enough – for instance if Ad 2 had a CTR of 3.15%, this would be sufficient.

Now, if we add 2 more ads to this ad test, even if we assume one of them now has a 4% CTR, our ability to generate significant results becomes an even larger challenge.

  • First, you have to compare each ad to every other ad in the ad group – why?  Well at this sample size, Ad 4 is better than Ad 1 and A2, but not Ad 3.  Another test would be required, likely for a longer period of time to gather more impressions, to see if there is actually a difference not likely caused by randomness.  Also, when comparing Ad 2 to the other three ads, there is no significance in the differences – making a decision here looks easy, but really isn’t.

Takeaway – It is critical when planning your tests to try and get a sense of how likely it will be that you can get a statistically significant result based on the # of impressions you will likely receive.

  • Look at your SOV data for the test ad groups – at current SOV levels, is it reasonable to assume that you’ll generate enough impressions?  Even if you push to 100% would you have enough impressions?
  • Remember that the # of impressions you need is largely based on the differences in CTR (or whatever other metric you’re testing), what is a reasonable assumption on performance improvements?  Be conservative and you’ll be more likely to ensure you have enough impressions.  Look back at other tests, assuming they were done correctly – what was the typical difference between good and bad ads – use this as an estimate.
  • Consider smaller (in terms of # of ads) tests that happen for shorter periods of time and do them more frequently.
    • This will also help ensure changes in user behavior, competition, marketplace algorithms, etc brought on by how long the test is in market don’t confound or mess with your results.
    • Don’t force a decision or extract learnings where there isn’t enough data – if you are running a test on tail keyword ad groups, it just may not be possible to gather enough data in each ad group in a reasonable period of time.  Consider segmenting your ads based on the variables tested, cluster the ads with the within the same segment across ad groups together, and evaluate.
      • Ensure that the keywords in the ad groups have some semantic and contextual consistency – i.e. don’t group “cheap” and “luxury” keywords together
    • Evaluating a conversion metric?  Replace “impressions”, “clicks”, and  “CTR” from the example above  with “clicks”, “conversions”  and your other conversion metric (e.g. CR, CPA, etc) respectively.

In the next part of our series, we’ll discuss what might be driving your test results and how to use Bing Ads to increase certainty on ad tests.

Have any thoughts or comments about this blog or any other thoughts about ad testing challenges and opportunities? Comment below and we can discuss.


Thanks for Reading!

Mike McMeekin and Vivian Li

Advertiser Insights and Analytics, Bing Ads




Keep reading

Savvy Search Marketer Series: Ad Copy Testing Tips, Part I

While the search engine marketing space has been advancing at a frantic pace with new capabilities, ad formats, reporting tools, and many other features, effective ad testing and ad copy writing is still crucial to the success of any search engine marketing (SEM) campaign. 

Whether you’re a marketer focused on the scale needed for an enterprise or someone who only has an hour a week to look at your SEM campaigns, if you’re not testing effectively, you’re losing opportunities to generate quality leads and sales.  While there are a lot of fantastic and effective approaches addressing how to 1) generate new ad copy to test and 2) evaluate those new ad copies, very few of them come from the perspective of someone who works for a company that owns one of the search advertising platforms.

With this series, we’ll review the typical metrics advertisers use to evaluate ad copy, the variables that might confound tests using those metrics, strategies to minimize those issues, and lastly, what we know about the types of ad copy that drives user response.  We’ll do all of this through transparency around our auction and useful big-data driven insights from our marketplace.

Choosing Ad Copy Testing Metrics

For the most part, there are really 4 different types of metrics that can be used, and each of them has their merits and flaws.  The primary types of metrics that advertisers tell us they use are (in order of frequency):

  1. Return-on-investment (ROI), or some other function of both costs and revenue
  2. Click-through rate (CTR)
  3. Cost-per-acquisition (CPA), or some other “cost-per” metric
  4. Quality Score

Many advertisers will even evaluate multiple metrics, but as everyone knows, when two or more metrics conflict, one metric consistently wins out.  Let’s discuss some of the highlights of the use of each metric type and then go into details on how they might be unexpectedly impacted by some auction variables.

Return on Investment (ROI)

ROI is an advertiser favorite for obvious reasons – it is often the North Star of their campaigns. 

  • Pros:
    • You ensure the juice is worth the squeeze - what’s the point of generating more clicks if they don’t convert or cost too much to be worth it?
    • While CTR and CPC are negatively correlated when an ad does not change position, when an ad moves up in position due to Ad Copy improvements, the CPC is more than likely to go up too.
      • The exception being position one, where the CPC could go down to the minimum with big enough increases in CTR
      • Given this, understanding your cost changes due to ad copy changes is important

Illustration of Typical Relationships Between CPC, CTR, and Ad Position

  • Cons
    • Could take more time to gather significant data, relative to CTR focused metrics, since one would need to generate enough clicks to have statistically significant samples, not impressions
    • Focusing on ROI as an evaluation metric often puts undue weight on an ad’s ability to increase or decrease conversion rates, value per conversion, etc.
      • Our data shows that most of conversion and value per conversion is related to the site-side user experience and things like traffic mix (exact vs. broad, O&O vs. network)
      • Exceptions are when an ad sets the wrong expectations about what the website has to offer or over-qualifies users in some way.

Click-Through Rate (CTR)

CTR is often considered, but not necessarily as the primary metric used to evaluate an ad copy test.  Having said that, there are some important considerations for these rate-based metrics to keep in mind.

  • Pros:
    • This is where ads are more effective – i.e. in gauging whether you’re getting users to click or not, so it may be prudent to judge your test using this metric
    • Using CTR can help you get to a sound decision on the effectiveness of one ad over another more quickly, since gathering sufficient data should take less time relative to a post-click based metric
  • Cons:
    • CTR is also impacted by a number of other variables that are difficult to control for, even when using even ad rotation
      • A good example is ad position – ads in the Main Line have CTR’s, on average, >10X that of the Side Bar.  If two ads have even slightly different impression distributions between Main Line and Side Bar, there is likely to be a confounding difference in CTR
  • Given that prices and conversions also change due to ad position (both are often higher for higher positions), if you don’t consider something other than CTR, you may generate more clicks at a non-profitable rate

Cost per Acquisition (CPA)

Like ROI, a significant portion of our advertiser base (roughly 30%) uses a “Cost per” metric to evaluate their ad testing.

  • Basically, this has all the same pros and cons as ROI, except you are less likely to get thrown off by artifact-driven changes to value per conversion (e.g. which ads, at scale, typically don’t have much affect)

Quality Score

Quality Score continues to be a popular metric, but like CTR, it’s often secondary to ROI or CPA based metrics.

  • Pros
    • Only metric that can control for a number of variables that advertisers can’t – ad position, user differences, O&O vs. syndication mix, etc.
    • Gives signals not only on your ad quality performance, but gives you a relative performance metric to your competition (Keyword Relevance)
    • Gives signals on landing page relevance and user experience to help line up ad copy and landing pages (a primary focus for ensuring that ad copy doesn’t hurt conversion rate)
  • Cons
    • Trailing metric, gives a sense of how the system has been evaluating you over time
    • Doesn’t consider cost or revenue related information

In summary, there are lots of metrics to choose from when doing your ad testing; you should choose the one that best reflects what you are trying to achieve with your ad testing.  Having said that, you should also be skeptical about whether ad text can always have a large impact on other metrics besides CTR or Quality Score.  From internal testing, we know the average user “reads” the top 4 ads in less than 3 seconds.  The ad copy is likely most effective in just getting the user to your website and then it’s up to your products and services and user experience to sell them.

In the next part of our series, we’ll discuss more about performing ad tests: how big should your sample sizes be, what data points do you have access to (or don’t have access to) that can help you better analyze the differences in performance between two ads and explain the variances you see, and how you can operate your ad tests to expose yourself to as little of these confounding variables as possible.

Have any thoughts or comments about this blog or any other metrics you might use for ad testing purposes? Comment below and we can discuss.

Thanks for reading!

Mike McMeekin and Vivian Li

Advertiser Insights and Analytics, Bing Ads

Keep reading