Sporting goods retailer evo is officially the first online retailer to bring a ChatGPT experience to the PDP (or what I affectionately call “ChatPDP”).
But this was no simple plug-and-play project. Over a year of testing and iterations went into this killer conversational commerce experience, and Ecom Ideas has the inside scoop on how it came to life.
If you missed it on LinkedIn the first time, check out this quick demo of it in action:
The Alby app puts a choose-your-own-adventure style Q&A to PDPs and:
🦾 Predicts questions most likely to be asked at the product-level
🦾 Self optimizes for questions that drive revenue
🦾 Serve different questions to different customers based on profile data/order history
Q: It’s fair to say evo is the first ecommerce site to deliver a baked-in ChatGPT experience this way on a PDP. What went into this from idea to launch?
A: About a year ago, when we started working on Alby, we had this idea of taking ChatGPT-like technology and embedding it on ecommerce shopping experiences to be a shopping assistant.
So I partnered with Nathan from evo to experiment and see what works. We went on this very long journey testing many different types of experiences -- and most of them failed.
We were always doing very rigorous CSAT scoring to see if customers like the experience, and doing very rigorous A/B tests. We did hold out groups, and it took us about six months before we finally found an experience that showed really great lifts really great CSAT scores.
And now we've been rolling this out across many clients and we're seeing consistent results. But it didn't work out of the gate, it took us a lot of experimentation to get there.
Q: How would a client get ready to use a tool like this? Is it just turn it on? Or do we need to participate with your team?
A: The key thing is the system needs data about your products. So, if you're on an ecommerce platform like Shopify, that process is quite simple. We have an app in the Shopify app store, you just install the app and upload any other peripheral documents that you might have on products.
If you're in the electronics category, usually there's specifications PDFs that you want to upload into it, and if you're in beauty or nutrition often there's ingredient PDFs or lists that you want to upload.
But as long as we have rich product information, which usually means add the app in the Shopify store, upload documents, and Alby will work out of the gate.
Over time, you can tune the model. We have really nice flows where if you see responses you don't like or you think can be improved in the interface, you can get feedback and say, “I think you should set up saying x and y.” And that data can sort of be an ongoing process.
But initially to launch you just need products and documents about products.
Q: Every CRO and product manager is scared of losing control of the experience. How do you approach QA of generated AI questions and validating the questions are good at scale?
A: There are a few things that we've done around high level quantitative metrics. There's a bunch of things you can do to directionally know whether this is working at all.
Holdout group
One is to do a holdout group. We’ve done many, many holdout groups on evo and it took us a while to get to the lift we see today.
Ask customers
Another thing is ask customers whether they're satisfied with the answers. We did random tests where we would ask people on a scale of one to five, how would you rate this experience?
At first, we were getting very low CSAT scores because people would be like, this is obviously wrong or not useful. And then over time, we got to the point we are now where about 85 to 90% of people rated a four or five.
Measure engagement
Then you can also look at engagement. What percentage of people actually engage with this thing, and then end up buying after engaging with it?
Flag negative feedback
We also have a way in which we can flag questions that got low rankings, and then bubble those to the top for experts to review. In the beginning, there were a few cases for very niche ski topics where people would say “this is a bad response.”
And we flagged them for an internal evo expert, who took probably a total of two hours to fix pretty much all of the issues that appeared because he just went in and made some minor tweaks to ensure that the system knew the edge case information that it wasn't privy to.
You can also just do very basic spot checking by randomly taking a sample of questions and reviewing them.
Q: Where do you focus and prioritize? With your flagship or top selling products or broken needles in your haystack?
A: Usually what people do is they don't just launch it, they upload all their data and then go to their top products and see what the questions are, and start going back and forth evaluating tuning and giving feedback.
Start with the products that you know matter the most and make sure those are rock solid. And then for the longtail products you can see what bubbles to the top.
With evo, we did all product categories from the start, but only with 10% of the audience to see which product categories performed best.
Our original thesis was that there were going to be categories that it added no value to, and certain categories that added a lot of value. But that wasn't actually what we saw, we saw a pretty consistent impact across the board, so we ended up rolling it out across all categories.
Q: Alby auto-optimizes the questions that get served over time for clicks but also KPIs like conversion and revenue. How does this work and long does it take to perfect itself?
A: One of the key learnings we've had is not all questions are created equal. And what that means is just because there's a question that a lot of people click on does not mean that that information will drive people to buy the product, and some questions make shoppers less likely to buy the product.
For example, “do you have a cheaper option” puts shoppers into decision paralysis.
The final unlock in our journey -- once we got to the point of high engagement, but still flat lift -- was in tuning the system to try different types of questions and learn which ones actually drive sales.
We now have a pre-trained system that out of the gate already scores questions based on which ones drive sales, and out of the gate it usually works.
But it also continuously self optimizes per client, as we're using a neural network to look at product information, look at questions and score them. And we can predict which types of questions matter for which types of products.
A very basic example is warranty information drives a ton of conversions on products like bikes and skis, but drives almost no incremental conversions on products like a hat.
The scoring system learns which types of products map well to which types of questions and filters out the low-performing types.
Q: Can this personalize questions based on customer profiles and browser behavior?
A: Today, the system doesn't do any form of personalization like you're describing. But our hope is within the next few quarters to allow the questions to be personalized based on the actual browse behavior of a customer on a website.
On evo, for example that would be inferring their expertise level, which types of sports they're interested in, what types of questions they tend to ask before making purchasing choices and use that to change the questions that we're prompting people with, and change the responses that we that we get.
Q: Who on the merchant side owns this experience (maintenance, testing, etc), and how much human effort is involved in optimizing it?
A: I’ve seen two types of organizational approaches to this with equal success, so I don’t know if there’s a preference. In the evo case, it’s the ecommerce team that rolls it out and there’s some partnership with the product team, but fundamentally the ecommerce team rolls it out through a tag manager, hook up a product feed to Alby, upload some concepts and are off to the races.
For other organizations, it’s the product team that manages the website UI and UX that are partnering with us to figure out where to place on the website and how to roll it out.
On Shopify, it’s a point-and-click process to get set up. We’ve had people launch this within 20 minutes. Then the additional things that need to be implemented are product feed and tagging the website, and with Google Tag Manager it’s usually copying and pasting a tag. Usually folks have a standardized feed they can just push to Alby.
For some of our larger customers, there’s more process getting the data to flow to us, and they have a ton more data through multiple data feeds.
Q: You did many rounds of testing to perfect the UI, including different implementations across different sections of the site. Can you walk through your testing and iteration strategy and process?
A: First, we tried just embedding a chatbot on the website in the bottom right, and if you opened it there was a shopping assistant. It looked like any other chatbot. But the numbers were just not good. It showed no lift and the CSAT scores were bad.
The main challenge was so few people ever clicked on it – like 0.2% of people opened the chatbot, and of those, the expectation was for customer support versus shopping assistance.
For anyone looking at different AI tools out there, there are some really amazing demos of someone asking a really crazy question and this AI chatbot serving an amazing answer. But the hidden secret is in order to get that amazing answer, the customer had to write a really complex question. And in practice, consumers don’t do that.
We were hoping people would write questions like “I’m a beginner skier, I really love West Coast skiing but I only like these types of curves, so help me pick a ski.”
Nobody wrote those types of questions.
What was more common was people confusing us with a support bot. What they were really saying was, “where’s my order?”
We did try the proactive chat approach as well, popping out the chat from the bottom right. Our surveys show customers really don’t like that.
Then we tried embedding a button over products where you could hover over it and click ask a question, then type a long form question about the products That showed really nice lift numbers, but really low engagement.
What that showed us was the people that asked questions about products became much more likely to buy, but a very small fraction of people actually click that button to write these long form answers. That led us to the hypothesis that nobody wants to do lots of typing. And this is where the analogy to ChatGPT does not hold – people do go to ChatGPT to write really complex things because their intent to ask it a question that they couldn't get elsewhere.
But when people are shopping, they're not interested in writing long form questions. They're trained in a shopping context to have very low cognitive overhead in the choices they're making. So that led to this idea of trying to do something much harder. It ended up being a very long build to actually use AI to predict the questions so the consumer doesn't actually have to think much and can just click to get their answer.
And when we did that we saw 10-20x lift and engagement, although the lift was sort of spotty because some questions drove lift but with further questions, it didn't drive lift.
Then the big breakthrough was building a proprietary system that figures out which questions actually do drive sales. That’s when we finally got to an experience that was good.
Q: Did you see a difference between mobile vs desktop in your UI testing?
A: Lift numbers are pretty consistent between desktop and mobile, but in general conversion rates on mobile are lower than on desktop.
My hunch with some of the work we're doing now is the impact will actually be even greater on mobile, because the challenge with mobile is there's a lot less real estate. Instead of 8 page scrolls on mobile to find information, we can condense that for a specific customer into the 3 key pieces of information they really need to know, and that’s where AI can really shine.
Ideal desktop placement
In terms of placement, most of the desktop lift is when it’s placed under hero images and above the litany of product info below the fold, keeping the key purchasing interactions on the right side (Buy Box area) as usual.
Ideal mobile placement
For mobile, it's definitely below the fold because the key thing that you see first is usually images, then size/variant selection, the add to cart button, and below that we insert the widget.
Q: What about using ChatGPT experiences for cross-sell and upsell after add-to-cart?
A: In the next few weeks we’re rolling out upsell and cross sell and to the experience where we're going to not only help people build conviction in the current product they're on, but also help them pair that product with other products in the catalog.
Q: Can ChatGPT answers on a PDP help SEO?
A: SEO is a really cool use case for this because we're creating this new data set for what we now know are the exact questions people have on different products.
So we can absolutely use that to auto-optimize SEO on product pages, and that’s on our roadmap.
And for merchants, we can capture really interesting insights on what types of questions people are asking before buying or before abandoning. We learn for example that when people ask this particular question if it makes them leave and buy an alternative product or buy nothing at all. And these types of insights can drive product cycles for a brand, or it can drive purchasing and buying decisions for a retailer.
I think such insights are going to be really valuable because it used to be the case that we really only had community driven Q&A data to power that, and that's really sparse because very few people actually engage with those widgets.
We’re also working with evo to test knowledge base generation. But we are prioritizing it to figure out what key data points to put on each page so Google doesn’t ding you for having too much content to crawl, because you can’t have an infinitely growing repository of Q&A pages that at the end of the day doesn’t help users.