Welcome, data enthusiasts! 🌟 Join us for an engaging session where we dive deep into the fascinating world of AI, data engineering, and cutting-edge models. I’m thrilled to introduce Julian Wiffen, our Director of Data Science & AI at Matillion. 🚀
Julian & Matillion have explored the potential of generative AI to transform data engineering. From standardizing job titles to handling vast blocks of unstructured data, he’s been at the forefront of innovation. 🤖
Now, it’s your turn! 🗣️ Ask Julian Anything about AI, data pipelines, or the future of analytics. Whether you’re a seasoned pro or just curious, this is your chance to engage with an expert. 🤝
Post your questions below, Julian will be on hand to answer them starting Monday 22nd April until Friday 26th April.
data enrichment (e.g. 'summarise what you know about this company')
generation of test data...
What are the biggest challenges - getting used to the non-deterministic nature of it first of all - we'll need to learn to wrap quality processes around it, just as we might around any business process done by a human, especially as we are looking at either text based outputs that are hard to code checks around and/or use cases where we are trying to measure the quality of a judgement call.
There's still a lot to be learned about how to measure the quality. Some things we are exploring include the use of multiple choice questions to test model reasoning in a way that can be automated, taking a random sample for human evaluation of quality, feedback links on anything that is going to a human end user and using vector separation to compare how consistently a model answers the same question. As models become more sophisticated, I can see an approach in which a lower cost, high throughput model processes all transactions and a small percentage are passed to a larger, more costly but more advanced model to evaluate the quality. We may also start to see the approach of asking multiple models the same question in an ensemble method - effectively having them vote on the outcome.
The other big challenge is in speed and in rate/throughput limits - we've grown used to handling very large datasets - 100,000 rows would not be classed as a large volume for any normal ETL purposes these days, but it would take significant time to process that through even a simple LLM. This will of course improve over time, but we may need to do some mental recalibration when working in this space. We're certainly going to need to get adept at balancing accuracy & quality against cost and speed.