🎙️ Ask Matillion Anything with Julian Wiffen 🎙️

 

Welcome, data enthusiasts! 🌟 Join us for an engaging session where we dive deep into the fascinating world of AI, data engineering, and cutting-edge models. I’m thrilled to introduce Julian Wiffen, our Director of Data Science & AI at Matillion. 🚀

 

Julian & Matillion have explored the potential of generative AI to transform data engineering. From standardizing job titles to handling vast blocks of unstructured data, he’s been at the forefront of innovation. 🤖

 

Now, it’s your turn! 🗣️ Ask Julian Anything about AI, data pipelines, or the future of analytics. Whether you’re a seasoned pro or just curious, this is your chance to engage with an expert. 🤝

 

Post your questions below, Julian will be on hand to answer them starting Monday 22nd April until Friday 26th April.

 

Questions

  • DO you think in future Generative AI will become mandatory skill for all data professionals?
  • What are the biggest challenges of using generative AI in data pipelines? and how do we ensure quality of data generated by AI?

Thanks @JoeCommunityManager​ and team for having these sessions,

Hello ASD

 

Will Generative AI become a mandatory skill for all data professionals? I would say probably yes, at least at a basic level.

 

It offers so many potential useful techniques to the data engineer, scientist or analyst - a few examples off the top of my head

  • classifying widely varying or free text columns into simple categorical buckets
  • feature engineering for BI purposes or predictive models based on yes/no questions over unstructured data
  • data clean-up or standardisation
  • the ability handle inputs in multiple languages
  • summarisation of very granular data to an appropriate level for the recipient
  • effectively extremely sophisticated regex filtering
  • data enrichment (e.g. 'summarise what you know about this company')
  • generation of test data...

 

What are the biggest challenges - getting used to the non-deterministic nature of it first of all - we'll need to learn to wrap quality processes around it, just as we might around any business process done by a human, especially as we are looking at either text based outputs that are hard to code checks around and/or use cases where we are trying to measure the quality of a judgement call.

 

There's still a lot to be learned about how to measure the quality. Some things we are exploring include the use of multiple choice questions to test model reasoning in a way that can be automated, taking a random sample for human evaluation of quality, feedback links on anything that is going to a human end user and using vector separation to compare how consistently a model answers the same question. As models become more sophisticated, I can see an approach in which a lower cost, high throughput model processes all transactions and a small percentage are passed to a larger, more costly but more advanced model to evaluate the quality. We may also start to see the approach of asking multiple models the same question in an ensemble method - effectively having them vote on the outcome.

 

The other big challenge is in speed and in rate/throughput limits - we've grown used to handling very large datasets - 100,000 rows would not be classed as a large volume for any normal ETL purposes these days, but it would take significant time to process that through even a simple LLM. This will of course improve over time, but we may need to do some mental recalibration when working in this space. We're certainly going to need to get adept at balancing accuracy & quality against cost and speed.

 

Spot-on insights! Appreciate it greatly. Thank you @JulianWiffen

Your insights are really motivating me to dive deeper into learning more about Generative AI.