Energy Bytes | Transcript: EP 59: Jacob Matson from MotherDuck

EP 59: Jacob Matson from MotherDuck

December 5, 2024 / 01:24:21/E59

0:00 Yeah, so we're third, third time to charm. We were talking about 20 minutes and realized we didn't hit record. And then as soon as we started recording, my power blipped and Wi-Fi router went out.

0:12 So we're trying again, but welcome to another episode of Energy Bites. I'm Bobby Neeland. We've got John Calfan with us. Got that. He's got the access, energy services rocking. Shout out to

0:23 Santo. Appreciate the swag, man. And yeah, we're really excited. We've got Jacob Mattson with us. He's the developer advocate from Mother Duck. And yeah, super excited to have you on. We've

0:33 gone back and forth on social media, various platforms for the last couple of years, and excited to have you on. Yeah, I'm excited to be here, you know, get to talk about, yeah, get to talk

0:44 about the cool things we're doing with DuckDB and at Mother Duck. So yeah, I'm really, really excited to talk about that. But also, you know, talk with people who are, you know, in the domain

0:53 and not just, you know, thinking about software, but like doing, doing real stuff with it, which is, which is like, you know, one of my favorite things to do. So love it. No, absolutely.

1:03 And real quick, Bob, what's up? I was, 'cause we're gonna be talking up at some point about data and analytics. I was actually looking at our channel analytics and 75 of you watching this are not

1:15 subscribed. So please go subscribe really quickly. It takes two seconds. We'd appreciate it. It's somewhere over here. Yeah. And yeah, no, just as most people who've been listening

1:29 consistently, you know, I try to find someone to talk about duck DB. And this time I was like, I'm going to just, you know, make sure it happens. I'm going to talk to, you know, one of the

1:37 guys that works at Mother Duck, one of the, probably one of the main, you know, companies that's the main maintainers of duck DB. And I know Jacob's been using it for some pretty cool use cases

1:45 over the last couple of years. So I'm really excited to dive into it. 'Cause again, I think there's a ton of opportunity in the oil and gas and energy space to use it. I mean, probably any

1:54 industry, but especially there where we have, you know. application is all the way from IOT all the way up to the cloud. So, you know, I think it's super powerful. And, you know, kind of want

2:05 to lead off looking at a somewhat, you know, current event was about a week or so ago. They came out with DuckDB for Google Sheets. DuckDB-G sheets are things what it's called. But basically

2:15 allows you to read from and write to Google Sheets using pure SQL. Yeah. So, you know, really kind of opens up a lot of doors for people that maybe they only had if they used BigQuery or, you

2:27 know Yeah, totally. Yeah, we're really excited about it that actually came from the community. One of the guys over at this BI company called Evidence, Archie, Archie built that. You know, I

2:38 think it was one of those things where I was like, oh, this takes too long. Instead of it, instead of it taking, you know, this takes this took too long. Instead of it being, you know, an

2:46 hour, now he's going to spend a hundred hours, you know, writing code to fix the problem. But it helps everyone else, right? Yeah. Yeah. But yeah, you know, what it, what it does that's

2:55 really cool is it kind of opens the doors around a lot of kind of. loads where you can take data out of Google Sheets, handles authentication seamlessly, and then push data back in. And so if you

3:06 think about, John, on our first recording you mentioned, hey, this is kind of like Power Query, but for Google Sheets, it's like, yes, exactly, exactly. Yeah, you get to let the end users

3:17 have the data where they want it and not allow them access to the database to screw things up. So it's a beautiful world, right? Yeah, totally, totally And I think it's also, you know, it's

3:27 really nice to be able to get access to these database primitives on top of Google Sheets and, you know, kind of do. Do a lot of fun things that maybe you wouldn't be able to do in the past or

3:39 without, you know, authorization and now you just kind of do it and it's very cool. Yeah, well, that's the thing, right? Like Google has this whole ecosystem and most people aren't familiar

3:49 with it, at least on the enterprise side because most of them are Microsoft's shops and so they have no need for it but our company is a Google shop. all of our stuff is on Google and Gmail and all

3:60 that stuff. And it's just incredibly powerful how much

4:05 kind of interoperability and connectivity that the G Suite platform offers, right? Being able to pass things from drive to - or docs to sheets back and then into forms and basically every direction

4:20 in between. And so something like this is really, really slick where you can pull from your database right to an end user's kind of notebook or worksheet and let them do whatever they want to do

4:34 from there. And then you can layer Apps Scripts on top of that and do all kinds of other really interesting things as well. So it's really, really cool. I just love Google Sheets and all of that

4:44 stuff for like POCs and stuff. It's such a quick and easy platform. And all the language models know how to write an Apps Script. And it's fun Yeah, totally. You know, John, I didn't think

4:54 about the first time was, but this guy should play really well with what probably one of our first five episodes. We had Jeff Davies on and he's got Rapify, which is basically a way to like, I

5:04 think he pulls the data out of or into the Google Sheets and then serves it out and basically creates like ready to use like investor presentation type things out of our people. It ties into the

5:15 Google presentations layer and then, but creates the charts and everything from it. So it would really cool. It would actually be pretty interesting to layer into that, to that process too So yeah,

5:26 why don't you tell us take about your journey into just data in general and then like how it kind of got you to this point? Yeah, sure. So if we go back in time a little bit, I actually have a

5:41 degree in accounting and computer science. I started my career actually more on the accounting side.

5:48 And as it turns out, there's a lot of shared primitives. If you zoom out a little bit, a lot of analytics kind of just like stuff on top of ledgers. As it turns out, that's kind of, that's also

5:59 like my own bias. Like, oh, this is just a ledger. Cool. I think what's a snapshot? That's a ledger. Cool. Yeah. I mean, you're even getting like slowly changing dimensions and stuff, right?

6:07 Yeah, totally. It's all the same. Like, oh, yeah, I know what that is. Like, cool. So, so, yeah, definitely you can apply a lot of the same principles to it. And what really happened is

6:18 that, like, I kind of realized pretty early on that I'm, I'm pretty good at kind of persisting around a problem probably longer than I should. And so, what that means is you kind of just

6:31 naturally drift into harder and harder, harder problems. And eventually, I got to the point where, you know, I didn't excel was not enough. This was probably maybe five years into my career.

6:42 And I said, OK, well, we're running a ER system, ERP system on-premise. It's Microsoft Dynamics Nav. It's got SQL Server behind the scenes. How do I get access to that system so that I can

6:52 start doing analysis directly on that?

6:55 And so that was the start of the journey,

6:58 kind of into the data side.

7:02 And just over time, just got more and more in depth with SQL Server, honestly. In fact, I'm actually running it on my Macbook right now in a Docker container, which is hilarious. But I found out

7:15 yesterday that the Linux, it's so it's running Linux inside the Docker container, like Linux version of SQL Server. And the syntax is slightly different than Windows SQL Server. Yes. Slightly.

7:29 So it's like, well, okay. These are the things that I love about coding. It's like, oh, yeah, syntax matters. Capital, the capitals in space, like that's. And I'm also going back and forth

7:41 between dialects too. I mean, like, if it's case sensitive or not, I mean, like that I beat my head against the wall with snowflake 'cause I didn't realize that was case sensitive, the, you

7:53 know. for like an in clause. And like, why is it not finding this? Oh, sure, like, yeah, for text. Yeah, yeah, yeah, absolutely. Like the collation types and SQL server, for sure. It's

8:02 like, if someone didn't install it right, the first time, well, now you have a case sensitive SQL server and like everything else in the Microsoft stack is case insensitive. I don't know if you

8:10 guys - I ran into that with a, you know, hot tip for people in oil and gas. Like it was within, there's a software called Open Wells. And I think they made like, it was like, either one table

8:22 or one column case sensitive, but everything else was, 'cause it was this ID column. And the ID was repeated as far as the, in the thing, but it was different capitalization. But the actual like,

8:35 you know, characters were exactly the same. So it was fine in snowflake, but then Power BI was choking on it, 'cause it said that it had duplicate. Yeah, Power BI is case insensitive by default

8:46 on joints. Yeah. That's right Yeah, so that was an interesting one to - I mean, I like - I kind of want to make it a mission in my life now that I'm just going to make a library that gets rid of

8:60 all of these things when you code and you talk between these things because it's so frustrating. You know, like, oh, the code is there and it should work and it doesn't. And it's because, yeah,

9:13 that's what turned me off at coding when I was in like high school. Oh yeah, for sure. I remember kind of being early in my career and dealing with Salesforce data and the opportunity ID for those

9:22 of you who don't know is a case sensitive 15-digit ID, I believe. And basically, if you do VLOOKUP on it, you can get the wrong data back. And I learned that by putting data into a board

9:34 presentation that was incorrect.

9:37 Yeah, it's very good. Always the hardest way. Yeah, yeah, yeah, yeah. So yes, learn that lesson, but yeah. So basically, as it turns out, a lot of the kind of accounting principles I apply,

9:51 have applied usefully to databases and eventually got the point where the type of operations I was doing were too slow, kind of in a

10:00 transactional database. And so around that time, I started exploring what other options were and discovered DuckDB, which is an embeddable kind of in-process database. And kind of work seamlessly

10:12 on all types of data, including CSVs,

10:15 NXL, Parquet, et cetera. And so that was really kind of my door into even faster analytical queries

10:23 using kind of an open source MIT licensed library, which is DuckDB. I'll do a quick side on that real quickly, Bobby. So from a maintenance standpoint of DuckDB, there's actually the owners of it.

10:25 It's the

10:27 DuckDB Foundation,

10:41 and it's maintained by DuckDB Labs. And so they're on point. We have relationships with them in terms getting stuff updated for us and maintained for us on the Motherduck side. They're independent,

10:52 it's kind of, you know, governed one way and I'm only calling that out because it's important to understand the difference between how we fit, so yeah. No, absolutely, in our previous recording

11:05 or not unrecording, you kind of dove a little bit into the OLTP

11:12 versus OLAP, you know, the transactional versus analytical processing and just if you touch on that real quickly, Yeah, sure, we might understand. Yeah, I mean, I think like, you know,

11:26 basically there's just different architectures for these databases and, you know, you can kind of broadly put them into two buckets. There's the OLTP, which is online transactional processing,

11:35 which are databases designed to kind of have high throughput on transactions and a transaction you can think about just as a row and a table.

11:43 And so they're built with certain trade-offs in mind And really what that means is when you're accessing the data, you're accessing it. you know, kind of in row, like row at a time. And from

11:57 a analytics perspective, we, you know, again, it's online, analytical processing engine.

12:04 For those, we kind of generally, we store columns, or data as columns instead of rows. So we access it and aggregate columns, et cetera, you know, slice and dice dimensions, et cetera. And so

12:14 it just thinks about things differently in terms of how it interacts with, you know, kind of low level hardware and software And so what that means is something like an aggregation operation in

12:24 analytical database is much faster than in a row level database or like a transactional database, broadly because

12:32 we can just look at a single column versus looking at all the rows in order to get a single column. And that's a massive oversimplification. And like, you know, like, there's a whole bunch of

12:43 different trade-offs that a bunch of different databases make SQL Server and, and it makes different trade-offs than Postgres, than Oracle, et cetera. So, I wanna just be careful there. Yeah,

12:55 I'm sure they'll tell you about it. They'll tell you about it in the comments. If you wanna tell Jacob where he's wrong, go drop them in the comments below, guys. That'll be fun. Yeah, so

13:06 that's kind of the core difference. I think there's one more thing that's really interesting that is specific to DuckDB that I'll share, which is DuckDB kind of came of age in the era of multi-core

13:17 processors, which is different than

13:22 more mature databases in the analytical space like Snowflake or Databricks, et cetera, which are kind of designed to scale out across multiple nodes. And so part of the reason why DuckDB is so fast

13:33 is designed to be super fast on a single node and it has vectorization and really advanced query planning built out around the fact that my laptop writing this on us, 14 Chorus, right? So I've got

13:45 all this compute available to me, and I can just break my query up and do a bunch of chunks and run it in parallel. And so that's kind of the core innovation that they bring, and they bring all

13:54 that power to the fact that you can run it locally, more or less kind of as a library, which is really cool. Nice, and anything about this earlier, but does it, can it also happen in GPU at all,

14:06 or is that anything that's kind of outside of its scope now? So right, yeah, right now it's outside of the scope I'm sure someone's thinking about it. I know that there's lots of interesting

14:17 engines that are doing that. There was one that kind of, an open source one that was taken on. I think it's like BlazeDB. It's not maintained anymore. I think it maybe got acquired by Voltron

14:27 perhaps. Okay. But there's a company of Voltron data who's doing a bunch of GPU stuff.

14:33 Yeah, it's not there, but as it turns out, we have so many cores that we'll see if it's necessary, at least for these kind of relational algebra problems.

14:45 I think some of the interesting stuff from our prior conversations was

14:51 being able to run it literally anywhere, even in browser. And then what is really interesting to me is the querying URLs and basically being able to adjust JSONs directly. That's pretty wild.

15:05 That's really slick. Yeah. So one thing that's really cool about DuckDB is it does have excellent native JSON handling So the first thing it can do, that means you can use this read JSON function

15:19 to read, read a

15:26 JSON file, or an API response, if it's a public API, you can just hit it and load it straight into your database and it'll do like one level of like transformation un-nesting kind of when it does

15:35 that. There's more complex controls you can implement too to like do your full un-nesting all in SQL, which is really nice And then recently they added this, this, I, though, um, the ability to

15:45 pass authentication headers with a JSON with these calls, which means you can basically do

15:52 a get API call like you could with a curl, for example, and just do that all in

15:57 SQL and get a response back, which is really awesome because it means you can kind of take these abstractions that used to require or these interfaces that used to require kind of lower levels of

16:09 abstraction and just move it all into SQL. So yeah, some really, really cool stuff there I've done a few, so I'm a big sports fan. I actually run a project called MDS in a box, which has an NBA

16:18 kind of yellow model that updates every night based on real results to kind of keep the team kind of, it does team level forecasting for games and things. Oh, we need to talk because I got into the

16:32 data side for fantasy football and so I'm right there with you on the football side. So there's a football model too, actually, but I have to update that one manually I never got the API set up,

16:42 but I should. But yeah, there's a bunch of really cool stuff that you can kind of do. That's all, it's pure SQL.

16:49 I do, I did used to kind of have a little stored procedure thing in there that I've replaced with Python, so I guess it's no longer pure SQL. But yeah, I'll open source that entire project, so

16:59 you can check it out mdsinaboxcom. It

17:02 does use DuckDB and DBT and a few other, those kind of like modern data stacks, so-called pieces. Just to solve a problem that was tickling my brain, so hopefully you can enjoy it, too. Am I

17:14 right? You're running a Monte Carlo simulation inside of

17:19 SQL? Yeah, so it runs in two ways. So yes, there's a Monte Carlo simulation that's kind of at the core of that, right? Because if we can basically say, hey, the probability that Team A will be,

17:32 Team B, you know, if we can express that, then we can just roll a bunch of dice and say, okay, for, you know, 10, 000 scenarios,

17:41 we combine all of these, you know, an entire schedule of a thousand games, right?

17:46 And so, yes, there's Monte Carlo there. There's also another page, and this ties in nicely to kind of running DuckDB everywhere. There's also Wasm library, which is a web assembly, and it runs

17:55 entirely in the web. And so there's actually a page on there that you can compare any team from any era in the NBA. That's cool. Right now. It has 86 Celtics versus 96 Bulls is

18:07 kind of the default And what it does is it runs a seven game simulation and says, okay, so if these teams played each other seven times and we run that scenario 10, 000 times, which is Monte Carlo

18:17 says, hey, what percentage of the time would, you know, the Bulls win versus the Celtics win? And, you know, what percentage of the time would it be a four game series or a five game or six

18:27 game or seven game and who wins and all that kind of stuff? And so, yeah, and that's all in the browser. What's really cool is actually if you take a look at that page and you can change the

18:36 parameters, maybe choose a different team in the drop down. I'm a big Sonics fan, you know, rest in peace. They don't exist anymore. But you can check like, you know, hey, those Gary Payton,

18:45 you know, Sean Kemp, Sonics, like compare, like how do they stack up against, you know, you know, those Celtics or whatever, and you can just kind of change the drop down. And it's like

18:54 instant, like it just calculates, it's crazy how fast it is. And that's, you know, Wasm

19:11 is just one core. And so, you know, once you can kind of get all your cores and it's even faster, it's honestly mind-blowing The things you can do kind of with this little library with database

19:13 primitives. Yeah, that was pretty amazing. NBA Jam Flashbacks. Oh yeah, for sure. Yeah, yeah, yeah. Yeah, pretty amazing. Give me a shot camp, yeah. Delo Shrimp was on that scene. Oh

19:21 yeah, yeah, yeah, yeah, yeah. Actually, I haven't, you know those like one up arcade cabs? I have one down on a basement. I have an NBA Jam cab on there. Oh, that's nice.

19:31 Yeah, nice, I have the Sean camp shoes too, the ones with like, almost like mountain range on it Oh yeah, oh yeah, that's sick.

19:39 That's back in the day, but actually that brings me back my daughter now like her feeder just growing at an insane rate And I was probably I think I was about third grade as well, but like she's

19:48 like in a woman's eight and a half now Third grade, but I remember just blowing through shoes and I think the Sean Kemp's were one of those that I just Yeah, yeah, you know war for about three

19:57 months and mom. They don't fit anymore. I was gonna say beg your mom to buy

20:03 Yeah Yeah Yeah, so all right, so we're getting into like I mean that it was actually you brought up in me this MDS and a box So let's talk about that sure because we can talk about we brought it up

20:15 a little bit the modern data stack But I mean this is like an end-to-end data pipeline and visualization You know all in one container essentially right you can containerize you can run it. You know

20:25 on your own, you know Yeah, PC or server or whatever And I like I think that's a super interesting concept even just say if I was doing consulting or something like that Where if I had an oil and

20:37 gas you know, database in a box, you know, kind of thing, almost like where like, this is something people can think about and you could package it and basically that it's containerized, run it

20:46 anywhere, right? Yeah. 'Cause you have all that through 'cause you brought up evidence, sorry, John. Yeah. You know, evidence is one of the visualization options you have in an envious in a

20:56 box, isn't it? Sure, yeah, it's the core one right now. You used to have superset in there too, but superset was, ended up being a little bit heavy for kind of like an embedded solution. So,

21:05 ended up kind of going away from it, but great tool as well. Yeah, so, and we're using, I know you had DBT, but as far as the ingestion, is it melt to now, or how do they say that? Yeah, so

21:16 there was some El Tano stuff in there at one point. That's kind of been removed. I'm using a simple script from DLT hub now, which is a Python based replication library. Highly, highly recommend

21:34 it It has a great SQL Server source, by the way. very fast to get data out of there and into wherever you're trying to get it into. Yeah, if anyone hires me, you know, again, and I get to build

21:45 something greenfield and something better hasn't come out. Like, I'll do that instead of 5TRAN. Yeah, so I'm sorry, 5TRAN peace, but. Like, look, 5TRAN is partnered with Motherduck. Like,

21:55 we love what they do, you know, the simplicity and the abstraction layer they have is excellent, but, you know, sometimes you need a little bit more, or, you know, if you're cost constrained,

22:06 it can be a little hard to do that, or maybe you're running something, you know, out in the field, right, where it's like, you're not connected to the cloud. Well, you know, what can you use

22:18 in that case? And yeah, the MDS and the box primitive primitives, because it's open source, like, I kind of built it with the idea that, like, you know, you can kind of stick that thing

22:27 anywhere. And I think this is actually a nice segue into what I was doing before I was at Mother Duck. I was actually running the finance and ops team at a IoT data platform called Symmetric. And

22:38 so we were sitting on top of like the carrier networks and adjusting a bunch of data, you know, for customers and then repackaging it and then giving them, you know, automation rules and things

22:45 that they could do to actually make changes to their deployment back in the on the carrier side. So in some ways like MDS in a box is just like a stripped down, you know, version that inspired kind

22:56 of by that type of work, which is like a lot of the times when you're building a data platform, you're kind of thinking about segmenting it by customer or by, you know, types of devices or

23:05 whatever And so we can think these problems that look really big and intractable, and if we segment them down or partition them small enough, now they're, now it's like, oh, we can actually fit

23:14 that whole thing on like a Mac mini, right? Or whatever, which is crazy. Yeah. Yeah, and I think, you know, thinking for us at John and I, you know, we have experience with it and we bring

23:24 it up quite a bit, but you say an oil gas like a, like a frack van or even like a drilling rig. And these are remote locations, but like, the cloud is great, but if you lose that internet

23:33 connection, like you have critical operations if you're utilizing that software ever for critical reasons that you might have shut down operations, but something like this, if you're able to, you

23:44 can run it

23:46 on the edge or on prem or in the cloud, and then you can have that same functionality locally that you have in the cloud. I mean, that becomes even more powerful, rather. Yeah, totally. I think

23:58 there's some actually really interesting stuff that this guy, Brian Chambers at Chick-fil-A has written about hybrid cloud

24:05 stuff, where they actually have, in every Chick-fil-A, like three little mini PCs. Yep,

24:14 we've got Kubernetes clusters. Yeah, I think it's K3S, but yeah, okay. So you guys are familiar, but those things are designed to predict, here's how many fries you need to put in based on how

24:25 long the line is in the drive-through or whatever, and they work if the internet's down, And all they're really doing is kind of sending aggregate data back to the cloud to kind of improve improve

24:34 the the statistical models and things. I think that's what that's definitely - The POS is even on it, to some extent, because in Houston, we have hurricanes and the power goes out when the

24:45 hurricanes come and there are very few places. I worked for an edge startup before I came to DW. And so

24:54 some of these, like people in our industry aren't thinking of the edge as robustly as I think they should be, where it's like Chick-fil-A, right? Like I would love to have Chick-fil-A on the pod,

25:04 by the way, if they are listening, but they're doing a

25:09 masterclass in it, though, like it really is fascinating. I mean, they were already an operational savant, you know, to begin with, and then they were layering this tech on, and you'd think,

25:18 like, you know, I ordered Chick-fil-A for lunch on my phone yesterday. It started, like, you put in the type of vehicle in the color, and now you've got the cameras outside that can see there's

25:28 a gray truck in line, and a guy with a gray truck just ordered food, make sure it's ready. Like they're doing a lot of really cool stuff that I don't think a lot of people outside of the direct

25:40 tech community are even like aware of. Yeah, I mean, their stuff is crazy. Yeah. But isn't that kind of like how we've said, you know, even just about AI now too, but like if you're doing a

25:52 good job with it, people shouldn't even know you have it in the platform. It's just make the user experience better. Yeah. So that's right. And like even just like whether it's, if you're

26:01 designing web page, people don't care if you use Angular or React or all the JavaScript, did it work? Did the button do what I wanted to do when it was fast? You know, and the same thing applies

26:10 across all different spectrums of like, just make the user experience better. And that's what it's all about. Yeah, totally. Yeah, yeah, yeah, totally. I think like, and I think you hit on

26:21 something really interesting there that we've seen with DuckDB too, on the mother back side, which is like, because it takes all these really nice abstractions and brings them to pure SQL,

26:34 It's kind of maintainable and iteratable across, you know, it doesn't matter if it's a business user or a data analyst or a data scientist or a data engineer or even a software engineer, the

26:43 primitives are kind of understood across those groups. And so what that means is that if you have a, if you have a hard business problem, like getting people around the table and speaking the same

26:55 language is like, now becomes a much simpler problem. It's not like, Oh, this team uses this, this team uses this, this team uses this. It's like, no, we're all, we're all kind of jamming on

27:04 SQL. And by doing that, you know, it really means that things that really matter to our business, we can get at faster. And I think that's, that's a really cool thing. And you know, obviously,

27:15 with one advantage of SQL today, too, is that like, there's 50 years of training data in every large language model, right? As much as like, there's things that are kind of crappy about SQL,

27:29 right? It's like written in the wrong order. The query plans are not deterministic. You know, if you're not using DuckDB, it's slow. You know, you have to go, you have to like install a really

27:39 heavy server on, on a, somewhere in the Nexus if from a client, et cetera. Like all those things, you know, are kind of pretty bad trade-offs. It's very verbose as well. But, you know,

27:52 DuckDB, it's less verbose, it's faster, there's libraries, all these, all these things like traditionally are challenging with SQL become, become less of a

28:02 problem And in fact, about 23 of users that use DuckDB identify as software engineers. So it's not just, you know, data people using it, which is excellent. You know, I think it might be the

28:14 first time that I've heard a software engineer not hate

28:19 SQL. That is like the tag line right there. Yeah. Like, genuinely, right? Like, I know, John, we were talking to someone recently, but I mean, just about how they're actually, or maybe it

28:29 was someone else, but how you would assume this. a lot of software engineers know SQL, but it's really not in the case because of ORMs and everything. So they end up having it being written by the

28:39 ORM, and they never ever have to actually go write SQL on their database. Yeah, totally. I mean, that tracks with my experience. And a lot of it, too, is like query optimization, which is

28:53 something, for those of you that are on SQL Server, you probably are very good at just in some more constrained environment And it's like, you know, I've definitely seen a lot of SQL written by

29:02 software engineers where I'm just like, I'm like, oh, well, let me fix like, let me add some tables there. Let me fix a little things here. And now it takes like, you know, a second to run.

29:11 And it's like, you know, there's like, why do I need to know how this thing thinks about it? Like, why doesn't it just execute the thing I wrote? And it's like, well, there's really good

29:20 reasons for that. But also totally understood. Like, the thing that I would say about DuckDB is it's the first time that I've seen repeatedly where the query planner is good enough that they're

29:30 like, okay, this is fast enough that I don't care. I'll actually write SQL because. it's vast enough. And I don't want an ORM actually is gonna make it slower. And so I think there's some really

29:40 interesting things that are kind of happening with that that are only possible with like really great library like WTB. Well, I think the other point there is, and I've, I feel like I've kind of

29:51 learned this in my product manager journey as I've, across, you know, we've built software across a bunch of different languages and stuff, but I don't think enough people appreciate like, yeah,

30:04 SQL's been around forever and like, it's robust, there's all this documentation, but from a maintenance longevity perspective, it is infinitely easier to go find someone to hire that knows SQL

30:16 than, you know, Rust or something. I know those aren't direct comparisons, but right, like the new shiny, sexy thing is what people always chase, and it's like, but then you go start building,

30:26 you're like, well, shit, I can't hire anybody. because there's only 50 rust experts in the world or whatever you know. Or you can, but you can't afford them. Right, exactly, that's my point,

30:35 right? Is it's like, there's a lot of value that just like secondary tertiary value in using these very robust, well-known languages across as many places as you can, just because of the longevity

30:50 perspective, the maintenance perspective, right? Like Bobbi and I, we always, in our, I'm sure it's similar in other industries, but you know, it's always the build versus buy question. Oh,

30:59 yeah. Should we build ourselves? Well, if we build ourselves, how are we gonna maintain it? Well, if it's SQL, that's a pretty easy thing to let, you know, to find people to maintain for you

31:07 instead of some new, you know, abstract thing. Totally. So, and I think we hadn't necessarily got too far into it here, like I'm trying to remember what we did the first time versus, you know,

31:18 now that we're on here. But when we talk about just being able to like kind of query the data where it lives. Oh, sure. You know, I think it was talking about how, I think for a while there, so

31:27 I was working on a. little project that died on the vine because it didn't have time for it, but it was, and I'm not sure if you're familiar with Immaculate Grid or not. It's like a three by three

31:36 game, so being a couple buddies. Shout out to Jeremy Funk and Brian Becker, friends of the show.

31:42 We had a little text group where we'd share our results, kind of like wordle for people who don't understand, but trying to come up with

31:51 like a Immaculate Grid solver kind of thing. And then we're trying to use it to use duct, you know, I wanted to play some technologies that I hadn't had a chance to play with at work. So, you

31:59 know, could I use, you know, LLMs to help write the queries and then, but I use duct DB to query the data in place. And I found like the lamin dataset would be when I could use to probably answer

32:10 most of the questions. And

32:14 so it was working on it. But I think what I'm getting to is the fact that you can query like a URL. Like, I can say, select these columns from, You know, get hubbed up, blah, blah, blah,

32:26 slash, you know. Fielderscsv and it'll pull up all of them and it'll filter them and do all it's up and I never have to move that CSV onto my machine or anything. It just queries it in place and

32:39 it's super fast and it's just, you, like, I think it's the first time, like the first time you do stuff like that, you can't believe that it actually works. You have to run the query twice.

32:48 Like I've heard that from multiple people where it's like, you run the query. The person was like, wait, hang on. Like, why did that take? Like two, like, why did that take? Like, you know,

32:54 500 milliseconds. Like I know, I know databases That query doesn't take that long. Like, you run it again. You're like, what? Yeah, totally. You know, it's super interesting. And like, one

33:05 thing that's interesting, too, like, in terms of querying data at rest. So like, CSVs in general are pretty large. Like, they're not very well compressed. And so one thing that we can do,

33:16 like, with DuckDB, that is relevant, like, to think about, like, the IoT space, is, you know, you can do a select star from a CSV, and it'll keep the data in the CSV But you can also do just

33:27 like a create table as. and it'll take that data and pull it in and save it in the DuckDB format. And that DuckDB format is

33:37 roughly five times more compressed than Parquet and Parquet is probably, you know, five to 10 times more compressed than a

33:43 CSV. And so what that means is if you're going DuckDB to DuckDB or DuckDB to Mother Duck, and maybe you're in a bandwidth constraint environment, it's very easy for you to write a small amount of

33:55 data across up to the cloud in a way that, you know, is understands the constraints of your environment, right? And so I think there's some interesting stuff there where you can do a bunch of

34:06 analysis kind of on maybe a big data set generated on a bunch of sensors out in the field, and then, you know, either get all the raw data up to the cloud, or you can do, you know, aggregate in

34:17 and compress kind of at the source, and then send up kind of in a data-based tabular format, in the DuckDB format up to the cloud. And I think there's some really, we've seen some really

34:25 interesting stuff there too. You know, and it runs, it's well enough too. I think like there's, you know, I've seen, I've seen people, I think they might do this at the trick voice scenario

34:34 too, where it's like, if the network's not available, it just keeps, you know, keeps a little data that's running. And then once it's available, it jams everything up to the cloud, like MQTT.

34:42 Yeah, yeah, correct. Yeah, yeah.

34:46 So there's some interesting scenarios there too, where it's like, oh, because in order to write really fast querying, you actually really care about compression, because you want to get as much

34:54 data as you can in your RAM and into your CPU, right? And those buses are not very big. And so the more that the better your compression is, the faster your queries can run. And so it becomes

35:06 like this really virtuous cycle of, hey, the better compression is, the faster we can run queries. So let's figure out how to make compression better. And so they've implemented a bunch of cool

35:12 tricks that we take advantage of in mother2 to kind of just make these things super fast. It's just like Silicon Valley with pipe, hyper, right? Yeah,

35:21 I mean, yeah, for sure It's definitely a virtuous loop and it unlocks a bunch of stuff if you can figure out compression. Yes,

35:30 that brings up some I want to hit on too, because we're going to start talking about passing data around like so I think one thing that's going to be very interesting the next year to 345 is

35:40 hopefully a shift from ODBC to ADBC of the arrow Oh, sure.

35:49 Can you expand on that? I can talk high level about it. Imagine you're a little more in the weeds on it from your side of the world. Yeah, sure. So

35:59 that actually aligns really nicely to what we talked about earlier about OLTP versus OLAP in terms of rows versus columns So, the cleanest way I've thought about ODBC, which is kind of the

36:12 traditional way to access data, is that's row - JDBC is similar to, right? Yeah, JDBC also, I think, is row based. Yeah,

36:22 I don't know if that's exclusively true I got Google. Yeah, but nonetheless, I mean, Odie was E versus ADBC. ADBC, you know, is based on Arrow, which is very performant kind of file format and

36:38 it's columnar.

36:40 So what that means is you can kind of push down, push down your kind of query a little bit down into your source layer and only get the data you need, whereas on a row-based data set, if you're

36:53 doing aggregates, you have to get all the rows.

36:56 And so there's more to it than that. Again, I'm generalizing maybe a little bit too much, but yeah, there's some really cool stuff that you can do with ADBC and as it's being broader adoption.

37:10 I'd love to see an ADBC Excel connector have not yet seen it, but some days. Isn't that something? Because when people have run DuckDB with even within Power Query you, could technically do it

37:22 right or something along those lines or so.

37:27 Yeah. I'll talk about this briefly. So

37:30 DuckDB is just single binary. And so you can kind of install it almost anywhere, like Linux, Mac, PC, you know, NVM, inside Python. So like you can just do pip install DuckDB and you have a

37:41 full database running inside Python. I think it's like - That's crazy. 20 megabyte executable or something. It's not very big. And so

37:52 what that means though, 'cause I think I believe Power BI, for example, can execute arbitrary Python in Power Query in the Power BI service. And so you can build a BI Python Power into back frame

38:03 data a as data pass to primitives DuckDB. the all use then can you therefore and DuckDB has that script

38:17 That's interesting, right? So

38:22 I got into Power BI because of Fantasy Football, of course.

38:26 one of the annoying things about Power BI, if you're, you know, a citizen coder is, I don't want to go set up a fucking database every time I want to go query fantasy football stats. And so it's

38:39 like, you know, so you can call the APIs from Power BI, but then you've got to loop through, you know, pagination and nesting and all the fun stuff is not great to do in Power BI. So something

38:50 like this, I could just spin it up in a Python script into a database That's pretty slick, and you never have to mess with it again. Yeah, you can even do like a pre-processing step where you just

38:59 like rip your data down and it's like a parquet file or CSV or whatever, because like actually the Power Query and Excel CSV reader is excellent, right? It's literally best in class, like it is so

39:11 fast. The DuckDB one is also very fast, but they benchmark very favorably against each other. So yeah, you can do all your pre-processing kind of in a Python script. You know, one thing I was

39:21 talking about earlier, like, you know, LLMs are great at SQL. They're also great at Python.

39:27 If you can think a little bit about how a computer works

39:33 and reverse engineer what you're trying to do, you can express that as a prompt and then get 90 of the way there really rapidly. hallucinates a little bit here and there. You still need to, I

39:45 think one thing that's been really beneficial for me is someone who's coming from an accounting background is like, I'm just like, oh, well, obviously I can use this code that's generated from an

39:54 LLM But the thing I really need to do is make sure that my tests pass, right? Which is just auditing. It's just auditing data. It's like, okay, great, let's write some tests. If it passed the

40:03 test, it's great. We can use it. If it doesn't pass the test, we can't use it. And so that's been very helpful, especially because then you can like write those assertions as, you know, in

40:12 your prompt and say, okay, make sure it passed this test or whatever. And that framing has been very beneficial for me, for sure. That's a good tip right there. There you go. Including the test

40:22 with the prompt, right? Like, I tell, I'm a. I use LLMs all the time. I use it to code Python explicitly almost. And there's still lots of people that are like, these things suck. They're not

40:34 going to do blah, blah, blah. And I'm just like, you're not giving it enough context. So like giving your tests, that's another really good one, right? Like I tell people to give it the

40:42 documentation, pass it the GitHub repo. I need to, there's another one that I just saw where you, I think it's, you append the G on the GitHub to something else. And it makes the entire get into

40:54 a plain text that you can pass into a language model. And I'm like, that's really slick. But anyway, the crazy thing that I think is, I don't even necessarily need Power BI because I can just put

41:08 in the API into the query and it can pull it directly, parse the JSON, return directly into the database. And then I can just use Python for the visualization, right? Like I, don't need - Yeah

41:19 totally,, I mean, that's exactly right. I think like one of the cool things about it

41:25 that about DuckDB specifically is like, if you can install a Python package, you can install DuckDB, right? Like it gives you all the database stuff without like someone without having like, you

41:34 know, gatekeepers in the way. Yeah. And so, and you're not like, you know, because it runs all locally, it's like if you mess something up, like it's just your computer, you're not like

41:42 deleting records from like a production database, you know? And so like, it's a really great way to actually like learn and pressure test and all these things that are like really big blockers

41:53 especially like as you work your way up the stack, get closer to the product and it's like, okay, well, like how do you know these things, especially, you know, for production like ERP type,

42:03 you know, planning systems, you know, we need more people who can write and read SQL, in my opinion, and it's a great pathway in there too.

42:13 So we kind of get on to the AI thing, and I want to get into that. And you tell me which one you want to tackle first, 'cause some of the segues a little more, 'cause it's more specific to mother

42:21 duck, I think. So I don't know if you want to talk about what mother duck or who you all are and how you fit in that duck DB Ecosystem, but then we can jump into all the different, you know, kind

42:33 of value ads there But again, you guys have a few things that mean you've got what was it last almost exactly a month ago. You know with the prompt function. Yeah, yeah And then also just you know,

42:44 Texas sequel any about stuff that you know, yeah So yeah, yeah, yeah, all right. I'll talk about a couple things there So what we're doing at Mother Ducks, we're building a cloud service on top

42:52 of duck DB that is effectively a data warehouse and

43:01 so we are Strong believers that most most data can be processed on a single node and that single nodes are only getting bigger And why are you even that it's just a single computer, right even in the

43:10 cloud? Just a here's a quick quiz How big do you guys think this the biggest the biggest instance you can get on?

43:20 on AWS's in terms of number of cores terabytes of RAM. I'll let you take that one, Bobby.

43:28 I've never even tried to go that far 'cause I think if I would be afraid about even clicked on it, that would try.

43:37 But he looked at it. So we're gonna go ahead and click on it. Yeah. I don't know how Azure stacks up or GC Cloud, but the biggest node on AWS is 32

43:50 terabytes of RAM AWS is 32 terabytes of RAM and 896

43:55 BCPUs. Holy shit. Now, I think those are built specifically to support SAP HANA. Sounds right. So like, I don't think you can just like buy them and then like run DuckDB on them, for example.

44:10 By the way, DuckDB doesn't quite scale out to that number of cores yet. You know, really, really it's sweet spot is like 32, 64 core, some around there So, but like that's the type of You know,

44:22 there's huge nodes available. They're only going to get bigger. You know, we can fit huge data problems into these things. And, you know, even a relatively smaller node, you know, with like

44:34 128 CPUs and 1 12 terabytes of RAM, you know, fits very nicely kind of in the notion of like where we really see kind of the sweet spot of where it fits in.

44:47 And so, yeah, so we're building a cloud service on top of that. We have like our own little UI It's very simple to use with the DuckDB command line. So, DuckDB has a binary installer. You can

44:56 put it anywhere, Mac PC Linux. Well, not every, can't put it on Alpine Linux yet, hopefully soon. Anyways, so you can install it, you know, on your local machine and then you just type in

45:09 DuckDB and then MD colon in your command line and now you're connected to mother duck and you can run queries, you know, kind of both locally and in the cloud. And what's really cool about that

45:21 what's called dual execution. So I can take data that's maybe in mother duck in the cloud and I can either bring it locally. So I can do like a create table as select and move that into my local

45:33 database. And now it's free for me to query from there. It doesn't cost me anything. 'Cause I'm using my local compute. And then what that also means is I can do something like take a CSV that's

45:41 maybe some local data and then join it with data that's in the cloud very, very easily. So like, I don't know, I like to think about like a marketing thing here Like someone gives me like an email

45:51 list. I'm like, okay, I got a list of like, you know, 2000 emails from some conference, right? And I have a HubSpot or a Salesforce account with a bunch of emails in it. Okay, well like,

45:60 which ones are new? Okay, like, you know, there's a bunch of data for you. Yeah, yeah, yeah. You gotta do a bunch of stuff to like clean it up, make it all match right, and then you can join

46:08 them together. So that's something that you can just do all in pure SQL, the entity resolution kind of stuff, or functions that are available inductive, you're excellent They're super fast. And

46:21 yeah, so that's kind of what we're building at Motherduck is this kind of dual execution database that extends into the cloud and can run queries both locally and then the cloud. And just as a

46:35 technical example of that, I do a bunch of testing with the TPC DS data set, which is, and I use that scale factor 100. So that means it's about, it's a few billion rows and it's about 40

46:46 gigabytes of parquet files And so to kind of run the full suite on my laptop, which is like 32 gigabytes of RAM, it takes about 20, 25 minutes. To run that in mother duck, it's about five minutes.

47:02 So a lot of that is just like, well, there's more RAM available. Actually, the cloud CPU speeds, as it turns out, are not great. Really, it kind of becomes a function of how much RAM is

47:12 available. And so for things like that, there's a lot of really kind of clever things you can do And again, it's all seamless. Like you guys cut out early. Earlier, in my DBT project, I just

47:22 put the S3 path for where the files are located, load them across into my mother duck instance, and then run all the queries. And it's super fast, it's awesome. So, yeah. Well, and I think I

47:34 remember you commenting even, 'cause I think I quoted the blog in a LinkedIn post, it was almost like it was exactly two months ago, but I think it was George Frazier from 5Tran put basically, but

47:45 it was basically pointing out that most queries are small Yeah, yeah, yeah. I think it was like between Snowflake and Redshift and that, I'm reading it

47:53 here, it says of queries that scan at least one megabyte, the median query scans about 100 megabytes. Yep. So I mean like in the 99th percentile, 300 gigabytes. So you see all these benchmarks

48:02 on terabytes or petabytes of data or whatever they want to talk about. And like no one actually does that in real life. Yeah, totally, that's like our kind of, we have this thesis, Jordan, our

48:11 CEO, wrote this great blog post called Big Data Is Dead. If you type Big Data Is Dead, you'll find the post where he looked at that data as well. some of the same stuff that George looked at, and

48:22 yeah, the reality is that when you

48:26 do query data, you're not querying that much. Part of

48:31 the story there is

48:35 that when we came into the big data era, let's say 2004-ish,

48:41 the underlying assumptions was you just needed to make all your software run across multiple computers at a time, and that's this notion of massively parallel processing, that a bunch of things like

48:49 Spark, for example, are excellent at, right? Really nice abstraction, but as it turns out, there's a lot of overhead to run systems that way, and so we're kind of like, you know, returning

48:58 back to the single node era with what we're doing at DuckDB, and part of that is just because, well, actually now

49:06 the amount of compute and storage available to us has exceeded or will soon exceed our ability to create data, which is a crazy thing to say, I would not have believed it 10 years ago But, um. You

49:19 know, high school me buying a 256 megabyte flash drive for150 does not believe that. Yeah, correct. Like I remember like,

49:30 in 2005, it cost2 gigabyte for a hard drive. You know, those things cost, you know, it's a tenth of a, or it's one cent per gigabyte now for a hard drive, right? Like it's insane You can fit

49:46 like,

49:48 you know, in a spinning disk drive, there's 36 terabyte, you know, spinning disk drives. Like a single drive. Like what? Like, you know, it's insane how much, how these things have scaled.

50:01 And so really what that means is that, you know, single note is an insane benefit for us, right? Because we don't have all the coordination overhead to compete with. And, you know, for better

50:12 or for worse, you know, machines or systems like Snowflake and Databricks.

50:17 Excellent, excellent databases, right? But they're kind of designed kind of in this notion of running against a bunch of nodes at a time. And so they have extra coordination expense that we just

50:27 don't have. So that's horizontal scaling versus vertical. Yeah, we just scale up. We only scale up, right, that you don't. And that's not exclusively true, but for the purpose of like kind of

50:38 just a thought experiment, that's right. We're working on some scale out on read type stuff where you can just like spin up multiple nodes and read against the same database at once. That'll be out

50:48 relatively soon, which is really cool. But yeah, the thing is single machines just getting bigger. And we'll just focus on making sure that our system scales up to the biggest possible nodes. And

50:59 if we can do that, we believe that we can give you a faster experience, a better experience, and we can deliver it for cheaper. And that's really powerful. So I'll just talk about what my

51:11 hesitancy is. It's probably more ignorance than anything Sure. Just with mother duck savers. Snowflake or Databricks, and it goes back to the driver question of like, how can I connect my BI

51:22 tools to Motherduck? 'Cause I mean, I guess Snowflake has excellent drivers and stuff like that. So just, what does that look like if I wanna connect? Yeah. You know, Power BI, obviously is a

51:33 probably one that you get off in, but I mean, in oil and gas, TIPCO spot fires, one which I'm sure they don't have a native connector for Motherduck at this point. We're happy to build one. So

51:42 we definitely recognize that like a big part of it is the ecosystem, right? So we have, for example, an ingest side like native connection with Vibtrant, for example. We just, we just - Good

51:51 start. Yeah, yeah, exactly. And where does that land? I mean, so then I'm thinking like, y'all are hosting that in AWS or various cloud providers and it's going to blob storage, I'm assuming,

52:01 or? I mean,

52:03 yes, but it's in Motherduck's native kind of format, which is

52:09 kind of a differential storage layer on top of the DuckDB format Okay. So, um, We get to take advantage of all the really nice things that DuckDB has but then also take advantage of things like

52:20 zero copy cloning by building a clever storage format. So, so yes, so it goes into the mother duck native format substrate from five train we have an air bike connector as well you know if you're

52:28 in that world. There's there's a bunch. We

52:32 have an ecosystems team that's very focused on building all of these out. Do

52:37 we have everything yet? No. Are we are we continuing to build it out? Yes. Do we have a power BI connector? Yes

52:46 One that we built, right? We like so we recognize that like yeah, the interop is definitely a key part of that and we're building that. This is actually a nice segue into kind of another thing

52:56 we're thinking about, which is we've recently announced this project called PG DuckDB, which is running DuckDB inside of Postgres and that can connect to Mother Duck as well. And so really what

53:08 that means is anything that connect to Postgres can also connect to Mother Duck Now, it's very early, it's alpha software today. Um, but we're continuing to work on that in partnership with, uh,

53:18 some, some other folks, um, to walk me through that real fast. Cause so our app is on Postgres. This is why I'm asking. So like, sure. And I'm using five train to move that to snowflake

53:31 currently. Yeah, sure. Right. So walk me through just as a potential user of that. Yeah. Sure. So if you're, yeah, if you're on Postgres, like, so there's, there's multiple ways to do it

53:42 So, so, so, uh, there's really great, uh, again, this is, this is kind of early innings here, but let's talk about it. Um, so we've got, we've got PG duck DB, which installs duck DB

53:54 inside of Postgres, um, and does a whole bunch of really, really interesting things there. We're still working on things like reading Postgres indexes, for example, we don't read indexes when we

54:04 read data from Postgres and with PG duck DB, like it's very early So like, uh, I, um, be hesitant to, to advise you to run that on a production database today. But like if I'm trying to do

54:17 analytics on the Postgres data, right? So it's a social, think of it as a Reddit, right? But for the energy industry. So I've got all these, you know, it's a relational database, but there's

54:27 lots of events that happen since that relational database. Yeah, the trick, the trick that it's data with it and all this other shit, you know. Yeah, yeah. So the trick there is that because it

54:36 has mother duck connected, you can actually very easily kind of send data over the wire to mother duck and then not break your production, you know, use all of your resources on your prod server

54:48 for analytics. You can just basically write the query, it'll go out to mother duck, you know, written in the return kind of the rose back to you in whatever your postgres, kind of in your

54:58 postgres native, whatever format. So that you can kind of like not think about it. But there's no, there's a whole bunch of really, really cool stuff that team is working on. Like being, again,

55:09 being able to read directly from the native postgres format very fast That's awesome. you know, it's not a foreign data wrapper. So it's not like a linked server, like you would do like in a SQL

55:18 server where it's like a separate server, there's really not limited, there's really not predicate pushdown. No, we're trying to solve all that stuff. It's very hard, it's gonna take a long time.

55:27 The next thing is, there's actually something called a PG scanner that's for DuckDB. So you can actually install, it's a library. And you just type in like install PG scanner and then load it, I

55:39 believe And then you create a relationship to your Postgres database. And then you can query it with DuckDB kind of on a separate machine. So like if you're on your laptop and you want to query

55:52 Postgres, you can use it just like you'd use the Postgres command line.

55:57 So that's really cool because what it also lets you do is do things like create table as select. Again, and it'll pull the whole thing over across into DuckDB and then you can do all your analytics

56:06 and DuckDB queries, right? and you can do a mother duck as well,

56:11 So those are kind of two easy options. Again, all the other things still work, five-tran using Postgres to Motherduck, things like

56:20 that.

56:22 There's an interesting, really interesting project that I've seen right now from one of the guys at LiveRamp that is called Universe SQL. And it's basically a DuckDB cache on top of Snowflake. So

56:34 you can basically pull down like a thin layer of data from Snowflake, do a bunch of DuckDB stuff on it, and then run it back into stuff like, as an example. But there's a whole bunch of early

56:44 stuff there. Core replication. I think there's a reason the reliable tools of the ones we know and love, because

56:55 those problems are really hard. And when you run into them, it feels like a waste of time. So it's very hard to kind of dedicate resources. I will say we've had a lot of success recently with

57:04 estuarydev as a kind of CDC replication partner. from Postgrespress, we had David on, what, month or two ago, John? Yeah. Oh, cool, cool, that's awesome. Yeah, so we've actually seen a lot

57:16 of customers using that and I remember, I've done, I wouldn't say quite a bit, but I've done some SQL Server CDC in like probably five years ago and it was so hard. So I think the abstractions

57:29 have gotten tremendously better recently for those types of replication. So yeah, I think there's a bunch of options I think the one that I would recommend to people who have Postgres today is use

57:41 the DuckDBPG scanner. It literally can just read from it and just think about it as like a little analytics cache on Postgres and you still have to do all the reads. Yeah, you still have the cost

57:54 of reading the data, right? So you have that cost regardless, right? Yeah, well, yeah, it just means that your first read will be the same speed as if you're reading it from Postgres correctly

58:05 the second, but the thing is once it's - in the DuckDB format, then you get all the advantages of the DuckDB stuff, and you can just run it on the table there. Right, but I'm also not querying

58:14 anything in Snowflake at that point either, right? Like all the analytics queries are happening in DuckDB, so I'm not getting charged from Snowflake. Yeah, correct, correct. Yeah, especially

58:24 if you're doing stuff that's like, you know, one thing I really like about DuckDB, too, is that like, there's a lot of stuff, especially like in the kind of more data science-y workflows where

58:34 you're not sure if it's gonna work And so like,

58:39 being like, okay, I'm gonna run this on Snowflake, and I'm gonna crank this up to like a two XL, and it's gonna take three hours. You're like, do you wanna do that? Yeah. Don't wanna do that.

58:48 But like, yeah, if I have a DuckDB kind of, you know, set up queries or a Python script that takes three hours, like first off, congrats, I don't think I've ever seen it that takes that long.

58:59 But like, it's like, it doesn't cost you anything, which means like you're, you can take bigger bets

59:05 in terms of the things you're thinking about, If I can get this to work, then I can optimize it. And

59:11 it's effectively free. And then there's always mother duck to kind of offload that too, if you need to as well, right? And so I think that's like a really interesting pattern that like, I think a

59:21 lot of teams on the data side are like a little gun shy right now. Like there's a ton of hiring, data scientists was like the sexiest job,

59:31 you know, like that was written by HBR, right? Like at a better time, magazine, one of the two, yeah Yeah, so like, it's

59:40 like, so there was this big promise that like, oh man, our businesses are gonna be so much better, like it's gonna rule, and it's like, actually everything stayed exactly the same. Yeah, it's

59:48 like, so never, it's like, wait, we have this data team, we're not like, are we better off? What's the ROI of this thing? And I'm just like, okay, those are really hard, like it's hard,

59:58 when you get to that, when you get in that situation, it's hard, right? And maybe you hired the wrong people, maybe it was just like, expectations were wrong, whatever. But like, if I can say,

1:00:06 hey, I can bring in this tool, and I can actually de-risk our entire team because we can run it on either a small server or our laptops, and then I can take some really big bets and see if these

1:00:15 things work. I think that's really compelling because a lot of people have kind of been burned by like there's snowflake bills just going up every month. Their businesses, like they don't really

1:00:26 have a good sense on like our things improving what's happening here. And so I think there's something there, there's something to that too Yeah, I think the thing there, especially with our

1:00:37 industry, we've got a lot of engineers, right? We're one of the most technical workforces in the world. And the engineers are the SMEs on the data, but typically IT gets involved and kind of cuts

1:00:50 them off from access to that data. And so there's the whole, in energy, everything is thought about from a risk perspective first because instead of the app not loading, a catastrophic failure,

1:01:03 people die, there's Maconda, there's all kinds of bad, bad things that can happen, which is good. But the flip side of that is it disincentivizes people to do new things, take risks and all of

1:01:15 that. And so just imagine this world where it's like, hey, SMEs, you can have this abstracted layer or access to this layer to do your analytics and to test all kinds of different stuff on. And

1:01:25 it's not costing us money and snowflake every time we touch it. Yeah. And then I think this even brings up a pattern that I think maybe I thought about it and I think I maybe saw you talk about it

1:01:38 maybe even a year ago, Jacob was like now with DuckDB and DBT and they all run these things like kind of locally, you can really just, you could do an ELTL.

1:01:49 Even if you dare, you know, if you wanted that cloud thing to serve it out, like, you know, even just load it into snowflake so that it's like a nice fast way for people to interact with that

1:01:57 final, you know, gold player if you're doing Medallion thing, but yeah. you know, why can't I just store all this stuff, even running on a local server again, because it could go single node.

1:02:06 And I don't need to run my DBT, like on those cloud servers all the time, or, you know, on Snowflake. I mean, like you could run these transformation things, and now we're coming full circle

1:02:16 then right back to - Yeah, oh, absolutely. We're definitely going full circle, yes. Yeah, yeah. I mean, I think one thing that I love about it too is that like, it really brings tools that

1:02:26 previously were like, you know,

1:02:29 like you're talking about oil and gas, you know, SMEs, you know, like, you know, engineers and geologists and stuff like that, right? Like, what if they could have all that power? Just like,

1:02:38 you know, like they probably are using Python because regret, you know, you know, pandas is very convenient. It's like, well, what if we can like supercharge that? Like literally a hundred

1:02:48 times faster than pandas is DuckDB. Like, and you can read your pandas data frames from DuckDB too, I want people to know that too. Yeah, you can't, it's amazing. So I'll send you guys some

1:02:59 stuff So actually, so I've been working on. yesterday, what I was working on, actually, is I'm working on some basic replication patterns. And because SQL Server is the database I know best,

1:03:08 I'm actually starting with SQL Server. So I wrote something yesterday where I spun up SQL Server on my Mac. This is where I discovered it's slightly different dialect.

1:03:17 And loaded AdventureWorks into it, right, which I'm sure you guys have seen. And then yeah, wrote a couple of lines of Python and now that data's in mother duck, right? And so you just can kind

1:03:28 of use like the PiODBC, which is officially supported by Microsoft, the PiODBC library. And so you can just return that as a data frame and then, yeah, load that straight in, load it in mother

1:03:38 duck. So

1:03:41 like, that's how easy it is today. Like,

1:03:44 I'm trying to like not be too, like say, oh yeah, it's just so easy. But like also, for those of you that were here like 10 years ago, you'd know how hard this was. Yeah. Like, like, yeah

1:03:57 Like I said, you're doing it on like a UNIX based system. wrestled with all the different ODBC drivers, first, you know, SQL Server on there and like having to screw or dick with all that. I

1:04:09 mean, like it's just insane. Yeah, like I actually was surprised like it worked basically the first time. I think that's just because I think it's crazy too that I think we learned on one of our

1:04:19 podcasts. We had this guy, John Archer on who's from Red Hat. They've got an office actually right across the street from where John and them are. They have a school innovation center with Chevron

1:04:28 or whatever, but apparently SQL Server runs faster on Red Hat than it does on Windows. I'm not surprised. I mean,

1:04:40 there's lots of crazy stuff that you can do, and there's a fair amount of overhead, I think, in the Windows side. But yeah, it's definitely - so I'm definitely - part of my focus over the next

1:04:51 few months is going to be writing about, hey, here's all the ways you can take your BI load and lift lift bits and pieces of it into Motherduck, and here's how you can use app DB, because that's

1:05:02 very near and dear to my heart. I want to empower analysts broadly. That's why I came to Motherduck. It was like, I know I'm a great analyst, but how do I make 100, 000 or more people have that

1:05:16 power, and that was part of what inspired me to join. So we would be remiss if we didn't talk about AI features, by the way, which we kind of like - Yeah, I think that. We quickly glossed over,

1:05:26 but a couple of things we're building that are really cool from a mother duck perspective is we are making it really easy to use like embeddings, so like vectors, and large language models inside of

1:05:43 mother duck. And so there's kind of two ways we're thinking about that. The first way is like as a query interface. So we've actually trained our own model to where you can ask it questions about

1:05:54 your data and it will return SQL and execute it for you. Um, and so that's actually available on hugging face. Um, so it's pretty cool. And we use that in our own product as well. So like, uh,

1:06:05 if you write a SQL query and it doesn't work right, we make a suggestion on like, Hey, here's so you can fix it. Um, and where it really helps, honestly, it's like to the duck DB,

1:06:16 uh, dialect. Yeah, yeah, correct. And we're training another one kind of right now. So like Bobby mentioned, well, I don't know one, which recording, but at one point, Bobby mentioned, I

1:06:27 play with the LLM's a lot. Uh, yeah. So walk me through that, like setting that up, right? So I've got a SQL table. Um, I want to be able to query it in natural language. I installed duck DB,

1:06:41 all right. Set up my duck DB instance, pull my, my table into duck DB. Do

1:06:48 I need to map anything? Like how, how granular do we have to get on that? I'm very, very, very interested in this. this problem specifically because it's a universal problem regardless of the

1:06:59 industry or in everybody that you talk to, at least at a management level that's remotely familiar with LLMs. The first thing they say is, Oh, I want to be able to query, you know, I want to

1:07:08 figure out why my IoT data is so bad or what asset is failing or what, you know, what the outliers are and all this stuff. And I'm like, Well, you really need to ML, like true ML, not language

1:07:17 models, but the language model piece where it's like, Hey, I use language models to write SQL all the time because I never learned SQL and I don't know if I will have to because the language model

1:07:27 can do it for me now. Yeah, sure. And so it's like the language models know SQL pretty damn well. It's just you have to make sure they understand the schema and the structure and the relationships

1:07:37 and all that. And so I think whoever figures this out is going to set themselves up incredibly well in the future because everyone has this question and problem. So just briefly kind of walk me

1:07:50 through what that looks like You know, we don't have to get a ton of details about it, but. Yeah, we have a few interfaces for that. Basically, we've set them up at, so

1:07:59 we've kind of set them up as functions you can call. So we have one that I just linked, that I just linked that's called prompt SQL. So you can actually give it a prompt, and then you tell it what

1:08:09 tables you want to look at also. You don't have to give it tables. If you don't give it tables, it'll just use the database you're currently in context of to answer that question. That's awesome.

1:08:20 So it's pretty good. Obviously, the quality of your column names, we'll really make it. If you've got a set of gold tables, it's actually incredibly good against a gold set of tables. If you can

1:08:33 scope it to like, Hey, you only care about these 10 tables, we've got a fact table, we've got a dim table, we've got whatever, it actually works really, really well. Also, you can also set it

1:08:43 to just return the query to you, so you can see the query. It's like one workflow we've seen is it'll just be like, Hey, I'm thinking about this problem. Here's what the query looks like. hey,

1:08:53 business analyst or data analyst, is this the right query? And instead of the conversation starting with like, you know, hey, I want to see this thing. It's someone starts with a query, and now

1:09:02 we're all speaking the same language, right? Because if it's too much for the data analysts to confirm, they can work up the chain, you know, all the way, you know, to a data scientist, to a

1:09:11 data engineer, to a software engineer, if they need to, right? And that's actually very powerful. Being able to start there,

1:09:19 we gave a really cool demo at PiData last week, kind of about this topic where we pulled a bunch of data down and then, you know, ran it locally and then run these things. So yeah, so it's as

1:09:30 simple as connecting to mother duck and then you have availability to these functions. And you can do that in command line. So you can just, in the command line do, you know, or in Python, you

1:09:38 can just do, you know, duck DB and then MD colon. And now you're connected to mother duck and all of these functions are available to you.

1:09:47 And that's the first thing because it's lightweight. push my language models and the associated data with them to the edge and do it all there. Yeah, you do not have to. The model, let me think

1:10:01 about this. I don't, you could use our model on the edge, yes. But if you're trying to use the functions we've built into DuckDB, you would need to internet. Okay, I just gotcha there in that.

1:10:10 'Cause we're, yeah, and so the other thing we're doing is we have this prompt function. You can use, that's a scalar function. So you can use it, so you can do like select, you know, select

1:10:19 prompt, whatever, and you pass a prompt into it from some table. And so you can pass like values from that column. Like I'll give you a practical example. So like my buddy sent me like a wine

1:10:28 list. He's like, Hey, I'm ordering some wine. You only like add a couple things to them. Like sure. I don't know a lot about wine, but sure. And so I just had like, you know, the name of the

1:10:36 bottle and like the region and the year and the price. And I'm like, okay, well, I don't know a lot about these, but what I can do is I can pass the name and the region and the year to a prompt

1:10:45 and say, Hey, write a description of this wine. because it knows, hey, this region produces this type of line, because it knows this year was a good year, because it knows this, it knows a few

1:10:56 things about this producer, it can kind of knit together a pretty good description of the wine. And then I can say, oh, well, now I have a description. I can actually make a decision that's sort

1:11:04 of educated on this. And it worked pretty well, like surprisingly well. So, but yeah, that's all using SQL. So we have a, and what's cool about that is, you know, so we do all the

1:11:14 parallelization behind you. So we run about 256 queries at a time when you are prompting it So we're just running that up against like the chat, uh, chat GPT kind of APIs. Um, but we're, we're

1:11:26 working on broadening that so you can maybe bring your own token or choose a different model or all that stuff. But that stuff is like required to be internet connected to make it work. Yeah. But

1:11:34 really what we're seeing is that, that folks are really liking it for, um, you know, uh, things like I was talking about like writing a product description, but also like it can do structured

1:11:43 text. So you can say like return an answer that like matches this structure. So maybe if you're doing some data profiling. And again, all of this stuff, what it does is it accelerates all of the

1:11:54 things that ML researchers have been talking about for a long time, which is like, hey, this stuff is only 80 right. So you can't just use it. You have to think about, the good news is there's

1:12:05 lots of really good writing out there by these researchers. That's like, here's how you judge if a prop function is good. And here's how you improve it and things like that. And so now you can

1:12:14 push all that stuff down instead of being really, really expensive And as researchers, you can have data scientists or data analysts doing that same type of loop, which is really powerful, again,

1:12:24 especially for unstructured data that you're just trying to understand reason about and group and things like that. You were just talking about Reddit-type comments. Maybe extracting sentiment or

1:12:37 adding topics to them so that you can sort them later. There's probably a handful of things like that where it makes a lot of sense to use LLMs to run that if you can do it at the right price with

1:12:46 the high enough accuracy. No, that's a, I'm a big, as someone who used to have to come up with all the labels for the data on our CRMs and ERP systems historically, I'm all about feeding a prompt

1:12:58 with the list that I want or letting the prompt come up with them from the data, which is how it really should be done. No, I love these things for a lot of the data stuff, whether it's extraction

1:13:09 or labeling, and you can easily see where it's like, hey, it's gonna, one, it's gonna do it the same way every time, hopefully, depending on what you're prompt and temperature are and stuff.

1:13:20 But even then, it's like, as long as it's consistently doing it, it's gonna be more consistent than a human because it doesn't get tired and it doesn't have a five-year-old that comes and wakes

1:13:30 them up at three in the morning or like, you know, it's just, you know, the normal stuff. But, I mean, you could easily see a world, though, where, you know, you guys, I mean, I'm

1:13:41 thinking of it for the oil field out of the gate, right? like, I will, I will. our enterprise clients are gonna want to be able to use our energy language model or their energy language model

1:13:51 from their phone. The field guys wanna be able to use it in the field. And so like, hey, we use a 53 model or one of the smaller lightweight open source ones that gets trained on the text is equal,

1:14:04 then you can deploy it with DuckDB and mother duck right there on the edge device. And so that's completely secure. That's a really, I think ultimately for a lot of the enterprise at least field

1:14:16 level stuff, that's gonna be what a lot of architectures end up kind of looking like. I totally agree. I'm seeing some of the same, especially in more secure air-gapped environments where they're

1:14:27 like, they've got like a

1:14:30 SQL server that's 20 years old that someone set up and doing any analytics on it and maybe a historian or some SCADA or whatever. It's like, can we just please get DuckDB in here so that we can

1:14:42 actually analyze our data? And yeah, as it turns out, that's not necessarily a mother duck use case, but certainly one that I'm seeing and have been talking to people about 'cause it's super

1:14:52 interesting. Well, you also, at some point, all the cloud credits run out and people actually have to start paying for their compute and they're gonna be pissed off when they're like, Oh shit, I

1:15:02 use a lot of chat GPT creditsand they're expensive because of course, those will only get more expensive somehow over time. Totally.

1:15:11 I know you're in question time. The only thing I wanna talk about that people should be aware of, I think with DuckDB, is just how that they've pushed forward, I think some like SQL syntax or

1:15:21 functionality, like group by all, or even, again, some of this will trigger a lot of people, whether you should do group by all even, or even just, you know, allowing a trailing comma. But I

1:15:31 mean, I think there's more to it than that even, but I mean, can you talk about just a couple of those, or even maybe your favorite of those, or? Yeah, so there's a couple that I really

1:15:42 like. One of them is you can just skip select star. So you can just type in from table and it'll return the table.

1:15:51 Another one is - I already like that. I like that a lot. So it also has this notion of select star exclude. So you can do select star and then exclude and then you can type in the columns you just

1:16:07 don't wanna return, right? So a lot of times it's like, okay, I wanna return, like get this list of table, but I don't want this ID I want an ID from somewhere else or whatever. And so you just

1:16:15 can ignore that. There's a few other ones, like there's some really interesting stuff, like it has a bunch of like text matching functions. This isn't really about like SQL syntax per se, but it

1:16:27 has things like Levenstein distance and stuff like that. So like when you're doing entity resolution, it's very fast and easy to do with native functions. So it was like fuzzy matching kind of

1:16:36 stuff? Yeah, like fuzzy matching kind of stuff. Yeah, I'm like ranking for sure, for sure.

1:16:41 I'll talk about one other one too. So it has this function. So you know, like when you're looking for the latest record of some row, where you use like qualify or a row number and order your data

1:16:54 set, right? So when you use those types of functions today, it requires you to order the entire data set. And so DuckDB has added this function called argmax and argmin. And so what it lets you

1:17:06 do is actually pass in a row and then sort it on some arbitrary kind of dimension. And so what that means is that instead of sorting the entire data set, what it can do is only sort that segment.

1:17:22 And then also that means you can break all of your, you can break that up into pieces and run them in parallel. And so you see for some scenarios, where instead of using Qualify, if you use like

1:17:31 argmax instead, you can see like 10x or 100x improvement in query speed. And like that's a really fun one where you have like an append-only database with a bunch of versions of Rose. where you can

1:17:41 just select only the latest version of each one using a timestamp.

1:17:46 And it is insanely fast and practical for stuff like that. You wouldn't kind of like the documentation just like, yeah, we have this function called argmax. It's like, it's here. And you're like,

1:17:56 okay, what does that mean? And I'm like, you see someone using you're like, oh, that makes perfect sense, yeah. And that's really fast. So there's some really cool stuff like that. They're

1:18:06 definitely, yeah. So they're definitely doing some cool stuff Yeah, that uses this algorithm called radix sort, which is slightly different than traditional sorting. So anyways, it's a little

1:18:15 faster. Actually, it's a lot faster if you have a data structure in a certain way, so very fun. Awesome. John, will we get real quick into speed round and then we'll wrap it up. Yes. What is

1:18:28 your favorite LLM for coding currently?

1:18:34 That's a good question I am a GitHub.

1:18:40 or a VS Code user. So I use whatever the GitHub co-pilot stuff is. They do have Claude in there now too. Works pretty well. I'm not really tied to one. I do pay for chat GPT personally and I use

1:18:52 it all the time too. Yeah, you should check out Cursor if you haven't played with Cursor yet. Okay, okay. It's the only reason I bring it up. It's a VS Code fork. So it's very, very familiar,

1:19:03 but there's just some of the easy things that they have where it's like copy a piece of code, added to the chat, add my repo, all the files, the config, everything to it. It writes the code and

1:19:16 then I click a button and it applies the code and I don't have to copy and paste anything. It's incredibly nice, but - Noted. I've been using Claude a lot recently as well, and I think the hype is

1:19:29 real with the latest update as far as it being better than Chat GBT with coding. Got it What's your favorite non-deduct EV database?

1:19:41 Oh, I mean, it's definitely SQL Server. Like, I think I definitely, you know, I have a little bit of a, it's probably Stockholm Syndrome, to be honest with you. But like, the thing is, the

1:19:51 most important thing that I learned about SQL Server is to like, be able to understand how the query planner works. And to the, you know, for better or for worse, it's now second nature. And so,

1:20:03 that means that I can see something and optimize it, you know, kind of mentally very quickly and it's served me very well, although I don't, I don't have to activate that part of my brain too much

1:20:13 anymore, but yeah,

1:20:17 partitioning and, set up your partitions, set up your indexes and use temp tables, get really far. Nice. It's okay. Favorite video or board game? Oh, that's a good question. I, I still, so

1:20:34 there's game that came out probably like, Oh man, like probably eight years ago called Rocket League. Play a little bit of that man. I love Rocket League, so big Rocket League fan. It's soccer

1:20:44 with RC cars for those who aren't familiar. It is so much fun. Yeah, it's a lot of fun. Scratch is a competitive itch, for sure. It's also really fun to get humbled by a 12-year-old. Yeah,

1:20:58 that's a good way to do it. Yeah, but I have kids now too, so we'll play Mario Kart and stuff too, so a lot of fun. Yeah, yeah All right, and I'll stick with my food theme. So if we took a

1:21:11 trip up to Seattle, where do we need to go? Yeah, there's a few things that I think are excellent in Seattle. Really good like Teriyaki, really good seafood like sushi. And there's the classic,

1:21:27 which is there's a burger place called Dix that is kind of like a local legend, so. Very nice. That would be the three, yeah.

1:21:36 All right, last one, most importantly, and it goes to both of you, who's gonna win the fight tonight since we've been talking sports in and out through this whole

1:21:46 thing, Tyson or Jake Paul. I mean, I think I have to go, I think I'll go conservatively here. I think the odds markets, the betting markets seem to favor Jake Paul, so I'll go conservatively.

1:21:58 The guy that's 30 years younger. Yeah, yeah, exactly Yeah,

1:22:05 I'll go Mike Tyson and I just think he can't turn it off. I imagine they probably have how this fight's gonna go planned out right now. And it doesn't matter how it's planned out 'cause Mike Tyson's

1:22:17 gonna get in the ring and if he doesn't let him knock him out, like, yeah, yeah, yeah. Yeah, I think

1:22:29 if it goes more than four, four rounds, I think Paul will win, but if he doesn't get knocked out by Tyson, I don't know, that dude still looks like he can. murder people. So for sure, for sure.

1:22:37 I would not want to get in the ring. No, no, not at all. Yeah. Yeah. Cool. Well, awesome. Awesome. Thanks so much for joining us. And yeah, I mean, where can people find you or how can I

1:22:50 reach out to you? Yeah, absolutely. So I'm on LinkedIn. You can find me. Jacob Mattson, a mother duck. I think my LinkedIn URL is just Jacob Mattson. And then also on X slash Twitter at

1:23:01 Mattson J Feel free to harass me there. I'm happy to chat. DuckDB or otherwise. So yeah. Beautiful. Well, thanks again. And Johnny, everything else now, make sure you guys subscribe. Like I

1:23:14 said, 75 of you are not subscribed. So go click the button wherever it may be. Appreciate it, guys. Thanks, Jacob.

Creators and Guests

Host

Bobby Neelon

Husband, Father, Baseball, Upstream Oil and Gas, R, Python, JS, SQL, Cloud Computing

Host

John Kalfayan

Raddad, energy tech, crypto, data, sports, cars

Creators and Guests

headphones Listen Anywhere

Listen Anywhere