To your point:
If you have your data as parquet files and load it into memory using DuckDB before passing the arrow object to Pandas you can run in arbitrary large datasets as long as you aggregate them or filter them somehow.
In my case I run analysis on an Apple M1 with 16GB of RAM and my files on disk are thousands of parquet files that add up to hundreds of gigs.
Apart from that:
Being able to go from duckdb to pandas very quickly to make operations that make more sense on the other end and come back while not having to change the format is super powerful. The author talks about this with Polars as an example.
> Being able to go from duckdb to pandas very quickly to make operations that make more sense on the other end and come back while not having to change the format is super powerful
I can't stress enough how much I think this is truly transformative. It's generally nice as a working pattern, but much more importantly it lets the scale of a problem that a tool needs to solve shrink dramatically. Pandas doesn't need to do everything, nor does DuckDB, nor does some niche tool designed to perform very specific forecasting - any can be slotted into an in memory set of processes with no overhead. This lowers the barrier to entry for new tools, so they should be quicker and easier to write for people with detailed knowledge just in their area.
It extends beyond this too, as you can then also get free data serialisation. I can read data from a file with duckdb, make a remote gRPC call to a flight endpoint written in a few lines of python that performs whatever on arrow data, returns arrow data that gets fed into something else... in a very easy fashion. I'm sure there's bumps and leaky abstractions if you do deep work here but I've absolutely got remote querying of dataframes & files working with a few lines of code, calling DuckDB on my local machine through ngrok from a colab instance.
Yes, I agree with your assessment that technologies that can deal with larger than memory datasets (e.g. Polars) can be used to filter data so there are less rows for technologies that can only handle datasets a fraction of the data (e.g. pandas).
Another way to solve this problem is using a Lakehouse storage format like Delta Lake so you only read in a fraction of the data to the pandas DataFrame (disclosure: I am on the Delta Lake team). I've blogged about this and think predicate pushdown filtering / Z ORDERING data is more straightforward that adding an entire new tech like Polars / DuckDB to the stack.
If you're using Polars of course, it's probably best to just keep on using it rather than switching to pandas. I suppose there are some instances when you need to switch to pandas (perhaps to access a library), but think it'll be better to just stick with the more efficient tech normally.
In my case I run analysis on an Apple M1 with 16GB of RAM and my files on disk are thousands of parquet files that add up to hundreds of gigs.
Apart from that: Being able to go from duckdb to pandas very quickly to make operations that make more sense on the other end and come back while not having to change the format is super powerful. The author talks about this with Polars as an example.