Hacker Newsnew | past | comments | ask | show | jobs | submit | wesm's commentslogin

See also The Mythical Agent-Month https://wesmckinney.com/blog/mythical-agent-month/


I've been building https://roborev.io/ (continuous background code review for agents) essentially as a cope to supervise the poor quality of the agents' work, since my agents write much more code than I can possible review directly or QA thoroughly. I think we'll see a bunch of interesting new tools to help alleviate the cognitive burden of supervising their work output.


You can see the exponential growth of tokens in real time! lol

Do you find it works well?

With these agents I've found that making the workflows more complicated has severe diminishing returns. And is outright worse in a lot of cases.

The real productivity boost I've found is giving it useful tools.


Super well! I don't work without this tool running in the background supervising all the agents' work


I was especially excited to learn that RZ is built on Apache Arrow internally, which makes it easy to integrate with other Arrow-based applications and the emerging "Composable Data Stack". Really exciting stuff, they're just getting started!


If you read my slide decks over the last 7 years or so (while I've been working actively on Arrow and sibling projects like Ibis) I've been saying exactly this.

See e.g. https://ibis-project.org/


Almost no database systems support multidimensional arrays. So they are not appropriate for many use cases?

* BigQuery: no * Redshift: no * Spark SQL: no * Snowflake: no * Clickhouse: no * Dremio: no * Impala: no * Presto: no ... list continues

We've invited developers to add the extension types for tensor data, but no one has contributed them yet. I'm not seeing a lot of tabular data with embedded tensors out in the wild.


I think that implementing good ndim=2 support would already be a huge leap forward, it doesn't have to be something super generic. E.g., given that most of the classic machine learning is essentially using 2-dimensional data (samples x features) as inputs, this is a very common use case.

E.g., as of right now, having to concatenate hundreds of columns manually just in order to pass them to some ml library in a contiguous format is always a pain and often doubles the max ram requirement.


This may help you do zero copy for a column of multi-dim without losing value types, just that it's encoding a multi-dim. This example is for values that are 3x3 of int8's:

```

import pyarrow as pa

my_col_of_3x3s = pa.struct([ (f'f_{x}_{y}', pa.int8()) for x in range(3) for y in range(3) ])

```

If using ndarrays, I think our helpers are another ~4 lines each. Interop with C is even easier, just cast. You can now pass this data through any Arrow-compatible compute stack / DB and not lose the value types. We do this for streaming into webgl's packed formats, for example.

What you don't get is a hint to the downstream systems that it is multidimensional. Tableau would just let you do individual bar charts, not say a heatmap, assuming they support rank 2's. To convert, you'd need to do that zero-copy cast to whatever they do support. I agree a targetable standard would avoid the need for that manual conversion, and increases the likelihood they use the same data rep.

Native support would also avoid some header bloat from using structs. However, we find that's fine in practice, it's metadata. E.g., our streaming code reads the schema at the beginning and then passes it along, so actual payloads are pure data, and skip resending metadata.



If you out a blank line between your bullet points, they'll display properly:

* BigQuery: no

* Redshift: no

* Spark SQL: no

* Snowflake: no

* Clickhouse: no

* Dremio: no

* Impala: no

* Presto: no


I suspect that AllegroCache accepts arrays with rank>=2, although I never got around to trying it out. (At the very least its documentation has nothing to say about any limitations on what kinds of arrays can be stored, so I'm assuming it stores all of them.)


On a side note, Clickhouse had some Arrow support

https://github.com/ClickHouse/ClickHouse/issues/12284


ClickHouse has support for multidimensional arrays.


> Arrow's serialization is Protobuf

Incorrect. Only Arrow Flight embeds the Arrow wire format in a Protocol Buffer, but the Arrow protocol itself does not use Protobuf.


Apologies, off base there. Edited with a pointer to Flight :)


There's no serde by design (aside from inspecting a tiny piece of metadata indicating the location of each constituent block of memory). So data processing algorithms execute directly against the Arrow wire format without any deserialization.


Of course there is. There is always deserialization. The data format is most definitely not native to the CPU.


I challenge you to have a closer look at the project.

Deserialization by definition requires bytes or bits to be relocated from their position in the wire protocol to other data structures which are used for processing. Arrow does not require any bytes or bits to be relocated. So if a "C array of doubles" is not native to the CPU, then I don't know what is.


Perhaps "zero-copy" is a more precise or well-defined term?


CPUs come in many flavors. One area where they differ is in the way that bytes of a word are represented in memory. Two common formats are Big Endian and Little Endian. This is an example where a "C array of doubles" would be incompatible and some form of deserilaziation would be needed.

My understanding is that an apache arrow library provides an API to manipulate the format in a platform agnostic way. But to claim that it eliminates deserialization is false.


I wish I had your confidence, to argue with Wes McKinney about the details of how Arrow works


Iunno. That looks like "where angels fear to tread" territory to me.


Hiya, a bit of OT (again, last one promise!): I saw your comment about type systems in data science the other day (https://news.ycombinator.com/item?id=25923839). From what I understood, it seems you want a contract system, wouldn't you think? The reason I'm asking is that I'm fishing for opinions on building data science infra in Racket (and saw your deleted comment in https://news.ycombinator.com/item?id=26008869 so thought you'd perhaps be interested), and Racket (and R) dataframes happen to support contracts on their columns.


You are right that if you want to do this in a heterogeneous computing environment, one of the layouts is going to be "wrong" and require an extra step regardless of how you do this.

But ... (a) this is way less common than it was decades ago (rare use cases we are talking about here ) and (b) it seems to be addressed in a sensible way (i.e. Arrow defaults to little-endian, but you could swap it on a big-endian network). I think it includes utility functions for conversion also.

So the usual case incurs no overhead, and the corner cases are covered. I'm not sure exactly what you are complaining about, unless it's the lack of liberally sprinkling ("no deserialization in most use cases") or whatever around the comments?


Big endian is pretty rare among anything you’d be doing in-memory analytics on. Looks like you can choose the endianess of the format if you need to, but it’s little endian by default: https://arrow.apache.org/docs/format/Columnar.html. I’d suggest reading up on the format, it covers the properties it provides to be friendly to direct random access.


Microsoft is also on top of this with their Magpie project

http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf

"A common, efficient serialized and wire format across data engines is a transformational development. Many previous systems and approaches (e.g., [26, 36, 38, 51]) have observed the prohibitive cost of data conversion and transfer, precluding optimizers from exploiting inter-DBMS performance advantages. By contrast, inmemory data transfer cost between a pair of Arrow-supporting systems is effectively zero. Many major, modern DBMSs (e.g., Spark, Kudu, AWS Data Wrangler, SciDB, TileDB) and data-processing frameworks (e.g., Pandas, NumPy, Dask) have or are in the process of incorporating support for Arrow and ArrowFlight. Exploiting this is key for Magpie, which is thereby free to combine data from different sources and cache intermediate data and results, without needing to consider data conversion overhead."


I wish MS put in some resources behind Arrow in .NET. I tried raising some remarks about it on dotnet repos (esp. within ML.NET), but to no avail. Hopefully it would change now that Arrow is more popular, and also written about by MS itself.


way cool! Is magpie end-user facing anywhere yet? We were using the azureml-dataprep library for a while which seems similar but not all of magpie


(Wes here) I appreciate the Arrow shout-out but note that Apache Arrow has been a major open source community collaboration and not something I can take sole credit for.


There is no “JIRA politics” blocking the LZ4 work, only a lack of volunteers to do the development and testing.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: