I am working on something adjacent to this problem. We focus much less on data p...

ricklamers · on Dec 1, 2022

Thanks for chipping in.

I’ve been leaning towards this direction. I think I/O is the biggest part that in the case of plain code steps still needs fixing. Input being data/stream and parameterization/config and output being some sort of typed data/stream.

My “let’s not reinvent the wheel” alarm is going off when I write that though. Examples that come to mind are text based (Unix / https://scale.com/blog/text-universal-interface) but also the Singer tap protocol (https://github.com/singer-io/getting-started/blob/master/doc...). And config obviously having many standard forms like ini, yaml, json, environment key value pairs and more.

At the same time, text feels horribly inefficient as encoding for some of the data objects being passed around in these flows. More specialized and optimized binary formats come to mind (Arrow, HDF5, Protobuf).

Plenty of directions to explore, each with their own advantages and disadvantages. I wonder which direction is favored by users of tools like ours. Will be good to poll (do they even care?).

PS Windmill looks equally impressive! Nice job

rubenfiszel · on Dec 1, 2022

Yes, inputs/outputs is likely the most interesting problems for our diverse specs of flows.

Because data pipeline is not the primary concerns of Windmill, we took the stance that Inputs and Output of steps were simply JSON in, JSON out. For all the languages, we simply extract the JSON object into the different parameters of the main, and then we wrap the return into the respective language native serializer for the output (e.g JSON.stringify in Typescript). Then each step can use a javascript expression executed by v8 to do some lightweight transformation between the output of any step to the input of that step.

A lot of the simplification we made is actually parsing the main function parameters into the corresponding jsonschema, supporting deeply nested objects when relevant.

That works great for automation that do not have big input/outputs, but not for data. So what we do for data is to use a folder that we symlink to be shared by all steps if a specific flag for that flow is set. It also force us to have the same worker process all the steps inside that flow when otherwise flow steps could have been processed by any workers. It is very fast since it's all local filesystem but not super scalable.

I am not pleased with that solution and believe that if we were to expand on the data problem, we would certainly rely on fast network and HDFS/Amazon EFS/etc to simply share that mounted folder across the network.

Anyway, sorry for the rambling but I do feel like we're all taking different approach to the same underlying problem of building the best abstraction for flows and believe we might learn from each other's choices.

ps: congrats Patterns on the launch, the tool look absolutely amazing.