I've had to build out some version of a geospatial vector embedding / latent variable dataset for at least 4 separate projects now. Come see the viewer I've built on top of it!
The embeddings come from globally available Copernicus land cover data.
Sure! The basic idea is that each hexagon is a discrete unit of space for which I obtain a vector embedding. This vector is supposed to represent a sort of data-based summary of that location, obtained in this case using deep learning.
When you put the search on a hex, it looks up the vector for that hex and then performs a similarity search on all other vectors within the circle and shows the ones which are most similar in terms of land cover. The dependence on land cover / land use data is just because that was easy to get.
As other folks have pointed out here, raw satellite imagery is also a potential input source for this. I'm playing around with other sources and really want to integrate something like GeoVex (https://openreview.net/forum?id=7bvWopYY1H) into the embeddings as well.
Would it provide useful/interesting results if the similarity search was global? E.g. find me neighborhoods in London most similar to this one in Chicago?
It's a way to encode land to make predictions of it. E.g. is the land arable, is it rural, how similar is it to X, etc. Embeddings help encode data in formats more usable by ML models.
The original idea came from something I saw at work - we needed a way to build generic feature sets representing something about real estate, but beyond the data we had on prices, floors, and other house-specific details.
My guess is this site is simply a way to explore the embeddings. People make similar data visualization tools for word embeddings, so that's what I assumed this was.
The embeddings are used by algorithms, not people, generally. You could ask something like "what's the most similar place to X within Y", and it would using the embeddings (which cover a variety of facts) to calculate answer. An embedding is an N dimensional vector (where the dimensions may or may not be meaningful to us), and similarity can be implemented by looking at the similarity between vectors.
Yup, and while the similarity search is perhaps the most visually appealing way to work with it, the real use (in my opinion) is in providing generic sets of geospatial features which are reusable across applications. I've built out versions of H3-referenced feature sets at each of the jobs I've had over the last 10 years.
Looks like Copernicus updates yearly? I can't tell if they include elevation from the "technical" tab on their home page.
Having originally come from the world of geointelligence, let me tell you this is not an easy problem to solve. For rural land use, this is probably fairly reliable, but depending on the granularity of change detection you want, cities are often building new neighborhoods in the span of months, large construction projects finish, human movement happens more in the span of hours or even minutes, and that's just for land. If you want maritime tracking, you need nearly continuous updates. We managed to do it for the Navy, but the infrastructure required for this is immense, much of the sensor technology is classified and not even available for commercial use, and the resource requirements not remotely practical for a personal side project.
Of course, military intelligence is primarily trying to track the land use of other militaries, especially in active theaters of operations, and that changes even more frequently than regular places where people aren't constantly erecting and moving temporary headquarters, living under camouflage cover, and blowing up existing infrastructure.
I guess you're doing this for peacetime domestic real estate, like neighborhood X in city Y is similarity ranked against neighborhood U in city V? Are you incorporating pricing and demographic data or just land use? It seems to me like neighbors make the neighborhood, as much or more than qualities of the land. Along with things like usability of the sidewalks, responsiveness and level of disrepair of the roads, crime rates, level of visible homelessness, air quality, vehicular traffic congestion.
I don't want to shit on the approach too much. Usefulness is determined by the results you get, but given the heterogeneity of the data here, some of it ordinal, some of it nominal, discrete versus continuous, irreconciability of scaling and dimensional analysis, not necessarily coming from similar distributions if you tried to just z-score it all, I can think of ways using pure numerical voodoo to put them all into the same vector space, but the statistical validity of doing this is dubious at best.
The embeddings were obtained using a CNN triplet loss model (~10M parameters) on the Copernicus land cover data. I haven't used DEM data yet but I have done generative modeling on DEMs in other work and would like to do that too:
The embeddings come from globally available Copernicus land cover data.