Fork me on GitHub

Hi good #data-science people, ML beginner looking for enlightenment here. I'm thinking of doing some classification using a nearest-neighbours approach e.g kNN. It seems to me the underlying assumption in kNN is that the labeled set is "dense" enough, and I would like to assess this assumption on my entire dataset, do you know approaches for doing that? My intuition is that the labeled set achieves good density when each (most) unlabeled point is among the very nearest unlabeled neighbours to its nearest labeled neighbour. Is this a thing? Are there quantitative tests / visualizations based on this?


(if that matters, we're talking high-dimensional sparse data here - bags of words or similar).


Hey @val_waeselynck; I think the bottom line is that you want your labeled (training) data to do a good job of covering the space of unlabeled (test) data you intend to run on. I'd say that how densely you sample has more to do with how granular the "true" underlying relationship is, in relation to the input space. If you think of the input space as a map, is on the order of coloring by zip code, county, state or nation? If the the former (zip or county), you'll need a lot deeper sampling in the training data than in the latter (state or nation). You should be able to look at a scatterplot of the training data colored by the property you're trying to predict, and visually guess the shapes of the boundaries they define. Obviously, there aren't always clearly defined boundaries in real world data, and regions can probabilistically "blur" together; your success with KNN (or any method really) will be constrained by how much your data does this.


If you have high dimensional data, you may want to do some dimensionality reduction (PCA, UMAP) to get a visual sense for how strong a relationship you can find between this space and the category you wish to predict, as well as a sense of how granular this relationship is.

☝️ 8

Thanks @U05100J3V! I'll have a look, I wonder how these dimensionality-reduction techniques will fare for the sort of text-based similarity I'm working on.


That being said, I don't really get to tune how "densely" I sample: my use case is that I have a 'feature-rich' subset of the data points that I can (hopefully) classify using another ML algorithm, and then I want to propagate that classification to the whole dataset using similarity, hence the nearest-neighbours approach. However, this essentially assumes that my 'information-rich' subset has the same distribution as my entire dataset, and that might not be the case, that's what I'm trying to assess.


are you trying to group terms that are associated by a topic?


if so, probabilistic latent semantic analysis and latent Dirichlet allocation are good starting points

👌 4

it sounds like you have documents you want to classify. in this scenario, you can represent a single document as an expectation-maximizing mix of topics. that vector of topic weights can be used as a reduced representation for the documents


Thanks @UCF779XFS this looks very relevant