지역 민감 해싱을 이해하는 방법?

Programming

지역 민감 해싱을 이해하는 방법?

procodes 2020. 6. 13. 20:23

지역 민감 해싱을 이해하는 방법?

LSH가 고차원 특성을 가진 유사한 항목을 찾는 좋은 방법 인 것 같습니다.

http://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf 논문을 읽은 후에도 여전히 그 공식과 혼동됩니다.

누구나 쉬운 방법을 설명하는 블로그 나 기사를 알고 있습니까?

내가 LSH에 대해 본 최고의 자습서는 대규모 데이터 세트 마이닝 책에 있습니다. 3 장-유사 항목 찾기 http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf

또한 아래 슬라이드를 추천합니다 : http://www.cs.jhu.edu/%7Evandurme/papers/VanDurmeLallACL10-slides.pdf . 슬라이드의 예제는 코사인 유사성에 대한 해싱을 이해하는 데 많은 도움이됩니다.

ACL2010 Benjamin Van Durme & Ashwin Lall 에서 2 개의 슬라이드를 빌려서 코사인 거리에 대한 LSH 패밀리의 직관을 설명하려고 노력합니다. 여기에 이미지 설명을 입력하십시오

그림에서 2 개의 2 차원 데이터 포인트를 나타내는 빨간색 과 노란색의 두 개의 원이 있습니다 . LSH를 사용하여 코사인 유사성 을 찾으려고합니다 .
회색 선은 균일하게 무작위로 선택된 평면입니다.
데이터 포인트가 회색 선 위 또는 아래에 있는지 여부에 따라이 관계를 0/1로 표시합니다.
왼쪽 위 모서리에는 두 개의 데이터 점의 서명을 나타내는 두 개의 흰색 / 검정색 사각형 행이 있습니다. 각 사각형은 비트 0 (흰색) 또는 1 (검정)에 해당합니다.
따라서 평면 풀이 있으면 평면에 해당하는 위치로 데이터 포인트를 인코딩 할 수 있습니다. 풀에 더 많은 평면이있을 때 서명에 인코딩 된 각도 차이가 실제 차이에 더 가깝다고 상상해보십시오. 두 점 사이에있는 평면 만 두 데이터에 다른 비트 값을 제공하기 때문입니다.

여기에 이미지 설명을 입력하십시오

이제 우리는 두 데이터 포인트의 서명을 살펴 봅니다. 예에서와 같이 각 데이터를 나타내는 데 6 비트 (제곱) 만 사용합니다. 이것이 우리가 가지고있는 원본 데이터의 LSH 해시입니다.
서명이 1 비트 만 다르기 때문에 두 해시 값 사이의 해밍 거리는 1입니다.
서명의 길이를 고려하여 그래프에 표시된 것처럼 각 유사성을 계산할 수 있습니다.

I have some sample code (just 50 lines) in python here which is using cosine similarity. https://gist.github.com/94a3d425009be0f94751

Tweets in vector space can be a great example of high dimensional data.

Check out my blog post on applying Locality Sensitive Hashing to tweets to find similar ones.

http://micvog.com/2013/09/08/storm-first-story-detection/

And because one picture is a thousand words check the picture below:

여기에 이미지 설명을 입력하십시오 http://micvog.files.wordpress.com/2013/08/lsh1.png

Hope it helps. @mvogiatzis

Here's a presentation from Stanford that explains it. It made a big difference for me. Part two is more about LSH, but part one covers it as well.

A picture of the overview (There are much more in the slides):

여기에 이미지 설명을 입력하십시오

Near Neighbor Search in High Dimensional Data - Part1: http://www.stanford.edu/class/cs345a/slides/04-highdim.pdf

Near Neighbor Search in High Dimensional Data - Part2: http://www.stanford.edu/class/cs345a/slides/05-LSH.pdf

LSH is a procedure that takes as input a set of documents/images/objects and outputs a kind of Hash Table.
The indexes of this table contain the documents such that documents that are on the same index are considered similar and those on different indexes are "dissimilar".
Where similar depends on the metric system and also on a threshold similarity s which acts like a global parameter of LSH.
It is up to you to define what the adequate threshold s is for your problem.

It is important to underline that different similarity measures have different implementations of LSH.

In my blog, I tried to thoroughly explain LSH for the cases of minHashing( jaccard similarity measure) and simHashing (cosine distance measure). I hope you find it useful: https://aerodatablog.wordpress.com/2017/11/29/locality-sensitive-hashing-lsh/

I am a visual person. Here is what works for me as an intuition.

Say each of the things you want to search for approximately are physical objects such as an apple, a cube, a chair.

My intuition for an LSH is that it is similar to take the shadows of these objects. Like if you take the shadow of a 3D cube you get a 2D square-like on a piece of paper, or a 3D sphere will get you a circle-like shadow on a piece of paper.

Eventually, there are many more than three dimensions in a search problem (where each word in a text could be one dimension) but the shadow analogy is still very useful to me.

Now we can efficiently compare strings of bits in software. A fixed length bit string is kinda, more or less, like a line in a single dimension.

So with an LSH, I project the shadows of objects eventually as points (0 or 1) on a single fixed length line/bit string.

The whole trick is to take the shadows such that they still make sense in the lower dimension e.g. they resemble the original object in a good enough way that can be recognized.

A 2D drawing of a cube in perspective tells me this is a cube. But I cannot distinguish easily a 2D square from a 3D cube shadow without perspective: they both looks like a square to me.

How I present my object to the light will determine if I get a good recognizable shadow or not. So I think of a "good" LSH as the one that will turn my objects in front of a light such that their shadow is best recognizable as representing my object.

So to recap: I think of things to index with an LSH as physical objects like a cube, a table, or chair, and I project their shadows in 2D and eventually along a line (a bit string). And a "good" LSH "function" is how I present my objects in front of a light to get an approximately distinguishable shape in the 2D flatland and later my bit string.

Finally when I want to search if an object I have is similar to some objects that I indexed, I take the shadows of this "query" object using the same way to present my object in front of the light (eventually ending up with a bit string too). And now I can compare how similar is that bit string with all my other indexed bit strings which is a proxy for searching for my whole objects if I found a good and recognizable way to present my objects to my light.

매우 짧은 tldr 답변 :

로컬 리티에 민감한 해싱의 예는 먼저 입력 공간에서 평면을 임의로 (회전 및 오프셋과 함께) 해시로 설정 한 다음 포인트를 해시로 드롭하고 각 평면에 대해 포인트가 위 또는 아래 (예 : 0 또는 1)이며 대답은 해시입니다. 따라서 공간에서 비슷한 점은 전후의 코사인 거리로 측정하면 유사한 해시를 갖습니다.

scikit-learn을 사용 하여이 예제를 읽을 수 있습니다 : https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer

참고 URL : https://stackoverflow.com/questions/12952729/how-to-understand-locality-sensitive-hashing

저작자표시 (새창열림)

'Programming' 카테고리의 다른 글

ActiveRecord 속성 메소드 대체 (0)	2020.06.13
Google지도에 영향을주는 Twitter 부트 스트랩 CSS (0)	2020.06.13
Java에서 사용자 정의 예외를 작성하는 방법은 무엇입니까? (0)	2020.06.13
리스트 / 튜플 쌍을 두리스트 / 튜플로 풀기 (0)	2020.06.13
쿼리 실행시 Sequelize가 콘솔에 SQL을 출력하지 못하게 하시겠습니까? (0)	2020.06.13

현재글지역 민감 해싱을 이해하는 방법?

procodes

지역 민감 해싱을 이해하는 방법?

지역 민감 해싱을 이해하는 방법?

'Programming' 카테고리의 다른 글

'Programming'의 다른글

티스토리툴바

지역 민감 해싱을 이해하는 방법?

지역 민감 해싱을 이해하는 방법?

'Programming' 카테고리의 다른 글

'Programming'의 다른글

관련글

티스토리툴바