>>
>>5435
>Can't tell which end of the bell curve you're on, but it's not the middle
Typically, hashes are unique for any input. In fact, they are designed for it to be statistically impossible to find 2 things with the same hash. Hardly useful for identifying the same author of different texts. However, there is a concept of locality-sensitive hashing which might be useful or unnecessary for you.
What you want is some machine learning model that takes in text and spits out a lower-dimensional vector describing the text (analogous to a convolutional net for images). Then, you can apply locality-sensitive hashing to collapse similar vectors to vectors you use as identities. For outliers, you would even be able to say % chance of each identity by calculating a simple projection in each.
However, if you do the machine learning properly, you won't need hashing; it's just a way to calibrate the model after the fact if you fuck up.
Now, for how to create and train the model, the first step is data. Ideally, you would have a large, diverse corpus of text labelled by author. If not, this is easy enough to create by scraping the web.
This problem is very similar to facial identification (different from facial recognition, which would be like a program that decides whether text was produced randomly or intelligently, not who produced it) in that both are solved by transforming the data into a vector of features describing the data which can then be compared to each other in the resulting metric space, and that both fundamentally deal with identity. Ie, the principle component of the facial metric space probably corresponds to gender. By comparing how white, how black, how asian, how fat, how thin, how masculine, how feminine, etc, the faces in multiple pictures are, you can tell which ones are probably of the same person. It is the same process for identifying the author of a text. The text has analogous features like syntax, verbosity, vocabulary, tone, etc, when taken together, can identify an author.
Anyway, set up a model that has an appropriate input size for your data and and your best guess at how many features you need for output (you will tweak this until you stop seeing improvement). Cost function should be (euclidian proportional, but more computationally efficient) distance of output vector from average of outputs for the same author minus the sum of the distances from each of the other authors' averages, normalized. If it doesn't work after completing training, mess with the output size and try again. Once it's as good as it'll get, optionally apply locality-sensitive hashing for maximum effort.