Essence

Essence is a Natural Language Processing (NLP) and Text Summarization library for Elixir. The work is currently in very early stages.

Installation

If available in Hex, the package can be installed as:

Add essence to your list of dependencies in mix.exs:

def deps do
  [{:essence, "~> 0.1.0"}]
end

Examples

In the following examples we will use test/genesis.txt, which is a copy of the book of genesis from the King James Bible (http://www.gutenberg.org/ebooks/8001.txt.utf-8).

We provide a convenience method for reading the plain text of the book of genesis into Essence via the method Essence.genesis/1

Let's first create a document from the text:

  iex> document = Essence.Document.from_text Essence.genesis

We can see that the text contains 1,533 paragraphs, 1,663 sentences and 44,741 tokens.

  iex> document |> Essence.Document.enumerate_tokens |> Enum.count
  iex> document |> Essence.Document.paragraphs |> Enum.count
  iex> document |> Essence.Document.sentences |> Enum.count

What might the first sentence of genesis be?

  iex> Essence.Document.sentence document, 0

Now let's compute the frequency distribution for tokens in the book of genesis:

  iex> fd = Essence.Vocabulary.freq_dist document

What is the vocabulary of this text?

  iex> vocabulary = Essence.Vocabulary.vocabulary document

or alternatively we can use the frequency distribution for the equivalent expression:

  iex> vocabulary = Map.keys fd

What might the top 10 most frequent tokens be?

  iex> vocabulary |> Enum.sort_by( fn(x) -> Map.get(fd, x) end, &>=/2 ) |> Enum.slice(1, 10)
  ["and", "the", "of", ".", "And", ":", "his", "he", "to", ";"]

Next, we can compute the lexical richness of the text:

  iex> Essence.Vocabulary.lexical_richness document
  16.74438622754491

ToDo

[x] Tokenization (Basic, done)
[x] Sentence Detection and Chunking (Basic, done)
[x] Vocabulary (Basic, done)
[x] Documents (Draft, done)
[ ] Readability (ARI done, SMOG in progress, FC, GF, DC, CL todo)
[ ] Corpora
[ ] Bi-Grams
[ ] Tri-Grams
[ ] n-Grams
[ ] Frequency Measures
[ ] Time-Series Documents
[ ] Dispersion
[ ] Similarity Measures
[ ] Part of Speech Tagging
[ ] Sentiment Analysis
[ ] Classification
[ ] Summarization
[ ] Document Hierarchies

Essence

Essence is a library for Natural Language Processing and Text Summarization in Elixir.

Essence

Installation

Examples

ToDo