findigl powers its technology up using Natural Language Processing to turn documents into reportable numbers

Natural Language Processing is a technique used to take content and classify the document in terms of its structure. More advanced Natural Language Processing (NLP) attempts to infer meaning and context. One major goal from NLP is Artificial Intelligence whereby computers can converse with humans and themselves in a manner we understand.

The purpose of this document is to introduce a new software application developed to create reportable data from documents. It is for certain this application will be enhanced and improved, there are many improvements to the software I already know. In addition - the accuracy needs to be improved.

Please contact me or visit my website at https://www.inforhino.co.uk

Basic premise of the NLP application by Info Rhino

One of our platforms, soon to be re-released is findigl . The platform is a property platform taking sources of information and placing it into a reportable format for users to understand the property market better. The solution displays maps, tables, and has third-party articles to provide a powerful perspective on the property market. The biggest challenge with a data platform is the data - how to source it and restructure the information into a reportable format.

For findigl, the goal is to search property listings/adverts and take useful information out creating a structured dataset which can be used within web and reporting applications.

WordNet

For many years, I have been aware of "Princeton University's" WordNet database. WordNet is a database of lexical and conceptual semantic relations - in plain speak, a Thesaurus. The list of relationships found within WordNet - AlsoSee - Antonym - Attribute - Cause - DerivationallyRelated - DerivedFromAdjective - Entailment - Hypernym - Hyponym - InstanceHypernym - InstanceHyponym - MemberHolonym - MemberMeronym - PartHolonym - ParticipleOfVerb - PartMeronym - Pertainym - RegionDomain - RegionDomainMember - SimilarTo - SubstanceHolonym - SubstanceMeronym - TopicDomain - TopicDomainMember - UsageDomain - UsageDomainMember - VerbGroup.

WordNet puts a focus on parts of speech too. Which are as follows; (None) - Noun - Verb - Adjective - Adverb

Being that I never attended a private school, I was late to the grammar party.

Other Technologies

We use proprietary code to extract numerical information, special words, exclude certain words to help cover different use cases WordNet cannot undertake.

 

Info Rhino's Natural language processing in practice

We have a definition file which should be relatively easy to understand;

[{
"PartsOfSpeech": ["Adjective","Verb","Noun"],
"synSetMode" : "RelatedSynsets",
"SynSetRelations": ["Hypernym","Hyponym"],
"LexicalRelations": ["DerivationallyRelated"],
"Catgegories": ["Testing it out"],
"SearchWords": ["farming","old","animal","beef","food"],
"SpecialWords": ["dinmore"],
"ExcludedWords": ["to","and","it"],
"MinimumMatchRank": 1,
"MatchDirection": "DocumentToConfiguredSearch"
}]

The application initialises the databases, and searches a directory for files.

One of the files contains the following content.

Cattle
50 acres with shoes.
cow 

dinmore

Running the application against this produces the following output

{"Content":"Cattle\r\n50 acres with shoes.\r\ncow\r\ndinmore","ContentWords":["cattle","50","acres","with","shoes","cow","dinmore"],"searchTerm":{"PartsOfSpeech":[3,2,1],"SynSetRelations":[8,9],"LexicalRelations":[5],"synSetMode":1,"Catgegories":["Testing it out"],"SearchWords":["farming","old","animal","beef","food"],"SpecialWords":["dinmore"],"ExcludedWords":["to","and","it"],"MinimumMatchRank":1,"MatchDirection":0},"matchComponents":[{"matchedStrength":1,"Source":"beef cattle","Target":"cattle","SearchWord":"beef","Sources":["beef","cattle"],"Targets":["cattle"],"MatchStrengthRanking":1},{"matchedStrength":3,"Source":"cattle","Target":"cattle","SearchWord":"beef","Sources":["cattle"],"Targets":["cattle"],"MatchStrengthRanking":3}],"criteriaWordBridge":null,"specialMatches":[{"Word":"dinmore","MatchStrengthRanking":4,"matchedStrength":4}]}

Many other files are produces - a key file being a full NLP Analysis file which applies NLP against the content acting as a verification of the output.

Cattle.txt (42.00 bytes)

Cattle_DefinedNLPMatching.json (853.00 bytes)

Cattle_FullNLPAnalysis.json (21.54 kb)

Word treatment, tokenisation and stemming (Lemma), match ranking

Our NLP application identifies sentences, but currently focuses on words as opposed to sentences and context. We generate a list of synset related information for either the document or configured search words. We haven't looked at stemming which identifies the canonical form of a word.

Our application grades matches as follows;

/// <summary>
/// The stronger match is last. Helps to create a benchmark of matching capability.
/// </summary>
public enum MatchedStrength
{
Excluded =-2,
Ignored = -1
,Unmatched = 0
,LooseMatch = 1
,CleanedMatch = 2
, FullMatch = 3
, SpecialMatch = 4
}

To reduce the amount of garbage output, we allow a minimum match strength filter to be configured.

Numerical extract

From another file, we extracted the following numerical values;

[{"Text":"50 acres","Start":8,"End":15,"TypeName":"dimension","Resolution":{"unit":"Acre","value":"50"}}]

Consider the potential - finding numerical amounts from documents with classifications.

The underlying technology

I am a .Net and .Net Core developer (C#). Whilst I am certain I could pick up python in a weekend sufficient to start building an application, it is my preference to stick with C# where I am most proficient. Similarly, we had SharpNLP in Java and ported to .Net but I preferred to work with WordNet.

The application and results data objects will be refined for certain but it should allow reporting to be undertaken on content.

Find out more about our NLP application

Feel free to visit Info Rhino's website. You can contact me through this blog.

We aren't at a production point yet, at which point we will need to cite Princeton University to thank them for their excellent database.

Add comment