Hi there,
I just wanna ask on the proper way of english word stemming and as to how affixes are being separated from the stem (root word) and recognized ready for word translation to other language using java.
Thanx in advance
Hi there,
I just wanna ask on the proper way of english word stemming and as to how affixes are being separated from the stem (root word) and recognized ready for word translation to other language using java.
Thanx in advance
you will deal with string built-in functions or string libraries.
You should have list of root words stored into memory or database.
try one of the following stemming algorithms, the best solution would be to combine stemming algorithms to get you the root word.
1) Use brute force - In java create a hastable, map or a hashmap (you decide the implementation choice). It will contain your root word as your map/hashmap/hashtable key (of course this would be unique). Your values can be a collection/array of stemmed words (i.e. key is 'run', your values would be a collection - runs, running, ran, ... etc).
2) Use suffix stripping - in the case of 'running' .. prune the string to take out 'ing' (i.e. running - runn, eating - eat). Once you get the remainder, use the brute force algorithm above.
That's the easiest algorithm I can think of, other algorithms you can use is lemmatisation algorithms and stochastic algorithms.
Better yet, I suggest if you wan't to pursue this project, try to look at libtranslate, you may try to download the source code and look at its implementation. Although this is in C/C++, you can still use its algorithm to study. Its simple enough to provide your own natural language conversion plugin to convert whole paragraphs/documents from English->Cebuano and vice versa.
http://www.nongnu.org/libtranslate
Or if you really wan't to dig deeper into Natural Language Processing ... get the NLTK (Natural Language Toolkit) which was written in Python - http://www.nltk.org
Cheers!
Last edited by kolz; 07-08-2009 at 10:30 AM.
ok mga bay, will try to take a look at your suggested algorithms and generators. pero kani ba nga mga generators built-in sa java? gusto man gud nako sa java lang tanan mga bay.
Just wan't to make a point that NLP (Natural Language Processing) is different from Formal Language Processing (or more commonly known as Computer Language Processing).
Natural Language refers to written and spoken language (like English, Cebuano, Tagalog ... etc.) and Formal Language/Computer Language Processing refers to C++, Perl, and what-have-you-computer-language ... etc.
The tools suggested by eax will refer to Formal Language, where you need parser and parser generators (lex, yacc, bison, etc....). These tools are used for crafting your own compiler, interpreter, etc. They target machine language (computer language). Although sometimes with regards to parsing, the concepts can overlap. But they are two entirely different set of tools altogether, and should not be confused with NLP.
For NLP, which the TS needs ... try using OpenNLP which is a set of Java tools for Natural Language Processing ... The OpenNLP Homepage
To Qoute:
"OpenNLP is an organizational center for open source projects related to natural language processing. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects. Click here to see the current list of OpenNLP projects. We'll also try to keep a fairly up-to-date list of useful links related to NLP software in general.
OpenNLP also hosts a variety of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference using the OpenNLP Maxent machine learning package. To start using these tools download the latest release here, and check out the OpenNLP Tools API. For the latest news about these tools and to participate in discussions, check out OpenNLP's Sourceforge project page. "
Also checkout the Stanford University Natural Language Processing lab ...
The Stanford NLP (Natural Language Processing) Group
They have an online version of their Java NLP at works ...
http://nlp.stanford.edu:8080/parser/index.jsp
Last edited by kolz; 07-09-2009 at 10:29 AM.
yep thanx for the clarification, kolz. that's why I was asking if the suggested generator is a java-built-in one because it seems that it's another language parsing generator independent of netbeans. but at least the idea is there--the way words are parsed in those suggested generators can be used with netbeans, i think.![]()
Similar Threads |
|