NLGen2 GSoC Project: GSoC application

Task Description:

This proposal describes and outlines the construction of a natural language generation module for integration with the RelEx system. This solution combines the ideas underlying the SegSim approach and the formulator suggested by Guhe's(2003) Incremental Conceptualization and Levelt's(1989) theory of speaking. From a linguistic standpoint, this approach is most heavily influenced by the minimalist program (Chomsky 1995) and simpler syntax theory (Cullicover and Jackendof 2005).

NLGen2 will consist of several interleaved phases of operation. Each of these phases roughly corresponds to a task given by the SegSim approach, but they are interleaved and share a common buffer to allow for incremental processing and parallel computation. In order to illustrate the operation of each phase, a running example will be given. This example will sketch the generation of the sentence “Alice ate the mushroom with a spoon.” Also, the example will not address the issue of ambiguity resolution as that significantly complicates matters and the purpose of this example is to simply illustrate the general operation of the algorithm.

The input to NLGen2 will be propositions similar to those generated by RelEx. These propositions will be grouped together to form preverbal messages (Levelt 1989). Once a preverbal message is formed it is added to a formulation buffer which serves as a shared resource for the other phases of processing. Forming syntactic structures from the contents of the formulation buffer consists of a select/merge/linearize/deselect cycle. When a syntactic structure is complete, as defined by the particular grammar formalism being used, the syntactic structure will be “spelled out” for sentence production. The input for the example is the output for the simple view of RelEx for the example sentence:

with(ate, spoon)

_obj(ate, mushroom)

_subj(ate, Alice)

tense(ate, past)

inflection-TAG(ate, .v)

pos(ate, verb)

pos(., punctuation)

inflection-TAG(spoon, .n)

pos(spoon, noun)

noun_number(spoon, singular)

pos(with, prep)

pos(a, det)

DEFINITE-FLAG(mushroom, T)

inflection-TAG(mushroom, .s)

pos(mushroom, noun)

noun_number(mushroom, singular)

DEFINITE-FLAG(Alice, T)

gender(Alice, feminine)

inflection-TAG(Alice, .f)

person-FLAG(Alice, T)

pos(Alice, noun)

noun_number(Alice, singular)

pos(the, det)

Preverbal message formation will be based around the object and event entities described by the propositional input to NLGen2. The entities will be recognized and combined by using simple variable matching rules to group together propositions that describe an entity. This stage of processing will also encompass lemma identification. If no lemma is given with the input then NLGen2 will query link-grammar's dictionary to find appropriate lemmas. Link structures will be bound to preverbal messages by querying the link-grammar dictionary with the message's lemma. Simple syntactico-semantic rules will then be applied to associate links with propositions.

The preverbal messages generated from the example input are:

PVM1:

_subj(2, 1)

DEFINITE-FLAG(1, T)

gender(1, feminine)

inflection-TAG(1, .f)

person-FLAG(1, T)

pos(1, noun)

noun_number(1, singular)

lemma(1, “Alice)

PVM2:

_obj(2,3)

_subj(2,1)

tense(2, past)

inflection-TAG(2, .v)

pos(2, verb)

with(2,4)

lemma(2, “eat”)

PVM3:

_obj(2,3)

DEFINITE-FLAG(3, T)

inflection-TAG(3, .s)

pos(3,noun)

noun_number(3, singular)

lemma(3, “mushroom”)

PVM4:

with(2,4)

inflection-TAG(4, .n)

pos(4, noun)

noun_number(4, singular)

lemma(4, “spoon”)

Each of these preverbal messages is grouped by a variable which appears in a _subj, _obj, or prepositional predicate to form distinct entities. Several other types of predicates would also signify entity identification, such as clausal connectives. In the above example notice that the lemma for preverbal message 2 is not the same as the variable it corresponds to in the input. This is because the preverbal message generation process will lookup or derive the root form of a word. Full link information has been omitted due to space, but the particular link/propositional bindings relevant to this example are:

PVM1:

_subj(2,1) & noun_number(1, singular) : Ss+

PVM2:

_subj(2,1): S-

_obj(2,3): O+

with(2,4): MV+

PVM3:

_obj(2,3):O-

PVM4:

with(2,4): MV-

Ambiguity will be handled in one of three ways that the user may specify. The least resource intensive strategy will choose one of the returned lexical items arbitrarily. A slightly more resource intensive option will involve performing a probability analysis of each lemma based on corpus occurences. The third option is to fork the generation process and perform ongoing statistical analysis of the formulated preverbal messages and syntactic objects. If the likelihood of a solution drops below a certain threshold and there are other solutions that meet the threshold then that solution should be abandoned. A preverbal message will be considered complete and will be added to the verbalization buffer once all required links for the given lemma are fulfilled by at least one proposition which may reference other preverbal messages.

NLGen2 will produce syntactic structures by merging preverbal messages into more complex syntactic structures. At any given time NLGen2 will have a current syntactic element which may be empty. Merging anything with the empty element will serve as the identity function. Merging two non-empty elements will yield either the empty element, in which case those two structures cannot be merged in a meaningful way, or it will yield a new element that incorporates the input elements in a way that is consistent with the links that the head lemmas of those structures are capable of forming. If there are multiple ways of merging two elements, ambiguity will be handled in a manner identical to that described for lexical ambiguity.

A select function will determine which element will be merged with the current syntactic element. This function will examine the current syntactic element and determine search criteria for relevant candidates. What constitutes a relevant candidate will be determined by the links that the head lemmas of the current syntactic element are capable of forming. If the current syntactic element is the empty element then the first element in the formulation buffer will be selected.

The order of mergers for the running example were chosen for clarity with the selection process omitted. The operation would in fact be order independent and the final result would be the same if the operations were applied in a different order because at no time between mergers is the syntactic object linearized. If linearizations intersperse mergers then they may have an effect on the final result. The preverbal messages for the example undergo merger in the following manner:

Merger 1: Empty and PVM2 resulting in SO1

Description: The merger of anything with empty is itself, so SO1 is simply PVM2

Merger 2: SO1 and PVM1 resulting in SO2

Description: The propositions in the two objects are compared for compatibility. The two objects are discovered to share “_subj(2, 1)”. This similarity allows merger. The resulting syntactic object will contain the following propositions:

DEFINITE-FLAG(1, T)

gender(1, feminine)

inflection-TAG(1, .f)

person-FLAG(1, T)

pos(1, noun)

noun_number(1, singular)

lemma(1, “Alice)

_obj(2,3)

tense(2, past)

inflection-TAG(2, .v)

pos(2, verb)

with(2,4)

lemma(2, “eat”)

verb_number(2, singular)

verb_gender(2, feminine)

Established Linkages:

1--Ss--2

In addition to the propositions of the two objects being conjoined, all shared propositions are removed. The syntactic object will have established links instead of these propositions. (Again ambiguity is being ignored for this example so, although it is possible that multiple linkages would satisfy a merger, only a single option is examined here) The other change is that propositions may be added based upon the nature of any linkages added. In this example verb agreement propositions are added to facilitate subject/verb agreement.

Merger 3: SO2 with PVM3 resulting in SO3

Description: These objects share the proposition “_obj(2,3)”. The resulting propositions of SO3 are the union of SO2's propositions with PVM3's propositions minus their shared proposition. The linkage “2--Os--3” is added. Because there is no subject/object agreement in English no new predicates are added. Also because English has a weak case system no propositions would be added to mark the accusative case unless a pronoun is involved. A general purpose solution would allow for both of these (and quite a few more) possibilities.

Merger 4: SO3 with PVM4 resulting in SO4

Description: These objects share the proposition “with(2,4)”. As in the previous mergers the resultant object will contain the merger of their propositions minus their intersection. The linkage “2--MV--4” will be added to the established linkage list.

The current syntactic element will be deselected and returned to the formulation buffer if at any time it is complete and there are no optional links which may be fulfilled by any element in the formulation buffer. To be returned to the formulation buffer a syntactic element will be linearized. Any morphological processing that is required for linearization would be performed at this time. For instance if the current syntactic element constitutes a verb phrase then tense would be incorporated at this time. This is because in English the tense of a verb is applied to the verb itself and not the verb phrase. If the element were a noun phrase however a morphological element such as possession would not be incorporated at this time because that morphological feature is applied at the phrasal level as opposed to the word level (i.e. “the man with the hat's friend”).

In the running example linearize will only occur once, after all four preverbal messages are consumed. The order of operations for linearize is significant, however the specific logic involved is outside of the scope of this application.

Linearization of SO4 resulting in PVM5:

The portion of SO4 indicated by <4> will be linearized first. It is an indefinite singular noun with the lemma “spoon”. Morphological processing will remove all non-lemma propositions which involve only <4> and update the lemma of <4> to be “a spoon”. The propositions involving only <1> will be removed next and the lemma of <1> will remain “Alice”. <3> will be similarly processed and its lemma will be altered to “the mushroom”. Finally <2> will have its lemma updated to “ate”. Once single lemma linearization is completed multi-lemma linearization begins. <2> and <3> will be linearized and the lemma for <3> will be removed all together, while the lemma for <2> will be changed to “ate the mushroom”. <2> and <4> will be linearized next and <4>'s lemma will be removed while <2>'s lemma will be updated to be “ate the mushroom with a spoon”. Finally <1> and <2> will be linearized and the entire structure will reduce to a lemma proposition for <2> that is “Alice ate the mushroom with a spoon”. After this the object is returned to the verbalization buffer where it will eventually be chosen by spellout.

Note: In this example all propositions are consumed in a single linearization. This will not always be the case. If some propositions are not consumed the resultant object may or may not be eligible for spellout. This will depend on the result of the underlying possible linkage mergers. The merger of potential linkages was omitted due to brevity concerns, but suffice it to say that this process is largely concerned with unifying the potential linkages of the objects while ensuring that links cannot cross.

The final macro-process is spellout. Spellout will monitor the formulation buffer for elements that can be adjoined to the right of all elements previously spelled out. For instance if no elements have thus far been spelled out then something capable of adjoining the left wall will be looked for. If the only element that has been spelled out is a noun phrase then a complement of that noun phrase or the beginning of a verb phrase will be looked for. If at any point the formulation buffer is empty and the right wall can be adjoined to the sentence being produced then spellout will signal that item to be expressed.

Spellout of PVM5:

At this stage the PVM is linearized with respect to whatever link it is forming with the already expressed object. The lemma is then appended to the expressed object. For the current example the only processing required is to ensure that the first word is capitalized and to append a period and the right wall.

Limitations and Future Work:

I do not anticipate strong ambiguity resolution to be completed by the end of the summer. This will be an area that I will continue to pursue and which I believe will become incrementally better over time. Also, the above described system does not incorporate error correction. The ability for an incremental system to make mistakes and correct them is one of its major strengths. This will definitely be one of the areas worth working on in the future.

Preliminary schedule for work:

Week 1:

The portion of the architecture concerning the structure of the preverbal messages and the functions involved in their formation will be the focus of week 1.

Week 2:

Work on the preverbal messages will be completed during the first part of week 2.
The remainder of week 2 will be spent coding the skeleton of the rest of the architecture. This will include the formulation buffer, the management of the current syntactic element, and the organization of the other portions of the architecture within NLGen2.

Week 3:

Merge will be the focus of weeks 3 and 4. This task will consist of fleshing out the classes for the current syntactic element. Also much work will be involved in determining what methods of merger yield the best results. For instance whether or not an element can be merged internal to a given structure or whether only the edges are viable merge points. Also whether to use a primarily flat or strictly binary branching structure will be examined. The choices for this option will be made based primarily upon accuracy, but also upon speed of computation. At this point in development ambiguity will be handled by simply selecting options on a “first come first serve” basis.

Week 4:

Work on merge will continue in week 4.

Week 5:

Any remainder of the work on merge will be finished by the middle of week 5.
Select will be coded in the remainder of Week 5. A simple brute force algorithm will be the first option explored and heuristic algorithms will be explored with the remainder of the time in this week.

Week 6:

Several disjoint tasks will be taken up during this week.
Firstly, functions to determine whether a given syntactic element is complete or not will be coded during this week.
Classes and functions to support the produced utterance will be developed.
The remaining portion of week 6 will be used to begin the work described under week 7.

Week 7:

Morphological processing and linearization will be developed during this week.
Morphological processing will be handled in a way similar to how it is currently handled by NLGen.
The specifics of linearization are not specifiable at this time as they will be determined by the decisions made during weeks 3 and 4.

Week 8:

Week 8 will involve fixing any remaining problems with the developments from previous weeks and finishing any integration issues.

Week 9:

The focus of week 9 will be on documentation and build issues. Although code will be documented as it is written this week will be used to ensure continuity and completeness of the documentation.

Week 10:

This week will involve taking care of whatever loose ends need to be tied up at this point, be they clarifications in the documentation, bug fixes, or otherwise.

Week 11:

If all goes well by this point the system will be ready for future development to include incorporating the more sophisticated ambiguity resolution options mentioned earlier. Otherwise this week will be used in the same way as week 10.

Knowledge required for project completion:

I am already very familiar with the Java programming language which is the language in which NGLen2 will be developed. I am also very familiar with the theoretical basis for language generation. In addition to this I will need to be very familiar with the APIs of link-grammar, RelEx, and the current NLGen system. I currently have a decent understanding of what these three systems are doing at a high level of abstraction and I will gain specific knowledge of their classes between now and the beginning of the summer program.

Reason for choosing this project:

I chose this project for largely the same reasons that I chose natural language processing as my research field in graduate school. Language is one of the few capabilities that sets humans apart from other animals. I believe that in order to create artificial general intelligence we need to gain greater understanding of natural language. This close relationship between AGI and NLP is evidenced by the fact that the Turing test is a test in a natural language medium.

Previous programming experience:

The languages I have used most extensively are C++, Java, and Python. A majority of my school projects were done in C++. Furthermore from December '06 to August '08 I was employed as the in-house programmer for a local company named Environmental Safety Solutions. During my tenure at that company I was responsible for the design and development of two programs. One was a Java application for in-house use and the other was PHP scripts for a database front end for one of the company's products. My natural language research has been done mostly in Python. Of these areas I have enjoyed the natural language processing work that I've done the most. To date my major works in the field are a formalism that describes a data structure useful in combining elements of competing syntactic theories and a chart parser that accounts for movement by forming DAGs instead of trees. The first was presented at the 2008 Undergraduate Linguistics Colloquim at Harvard. The paper for the second is being revised and has not yet been accepted for publication.

Previous “Open Source” experience:

This project will be my first contribution to an “Open Source” project. I am very excited about this opportunity because I believe that the free software development model is the best model for producing reliable, stable, and useful programs.

Description of OpenCog:

OpenCog is a free software project that has as its long term goal facilitating the creation of artificial general intelligence, or what is classically known as “strong AI”. At the core of OpenCog is its unified knowledge representation system AtomSpace. The OpenCog core architecture also features plug-ins for specific knowledge manipulation tasks. Examples of these are PLN, a probabilistic reasoning plug-in, and MOSES, a plug-in that dynamically improves at searching similar state spaces. This architecture and its plug-ins facilitate the development of MindAgents. These agents each perform a particular cognitive task. This architecture is based on the idea that cognition can be described as a coordinated systems of activities.

Internet access:

I have a permanent internet connection at my house. In addition to this I will have access to my university's facilities over the course of the summer, so if for some reason my internet goes out I can come to my office at school to work. I am both able and willing to hang out on IRC and post on the group mailing list. Due to the fact that my wife is also in school and we are raising two small children I would not be able to invert my day/night rhythm. However, I live in the CST zone and am usually up until midnight. It has been my experience over the last few weeks that this schedule allows me to interact with most of the mentors on IRC. I will primarily be using the nick “blemoine”, but I do sometimes forget myself logged in at home and will log in with “Blake_Lemoine” on my laptop.

University obligations:

The last day of spring classes is May 1 and exams end on May 8. This will not overlap with the google summer of code time frame. For the summer term I am enrolled in independent study courses which will consist of working on this project. The fall semester begins on August 24, which also should have no overlap with the google summer of code time frame.

Time Commitment:

I intend to spend between 35 and 50 hours each week on my GsoC project this summer. I will try to maintain a steady schedule of 8 to 10 hours per day.

Other Plans:

I have no other major activities scheduled for this summer.

License Agreement:

I am able and willing to execute the SIAI Individual Contributor License Agreement.

Progress updates:

In addition to regular communication with my mentor and the group in general I will post status updates at nlgen2.blogspot.com . These status updates will happen on an at least weekly basis, but a highly productive week may see multiple updates.

Post GsoC code maintainance:

I intend to make sure that my code will be maintained and supported by doing so myself. If I am accepted into this program I would be happy to continue contributing after the summer program is over. The work that needs to be done for language generation and NLP tasks in general for OpenCog and its associated programs lines up very well with my research plan.

NLGen2 GSoC Project

Followers

Blog Archive

About Me

Wednesday, April 1, 2009

GSoC application

No comments:

Post a Comment