SAP HANA CHINESE TEXT PROCESS（二）：SAP HANA CHINESE SEGMENTATION

We have a Chinese versionof this document.

In the last blogs we can crawling the html source files and extract html contents using the RLANG. In this blog, we will discuss how to realize Chinese segmentation with SAP HANA since segmentation is the base of text-based scenery.

1. Introduction of SAP HANA Chinese segmentation

SAP HANA segmentation is a part of SAP text analysis, we can use SAP　HANA segmentation engine by creating full-text index on tables. Seven data types are supported by SAP HANA text analysis: TEXT, BINTEXT, NVARCHAR, VARCHAR, NCLOB, CLOB and BLOB。

To use SAP HANA Chinese segmentation, we should firstly ensure that

You installed HANA database supports Chinese segmentation. We can use

The following SQL to check the feature:

SELECT * FROM SYS.M_TEXT_ANALYSIS_LANGUAGES

As the following picture shows, simplified Chinese is supported.

2. DEMO

Firstly, we create a table for testing:

CREATE COLUMN TABLE SEGMENTATION_TEST(  URL VARCHAR(200) PRIMARY KEY,  CONTENT NCLOB,  LANGU VARCHAR(10)
);

The column CONTENT stores the text for segmenting. The column LANGU specifies the languages we used. Here we specify the language to ZH.

Next, we create full-text index on the content column of table SEGMENTATION_TEST. We can use the following SQLS:

CREATE FULLTEXT INDEX FT_INDEX
ON SEGMENTATION_TEST(CONTENT) TEXT ANALYSIS
ON CONFIGURATION 'LINGANALYSIS_FULL'
LANGUAGE COLUMN "LANGU";

We should be aware of that the tables need to create full-text indexes must have primary key, otherwise an error will be occurred.

After we create the full-text index, SAP HANA will automatically generator a table named $TA_<index_name>.

Now we insert a record to table SEGMENTATION_TEST:

INSERT INTO SEGMENTATION_TEST(URL,CONTENT,LANGU)
VALUES('http://xxx.xxx.xxx','想获取更多SAP HANA学习资料或有任何疑问，请关注新浪微博@HANAGeek！我们欢迎你的加入！','ZH');

Then we query the contents of table $TA_FT_INDEX, each row contains the word and some other information:

As shown above, segmentation table not only contains split words but also have the words’ speech. For example, the speech of word “获取”is verb. The word HANA is an unrecognized word, so the speech is unknown.

SAP HANA supports various text analysis configurations:

LINGANALYSIS_BASIC: This configuration provides the following language processing capabilities for linguistic analysis of unstructured data:

Segmentation, also known as tokenization - the separation of input text into its elements

LINGANALYSIS_STEMS: This configuration provides the following language processing capabilities for linguistic analysis of unstructured data:

Segmentation, also known as tokenization - the separation of input text into its elements
Stemming - the identification of word stems or dictionary base forms

LINGANALYSIS_FULL: This configuration provides the following language

processing capabilities for linguistic analysis of unstructured data:

Segmentation, also known as tokenization - the separation of input text into its elements
Stemming - the identification of word stems or dictionary base forms
Tagging - the labeling of words' parts of speech

EXTRACTION_CORE: This configuration extracts entities of interest from unstructured text, such as people, organizations, or places mentioned.

In most use cases, this option is sufficient. EXTRACTION_CORE_VOICEOFCUSTOMER: Voice of the customer content includes a set of entity types and rules that address requirements for extracting customer sentiments and requests. You can use this content to retrieve specific information about your customers' needs and perceptions when processing and analyzing text. The configuration involves complex linguistic analysis and pattern matching that includes processing parts of speech, syntactic patterns, negation, and so on, to identify the patterns to be extracted.

To keep track of deletions in the source table, the keys in the table $TA_FT_INDEX need to be aligned to the keys of the source table. To do this, use the following SQL statement:

ALTER TABLE TEST."$TA_FT_INDEX" ADD CONSTRAINT R_KEY FOREIGN
KEY(URL) REFERENCES TEST.SEGMENTATION_TEST(URL) ON DELETE CASCADE;

SAP HANA CHINESE TEXT PROCESS（二）：SAP HANA CHINESE SEGMENTATION

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List