We have a Chinese versionof this document.
In the last blogs we can crawling the html source files and extract html contents using the RLANG. In this blog, we will discuss how to realize Chinese segmentation with SAP HANA since segmentation is the base of text-based scenery.
1. Introduction of SAP HANA Chinese segmentation
SAP HANA segmentation is a part of SAP text analysis, we can use SAP HANA segmentation engine by creating full-text index on tables. Seven data types are supported by SAP HANA text analysis: TEXT, BINTEXT, NVARCHAR, VARCHAR, NCLOB, CLOB and BLOB。
To use SAP HANA Chinese segmentation, we should firstly ensure that
You installed HANA database supports Chinese segmentation. We can use
The following SQL to check the feature:
SELECT * FROM SYS.M_TEXT_ANALYSIS_LANGUAGES
As the following picture shows, simplified Chinese is supported.
2. DEMO
Firstly, we create a table for testing:
CREATE COLUMN TABLE SEGMENTATION_TEST( URL VARCHAR(200) PRIMARY KEY, CONTENT NCLOB, LANGU VARCHAR(10) );
The column CONTENT stores the text for segmenting. The column LANGU specifies the languages we used. Here we specify the language to ZH.
Next, we create full-text index on the content column of table SEGMENTATION_TEST. We can use the following SQLS:
CREATE FULLTEXT INDEX FT_INDEX ON SEGMENTATION_TEST(CONTENT) TEXT ANALYSIS ON CONFIGURATION 'LINGANALYSIS_FULL' LANGUAGE COLUMN "LANGU";
We should be aware of that the tables need to create full-text indexes must have primary key, otherwise an error will be occurred.
After we create the full-text index, SAP HANA will automatically generator a table named $TA_<index_name>.
Now we insert a record to table SEGMENTATION_TEST:
INSERT INTO SEGMENTATION_TEST(URL,CONTENT,LANGU) VALUES('http://xxx.xxx.xxx','想获取更多SAP HANA学习资料或有任何疑问,请关注新浪微博@HANAGeek!我们欢迎你的加入!','ZH');
Then we query the contents of table $TA_FT_INDEX, each row contains the word and some other information:
As shown above, segmentation table not only contains split words but also have the words’ speech. For example, the speech of word “获取”is verb. The word HANA is an unrecognized word, so the speech is unknown.
SAP HANA supports various text analysis configurations:
LINGANALYSIS_BASIC: This configuration provides the following language processing capabilities for linguistic analysis of unstructured data:
- Segmentation, also known as tokenization - the separation of input text into its elements
LINGANALYSIS_STEMS: This configuration provides the following language processing capabilities for linguistic analysis of unstructured data:
- Segmentation, also known as tokenization - the separation of input text into its elements
- Stemming - the identification of word stems or dictionary base forms
LINGANALYSIS_FULL: This configuration provides the following language
processing capabilities for linguistic analysis of unstructured data:
- Segmentation, also known as tokenization - the separation of input text into its elements
- Stemming - the identification of word stems or dictionary base forms
- Tagging - the labeling of words' parts of speech
EXTRACTION_CORE: This configuration extracts entities of interest from unstructured text, such as people, organizations, or places mentioned.
In most use cases, this option is sufficient. EXTRACTION_CORE_VOICEOFCUSTOMER: Voice of the customer content includes a set of entity types and rules that address requirements for extracting customer sentiments and requests. You can use this content to retrieve specific information about your customers' needs and perceptions when processing and analyzing text. The configuration involves complex linguistic analysis and pattern matching that includes processing parts of speech, syntactic patterns, negation, and so on, to identify the patterns to be extracted.
To keep track of deletions in the source table, the keys in the table $TA_FT_INDEX need to be aligned to the keys of the source table. To do this, use the following SQL statement:
ALTER TABLE TEST."$TA_FT_INDEX" ADD CONSTRAINT R_KEY FOREIGN KEY(URL) REFERENCES TEST.SEGMENTATION_TEST(URL) ON DELETE CASCADE;