<?xml version="1.0"?>
<api batchcomplete="">
<query>
<pages>
<page _idx="1112569" pageid="1112569" ns="0" title="紅葉溫泉">
<revisions>
<rev contentformat="text/x-wiki" contentmodel="wikitext" xml:space="preserve">
'''紅葉溫泉''' 可以是指: *[[南投縣]][[仁愛鄉]]的南投紅葉溫泉([[紅香溫泉]]) *[[花蓮縣]][[瑞穗鄉]]的[[花蓮紅葉溫泉]] *[[台東縣]][[延平鄉]]的[[台東紅葉溫泉]] {{disambig|Cat=四字地名消歧义}}
</rev>
</revisions>
</page>
</pages>
</query>
</api>
In this project, we need to extract at least two major information from Wiki XML format:
(1) Title, which appears in the "title" attribute in the "<page ... >" tag. See from the example.
(2) Wiki-Text, which appears within the tags "<rev contentformat="text/x-wiki" contentmodel="wikitext">" and "</rev>". Furthermore, the extracted text need more processing to delete the unnecessary characters such as "[[ ]]" or "{{ }}" and finally output the original text. Therefore, you need to find out the unnecessary characters as well as there meaning on Wiki's webpage, for example, '''紅葉溫泉''' means bold text-decoration, and [[南投縣]] means there's a hyperlink on "南投縣." You can learn this by comparing the XML and the Wiki webpage.
Where to get the XML file:
1. Link: http://140.116.41.55/PlanningWordNets/spaceDB_index.php
2. Enter any Chinese words in the "查詢地名" and then press "確定"
3. If the Wiki's XML file is successfully created, the "右鍵存檔" link will appear. Right click and save the file, the filename is "save.txt" with UTF-8 encoding.
Remind that this project is focusing on Chinese natural language processing, so please take Chinese terms as training examples.
[1] https://www.mediawiki.org/wiki/API:Main_page/zh
No comments:
Post a Comment