September 14, 2015

Wiki API: XML format

The Wiki documents are created by Wiki API [1], they can be performed with XML format. The format is as follow:


<?xml version="1.0"?>
<api batchcomplete="">
  <query>
    <pages>
      <page _idx="1112569" pageid="1112569" ns="0" title="紅葉溫泉">
        <revisions>
          <rev contentformat="text/x-wiki" contentmodel="wikitext" xml:space="preserve">
          '''紅葉溫泉''' 可以是指: *[[南投縣]][[仁愛鄉]]的南投紅葉溫泉([[紅香溫泉]]) *[[花蓮縣]][[瑞穗鄉]]的[[花蓮紅葉溫泉]] *[[台東縣]][[延平鄉]]的[[台東紅葉溫泉]] {{disambig|Cat=四字地名消歧义}}
          </rev>
        </revisions>
      </page>
    </pages>
  </query>
</api>

Characters in blue are the formats of XML. Characters in purple are the predefined tags in Wiki's XML, which are defined by Wiki itself, not original XML. Characters in bold black are the value of the tags. Of course, the example of "紅葉溫泉" is a simple one, there are many predefinded tags in Wiki API XML format. You can learn more by sending other query.

In this project, we need to extract at least two major information from Wiki XML format:

(1) Title, which appears in the "title" attribute in the "<page ... >" tag. See from the example.

(2) Wiki-Text, which appears within the tags "<rev contentformat="text/x-wiki" contentmodel="wikitext">" and "</rev>". Furthermore, the extracted text need more processing  to delete the unnecessary characters such as "[[ ]]" or "{{ }}" and finally output the original text. Therefore, you need to find out the unnecessary characters as well as there meaning on Wiki's webpage, for example,  '''紅葉溫泉''' means bold text-decoration, and [[南投縣]] means there's a hyperlink on "南投縣." You can learn this by comparing the XML and the Wiki webpage.

Where to get the XML file:
1. Link: http://140.116.41.55/PlanningWordNets/spaceDB_index.php
2. Enter any Chinese words in the "查詢地名" and then press "確定"
3. If the Wiki's XML file is successfully created, the "右鍵存檔" link will appear. Right click and save the file, the filename is "save.txt" with UTF-8 encoding.

Remind that this project is focusing on Chinese natural language processing, so please take Chinese terms as training  examples.

 [1] https://www.mediawiki.org/wiki/API:Main_page/zh

No comments:

Post a Comment