“Beginning Python”（四）“Instant Markup 1”_综合

本文主要解读《Beginning Python》书后十个应用项目中的“Instant Markup”项目。它实现的是：将“plain text”（普通文本）转变为“Markup text”（标记文本），包括：html、xml、latex等。尽管该项目仅仅演示了“plain to html”，但是它也很容易扩展到其他“Markup text”。

注：关于html的入门知识，可以看：http://www.w3.org/MarkUp/Guide/

一、问题和目标

问题其实很明确，那就是：将“plain text”转换为“html text”，具体包括：

1）区分不同的文本块：headings、paragraphs

2）处理特殊文本块：list items、in-line text，如：emphasized text和URLs。

3）可扩展处理其他markup文本。

测试文本“test_input.txt”如下：

Welcome to World Wide Spam, Inc.These are the corporate web pages of *World Wide Spam*, Inc. We hope
you find your stay enjoyable, and that you will sample many of our
products.A short history of the companyWorld Wide Spam was started in the summer of 2000. The business
concept was to ride the dot-com wave and to make money both through
bulk email and by selling canned meat online.After receiving several complaints from customers who weren't
satisfied by their bulk email, World Wide Spam altered their profile,
and focused 100% on canned goods. Today, they rank as the world's
13,892nd online supplier of SPAM.DestinationsFrom this page you may visit several of our interesting web pages:- What is SPAM? (http://wwspam.fu/whatisspam)- How do they make it? (http://wwspam.fu/howtomakeit)- Why should I eat it? (http://wwspam.fu/whyeatit)How to get in touch with usYou can get in touch with us in *many* ways: By phone (555-1234), by
email (wwspam@wwspam.fu) or by visiting our customer feedback page
(http://wwspam.fu/feedback).

二、技术分解

1，知识点分布

1）读写文件 - Chapter 11，fileinput

2）逐行迭代 - 同上

3）字符串处理 - Chapter 3

4）generator - Chapter 9

5）正则表达式 re - Chapter 10

2，子任务

1）文本分割

由于html中需要区分head1、head2和paragraph，我们的第一个子任务就是要根据“输入文本”（plain text）的特征，提取出（分割）标题行和段落块。

观察“test_input.txt”，很明显，它是以一个或多个空行来划分段落。

#util.py
def lines(file):for line in file: yield lineyield '\n'def blocks(file):block = []for line in lines(file):if line.strip():block.append(line)elif block:yield ''.join(block).strip()block = []

如上，util.py中包含了两种generator：lines和blocks。注意，它们不是普通的函数，而是generator。

lines将输入文件（流）转换为行，并逐行输出，建议VS单步调试，查看它的处理过程。很显然，输入文件流--file提供了一个（行）迭代器，lines只是借助这个行迭代器，将文件按行输出。

注：关于“File Iterators”这个知识点，可以查看Chapter 11。python的文件流和sys.stdin都是可以直接用for迭代的。

blocks的内部是通过一个list来实现的，它的代码很好理解：收集多个自然行（以回车符结束），组成一个list，直到遇到一个空行结束。其中，string.strip()函数默认是取出头尾的空格。此外，blocks会过滤空行。

2）添加标记（markup）

对于markup类文件，一般包括三个部分：

a. 头部信息

b. 主体段落

c. 尾部信息

参考html文件: http://www.w3.org/MarkUp/Guide/

三、代码分析

1，模块化

为了便于扩展和维护，我们需要将程序按照OOP的方法进行模块化设计，大致可以分为以下几个模块：

1）A Parser

它是一个集成类，主要功能包括：读文件和管理其他类。很明显，它会创建程序的入口对象。

2）Rules

每一个规则对应一种文本块。

3）Filters

封装正则表达式，过滤行内文本（deal with in-line elements）。注意，它针对的是行内，而不是文本块。

4）Handlers

生成输出文本，每一个handler对应一类输出文本。事实上，它是该程序扩展性的基石，定义不同的handler就可以生成不同的markup text。

类图关系如下：

关于UML可以参考：http://design-patterns.readthedocs.io/zh_CN/latest/read_uml.html