当前位置: 代码迷 >> 综合 >> whitespace/control/punctuation character 判断
  详细解决方案

whitespace/control/punctuation character 判断

热度:76   发布时间:2024-02-20 00:11:09.0

最近读 https://github.com/google-research/bert
tokenization.py 里面一段代码觉得很有用,记录一下,以后也许用得到哈哈

def _is_whitespace(char):"""Checks whether `chars` is a whitespace character."""# \t, \n, and \r are technically contorl characters but we treat them# as whitespace since they are generally considered as such.if char == " " or char == "\t" or char == "\n" or char == "\r":return Truecat = unicodedata.category(char)if cat == "Zs": #[Zs] Separator, Spacereturn Truereturn Falsedef _is_control(char):"""Checks whether `chars` is a control character 控制字符."""# These are technically control characters but we count them as whitespace# characters. 这些在技术上是控制字符,但我们将它们视为空白字符if char == "\t" or char == "\n" or char == "\r": # \r回车符 \n换行符return Falsecat = unicodedata.category(char) # unicodedata.category(chr) 把一个字符返回它在UNICODE里分类的类型。具体类型如下:https://blog.csdn.net/xc_zhou/article/details/82079753if cat in ("Cc", "Cf"): # [Cc] Other, Control [Cf] Other, Formatreturn Truereturn Falsedef _is_punctuation(char):"""Checks whether `chars` is a punctuation character."""cp = ord(char) # We treat all non-letter/number ASCII as punctuation.# Characters such as "^", "$", and "`" are not in the Unicode# Punctuation class but we treat them as punctuation anyways, for# consistency.if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or(cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):return Truecat = unicodedata.category(char)if cat.startswith("P"): # [Pc] Punctuation, Connector [Pd] Punctuation, Dash [Pe] Punctuation, Close [Pf] Punctuation, Final quote (may behave like Ps or Pe depending on usage) [Pi] Punctuation, Initial quote (may behave like Ps or Pe depending on usage) [Po] Punctuation, Other [Ps] Punctuation, Openreturn Truereturn False

unicodedata.category(chr) 把一个字符返回它在UNICODE里分类的类型。具体类型如下:https://blog.csdn.net/xc_zhou/article/details/82079753

  相关解决方案