机器学习算法整理——贝叶斯算法(实现拼写检查器)
贝叶斯拼写检查器实现
1. import re, collections 1. 1. def words(text): return re.findall('[a-z]+', text.lower()) 1. 1. def train(features): 1. model = collections.defaultdict(lambda: 1) 1. for f in features: 1. model[f] += 1 1. return model 1. 1. NWORDS = train(words(open('big.txt').read())) 1. 1. alphabet = 'abcdefghijklmnopqrstuvwxyz' 1. 1. def edits1(word): 1. n = len(word) 1. return set([word[0:i]+word[i+1:] for i in range(n)] + # deletion 1.[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition 1.[word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration 1.[word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]) # insertion 1. 1. def known_edits2(word): 1. return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) 1. 1. def known(words): return set(w for w in words if w in NWORDS) 1. 1. def correct(word): 1. candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] 1. return max(candidates, key=lambda w: NWORDS[w])
求解:argmaxc P(c|w) -> argmaxc P(w|c) P(c) / P(w)
- P(c), 文章中出现一个正确拼写词 c 的概率, 也就是说, 在英语文章中, c 出现的概率有多大
- P(w|c), 在用户想键入 c 的情况下敲成 w 的概率. 因为这个是代表用户会以多大的概率把 c 敲错成 w
- argmaxc, 用来枚举所有可能的 c 并且选取概率最大的
1. # 把语料中的单词全部抽取出来, 转成小写, 并且去除单词中间的特殊符号 1. def words(text): return re.findall('[a-z]+', text.lower()) 1. 1. def train(features): 1. model = collections.defaultdict(lambda: 1) 1. for f in features: 1. model[f] += 1 1. return model 1. 1. nwords = train(words(open('big.txt').read()))
要是遇到我们从来没有过见过的新词怎么办. 假如说一个词拼写完全正确, 但是语料库中没有包含这个词, 从而这个词也永远不会出现在训练集中.
于是, 我们就要返回出现这个词的概率是0. 这个情况不太妙, 因为概率为0这个代表了这个事件绝对不可能发生, 而在我们的概率模型中,
我们期望用一个很小的概率来代表这种情况. lambda: 1
nwords
1. defaultdict(<function __main__.train.<locals>.<lambda>()>, 1. {'the': 80031, 1. 'project': 289, 1. 'gutenberg': 264, 1. 'ebook': 88, 1. 'of': 40026, 1. 'adventures': 18, 1. 'sherlock': 102, 1. 'holmes': 468, 1. 'by': 6739, 1. 'sir': 178, 1. 'arthur': 35, 1. 'conan': 5, 1. 'doyle': 6, 1. 'in': 22048, 1. 'our': 1067, 1. 'series': 129, 1. 'copyright': 70, 1. 'laws': 234, 1. 'are': 3631, 1. 'changing': 45, 1. 'all': 4145, 1. 'over': 1283, 1. 'world': 363, 1. 'be': 6156, 1. 'sure': 124, 1. 'to': 28767, 1. 'check': 39, 1. 'for': 6940, 1. 'your': 1280, 1. 'country': 424, 1. 'before': 1364, 1. 'downloading': 6, 1. 'or': 5353, 1. 'redistributing': 8, 1. 'this': 4064, 1. 'any': 1205, 1. 'other': 1503, 1. 'header': 8, 1. 'should': 1298, 1. 'first': 1178, 1. 'thing': 304, 1. 'seen': 445, 1. 'when': 2924, 1. 'viewing': 8, 1. 'file': 22, 1. 'please': 173, 1. 'do': 1504, 1. 'not': 6626, 1. 'remove': 54, 1. 'it': 10682, 1. 'change': 151, 1. 'edit': 5, 1. 'without': 1016, 1. 'written': 118, 1. 'permission': 53, 1. 'read': 219, 1. 'legal': 53, 1. 'small': 528, 1. 'print': 48, 1. 'and': 38313, 1. 'information': 74, 1. 'about': 1498, 1. 'at': 6792, 1. 'bottom': 43, 1. 'included': 44, 1. 'is': 9775, 1. 'important': 286, 1. 'specific': 38, 1. 'rights': 169, 1. 'restrictions': 24, 1. 'how': 1316, 1. 'may': 2552, 1. 'used': 277, 1. 'you': 5623, 1. 'can': 1096, 1. 'also': 779, 1. 'find': 295, 1. 'out': 1988, 1. 'make': 505, 1. 'a': 21156, 1. 'donation': 11, 1. 'get': 469, 1. 'involved': 108, 1. 'welcome': 19, 1. 'free': 422, 1. 'plain': 109, 1. 'vanilla': 7, 1. 'electronic': 59, 1. 'texts': 8, 1. 'ebooks': 55, 1. 'readable': 14, 1. 'both': 530, 1. 'humans': 3, 1. 'computers': 8, 1. 'since': 261, 1. 'these': 1232, 1. 'were': 4290, 1. 'prepared': 139, 1. 'thousands': 94, 1. 'volunteers': 23, 1. 'title': 40, 1. 'author': 30, 1. 'release': 29, 1. 'date': 49, 1. 'march': 136, 1. 'most': 909, 1. 'recently': 31, 1. 'updated': 5, 1. 'november': 42, 1. 'edition': 22, 1. 'language': 62, 1. 'english': 212, 1. 'character': 175, 1. 'set': 325, 1. 'encoding': 6, 1. 'ascii': 12, 1. 'start': 68, 1. 'additional': 31, 1. 'editing': 7, 1. 'jose': 2, 1. 'menendez': 2, 1. 'contents': 51, 1. 'i': 7683, 1. 'scandal': 20, 1. 'bohemia': 16, 1. 'ii': 78, 1. 'red': 289, 1. 'headed': 38, 1. 'league': 54, 1. 'iii': 92, 1. 'case': 439, 1. 'identity': 12, 1. 'iv': 56, 1. 'boscombe': 17, 1. 'valley': 79, 1. 'mystery': 40, 1. 'v': 52, 1. 'five': 280, 1. 'orange': 24, 1. 'pips': 13, 1. 'vi': 38, 1. 'man': 1653, 1. 'with': 9741, 1. 'twisted': 22, 1. 'lip': 57, 1. 'vii': 35, 1. 'adventure': 35, 1. 'blue': 144, 1. 'carbuncle': 18, 1. 'viii': 40, 1. 'speckled': 6, 1. 'band': 55, 1. 'ix': 29, 1. 'engineer': 13, 1. 's': 5632, 1. 'thumb': 52, 1. 'x': 137, 1. 'noble': 49, 1. 'bachelor': 19, 1. 'xi': 29, 1. 'beryl': 5, 1. 'coronet': 30, 1. 'xii': 29, 1. 'copper': 27, 1. 'beeches': 13, 1. 'she': 3947, 1. 'always': 609, 1. 'woman': 326, 1. 'have': 3494, 1. 'seldom': 77, 1. 'heard': 637, 1. 'him': 5231, 1. 'mention': 47, 1. 'her': 5285, 1. 'under': 964, 1. 'name': 263, 1. 'his': 10035, 1. 'eyes': 940, 1. 'eclipses': 3, 1. 'predominates': 4, 1. 'whole': 745, 1. 'sex': 12, 1. 'was': 11411, 1. 'that': 12513, 1. 'he': 12402, 1. 'felt': 698, 1. 'emotion': 37, 1. 'akin': 15, 1. 'love': 485, 1. 'irene': 19, 1. 'adler': 17, 1. 'emotions': 11, 1. 'one': 3372, 1. 'particularly': 175, 1. 'abhorrent': 2, 1. 'cold': 258, 1. 'precise': 14, 1. 'but': 5654, 1. 'admirably': 8, 1. 'balanced': 7, 1. 'mind': 342, 1. 'take': 617, 1. 'perfect': 40, 1. 'reasoning': 42, 1. 'observing': 22, 1. 'machine': 40, 1. 'has': 1604, 1. 'as': 8065, 1. 'lover': 27, 1. 'would': 1954, 1. 'placed': 183, 1. 'himself': 1159, 1. 'false': 65, 1. 'position': 433, 1. 'never': 594, 1. 'spoke': 219, 1. 'softer': 11, 1. 'passions': 30, 1. 'save': 111, 1. 'gibe': 3, 1. 'sneer': 7, 1. 'they': 3939, 1. 'admirable': 15, 1. 'things': 322, 1. 'observer': 14, 1. 'excellent': 63, 1. 'drawing': 241, 1. 'veil': 17, 1. 'from': 5710, 1. 'men': 1146, 1. 'motives': 15, 1. 'actions': 78, 1. 'trained': 24, 1. 'reasoner': 7, 1. 'admit': 66, 1. 'such': 1437, 1. 'intrusions': 2, 1. 'into': 2125, 1. 'own': 786, 1. 'delicate': 55, 1. 'finely': 12, 1. 'adjusted': 17, 1. 'temperament': 6, 1. 'introduce': 24, 1. 'distracting': 2, 1. 'factor': 42, 1. 'which': 4843, 1. 'might': 537, 1. 'throw': 49, 1. 'doubt': 153, 1. 'upon': 1112, 1. 'mental': 38, 1. 'results': 230, 1. 'grit': 2, 1. 'sensitive': 36, 1. 'instrument': 36, 1. 'crack': 21, 1. 'high': 291, 1. 'power': 549, 1. 'lenses': 2, 1. 'more': 1998, 1. 'disturbing': 10, 1. 'than': 1207, 1. 'strong': 169, 1. 'nature': 171, 1. 'yet': 489, 1. 'there': 2973, 1. 'late': 166, 1. 'dubious': 2, 1. 'questionable': 4, 1. 'memory': 56, 1. 'had': 7384, 1. 'little': 1002, 1. 'lately': 23, 1. 'my': 2250, 1. 'marriage': 97, 1. 'drifted': 6, 1. 'us': 685, 1. 'away': 839, 1. 'each': 412, 1. 'complete': 146, 1. 'happiness': 144, 1. 'home': 296, 1. 'centred': 3, 1. 'interests': 119, 1. 'rise': 241, 1. 'up': 2285, 1. 'around': 272, 1. 'who': 3051, 1. 'finds': 24, 1. 'master': 142, 1. 'establishment': 41, 1. 'sufficient': 76, 1. 'absorb': 5, 1. 'attention': 192, 1. 'while': 769, 1. 'loathed': 2, 1. 'every': 651, 1. 'form': 508, 1. 'society': 170, 1. 'bohemian': 9, 1. 'soul': 169, 1. 'remained': 232, 1. 'lodgings': 12, 1. 'baker': 50, 1. 'street': 181, 1. 'buried': 22, 1. 'among': 452, 1. 'old': 1181, 1. 'books': 60, 1. 'alternating': 3, 1. 'week': 96, 1. 'between': 655, 1. 'cocaine': 5, 1. 'ambition': 14, 1. 'drowsiness': 5, 1. 'drug': 22, 1. 'fierce': 13, 1. 'energy': 46, 1. 'keen': 33, 1. 'still': 923, 1. 'ever': 275, 1. 'deeply': 78, 1. 'attracted': 37, 1. 'study': 145, 1. 'crime': 62, 1. 'occupied': 117, 1. 'immense': 78, 1. 'faculties': 9, 1. 'extraordinary': 75, 1. 'powers': 150, 1. 'observation': 40, 1. 'following': 209, 1. 'those': 1202, 1. 'clues': 4, 1. 'clearing': 30, 1. 'mysteries': 10, 1. 'been': 2600, 1. 'abandoned': 73, 1. 'hopeless': 18, 1. 'official': 92, 1. 'police': 95, 1. 'time': 1530, 1. 'some': 1537, 1. 'vague': 40, 1. 'account': 178, 1. 'doings': 12, 1. 'summons': 12, 1. 'odessa': 4, 1. 'trepoff': 2, 1. 'murder': 31, 1. 'singular': 37, 1. 'tragedy': 10, 1. 'atkinson': 2, 1. 'brothers': 51, 1. 'trincomalee': 2, 1. 'finally': 157, 1. 'mission': 35, 1. 'accomplished': 40, 1. 'so': 3018, 1. 'delicately': 4, 1. 'successfully': 26, 1. 'reigning': 4, 1. 'family': 211, 1. 'holland': 13, 1. 'beyond': 226, 1. 'signs': 99, 1. 'activity': 132, 1. 'however': 431, 1. 'merely': 190, 1. 'shared': 26, 1. 'readers': 12, 1. 'daily': 45, 1. 'press': 82, 1. 'knew': 497, 1. 'former': 178, 1. 'friend': 284, 1. 'companion': 82, 1. 'night': 386, 1. 'on': 6644, 1. 'twentieth': 20, 1. 'returning': 69, 1. 'journey': 70, 1. 'patient': 384, 1. 'now': 1698, 1. 'returned': 195, 1. 'civil': 178, 1. 'practice': 96, 1. 'way': 860, 1. 'led': 197, 1. 'me': 1921, 1. 'through': 816, 1. 'passed': 368, 1. 'well': 1199, 1. 'remembered': 121, 1. 'door': 499, 1. 'must': 956, 1. 'associated': 197, 1. 'wooing': 3, 1. 'dark': 182, 1. 'incidents': 15, 1. 'scarlet': 23, 1. 'seized': 115, 1. 'desire': 97, 1. 'see': 1102, 1. 'again': 867, 1. 'know': 1049, 1. 'employing': 8, 1. 'rooms': 87, 1. 'brilliantly': 6, 1. 'lit': 75, 1. 'even': 947, 1. 'looked': 761, 1. 'saw': 600, 1. 'tall': 75, 1. 'spare': 28, 1. 'figure': 104, 1. 'pass': 155, 1. 'twice': 85, 1. 'silhouette': 2, 1. 'against': 661, 1. 'blind': 24, 1. 'pacing': 27, 1. 'room': 961, 1. 'swiftly': 39, 1. 'eagerly': 40, 1. 'head': 726, 1. 'sunk': 28, 1. 'chest': 82, 1. 'hands': 456, 1. 'clasped': 12, 1. 'behind': 402, 1. 'mood': 52, 1. 'habit': 56, 1. 'attitude': 73, 1. 'manner': 136, 1. 'told': 491, 1. 'their': 2956, 1. 'story': 134, 1. 'work': 383, 1. 'risen': 31, 1. 'created': 63, 1. 'dreams': 17, 1. 'hot': 120, 1. 'scent': 18, 1. 'new': 1212, 1. 'problem': 77, 1. 'rang': 30, 1. 'bell': 66, 1. 'shown': 114, 1. 'chamber': 36, 1. 'formerly': 78, 1. 'part': 705, 1. 'effusive': 3, 1. 'glad': 151, 1. 'think': 558, 1. 'hardly': 174, 1. 'word': 299, 1. 'spoken': 93, 1. 'kindly': 87, 1. 'eye': 111, 1. 'waved': 30, 1. 'an': 3424, 1. 'armchair': 50, 1. 'threw': 97, 1. 'across': 223, 1. 'cigars': 8, 1. 'indicated': 89, 1. 'spirit': 168, 1. 'gasogene': 2, 1. 'corner': 129, 1. 'then': 1559, 1. 'stood': 384, 1. 'fire': 275, 1. 'introspective': 4, 1. 'fashion': 50, 1. 'wedlock': 2, 1. 'suits': 9, 1. 'remarked': 170, 1. 'watson': 84, 1. 'put': 436, 1. 'seven': 133, 1. 'half': 319, 1. 'pounds': 27, 1. 'answered': 227, 1. 'indeed': 140, 1. 'thought': 903, 1. 'just': 768, 1. 'trifle': 12, 1. 'fancy': 51, 1. 'observe': 38, 1. 'did': 1876, 1. 'tell': 493, 1. 'intended': 59, 1. 'go': 906, 1. 'harness': 28, 1. 'deduce': 15, 1. 'getting': 93, 1. 'yourself': 163, 1. 'very': 1341, 1. 'wet': 61, 1. 'clumsy': 9, 1. 'careless': 15, 1. 'servant': 47, 1. 'girl': 167, 1. 'dear': 450, 1. 'said': 3465, 1. 'too': 549, 1. 'much': 672, 1. 'certainly': 120, 1. 'burned': 78, 1. 'lived': 114, 1. 'few': 459, 1. 'centuries': 13, 1. 'ago': 109, 1. 'true': 206, 1. 'walk': 76, 1. 'thursday': 8, 1. 'came': 980, 1. 'dreadful': 69, 1. 'mess': 11, 1. 'changed': 135, 1. 'clothes': 63, 1. 't': 1319, 1. 'imagine': 97, 1. 'mary': 706, 1. 'jane': 3, 1. 'incorrigible': 3, 1. 'wife': 368, 1. 'given': 365, 1. 'notice': 99, 1. 'fail': 41, 1. 'chuckled': 8, 1. 'rubbed': 33, 1. 'long': 992, 1. 'nervous': 55, 1. 'together': 261, 1. 'simplicity': 31, 1. 'itself': 274, 1. 'inside': 44, 1. 'left': 835, 1. 'shoe': 12, 1. 'where': 978, 1. 'firelight': 3, 1. 'strikes': 20, 1. 'leather': 36, 1. 'scored': 5, 1. 'six': 177, 1. 'almost': 326, 1. 'parallel': 18, 1. 'cuts': 6, 1. 'obviously': 39, 1. 'caused': 103, 1. 'someone': 161, 1. 'carelessly': 15, 1. 'scraped': 22, 1. 'round': 557, 1. 'edges': 71, 1. 'sole': 71, 1. 'order': 405, 1. 'crusted': 3, 1. 'mud': 37, 1. 'hence': 33, 1. 'double': 50, 1. 'deduction': 13, 1. 'vile': 17, 1. 'weather': 43, 1. 'malignant': 89, 1. 'boot': 23, 1. 'slitting': 3, 1. 'specimen': 15, 1. 'london': 77, 1. 'slavey': 2, 1. 'if': 2373, 1. 'gentleman': 100, 1. 'walks': 11, 1. 'smelling': 6, 1. 'iodoform': 44, 1. 'black': 236, 1. 'mark': 39, 1. 'nitrate': 8, 1. 'silver': 129, 1. 'right': 711, 1. 'forefinger': 8, 1. 'bulge': 3, 1. 'side': 512, 1. 'top': 43, 1. 'hat': 106, 1. 'show': 214, 1. 'secreted': 3, 1. 'stethoscope': 3, 1. 'dull': 75, 1. 'pronounce': 10, 1. 'active': 97, 1. 'member': 51, 1. 'medical': 23, 1. 'profession': 23, 1. 'could': 1701, 1. 'help': 231, 1. 'laughing': 116, 1. 'ease': 45, 1. 'explained': 61, 1. 'process': 220, 1. 'hear': 184, 1. 'give': 524, 1. 'reasons': 65, 1. 'appears': 109, 1. 'ridiculously': 2, 1. 'simple': 140, 1. 'easily': 115, 1. 'myself': 228, 1. 'though': 651, 1. 'successive': 18, 1. 'instance': 51, 1. 'am': 747, 1. 'baffled': 9, 1. 'until': 326, 1. 'explain': 124, 1. 'believe': 184, 1. 'good': 745, 1. 'yours': 47, 1. 'quite': 503, 1. 'lighting': 17, 1. 'cigarette': 7, 1. 'throwing': 47, 1. 'down': 1129, 1. 'distinction': 20, 1. 'clear': 234, 1. 'example': 287, 1. 'frequently': 219, 1. 'steps': 189, 1. 'lead': 138, 1. 'hall': 84, 1. 'often': 444, 1. 'hundreds': 49, 1. 'times': 237, 1. 'many': 610, 1. 'don': 582, 1. 'observed': 132, 1. 'point': 224, 1. 'seventeen': 11, 1. 'because': 631, 1. 'interested': 66, 1. 'problems': 79, 1. 'enough': 176, 1. 'chronicle': 8, 1. 'two': 1139, 1. 'trifling': 13, 1. 'experiences': 12, 1. 'sheet': 30, 1. 'thick': 78, 1. 'pink': 28, 1. 'tinted': 10, 1. 'notepaper': 3, 1. 'lying': 119, 1. 'open': 326, 1. 'table': 297, 1. 'last': 566, 1. 'post': 118, 1. 'aloud': 29, 1. 'note': 116, 1. 'undated': 2, 1. 'either': 294, 1. 'signature': 10, 1. 'address': 77, 1. 'will': 1578, 1. 'call': 198, 1. 'quarter': 47, 1. 'eight': 129, 1. 'o': 258, 1. 'clock': 121, 1. 'desires': 23, 1. 'consult': 20, 1. 'matter': 366, 1. 'deepest': 16, 1. 'moment': 488, 1. 'recent': 55, 1. 'services': 39, 1. 'royal': 112, 1. 'houses': 118, 1. 'europe': 154, 1. 'safely': 12, 1. 'trusted': 17, 1. 'matters': 137, 1. 'importance': 118, 1. 'exaggerated': 29, 1. 'we': 1907, 1. 'quarters': 73, 1. 'received': 281, 1. 'hour': 158, 1. 'amiss': 7, 1. 'visitor': 75, 1. 'wear': 31, 1. 'mask': 13, 1. 'what': 3012, 1. 'means': 254, 1. 'no': 2349, 1. 'data': 18, 1. 'capital': 145, 1. 'mistake': 40, 1. 'theorise': 2, 1. 'insensibly': 3, 1. 'begins': 48, 1. 'twist': 15, 1. 'facts': 73, 1. 'suit': 26, 1. 'theories': 22, 1. 'instead': 138, 1. 'carefully': 73, 1. 'examined': 50, 1. 'writing': 70, 1. 'paper': 178, 1. 'wrote': 150, 1. 'presumably': 9, 1. 'endeavouring': 9, 1. 'imitate': 8, 1. 'processes': 36, 1. 'bought': 56, 1. 'crown': 62, 1. 'packet': 12, 1. 'peculiarly': 15, 1. 'stiff': 21, 1. 'peculiar': 85, 1. 'hold': 115, 1. 'light': 279, 1. 'large': 484, 1. 'e': 137, 1. 'g': 56, 1. 'p': 67, 1. 'woven': 6, 1. 'texture': 7, 1. 'asked': 778, 1. 'maker': 5, 1. 'monogram': 5, 1. 'rather': 220, 1. 'stands': 20, 1. 'gesellschaft': 2, 1. 'german': 197, 1. 'company': 193, 1. 'customary': 20, 1. 'contraction': 62, 1. 'like': 1081, 1. 'co': 31, 1. 'course': 390, 1. 'papier': 2, 1. 'eg': 2, 1. 'let': 507, 1. 'glance': 92, 1. 'continental': 47, 1. 'gazetteer': 2, 1. 'took': 574, 1. 'heavy': 140, 1. 'brown': 72, 1. 'volume': 31, 1. 'shelves': 4, 1. 'eglow': 2, 1. 'eglonitz': 2, 1. 'here': 692, 1. 'egria': 2, 1. 'speaking': 186, 1. 'far': 409, 1. 'carlsbad': 2, 1. 'remarkable': 78, 1. 'being': 919, 1. 'scene': 50, 1. 'death': 331, 1. 'wallenstein': 2, 1. 'its': 1636, 1. 'numerous': 51, 1. 'glass': 117, 1. 'factories': 30, 1. 'mills': 40, 1. 'ha': 76, 1. 'boy': 170, 1. 'sparkled': 6, 1. 'sent': 320, 1. 'great': 793, 1. 'triumphant': 17, 1. 'cloud': 31, 1. 'made': 1008, 1. 'precisely': 25, 1. 'construction': 26, 1. 'sentence': 27, 1. 'frenchman': 103, 1. 'russian': 462, 1. 'uncourteous': 2, 1. 'verbs': 2, 1. 'only': 1874, 1. 'remains': 74, 1. 'therefore': 187, 1. 'discover': 29, 1. 'wanted': 214, 1. 'writes': 21, 1. 'prefers': 3, 1. 'wearing': 88, 1. 'showing': 105, 1. 'face': 1126, 1. 'comes': 92, 1. 'mistaken': 60, 1. 'resolve': 15, 1. 'doubts': 40, 1. 'sharp': 84, 1. 'sound': 220, 1. 'horses': 263, 1. 'hoofs': 25, 1. 'grating': 11, 1. 'wheels': 48, 1. 'curb': 5, 1. 'followed': 330, 1. 'pull': 24, 1. 'whistled': 14, 1. 'pair': 41, 1. 'yes': 689, 1. 'continued': 292, 1. 'glancing': 99, 1. 'window': 187, 1. 'nice': 54, 1. 'brougham': 5, 1. 'beauties': 3, 1. 'hundred': 230, 1. 'fifty': 95, 1. 'guineas': 4, 1. 'apiece': 8, 1. 'money': 327, 1. 'nothing': 647, 1. 'else': 202, 1. 'better': 267, 1. 'bit': 64, 1. 'doctor': 184, 1. 'stay': 75, 1. 'lost': 225, 1. 'boswell': 2, 1. 'promises': 16, 1. 'interesting': 72, 1. 'pity': 76, 1. 'miss': 113, 1. 'client': 34, 1. 'want': 324, 1. 'sit': 90, 1. 'best': 269, 1. 'slow': 66, 1. 'step': 140, 1. 'stairs': 32, 1. 'passage': 111, 1. 'paused': 80, 1. 'immediately': 183, 1. 'outside': 111, 1. 'loud': 65, 1. 'authoritative': 3, 1. 'tap': 11, 1. 'come': 935, 1. 'entered': 283, 1. 'less': 368, 1. 'feet': 180, 1. 'inches': 17, 1. 'height': 37, 1. 'limbs': 68, 1. 'hercules': 5, 1. 'dress': 139, 1. 'rich': 93, 1. 'richness': 3, 1. 'england': 312, 1. 'bad': 156, 1. 'taste': 24, 1. 'bands': 28, 1. 'astrakhan': 2, 1. 'slashed': 4, 1. 'sleeves': 31, 1. 'fronts': 2, 1. 'breasted': 2, 1. 'coat': 173, 1. 'deep': 216, 1. 'cloak': 63, 1. 'thrown': 93, 1. 'shoulders': 126, 1. 'lined': 33, 1. 'flame': 16, 1. 'coloured': 22, 1. 'silk': 51, 1. 'secured': 49, 1. 'neck': 204, 1. 'brooch': 2, 1. 'consisted': 39, 1. 'single': 174, 1. 'flaming': 9, 1. 'boots': 92, 1. 'extended': 76, 1. 'halfway': 20, 1. 'calves': 4, 1. 'trimmed': 9, 1. 'tops': 4, 1. 'fur': 39, 1. 'completed': 26, 1. 'impression': 68, 1. 'barbaric': 3, 1. 'opulence': 4, 1. 'suggested': 70, 1. 'appearance': 136, 1. 'carried': 283, 1. 'broad': 93, 1. 'brimmed': 5, 1. 'hand': 835, 1. 'wore': 59, 1. 'upper': 131, 1. 'extending': 36, 1. 'past': 224, 1. 'cheekbones': 5, 1. 'vizard': 2, 1. 'apparently': 69, 1. 'raised': 213, 1. 'lower': 197, 1. 'appeared': 198, 1. 'hanging': 43, 1. 'straight': 125, 1. 'chin': 31, 1. 'suggestive': 12, 1. 'resolution': 58, 1. 'pushed': 82, 1. 'length': 64, 1. 'obstinacy': 8, 1. 'harsh': 23, 1. 'voice': 463, 1. 'strongly': 42, 1. 'marked': 139, 1. 'accent': 19, 1. 'uncertain': 31, 1. 'pray': 80, 1. 'seat': 171, 1. 'colleague': 8, 1. 'dr': 49, 1. 'occasionally': 90, 1. 'cases': 454, 1. 'whom': 490, 1. 'honour': 17, 1. 'count': 749, 1. 'von': 12, 1. 'kramm': 3, 1. 'nobleman': 12, 1. 'understand': 413, 1. 'discretion': 14, 1. 'trust': 69, 1. 'extreme': 73, 1. 'prefer': 22, 1. 'communicate': 16, 1. 'alone': 338, 1. 'rose': 244, 1. 'caught': 91, 1. 'wrist': 69, 1. 'back': 747, 1. 'chair': 136, 1. 'none': 111, 1. 'say': 756, 1. 'anything': 380, 1. 'shrugged': 36, 1. 'begin': 98, 1. 'binding': 19, 1. 'absolute': 57, 1. 'secrecy': 19, 1. 'years': 572, 1. 'end': 466, 1. 'present': 330, 1. 'weight': 71, 1. 'influence': 139, 1. 'european': 100, 1. 'history': 440, 1. 'promise': 68, 1. 'excuse': 54, 1. 'strange': 221, 1. 'august': 71, 1. 'person': 186, 1. 'employs': 3, 1. 'wishes': 43, 1. 'agent': 26, 1. 'unknown': 88, 1. 'confess': 37, 1. 'once': 570, 1. 'called': 451, 1. 'exactly': 48, 1. 'aware': 53, 1. 'dryly': 6, 1. 'circumstances': 108, 1. 'delicacy': 12, 1. 'precaution': 10, 1. 'taken': 439, 1. 'quench': 4, 1. 'grow': 75, 1. 'seriously': 64, 1. 'compromise': 72, 1. 'families': 46, 1. 'speak': 256, 1. 'plainly': 40, 1. 'implicates': 6, 1. 'house': 662, 1. 'ormstein': 3, 1. 'hereditary': 15, 1. 'kings': 28, 1. 'murmured': 19, 1. 'settling': 17, 1. 'closing': 36, 1. 'glanced': 177, 1. ...})
编辑距离:
两个词之间的编辑距离定义为使用了几次插入(在词中插入一个单字母), 删除(删除一个单字母), 交换(交换相邻两个字母), 替换(把一个字母换成另一个)的操作从一个词变到另一个词.
1. #返回所有与单词 w 编辑距离为 1 的集合 1. alphabet = 'abcdefghijklmnopqrstuvwxyz' 1. def edits1(word): 1. n = len(word) 1. return set([word[0:i]+word[i+1:] for i in range(n)] + # deletion 1.[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition 1.[word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration 1.[word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]) # insertion
与 something 编辑距离为2的单词居然达到了 114,324 个
优化:在这些编辑距离小于2的词中间, 只把那些正确的词作为候选词,只能返回 3 个单词: ‘smoothing’, ‘something’ 和 ‘soothing’
- #返回所有与单词 w 编辑距离为 2 的集合
- #在这些编辑距离小于2的词中间, 只把那些正确的词作为候选词
- def edits2(word):
- return set(e2 for e1 in edits1(word) for e2 in edits1(e1))
正常来说把一个元音拼成另一个的概率要大于辅音 (因为人常常把 hello 打成 hallo 这样); 把单词的第一个字母拼错的概率会相对小, 等等.
但是为了简单起见, 选择了一个简单的方法: 编辑距离为1的正确单词比编辑距离为2的优先级高, 而编辑距离为0的正确单词优先级比编辑距离为1的高.
1. def known(words): return set(w for w in words if w in nwords) 1. 1. #如果known(set)非空, candidate 就会选取这个集合, 而不继续计算后面的 1. def correct(word): 1. candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] 1. return max(candidates, key=lambda w: nwords[w])
correct('knona')