Three methods for text manipulation exist: lexical tokenizers, regular expressions and abstract syntax trees (AST). In this study I will focus on tokenizers as they are common in NLP tasks. AST, similarly to David Halter’s “Parso” library, is mainly used for modifying syntactic constructs in Python source code. Regular expression (regexp) method for text processing is often called the naive method. Re.match function for instances matches a pattern with a string of a special syntax. It finds the pattern as the first argument, string as the second argument and returns the match object. The match function will match a string from the beginning until it can no longer find a pattern. Search will go through the entire document to find a pattern. For “edge cases” regexp may often fail and it is difficult to correctly maintain. It won’t typically detect false positives and it is prone to errors. AST module can avoid false positives but lexical tokenizers can capture more information. AST will looselose comments, parenthesisparentheses or whitespaces. Lexical tokenizers (lexers) can tokenize strings as separate entities to avoid common errors, map parts of speech, match or remove unwanted tokens. A tokenizer read “tokensreads tokens one at a time from the input stream and passpasses the tokens to the parser.

The text above was approved for publishing by the original author.

Previous       Next

Try for free

Please enter your message
Please choose what language to correct

Try our proofreading add-in for Outlook!

eAngel.me

eAngel.me is a human proofreading service that enables you to correct your texts by live professionals in minutes.