User:Macrakis/Content extraction

Content extraction, body text extraction, or boilerplate removal is software that extracts the substantive textual content of a web page, PDF, or other formatted document, ignoring headers, footers, navigational tools, advertising, legal notices, and so on. It is used by search engines to remove redundant and uninformative parts of the web page before evaluating its relevance to queries. "Boilerplate" in this context covers content and design elements which are repeated across multiple pages, typically using templates.

Uses[edit]

Content extraction is typically used as a first step in processing web pages mechanically.