Skip to content

lightweight command line HTML processor using CSS and XPath selectors

License

Notifications You must be signed in to change notification settings

ludovicianul/hq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hq

Small utility to parse and grep HTML files. It uses CSS selectors or XPath Selectors to extract HTML elements.

Usage

hq - command line HTML elements finder; version 1.0.0

Usage: hq [-hptV] [-a=<attribute>] [-f=<FILE>] [-o=<FILE>] [-s=<POLICY>] [-x=<XPATH>] <selector>
          [COMMAND]
      <selector>            The CSS selector
  -a, --attribute=<attribute>
                            Return only this attribute from the selected HTML elements
  -f, --file=<FILE>         The HTML input file. If not supplied it will default to stdin
  -h, --help                Show this help message and exit.
  -o, --output=<FILE>       The output file. If not supplied it will default to stdout
  -p, --pretty              Force pretty printing the output
  -r, --remove=<SELECTOR>   Remove nodes matching given selector
  -s, --sanitize=<POLICY>   Sanitizes the html input according to the given policy
  -t, --text                Display only the inner text of the selected HTML top element
  -V, --version             Print version information and exit.
  -x, --xpath=<XPATH>       Supply an XPath selector instead of CSS
Commands:
  generate-completion  Generate bash/zsh completion script for hq.

Installation

Homebrew

> brew tap ludovicianul/tap
> brew install ludovicianul/tap/hq

Manual

hq is compiled to native code using GraalVM. Check the release page for binaries (Linux, MacOS, uberjar).

After download, you can make hq globally available:

sudo cp hq-macos /usr/local/bin/hq

The uberjar can be run using java -jar hq. Requires Java 11+.

Autocomplete

Run the following commands to get autocomplete:

hq generate-completion >> hq_autocomplete

source hq_autocomplete

HTML Sanitizing

hq can sanitize html output. Supported modes are: NONE, BASIC, SIMPLE_TEXT, BASIC_WITH_IMAGES, RELAXED.

This is how sanitization works:

Policy Details
NONE Allows only text nodes: all HTML will be stripped.
BASIC Allows a fuller range of text nodes: a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, span, strike, strong, sub, sup, u, ul, and appropriate attributes. Does not allow images.
SIMPLE_TEXT Allows only simple text formatting: b, em, i, strong, u. All other HTML (tags and attributes) will be removed.
BASIC_WITH_IMAGES Allows the same text tags as BASIC, and also allows img tags, with appropriate attributes, with src pointing to http or https.
RELAXES Allows a full range of text and structural body HTML: a, b, blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul.

Examples

Get the div with id mainLeaderboard:

➜ curl -s https://www.w3schools.com/cssref/css_selectors.php | hq "#main > p:nth-child(6)" -t

In CSS, selectors are patterns used to select the element(s) you want to style.

Get the text inside an article:

➜ curl -s https://ludovicianul.github.io/2021/07/16/unicode_language_version/ | hq '.post' -t

Make sure you know which Unicode version is supported by your programming language version 16 Jul 2021 While enhancing CATS I recently added a feature to send requests that include 
single and multi code point emojis. This is a single code point emoji: 🥶, which can be represented in Java as the \uD83E\uDD76 string. The test case is simple: inject emojis within 
strings and expect that the REST endpoint will sanitize the input and remove them entirely (I appreciate this might not be a valid case for all APIs, this is why the behaviour is 
configurable in CATS, but not the focus of this article). I usually recommend that any REST endpoint should sanitize input before validating it and remove special characters. 
A typical regex for this would be [\p{C}\p{Z}\p{So}]+ (although you should enhance it to allow spaces between words), which means: p{C} - match Unicode invisible Control 
Chars (\u000D - carriage return for example) ...
...

Sanitize the html according to the specified policy:

 ➜ curl -s https://ludovicianul.github.io/2021/07/16/unicode_language_version/ | hq html -s=BASIC -p

<html>
    <head></head>
    <body>
        <a href="https://ludovicianul.github.io/" rel="nofollow"> m's blog </a>
        <p>practical thoughts about software engineering</p>
        <a href="https://ludovicianul.github.io/" rel="nofollow">Home</a>
        <a rel="nofollow">About</a>
        <a href="https://github.com/ludovicianul" rel="nofollow">GitHub</a>
        <p>© 2021. All rights reserved.</p>
        Make sure you know which Unicode version is supported by your programming language version
        <span>16 Jul 2021</span>
        <p>
...
    </body>
</html>

Get all href attributes from a given page:

 ➜ curl -s https://ludovicianul.github.io | hq "*" -a "href"
http://gmpg.org/xfn/11
https://ludovicianul.github.io/public/css/poole.css
https://ludovicianul.github.io/public/css/syntax.css
https://ludovicianul.github.io/public/css/hyde.css
https://fonts.googleapis.com/css?family=PT+Sans:400,400italic,700|Abril+Fatface
https://ludovicianul.github.io/public/apple-touch-icon-144-precomposed.png
https://ludovicianul.github.io/public/favicon.ico
/atom.xml
https://ludovicianul.github.io/
https://ludovicianul.github.io/
/about/
...

Resources