hq

Small utility to parse and grep HTML files. It uses CSS selectors or XPath Selectors to extract HTML elements.

Usage

hq - command line HTML elements finder; version 1.0.0

Usage: hq [-hptV] [-a=<attribute>] [-f=<FILE>] [-o=<FILE>] [-s=<POLICY>] [-x=<XPATH>] <selector>
          [COMMAND]
      <selector>            The CSS selector
  -a, --attribute=<attribute>
                            Return only this attribute from the selected HTML elements
  -f, --file=<FILE>         The HTML input file. If not supplied it will default to stdin
  -h, --help                Show this help message and exit.
  -o, --output=<FILE>       The output file. If not supplied it will default to stdout
  -p, --pretty              Force pretty printing the output
  -r, --remove=<SELECTOR>   Remove nodes matching given selector
  -s, --sanitize=<POLICY>   Sanitizes the html input according to the given policy
  -t, --text                Display only the inner text of the selected HTML top element
  -V, --version             Print version information and exit.
  -x, --xpath=<XPATH>       Supply an XPath selector instead of CSS
Commands:
  generate-completion  Generate bash/zsh completion script for hq.

Installation

Homebrew

> brew tap ludovicianul/tap
> brew install ludovicianul/tap/hq

Manual

hq is compiled to native code using GraalVM. Check the release page for binaries (Linux, MacOS, uberjar).

After download, you can make hq globally available:

sudo cp hq-macos /usr/local/bin/hq

The uberjar can be run using java -jar hq. Requires Java 11+.

Autocomplete

Run the following commands to get autocomplete:

hq generate-completion >> hq_autocomplete

source hq_autocomplete

HTML Sanitizing

hq can sanitize html output. Supported modes are: NONE, BASIC, SIMPLE_TEXT, BASIC_WITH_IMAGES, RELAXED.

This is how sanitization works:

Policy	Details
`NONE`	Allows only text nodes: all HTML will be stripped.
`BASIC`	Allows a fuller range of text nodes: `a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, span, strike, strong, sub, sup, u, ul`, and appropriate attributes. Does not allow images.
`SIMPLE_TEXT`	Allows only simple text formatting: `b, em, i, strong, u`. All other HTML (tags and attributes) will be removed.
`BASIC_WITH_IMAGES`	Allows the same text tags as `BASIC`, and also allows `img` tags, with appropriate attributes, with `src` pointing to `http` or `https`.
`RELAXES`	Allows a full range of text and structural body HTML: `a, b, blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul`.

Examples

Get the div with id mainLeaderboard:

➜ curl -s https://www.w3schools.com/cssref/css_selectors.php | hq "#main > p:nth-child(6)" -t

In CSS, selectors are patterns used to select the element(s) you want to style.

Get the text inside an article:

➜ curl -s https://ludovicianul.github.io/2021/07/16/unicode_language_version/ | hq '.post' -t

Make sure you know which Unicode version is supported by your programming language version 16 Jul 2021 While enhancing CATS I recently added a feature to send requests that include 
single and multi code point emojis. This is a single code point emoji: 🥶, which can be represented in Java as the \uD83E\uDD76 string. The test case is simple: inject emojis within 
strings and expect that the REST endpoint will sanitize the input and remove them entirely (I appreciate this might not be a valid case for all APIs, this is why the behaviour is 
configurable in CATS, but not the focus of this article). I usually recommend that any REST endpoint should sanitize input before validating it and remove special characters. 
A typical regex for this would be [\p{C}\p{Z}\p{So}]+ (although you should enhance it to allow spaces between words), which means: p{C} - match Unicode invisible Control 
Chars (\u000D - carriage return for example) ...
...

Sanitize the html according to the specified policy:

 ➜ curl -s https://ludovicianul.github.io/2021/07/16/unicode_language_version/ | hq html -s=BASIC -p

<html>
    <head></head>
    <body>
        <a href="https://ludovicianul.github.io/" rel="nofollow"> m's blog </a>
        <p>practical thoughts about software engineering</p>
        <a href="https://ludovicianul.github.io/" rel="nofollow">Home</a>
        <a rel="nofollow">About</a>
        <a href="https://github.com/ludovicianul" rel="nofollow">GitHub</a>
        <p>© 2021. All rights reserved.</p>
        Make sure you know which Unicode version is supported by your programming language version
        <span>16 Jul 2021</span>
        <p>
...
    </body>
</html>

Get all href attributes from a given page:

 ➜ curl -s https://ludovicianul.github.io | hq "*" -a "href"
http://gmpg.org/xfn/11
https://ludovicianul.github.io/public/css/poole.css
https://ludovicianul.github.io/public/css/syntax.css
https://ludovicianul.github.io/public/css/hyde.css
https://fonts.googleapis.com/css?family=PT+Sans:400,400italic,700|Abril+Fatface
https://ludovicianul.github.io/public/apple-touch-icon-144-precomposed.png
https://ludovicianul.github.io/public/favicon.ico
/atom.xml
https://ludovicianul.github.io/
https://ludovicianul.github.io/
/about/
...

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.mvn/wrapper		.mvn/wrapper
src/main		src/main
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
hq_autocomplete		hq_autocomplete
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hq

Usage

Installation

Homebrew

Manual

Autocomplete

HTML Sanitizing

Examples

Resources

About

Releases 5

Packages

Contributors 3

Languages

License

ludovicianul/hq

Folders and files

Latest commit

History

Repository files navigation

hq

Usage

Installation

Homebrew

Manual

Autocomplete

HTML Sanitizing

Examples

Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 3

Languages

Packages