Lynx is one of the oldest web browsers still in active use—a lightweight, text-based browser that runs entirely in the terminal. While its text-only interface may seem limiting at first glance, Lynx has a host of advantages for both browsing and web scraping. In this post, we’ll dive into how you can use Lynx for day-to-day web browsing, and then explore its powerful capabilities for extracting data from web pages.
Table of Contents
- Introduction
- Installing and Configuring Lynx
- Basic Web Browsing with Lynx
- Web Scraping with Lynx
- Scripting and Automation Examples
- Advanced Tips and Customizations
- Conclusion
Introduction
Lynx is a command-line web browser that renders web pages in plain text. Because it does not process JavaScript or multimedia content, it displays only the server-rendered HTML. This behavior is particularly useful for web scrapers that need a simplified view of a webpage without the extra clutter of images, scripts, and style sheets.
Whether you’re troubleshooting a site’s text output for SEO purposes, or you need to quickly harvest links and data from a website using a shell script, Lynx offers an efficient and reliable solution.
Installing and Configuring Lynx
Installation
Lynx is available on many platforms:
- Linux (Ubuntu/Debian):
sudo apt-get update sudo apt-get install lynx
- macOS (using Homebrew):
brew install lynx
- Windows: Install Lynx through Windows Subsystem for Linux (WSL) or via a package manager like Chocolatey.
Basic Configuration
After installation, you can check the version by running:
lynx --version
Lynx also supports a configuration file (lynx.cfg) where you can customize settings such as:
- Keybindings (e.g., enabling vi keys for navigation)
- Link numbering and text formatting options
- Cookie handling for smoother browsing
A properly tuned configuration file can enhance both your browsing experience and scraping accuracy.
Basic Web Browsing with Lynx
Using Lynx for browsing is straightforward. Launch it by specifying a URL:
lynx https://example.com/
Navigation in Lynx is done entirely via the keyboard. Links are typically numbered, so you can jump to a specific link by entering its corresponding number. Here are some common commands:
- Arrow Keys / j and k: Scroll up and down
- Enter: Follow a link
- q: Quit Lynx
- p: Print or save the current page
The text-only view provided by Lynx is great for quickly scanning content without distractions, especially when you need a clear view of the server-rendered HTML.
Web Scraping with Lynx
Because Lynx strips away JavaScript and multimedia, it produces a clean dump of a web page’s content. This makes it ideal for scraping data such as:
- Lists of links
- Plain text content
- Metadata visible in the HTML source
Dumping a Webpage
To output the rendered text of a page, use the --dump option:
lynx --dump https://example.com/ > output.txt
This command prints the page’s content to standard output, which you can redirect into a file. The dumped content often includes a list of numbered links at the bottom of the file.
Extracting Links
If your goal is to extract just the links from a page, combine a few options:
- --listonly: Outputs only the list of links
- --nonumbers: Removes the line numbers from the output
- --display_charset=utf-8: Ensures proper character encoding
Example:
lynx --listonly --nonumbers --display_charset=utf-8 --dump https://www.nytimes.com/ | grep "^http" | sort | uniq > links.txt
This command extracts, sorts, and removes duplicate URLs from the New York Times homepage.
Scripting and Automation Examples
By integrating Lynx with other shell utilities, you can create powerful web scraping scripts. Here are two practical examples.
Example 1: Extracting Links with a Shell Function
Add this function to your shell configuration file (e.g., ~/.bashrc or ~/.zshrc):
extract_links() { lynx --listonly --nonumbers --display_charset=utf-8 --dump "$1" | grep "^http" | sort | uniq }
Usage:
extract_links https://www.example.com/ > example_links.txt
Example 2: Scraping Specific Data
Suppose you want to extract weather data from a specific site. You can pipe Lynx’s dump into grep and awk to filter the data:
#!/bin/bash # weather_scrape.sh: Extract weather info URL="https://weather.example.com/today" # Dump the page content and filter for temperature information weather_info=$(lynx --dump "$URL" | grep "Temperature:") echo "Today's weather: $weather_info"
Make the script executable:
chmod +x weather_scrape.sh
And run it:
./weather_scrape.sh
Advanced Tips and Customizations
Using Regular Expressions
For more refined data extraction, integrate regular expressions with grep or sed. For instance, to extract IP addresses from a webpage:
lynx --dump https://example.com/ | grep -Eo '([0-9]{1,3}\.){3}[0-9]{1,3}' > ips.txt
Combining Lynx with Other Tools
- curl: Sometimes using Lynx together with curl can further refine your workflow.
- Mailcap configuration: Lynx is mailcap-aware, allowing you to specify external programs to handle specific MIME types. This is useful if you need to process or view certain types of data with specialized tools.
- Scripting in Bash: Automate regular scrapes by scheduling your scripts with cron or integrating them into larger data pipelines.
Handling Dynamic Content
Remember that Lynx does not execute JavaScript. For sites that heavily rely on client-side scripting, Lynx may not capture all dynamic content. However, many sites still serve a basic HTML version that is perfect for text scraping and SEO analysis.
Conclusion
Lynx may have started its life in the early 1990s, but its simplicity and efficiency still make it a valuable tool for both web browsing and web scraping. Its ability to render pages in pure text provides an uncluttered view of server-rendered content—ideal for debugging, SEO analysis, and quick data extraction.
By mastering Lynx’s command-line options and combining it with powerful shell utilities like grep, awk, and sed, you can create flexible scripts that handle everything from link extraction to comprehensive web scraping tasks.