2010-04-14 - Jakarta Taglibs has been retired.

For more information, please explore the Attic.

Jakarta Project: Scrape Tag library (Pre Beta)

Version: 1.0

Table of Contents

Overview

The scrape tag library can scrape or extract content from web documents and display the content in your JSP. For example, you could scrape stock quotes from other web sites and display them in your pages.

After your JSP scrapes a document for the first time, the results of the scrape are cached for subsequent JSP requests. These results are returned unless the JSP determines that the document must be rescraped. Rescraping is determined by the following logic:

  1. The status of the scrape tags and attributes in the JSP is examined. Any modifications to the tags or attributes trigger a rescrape. If the tags have not been modified, the JSP proceeds to step 2.
  2. The minimum time for rescraping, specified by the time attribute of the page tag, is examined. The default time is 10 minutes. If this time has not passed since the last scrape, cached results are returned. If this time has passed, the JSP proceeds to step 3.
  3. The expired header of the scraped document is examined. If the expiration date/time has not passed, cached results are returned. If the expiration date/time is not specified or the document has expired, the JSP proceeds to step 4.
  4. The headers for the scraped document are requested and examined. If the document has not been modified since the last scrape, cached results are returned. If the document has been modified, it is rescraped and the new results are returned.

Requirements

This custom tag library requires a servlet container that supports the JavaServer Pages Specification, version 1.1 or higher. It also requires an up-to-date version of the jakarta-oro package.

Configuration

Follow these steps to configure your web application with this tag library:

To use the tags from this library in your JSP pages, add the following directive at the top of each page:

<%@ taglib uri="http://jakarta.apache.org/taglibs/scrape-1.0" prefix="scrp" %>

where "scrp" is the tag name prefix you wish to use for tags from this library. You can change this value to any prefix you like.

Tag Summary

Scrape Tags
pageSpecify the URL of the document to be scraped and the minimum time that must pass before the document is rescraped.
url Specify the URL of the document that contains the content to be scraped. Use this tag as an alternate to the page tag's url attribute when the URL must be generated dynamically.
header Set an http header for the request.
scrape Specify the text anchors that mark the beginning and end of the content to be scraped.
resultRetrieve the content from a scrape.
 

Tag Reference

page Availability: 1.0

Specify the URL of the document to be scraped and the minimum time that must pass before the document is rescraped.

Tag BodyJSP    
Restrictions

None

AttributesNameRequired Runtime Expression Evaluation Availability
 url  No   No  1.0
 

The fully qualified URL of the document that is to be scraped, such as:

http://domain.name/directory/document.html

Note that if you must dynamically generate the URL, perhaps via a set of tags from a different tag library, you can omit the url attribute in the page tag and instead use the url tag.

 time  No   No  1.0
 

The length of time the JSP waits before attempting to rescrape the document. The value of time is specified in minutes. The minimum value is 10 minutes. Note that the minimum value is used if a time attribute is not specified.

 useProxy  No   No  1.0
 

Tells the taglib to use a proxy for the connection. The name and port of the proxy server will be retreived from the system properties http.proxyHost and http.proxyPort. This attribute is not necessary if setting the name amd port with the proxyServer and proxyPort attributes.

 proxyServer  No   No  1.0
 

The name of the proxy server to use.

 proxyPort  No   No  1.0
 

The number of the port to use to connect to the proxy server. Defaults to 3128.

 proxyName  No   No  1.0
 

The username for authentication to the proxy server.

 proxyPass  No   No  1.0
 

The password for authentication to the proxy server.

 charset  No   No  1.0
 

Charset used by the scraped page. This attribute is useful when the page being scrapped uses a different charset than the web server.

VariablesNone
Examples Specify a document to be scraped with a rescrape time of 20 minutes. Note that a scrape tag must be nested within the body of the page tag.  
 


 
<scrp:page url="http://finance.yahoo.com/q?s=SUNW" time="20">
   <scrp:scrape id="qt" begin="<table border=1" end="</table>" anchors="true"/>
</scrp:page>
       
          

Examples Specify a document to be scraped with a connection that must be made through a proxy on a port other than the default 3128. Note that a scrape tag must be nested within the body of the page tag.  
 


 
<scrp:page url="http://finance.yahoo.com/q?s=SUNW" proxyServer="proxy.server"
proxyPort="3129">
   <scrp:scrape id="qt" begin="<table border=1" end="</table>" anchors="true"/>
</scrp:page>
       
          

Examples Specify a document to be scraped with a connection that must be made through a proxy. Use the java system defaults of http.proxyHost and http.proxyPort. Note that a scrape tag must be nested within the body of the page tag.  
 


 
<scrp:page url="http://finance.yahoo.com/q?s=SUNW" useProxy="true">
   <scrp:scrape id="qt" begin="<table border=1" end="</table>" anchors="true"/>
</scrp:page>
       
          

Examples Specify a document to be scraped with a connection that must be made through a proxy on a port other than the default 3128. The proxy server requires authentication. Note that a scrape tag must be nested within the body of the page tag.  
 


 
<scrp:page url="http://finance.yahoo.com/q?s=SUNW" proxyServer="proxy.server"
proxyPort="3129" proxyName="foo" proxyPass="foobar">
   <scrp:scrape id="qt" begin="<table border=1" end="</table>" anchors="true"/>
</scrp:page>
       
          

url Availability: 1.0

Specify the URL of the document that contains the content to be scraped. Use this tag as an alternate to the page tag's url attribute when the URL must be generated dynamically.

Tag BodyJSP    
Restrictions

Must be nested within a page tag.

AttributesNone
VariablesNone
Examples Specify a document to be scraped Note that a url tag must be nested within the body of the page tag  
 

   
      
<scrp:page>
   <scrp:url>http://finance.yahoo.com/q?s=SUNW</scrp:url>
   <scrp:scrape id="qt" begin="<table border=1" end="</table>" anchors="true"/>
</scrp:page>
               
          

header Availability: 1.1

Set an http header for the request.

Tag BodyJSP    
Restrictions

Must be nested within a page tag

AttributesNameRequired Runtime Expression Evaluation Availability
 name  Yes  1.1
 

The name of the http header to be sent in the http request.

 value  No  1.1
 

The value of the http header to be sent in the http request.

VariablesNone
Examples Specify that the http request for the scrape set the User-Agent and Referer headers. The User-Agent is set using the name and value attributes. The Referer header is set using the name attribute and the body of the header tag. Note that a header tag must be nested within the body of the page tag  
 

   
      
<scrp:page>
   <scrp:header name="User-Agent" value="mozilla/1.2"/>
   <scrp:header name="Referer">
       http://localhost:8080/scrape-examples/scrape.jsp
   </scrp:header>
   <scrp:url>http://finance.yahoo.com/q?s=SUNW</scrp:url>
   <scrp:scrape id="qt" begin="<table border=1" end="</table>" anchors="true"/>
</scrp:page>
               
          

scrape Availability: 1.0

Specify the text anchors that mark the beginning and end of the content to be scraped.

Tag BodyJSP    
Restrictions

Must be nested within a page tag

AttributesNameRequired Runtime Expression Evaluation Availability
 id  Yes   No  1.0
 

A unique identifier that distinguishes this scrape from all others. Each scrape is unique and accessible only by this id.

 begin  Yes   No  1.0
 

The text anchor that marks the beginning of the content to be scraped from the document.

 end  Yes   No  1.0
 

The text anchor that marks the end of the content to be scraped from the document.

 strip  No   No  1.0
 

If strip is set to true, the output from the result tag is stripped of HTML, XML, DHTML, etc. tags. That is, nothing within < > will be included in the scrape result. The default value is false. Note that strip can be used in conjunction with the anchors attribute.

 anchors  No   No  1.0
 

If anchors is set to true, the begin and end text anchors are included in the scrape result. The default value is false. Note that anchors can be used in conjunction with the strip attribute.

VariablesNameScopeAvailability
  id attribute value   Start of tag to end of page  1.0
 

Name used to retrieve the scrape later in the page.

 PropertiesNone
Examples Set a scrape on a page with anchors included. Note that the page tag is first and the scrape tag is nested.  
 

   
         
<scrp:page url="http://finance.yahoo.com/q?s=SUNW" time="20">
   <scrp:scrape id="qt" begin="<table border=1" end="</table>" anchors="true" />
</scrp:page>

          

Examples Set a scrape on a page with results set to have no tags.  
 



<scrp:page url="http://finance.yahoo.com/q?s=SUNW" time="20">
   <scrp:scrape id="qt" begin="<table border=1" end="</table>" strip="true"/>
</scrp:page>

	  

result Availability: 1.0

Retrieve the content from a scrape.

Tag BodyEmpty    
Restrictions

None

AttributesNameRequired Runtime Expression Evaluation Availability
 scrape  Yes   No  1.0
 

The id of a previously preformed scrape who's results you would like to retreive.

VariablesNone
Examples Get the results of a previously performed scrape.  
 

   
      
<scrp:result scrape="qt"/>
               
          

Examples

See the example application scrape-examples.war for examples of the usage of the tags from this custom tag library.

Java Docs

Java programmers can view the java class documentation for this tag library as javadocs.

Revision History

Review the complete revision history of this tag library.