rulururu

post Web Crawlers in Java

October 24th, 2007

Filed under: Programming — Unggul_USA @ 9:02 am — View blog reactions


  • Arachnid - Arachnid is a Java-based web spider framework. It includes a simple HTML parser object that parses an input stream containing HTML content. Simple Web spiders can be created by sub-classing Arachnid and adding a few lines of code called after each page of a Web site is parsed. Two example spider applications are included to illustrate how to use the framework.
  • Arale - While many bots around are focused on page indexing, Arale is primarly designed for personal use. It fits the needs of advanced web surfers and web developers. Some real life cases are:downloading only images, videos, mp3 or zip files from a site. anuals, articles, ebooks fragmented in many files to discourage download.user-unfriendly sites. Popups, banners and tricky scripts annoying you before you can download a resource.
  • Grunk - Grunk (for GRammar UNderstanding Kernel) is a library for
    parsing and extracting structured metadata from semi-structured text formats. It
    is based on a very flexible parsing engine capable of detecting a wide variety
    of patterns in text formats and extracting information from them. Formats are
    described in a simple and powerful XML configuration from which Grunk builds a
    parser at runtime, so adapting Grunk to a new format does not require a coding
    or compilation step.
    Grunk features:

    • Powerful two-step parser with pattern-matching based on Perl5 regular
      expressions
    • Inline transformations making it possible to parse otherwise tricky
      syntaxes
    • XML-based configuration
    • Support for XML output
    • Flexible API
  • Heritrix - Heritrix is the Internet Archives open-source, extensible, web-scale, archival-quality web crawler project.
    Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/ heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.
    It is designed to respect the robots.txt exclusion directives and META robots tags .
  • HyperSpider - HyperSpider (Java app) collects the link structure of a website. Data import/export from/to database and CSV-files. Export to Graphviz DOT, Resource Description Framework (RDF/DC), XML Topic Maps (XTM), Prolog, HTML. Visualization as hierarchy and map.his Java application collects the link structure of a website by following the hyperlinks. Various export formats are supported which makes this project unique, especially concerning RDF and XTM which allows to import the data into forthcoming visualization/analysis tools.
  • J-Spider - A Java implementation of a flexible and extensible web spider engine. Optional modules allow functionality to be added (searching dead links, testing the performance and scalability of a site, creating a sitemap, etc ..
  • LARM - LARM is a 100% Java search solution for end-users of the Jakarta Lucene search engine framework. It contains methods for indexing files, database tables, and a crawler for indexing web sites. ell, it will be. At the moment we only have some specifications. Its up to you to turn this into a working program.ts predecessor was an experimental crawler called larm-webcrawler available from the Jakarta project. Some people joined to leverage LARM on a higher level and wrote down some ideas. This resulted in a new project currently hosted on Sourceforge.
  • Metis - Metis is a tool to collect information from the content of web sites. This was written for the Ideahamster Group for finding the competitive intelligence weight of a web server and assists in satisfying the CI Scouting portion of the Open Source Security Testing Methodology Manual (OSSTMM). The tool is distributed under the GNU Public license.
    The too is written in Java and is composed of 2 packages:
    The web spider engine : the faust.sacha.web java package
    This package handles the web spidering process, collects and stores the information in memory.
    The data analysis part : Metis org.idehamster.metis java package
    This package reads the data collected by the spider and generate a report
  • Nutch - Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.
  • Spider - Spider is a complete standalone Java application designed to easily integrate varied datasources.
    XML driven framework for data retrieval from network accessible sources
    Scheduled pulling
    Highly extensible
    Provides hooks for custom post-processing and configuration
    Implemented as a Avalon/Keel framework datafeed service
    Included Core Connectors:
    Files and Zip Archives via HTTP/FTP/HTTPS/FileSystem
    Supports access via links described as literals or regular expressions
    Supports sessions/cookies/form parameters
    Included Optional Connectors:
    Axis (SOAP webservices)
  • Spindle - Spindle is a web indexing/search tool built on top of the Lucene toolkit. It includes a HTTP spider that is used to build the index, and a search class that is used to search the index. In addition, support is provided for the Bitmechanic listlib JSP TagLib, so that a search can be added to a JSP based site without writing any Java classes.This library is released free of charge with source code included under the terms of the GPL. See the LICENSE file for details.
  • WebLech - WebLech is a fully featured web site download/mirror tool in Java, which supports many features required to download websites and emulate standard web-browser behaviour as much as possible. WebLech is multithreaded and will feature a GUI console.
  • WebSPHINX - WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for web crawlers. A web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically.
    WebSPHINX consists of two parts: the Crawler Workbench and the WebSPHINX class library.
    The Crawler Workbench is a graphical user interface that lets you configure and control a customizable web crawler.
    The WebSPHINX class library provides support for writing web crawlers in Java.

Inlinks :

(No Ratings Yet)
1,382 Views

post Email Tips : Chain Email : Keamanan #2

October 10th, 2007

Filed under: XtraPost — Unggul_USA @ 4:03 pm — View blog reactions


A typical chain letter consists of a message that attempts to induce the recipient to make a number of copies of the letter and then pass them on to one or more new recipients.

“Chain email” means any email that suggests to the recipient that he forward it to “all your friends and relatives” or anything similar, thus forming a chain between the author of the email and each recipient.

How Should I Respond?

  • Please do not forward chain email to anyone else.
  • Reply to the sender (if you know them) without including the contents of the original e-mail and politely ask them not to send you any more. If you do not know the sender, ignore the e-mail and report it as spam.

Why Do Chain Emails Happen?

  • Keep in mind that people who initiate such emails, whether willfully perpetrating a scam or simply overreacting to some bit of news, usually don’t have the credibility to convince lots of people to take some action. So, they try hard to gain credibility for their message by encouraging everyone they can to “forward” it for them.

However, please remember this.

  • No chain e-mails are legitimate, credible companies do not conduct their marketing in such a haphazard fashion. Chain e-mails cannot bring you fortune or cause bad luck, they will not make you rich and you will never get that luxury holiday. They are lies, at best mischievous at worst (like virus hoaxes) designed to cause worry and disruption.

Inlinks :

(No Ratings Yet)
417 Views
Next Page »

Most Viewed Post/Page:

  • Tips : Mengatasi Komputer Bermasalah - 7,984 Views
  • Download - 4,325 Views
  • SORTING ALGORITHM ANALYSIS - 3,847 Views
  • Tips : Membuat Jaringan Wi-Fi - 3,623 Views
  • Memilih Anti Virus - 3,308 Views
  • Apa itu Multimedia ? - 2,676 Views
  • Database - 2,526 Views
  • Sejarah Kriptografi - 2,510 Views
  • Power Builder 11.0 Launching - 2,378 Views
  • Programming - 2,318 Views
  • Most Rated Post/Page:

  • Tips : Membuat Jaringan Wi-Fi - 8 Votes
  • Tips : Mengatasi Komputer Bermasalah - 7 Votes
  • Download - 5 Votes
  • Tips : Komputer Aman Dari Virus - 4 Votes
  • Menjalankan Banyak Account Yahoo Messenger - 3 Votes
  • Aplikasi Web Atau Aplikasi Desktop ? - 3 Votes
  • Aplikasi Untuk Amankan Data Penting - 3 Votes
  • Memilih Anti Virus - 3 Votes
  • Tips : Merawat Komputer - 2 Votes
  • Database #2 - 2 Votes
  • ruldrurd
    porn movies buy online pharmacy viagra soft tabs viagra or levitra order cialis soft tabs online information on viagra for woman cheap cialis soft tabs levitra cheap generic viagra online viagra levitra purchase uk free cialis order online cialis cream for women levitra for women online viagra soft tabs
    Powered by WordPress, Web Design by Laurentiu Piron
    Entries (RSS) and Comments (RSS)