|
The solution was developed as 2 Perl scripts:
1) Search Engine Parser that parsed the result listings of a search engine result pages extracting links to websites. The configuration file could be customized for any search engine of Client’s choice (including country specific search engines). 2) Spider Component that mined the websites from the list, recursively looking for pages and extracting e-mail addresses.
Benefits: The Spider Software developed gave possibility to spider 100000 pages per week. The Solution allowed mining the Web really quick and without payment for incoming traffic. By customizing, the Search Engine Parser could be tuned for almost any search engine that provided the search results listings.
Development time: 30 hours
Web Spider for publishing control
Problem: The Client’s company wrote press releases for their customers in a specific industry and issued those press releases to lots of agencies. Client needed custom web spider software for press releases publishing control. The spider software would have to crawl definite set of websites and report which sites contained the specific press releases.
Solution: The solution was developed under the Client’s requirements. The spider crawled the websites and checked what press releases were published and on which sites. The spider went recursively from the predefined URL and found the matching text fragments (chosen by customer from the full text of the press release) in the pages. When spider found them it reported that the match had been found and displayed the location of the page found in the list. Since the text of the press release might have been reformatted a little it also displayed matching factor (in percentage). Texts of press releases were stored in database. The searches were performed daily to control press releases publishing on a timely basis. The list of starting URLs could be customized so that new locations for searches could be easily added into the database. For each press release the system maintained the list of locations where it had been found. This list was also stored in the database and could be accessed.
For the reasons of stability and autonomous work the system was implemented on a web server as a collection of Perl scripts with a web interface. The spiders were run automatically in several threads to improve the overall performance of the system.
Technologies used: The web spider solution was implemented on Perl and used MySQL based database. It could be setup either on LINUX or on Windows platform.
Development time: 73 hours.
top of page |