About UsSolutionsTechnologyPortfolio
Softomate LogoInquiry
 


Custom Web mining software
Case Studies

E-mail Spider

Problem:
Client needed to design web spider software that would allow to mine predefined web pages and extract e-mail addresses from them.

Solution:
Softomate provided Client with fully functional solution that allowed mining web pages from result listings of search engines basing on specified words sets (country specific engines like fireball.de, virgilio.it were also supported).
The emails extracting solution allowed crawling the Web extremely fast since the solution was a collection of a PERL scripts that could be installed on a hosting provider space and used connection speeds close to OC-3 capabilities. That gave the possibility to spider 100000 pages per week. Client didn't have to pay for incoming traffic.

 

Contact me!
Talk to Web Spiders development expert:


tel. +1-877-2438735
(9.00 -12.00 a.m. EST)

e-mail to:
sam@softomate.com

You’ll receive the Quote within 24 hours.

The solution was developed as 2 Perl scripts:

1) Search Engine Parser that parsed the result listings of a search engine result pages extracting links to websites. The configuration file could be customized for any search engine of Client’s choice (including country specific search engines).
2) Spider Component that mined the websites from the list, recursively looking for pages and extracting e-mail addresses.

Benefits:
The Spider Software developed gave possibility to spider 100000 pages per week. The Solution allowed mining the Web really quick and without payment for incoming traffic.
By customizing, the Search Engine Parser could be tuned for almost any search engine that provided the search results listings.

Development time: 30 hours


Web Spider for publishing control

Problem:
The Client’s company wrote press releases for their customers in a specific industry and issued those press releases to lots of agencies. Client needed custom web spider software for press releases publishing control. The spider software would have to crawl definite set of websites and report which sites contained the specific press releases.

Solution:
The solution was developed under the Client’s requirements.
The spider crawled the websites and checked what press releases were published and on which sites. The spider went recursively from the predefined URL and found the matching text fragments (chosen by customer from the full text of the press release) in the pages. When spider found them it reported that the match had been found and displayed the location of the page found in the list. Since the text of the press release might have been reformatted a little it also displayed matching factor (in percentage).
Texts of press releases were stored in database. The searches were performed daily to control press releases publishing on a timely basis.
The list of starting URLs could be customized so that new locations for searches could be easily added into the database.
For each press release the system maintained the list of locations where it had been found. This list was also stored in the database and could be accessed.

For the reasons of stability and autonomous work the system was implemented on a web server as a collection of Perl scripts with a web interface.
The spiders were run automatically in several threads to improve the overall performance of the system.

Technologies used:
The web spider solution was implemented on Perl and used MySQL based database. It could be setup either on LINUX or on Windows platform.

Development time: 73 hours.

top of page


US Office
104 6th Street, Unit B
Lynden, Washington 98264
USA
email: info@softomate.com
tel +1-877-2438735
fax +1-801-4578820
  Australian Office
5/11 Sully Street
Randwick, NSW 2031
Sydney, Australia
email:oz@softomate.com
tel +61-2-93985845
fax +61-2-85707027
  Russian Office
Nemirovicha Danchenko Street 122 , 630087
Novosibirsk,
Russia
email: ru@softomate.com
tel +7-383-3462806
fax
+7-383-3462806