The aim of the project is to create a search engine which indexes Hackerspace related websites in Belgium (maybe BeNeLux, Europe or the whole world later).
I have chosen YaCy because I am biased. I am currently in the process of customizing everything in order to have an up to date index of the HSBXL wiki as a start.
Please feel free to request features and make suggestions! (I don't promise anything though.)
- install YaCy (done: http://4o4.dyndns.org:8081)
- write a script which reads changelog from RSS and adds changed pages to crawler (done, quick and dirty, but works)
- define filters to ignore unwanted pages (done)
- create a stylesheet which matches the HSBXL wiki (tbd)
- compile list of sites which should be included (tbd)
- update Ismael.pm to work with current versions of YaCy (tbd)
- move search engine to server @HSBXL once everything is working???
Sites which should be indexed
- hackerspace.be and subdomains
(remark from ptr_: stick to hackerspace.be + subdomains)
- Does Yacy supports caching the content?
- It is possible to cache contents, but there is no decent mechanism to retrieve them to use the cache like Google Cache (yet). Low012
- Why require Google Cache? You don´t need Google Cache if you cache it yourself
- Does it index all the versions of a page?
- Not with the current settings, but would be possible in principle. The result page might become pretty cluttered by similar results though. It would be nice also be able to search old contents indeed. It is possible to include a URL filter though when you send a search request. This way it would be possible to have a regular search (filters all URLs which contain a questionmark) which only shows pages that have been crawled using their "pretty" names and a search without restrictions. I'll think about it. Low012
- Is it possible to crawl webpages we link to? The probability to have content related to HSB is strong in those pages
- Yes, in fact I have done that. I have crawled hackerspace.be with a certain depth (I forgot, probably 3) and I also set the crawler to only index domains which have a direct link from hackerspace.be. If you try searching for a pretty common word, you will probably get results from domains which have a direct link from this wiki. Low012