This writeup is based on my adventures with getting Apache Solr working for the first time. Solr is a wrapper over Lucene. To get Solr working, it is necessary that you have some basic knowledge of Lucene. This tutorial helped me in understanding the basics of Lucene.
I started by following the starters guide present in Solr site. I downloaded the nightly build and unzipped it. In the dist folder you can find the Solr war file. I wanted to host Solr in resin where as the tutorial has steps for doing it in Jetty.
I did the following to start Solr in resin:
- Put the Solr war file in resin webapps folder.
- Renamed the war file to solr.war.
- Created a folder called solr and copied bin and conf folder present in Solr package to solr folder. Made this my solr home.
- When you fire up Solr, you need to tell Solr where to find it’s home folder. So added the following in resin.conf file under the id element.
<system-property solr.solr.home="E:\software\solr"/>;
Id element looked as below:
<host id='*'> <root-directory>./solr</root-directory> <web-app id="solr/"> <document-directory>webapps/solr</document-directory> </web-app> <system-property solr.solr.home="E:\software\solr"/> </host>
- The tutorial assumes that your Solr is listening on port 8983. I changed mine to 8080 in resin.conf file.
<http server-id="" host="*" port="8080"/>
- Bounced resin and entered the url http://localhost:8080/solr/ in mozilla to view the Solr welcome page.
Wow…first hurdle passed.
In case you do not want the “solr” string in your uri(i.e no context), change the webapp id in resin.conf file from <web-app id=”solr/”> to <web-app id=”/”> and bounce resin. Now you do not have to type solr in all your urls. http://localhost:8080/ should get you the Solr welcome page.
I already had one application running in my resin. I did not want to change that. So I set up my resin to host 2 applications by adding the following to windows hosts.conf file present in my <windows folder>\system32\drivers\etc
127.0.0.1 solr.com
My resin.conf file looked as below:
<host id='mumbai.brp.com'> <root-directory>.</root-directory> <web-app id="/"> <document-directory>webapps/burrp</document-directory> </web-app> </host> <host id='solr.com'> <root-directory>./solr></root-directory> <web-app id="solr/"> <document-directory>webapps/solr</document-directory> </web-app> <system-property solr.solr.home="E:\software\solr"/> </host>
To get the above working you have to create a directory called solr containing webapps folder in your resin home folder and place the solr.war file in this webapps folder.
Now comes the part where I need to provide solr with Lucene documents for it to index.
The tutorial has steps for uploading Lucene documents in xml format using a command line tool that comes along with the download. But this tries to upload the document to http://localhost:8983/solr/update. I could not use this as I had changed Solr url as well as port. I tried to find the source of post.jar so that I could make appropriate changes to it. I could not get hold of the source. As solr uses HTTP I tried cURL. This is my first experience with cURL also.
My first goal was to index the data present in monitor.xml file that is present in Solr example docs folder. I tried various options of cURL to upload files using a file name. The one that I got was for HTTP put not a multipart form file transfer. This did not work as Solr does not support HTTP put. Got some pointers as to how to do it here. I tried the curl command cited in the link with the contents of monitor.xml and my own solr link.
curl http://solr.com:8080/solr/update -H “Content-Type: text/xml” –data-binary ‘<contents of monitor.xml>’
I got the following error message.
“< was unexpected at this time. “
After some experimenting I got the above working. Did this by replacing the double quotes in monitor.xml with single quotes.
curl http://solr.com:8080/solr/update -H “Content-Type: text/xml” –data-binary “<contents of monitor.xml with single quotes>”
Once you do this you have to commit the indexing changes. This I did using the below:
curl http://solr.com:8080/solr/update -H “Content-Type: text/xml” –data-binary “<commit waitFlush=’false’ waitSearcher=’false’>”
Yippee…Now I could search the contents of monitor.xml from the admin search interface.
Now I wanted to upload my own data for indexing. But my xml data was too huge to paste in the command prompt and I had not figured a way to upload files using their name through cURL. So I whipped up an HTML with a file submit form, with the action as my Solr update link.
<html> <head> </head> <body> <form action="http://solr.com:8080/solr/update" enctype="multipart/form-data" method="post"> <input type="file"> <input type="submit" value="Send"> </form> </body> </html>
I configured Solr for auto commit by changing the options present in solrconfig.xml, present in Solr home folder. It comes commented in Solr download.
<autoCommit> <maxDocs>10</maxDocs> </autoCommit>
I uploaded a document and tried searching for it but the query did not return any result. After lots of fidgeting around the net came to know that you have to specify the fields present in the document that you upload in schema.xml file present in Solr home conf folder.
My xml document looked as below:
<add> <doc> <field name="name">foo</field> <field name="category">foocat</field> <field name="id">0</field> </doc> <doc> <field name="name">bar</field> <field name="category">barCat</field> <field name="id">1</field> </doc> . . . .
So added the following under the fields element present in schema.xml
<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="name" type="string" indexed="true" stored="true" required="true" /> <field name="category" type="string" indexed="true" stored="true" required="true" />
After making these changes you have to bounce resin. Otherwise the changes made will not be picked up. Still my search results where returning me blanks. Still more fidgeting around and got to know that the fields have to be added to copyField element also.
Added the below to schema.xml
<copyField source="id" dest="text"/> <copyField source="category" dest="text"/> <copyField source="name" dest="text"/>
Bounced resin. Yippeeee. My searches started working.
One day of adventure with Lucene and Solr comes to an end. Will keep you guys posted as I tread deep into Apache Lucene and Solr.
Source for the simple post tool in post.jar …
http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/util/SimplePostTool.java
But you can just run it with the -help option to see how to change the URL…
$ java -jar post.jar -help
SimplePostTool: version 1.2
This is a simple command line tool for POSTing raw XML to a Solr
port. XML data can be read from files specified as commandline
args; as raw commandline arg strings; or via STDIN.
Examples:
java -Ddata=files -jar post.jar *.xml
java -Ddata=args -jar post.jar ’42’
java -Ddata=stdin -jar post.jar < hd.xml
Other options controlled by System Properties include the Solr
URL to POST to, and whether a commit should be executed. These
are the defaults for all System Properties…
-Ddata=files
-Durl=http://localhost:8983/solr/update
-Dcommit=yes
…keep in mind, this is just an simple example tool, if you want to use curl that works too (the post.sh script shows how to do that and specify the filename on the commandline) or you can use any other tool, app, library, whathaveyou that knows how to do an HTTP POST.
I am getting the following error when I tried to submit an xml file from the UI that you suggested.
message missing content stream
description: The request sent by the client was syntactically incorrect (missing content stream).
I checked the xml it was perfectly fine. Please suggest if I am missing something.
Hi Mady,
The post is based on my playing with Solr for a couple of hours out of sheer curiosity. Neither my work is based on Solr nor do I at present have the set up in my box to try something. I am sorry to say that I will not be of much help to you.
But from the message you posted it looks like a problem with the xml file. Try the Solr mailing list. It was pretty active the last time I checked.
Does anyone have a Solr newbie writeup with an Oracle table as the data source using DIH? In the XML response to my data import, I get “Total Rows Fetched”>3690<.
You Can get more solr information here,
http://antguider.blogspot.com/2012/06/solr-search.html
I have OpenCMS on one of my server and use Solr for searching in index files. Each night will OpenCMS-index files updated. But Solr stille use the old index files. How can I configure Solr to become updated just shortly efter index files are updated?