Thrift Client
The following examples are of using the Thrift API directly. You will need to following libraries at a minimum:
- blur-thrift-*.jar
- blur-util-*.jar
- slf4j-api-1.6.1.jar
- slf4j-log4j12-1.6.1.jar
- commons-logging-1.1.1.jar
- log4j-1.2.15.jar
Note
Other versions of these libraries could work, but these are the versions that Blur currently uses.Getting A Client Example
Connection String
The connection string can be parsed or constructed through "Connection" object. If you are using the parsed version there are some options. At a minimum you will have to provide hostname and port:host1:40010
You can list multiple hosts:
host1:40010,host2:40010
You can add a SOCKS proxy server for each host:
host1:40010/proxyhost1:6001
You can also add a timeout on the socket of 90 seconds (the default is 60 seconds):
host1:40010/proxyhost1:6001#90000
Multiple hosts with a different timeout:
host1:40010,host2:40010,host3:40010#90000
Here is all options together:
host1:40010/proxyhost1:6001,host2:40010/proxyhost1:6001#90000
Thrift Client
Client Example 1:Iface client = BlurClient.getClient("controller1:40010,controller2:40010");
Client Example 2:
Connection connection = new Connection("controller1:40010");
Iface client = BlurClient.getClient(connection);
Client Example 3:
BlurClientManager.execute("controller1:40010,controller2:40010", new BlurCommand<T>() {
@Override
public T call(Client client) throws BlurException, TException {
// your code here...
}
});
Client Example 4:
List<Connection> connections = BlurClientManager.getConnections("controller1:40010,controller2:40010");
BlurClientManager.execute(connections, new BlurCommand<T>() {
@Override
public T call(Client client) throws BlurException, TException {
// your code here...
}
});
Query Example
This is a simple example of how to run a query via the Thrift API and get back search results. By default the first 10 results are returned with only row ids to the results.
Iface client = BlurClient.getClient("controller1:40010,controller2:40010");
Query query = new Query();
query.setQuery("+docs.body:\"Hadoop is awesome\"");
BlurQuery blurQuery = new BlurQuery();
blurQuery.setQuery(query);
BlurResults results = client.query("table1", blurQuery);
System.out.println("Total Results: " + results.totalResults);
for (BlurResult result : results.getResults()) {
System.out.println(result);
}
Query Example with Data
This is an example of how to run a query via the Thrift API and get back search results with data. All the columns in the "fam0" family are returned for each Record in the Row.
Iface client = BlurClient.getClient("controller1:40010,controller2:40010");
Query query = new Query();
query.setQuery("+docs.body:\"Hadoop is awesome\"");
Selector selector = new Selector();
// This will fetch all the columns in family "fam0".
selector.addToColumnFamiliesToFetch("fam0");
// This will fetch the "col1", "col2" columns in family "fam1".
Set cols = new HashSet();
cols.add("col1");
cols.add("col2");
selector.putToColumnsToFetch("fam1", cols);
BlurQuery blurQuery = new BlurQuery();
blurQuery.setQuery(query);
blurQuery.setSelector(selector);
BlurResults results = client.query("table1", blurQuery);
System.out.println("Total Results: " + results.totalResults);
for (BlurResult result : results.getResults()) {
System.out.println(result);
}
Query Example with Sorting
This is an example of how to run a query via the Thrift API and get back search results with data being sorted by the "docs.timestamp" column. All the columns in the records will be returned.
Note
Sorting is only allowed on Record queries at this point.Iface client = BlurClient.getClient("controller1:40010,controller2:40010");
Query query = new Query();
query.setQuery("+docs.body:\"Hadoop is awesome\"");
query.setRowQuery(false);
Selector selector = new Selector();
selector.setRecordOnly(true);
BlurQuery blurQuery = new BlurQuery();
blurQuery.setQuery(query);
blurQuery.setSelector(selector);
blurQuery.addToSortFields(new SortField("docs", "timestamp", true));
BlurResults results = client.query("table1", blurQuery);
System.out.println("Total Results: " + results.totalResults);
for (BlurResult result : results.getResults()) {
System.out.println(result);
}
Faceting Example
This is an example of how to use the faceting feature in a query. This API will likely be update in a future version.
Iface client = BlurClient.getClient("controller1:40010,controller2:40010");
Query query = new Query();
query.setQuery("+docs.body:\"Hadoop is awesome\"");
final BlurQuery blurQuery = new BlurQuery();
blurQuery.setQuery(query);
// This facet will stop counting once the count has reached 10000. However this is only counted
// on each server, so it is likely you will receive a count larger than your max.
blurQuery.addToFacets(new Facet("fam1.col1:value1 OR fam1.col1:value2", 10000));
blurQuery.addToFacets(new Facet("fam1.col1:value100 AND fam1.col1:value200", Long.MAX_VALUE));
BlurResults results = client.query(tableName, blurQuery);
System.out.println("Facet Results:");
List facetCounts = results.getFacetCounts();
List facets = blurQuery.getFacets();
for (int i = 0; i < facets.size(); i++) {
System.out.println("Facet [" + facets.get(i) + "] got [" + facetCounts.get(i) + "]");
}
BlurResults results = client.query("table1", blurQuery);
System.out.println("Total Results: " + results.totalResults);
for (BlurResult result : results.getResults()) {
System.out.println(result);
}
Fetch Data
This is an example of how to fetch data via the Thrift API. All the records of the Row "rowid1" are returned. If it is not found then Row would be null.
Iface client = BlurClient.getClient("controller1:40010,controller2:40010");
Selector selector = new Selector();
selector.setRowId("rowid1");
FetchResult fetchRow = client.fetchRow("table1", selector);
FetchRowResult rowResult = fetchRow.getRowResult();
Row row = rowResult.getRow();
for (Record record : row.getRecords()) {
System.out.println(record);
}
Mutate Example
This is an example of how to perform a mutate on a table and either add or replace an existing Row.
Iface client = BlurClient.getClient("controller1:40010,controller2:40010");
Record record1 = new Record();
record1.setRecordId("recordid1");
record1.setFamily("fam0");
record1.addToColumns(new Column("col0", "val0"));
record1.addToColumns(new Column("col1", "val1"));
Record record2 = new Record();
record2.setRecordId("recordid2");
record2.setFamily("fam1");
record2.addToColumns(new Column("col4", "val4"));
record2.addToColumns(new Column("col5", "val5"));
List recordMutations = new ArrayList();
recordMutations.add(new RecordMutation(RecordMutationType.REPLACE_ENTIRE_RECORD, record1));
recordMutations.add(new RecordMutation(RecordMutationType.REPLACE_ENTIRE_RECORD, record2));
// This will replace the exiting Row of "rowid1" (if one exists) in table "table1". It will
// write the mutate to the write ahead log (WAL) and it will not block waiting for the
// mutate to become visible.
RowMutation mutation = new RowMutation("table1", "rowid1", true, RowMutationType.REPLACE_ROW,
recordMutations, false);
mutation.setRecordMutations(recordMutations);
client.mutate(mutation);
Shortened Mutate Example
This is the same example as above but is shorted with a help class.
import static org.apache.blur.thrift.util.BlurThriftHelper.*;
Iface client = BlurClient.getClient("controller1:40010,controller2:40010");
// This will replace the exiting Row of "rowid1" (if one exists) in table "table1". It will
// write the mutate to the write ahead log (WAL) and it will not block waiting for the
// mutate to become visible.
RowMutation mutation = newRowMutation("table1", "rowid1",
newRecordMutation("fam0", "recordid1", newColumn("col0", "val0"), newColumn("col1", "val2")),
newRecordMutation("fam1", "recordid2", newColumn("col4", "val4"), newColumn("col5", "val4")));
client.mutate(mutation);
Shell
The shell can be invoked by running:
$BLUR_HOME/bin/blur shell
Also any shell command can be invoked as a cli command by running:
$BLUR_HOME/bin/blur <command>
# For example to get help
$BLUR_HOME/bin/blur help
The following rules are used when interacting with the shell:
- Arguments are denoted by "< >".
- Optional arguments are denoted by "[ ]".
- Options are denoted by "-".
- Multiple options / arguments are denoted by "*".
Map Reduce
Here is an example of the typical usage of the BlurOutputFormat. The Blur table has to be created before the MapReduce job is started. The setupJob method configures the following:
- The reducer class to be DefaultBlurReducer
- The number of reducers to be equal to the number of shards in the table.
- The output key class to a standard Text writable from the Hadoop library
- The output value class is a BlurMutate writable from the Blur library
- The output format to be BlurOutputFormat
- Sets the TableDescriptor in the Configuration
- Sets the output path to the TableDescriptor.getTableUri() value
- Also the job will use the BlurOutputCommitter class to commit or rollback the MapReduce job
Example Usage
Iface client = BlurClient.getClient("controller1:40010");
TableDescriptor tableDescriptor = client.describe(tableName);
Job job = new Job(jobConf, "blur index");
job.setJarByClass(BlurOutputFormatTest.class);
job.setMapperClass(CsvBlurMapper.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(input));
CsvBlurMapper.addColumns(job, "cf1", "col");
BlurOutputFormat.setupJob(job, tableDescriptor);
BlurOutputFormat.setIndexLocally(job, true);
BlurOutputFormat.setOptimizeInFlight(job, true);
job.waitForCompletion(true);
Options
-
BlurOutputFormat.setIndexLocally(Job,boolean)
- Enabled by default, this will enable local indexing on the machine where the task is running. Then when the RecordWriter closes the index is copied to the remote destination in HDFS.
-
BlurOutputFormat.setMaxDocumentBufferSize(Job,int)
- Sets the maximum number of documents that the buffer will hold in memory before overflowing to disk. By default this is 1000 which will probably be very low for most systems.
-
BlurOutputFormat.setOptimizeInFlight(Job,boolean)
- Enabled by default, this will optimize the index while copying from the local index to the remote destination in HDFS. Used in conjunction with the setIndexLocally.
-
BlurOutputFormat.setReducerMultiplier(Job,int)
- This will multiple the number of reducers for this job. For example if the table has 256 shards the normal number of reducers is 256. However if the reducer multiplier is set to 4 then the number of reducers will be 1024 and each shard will get 4 new segments instead of the normal 1.
CSV Loader
The CSV Loader program can be invoked by running:
$BLUR_HOME/bin/blur csvloader
Caution
Also the machine that will execute this command will need to have Hadoop installed and configured locally, otherwise the scripts will not work correctly.usage: csvloader
The "csvloader" command is used to load delimited into a Blur table.
The required options are "-c", "-t", "-d". The standard format for the contents of a file
is:"rowid,recordid,family,col1,col2,...". However there are several options, such as the rowid and
recordid can be generated based on the data in the record via the "-A" and "-a" options. The family
can assigned based on the path via the "-I" option. The column name order can be mapped via the "-d"
option. Also you can set the input format to either sequence files vie the "-S" option or leave the
default text files.
-A No Row Ids - Automatically generate row ids for each record based on a MD5
has of the data within the record.
-a No Record Ids - Automatically generate record ids for each record based on a
MD5 has of the data within the record.
-b <size> The maximum number of Lucene documents to buffer in the reducer for a single
row before spilling over to disk. (default 1000)
-c <controller*> * Thrift controller connection string. (host1:40010 host2:40010 ...)
-C <minimum maximum> Enables a combine file input to help deal with many small files as the
input. Provide the minimum and maximum size per mapper. For a minimum of
1GB and a maximum of 2.5GB: (1000000000 2500000000)
-d <family column*> * Define the mapping of fields in the CSV file to column names. (family col1
col2 col3 ...)
-I <family path*> The directory to index with a family name, the family name is assumed to NOT
be present in the file contents. (family hdfs://namenode/input/in1)
-i <path*> The directory to index, the family name is assumed to BE present in the file
contents. (hdfs://namenode/input/in1)
-l Disable the use storage local on the server that is running the reducing
task and copy to Blur table once complete. (enabled by default)
-o Disable optimize indexes during copy, this has very little overhead.
(enabled by default)
-p <codec> Sets the compression codec for the map compress output setting.
(SNAPPY,GZIP,BZIP,DEFAULT, or classname)
-r <multiplier> The reducer multipler allows for an increase in the number of reducers per
shard in the given table. For example if the table has 128 shards and the
reducer multiplier is 4 the total number of reducers will be 512, 4 reducers
per shard. (default 1)
-s <delimiter> The file delimiter to be used. (default value ',') NOTE: For special
charactors like the default hadoop separator of ASCII value 1, you can use
standard java escaping (\u0001)
-S The input files are sequence files.
-t <tablename> * Blur table name.
JDBC
The JDBC driver is very experimental and is currently read-only. It has a very basic SQL-ish
language that should allow for most Blur queries.
Basic SQL syntax will work for example:
select * from testtable where fam1.col1 = 'val1'
You may also use Lucene syntax by wrapping the Lucene query in a "query()" function:select * from testtable where query(fam1.col1:val?)
Here is a screenshot of the JDBC driver in SQuirrel:
