output of locate into a dynamic String array -- C++

I'm writing my own file indexer similar to Google Desktop. I'm in the baby stages right now and I'm trying to use as many FreeBSD commands where I can so I don't have to write the same thing again. Where I'm at right now is I've created a class called Crawler. I'm using $ locate / * > database.tmp to get a list of all the files that user can see on the HDD. After that is done I will then go through each file doing my indexing which I haven't started to write yet. My question is instead of using the locate command and dumping it into a file can I directly put it into a dynamic String array? I think this would be faster and easier.

Here's the code I have so far

Code:
/*
 * Crawler.h
 *
 *  Created on: Jul 8, 2012
 *      Author: kclark
 */

#ifndef CRAWLER_H_
#define CRAWLER_H_

class Crawler {
public:
	void startCrawl();
private:
	void getFiles();
	int getLines();
};

void Crawler::startCrawl()
{
	getFiles();
	int dbLines = getLines();
}

void Crawler::getFiles()
{
	char record[1000];										// Declaration of record
	sprintf(record,"/usr/bin/locate * / > database.tmp");	// Get the list of files
	system(record);											// Run the command
}

int Crawler::getLines()
{
	char record[1000];										// Declaration of record
	sprintf(record,"cat database.tmp | wc -l");				// Count the lines
	return system(record);									// Run the command
}

#endif /* CRAWLER_H_ */
 
You can parse directory tree with directory(3) recursively, that would be most quick(programmatically) and correct way to do it. As you parse the tree you can then directly index the files into a file based hash table like BDB (dbopen(3)) instead of buffering it in large string arrays or text files.

Also you should avoid defining class functions in header files, header files should be used only for declarations.
 
Seems also to me that the getFiles (and getLines) could be static services, they should not be tied to each crawler instance, at least the database building should not be done by each instance.
 
mtt said:
Since you're writing in C++ anyway, you can also take a look at Boost.Filesystem: http://www.boost.org/libs/filesystem/ // it's likely to end up in the next version of the C++ standard: http://en.wikipedia.org/wiki/C++_Technical_Report_1#Technical_Report_2
The tutorial includes a directory iteration example: http://www.boost.org/doc/libs/release/libs/filesystem/doc/tutorial.html#Directory-iteration

Thanks! I've been reading the tutorials and I think this will be a big help...I'm at a major design decision here. I can go one of two ways with my search engine.

When the data to search for on the hard drive is entered I can look for it in one of two ways.

1. I can search the files on the fly and sort them by most word hits in the files and then by last used

2. Or I can create a database containing all of the words used on the HDD once and then the path names would point to which entry in the DB that file contained. And then sort the results as above.

The problem is 1 would take less space but it would be much slower. 2 would be faster (I think?)
 
kr651129 said:
Thanks! I've been reading the tutorials and I think this will be a big help...I'm at a major design decision here. I can go one of two ways with my search engine.

When the data to search for on the hard drive is entered I can look for it in one of two ways.

1. I can search the files on the fly and sort them by most word hits in the files and then by last used

2. Or I can create a database containing all of the words used on the HDD once and then the path names would point to which entry in the DB that file contained. And then sort the results as above.

The problem is 1 would take less space but it would be much slower. 2 would be faster (I think?)

Well, "1" vs "2" is arguably "find" vs "locate" -- both tools are useful, what's optimal depends on the application :)

http://www.unix.com/man-page/freebsd/1/find/
http://www.unix.com/man-page/freebsd/1/locate/
 
I'm unsure if I should start another thread for this or not...I installed boost and have got the following:

Code:
$ make
g++ -I/usr/local/include ./main.cpp ./Crawler.cpp -o ./Desktop
/var/tmp//ccVfGlTQ.o: In function `__static_initialization_and_destruction_0(int, int)':
main.cpp:(.text+0x1de): undefined reference to `boost::system::generic_category()'
main.cpp:(.text+0x1e8): undefined reference to `boost::system::generic_category()'
main.cpp:(.text+0x1f2): undefined reference to `boost::system::system_category()'

I found a post on stack overflow that suggested that I pass -lboost_system to g++ and now I've got this new error that I can't seem to find any information on

Code:
$ make
g++ -I/usr/local/include -lboost_system ./main.cpp ./Crawler.cpp -o ./Desktop
/usr/bin/ld: cannot find -lboost_system
*** Error code 1
 
Yes, while most of the Boost libraries are header-only, Boost.Filesystem is among the ones that aren't.
This means it has to be built separately (and they you'll be able to compile with the -lboost_filesystem flag, provided that you also provide the lib directory using -L (just like you use -I for include directory) and list your .cpp files BEFORE the library flag, see the compiler manual): http://www.boost.org/libs/filesystem/#Building

See also:
http://www.boost.org/more/getting_started/unix-variants.html

Note that there's also Boost in the FreeBSD Ports and Packages Collection, although it might be relatively old(er):
http://forums.freebsd.org/showthread.php?t=11379
http://www.freshports.org/devel/boost-all/
 
I think the use case is that we type in a file/directory name full or partial and OS returns us the list of matching file or directory path .

First, of searching with a wild card is not optimum (unless you wish to throw away your HDD few months later). What's really need to be done is to run /usr/libexec/locate.updatedb if indexes need to be rebuilt or better still hand it off to cron.

Second is once we type in a full/partial string the file/directory path is fetched, but we also need to capture the stdout stream of the 'locate' output. As someone suggested popen() would do the job.

Some people are suggesting directory enumerators and similar stuff, which will not work.
 
Thanks for all the info everyone, from all of the feedback I'm headed in the right direction but I have what I think might be a real simple question.

I've got ./file.txt open in C++ and I want to search for "word" inside of the file and all I need is a boolean to return to let me know if "word" is in the file, is there a simple solution for this?
 
Why did you choose C++ for this? If you are going to rely on external commands and litter your code with system() calls, you might as well save time and write it in form of a shell script and use grep to search files.
 
I'm only using system() calls to get the file paths from the locate DB and dump it into a text file. Unless you can suggest easier way to read the DB into a string so I can parse the data how I need to. I've cleaned up a lot of the code. I started a project on google code for my svn. I'll post the link later if anyone wants to see it/contribute.
 
kr651129 said:
I'm only using system() calls to get the file paths from the locate DB and dump it into a text file. Unless you can suggest easier way to read the DB into a string so I can parse the data how I need to. I've cleaned up a lot of the code. I started a project on google code for my svn. I'll post the link later if anyone wants to see it/contribute.

As already stated in this thread, if you are going to develop a C/C++ program you should use as much as possible C/C++ libraries. Using a system has several drawbacks, mainly (i) requesting a new process to be created and scheduled and (ii) requresting for some kind of IPC (reading the output is a baby IPC, pipes and sockets are more complex solutions). Having a process that handles the data as the system does, that is using POSIX libraries, is much more better.
 
I agree with you but I'm only using the system() call once to get the file paths. I don't know how I'd read the locate database without it, if someone has a better suggestion I'd use that instead.
 
Back
Top