About

Filetracker is a module which provides a shared storage for files together with some extra metadata.

It was designed with the intent to be used along with a relational database in cases where large files need to be stored and accessed from multiple locations, but storing them as blobs in the database is not suitable.

Filetracker base supports caching of files downloaded from the remote master store.

Filetracker API allows versioning of the stored files, but its implementation is optional and not provided by default store classes.

Files, names and versions

A file may contain arbitrary data. Each file has a name, which looks like an absolute filesystem path (components separated by slashes and the first symbol in the filename must be a slash). Filetracker does not support folders explicitly. At the moment you may assume that a file in filetracker is identified by name which by convention looks like a filesystem path. In the future we may make use of this fact, so please obey.

Many methods accept or return versioned names, which look like regular names with version number appended, separated by @. For those methods, passing an unversioned name usually means “the latest version of that file”.

Configuration and usage

Probably the only class you’d like to know and use is Client.

class filetracker.Client(local_store='auto', remote_store='auto', lock_manager='auto', cache_dir=None, remote_url=None, locks_dir=None)[source]

The main filetracker client class.

The client instance can be built is one of several ways. The easiest one is to just call the constructor without arguments. In this case the configuration is taken from the environment variables:

FILETRACKER_DIR
the folder to use as the local cache; if not specified, ~/.filetracker-store is used.
FILETRACKER_URL
the URL of the filetracker server; if not present, the constructed client is a stand-alone local client, which stores the files and metadata locally — this can be safely used by multiple processes on the same machine, too.

Another way to create a client is to pass these values as constructor arguments — remote_url and cache_dir.

If you are the power-user, you may create the client by manually passing local_store and remote_store to the constructor (see Filetracker Cache Cleaner).

delete_file(name)[source]

Deletes the file identified by name along with its metadata.

The file is removed from both the local store and the remote store.

file_size(name, force_refresh=False)[source]

Returns the size of the file.

For efficiency this operation does not use locking, so may return inconsistent data. Use it for informational purposes.

file_version(name)[source]

Returns the newest available version number of the file.

If the remote store is configured, it is queried, otherwise the local version is returned. It is assumed that the remote store always has the newest version of the file.

If version is a part of name, it is ignored.

get_file(name, save_to, add_to_cache=True, force_refresh=False, _lock_exclusive=False)[source]

Retrieves file identified by name.

The file is saved as save_to. If add_to_cache is True, the file is added to the local store. If force_refresh is True, local cache is not examined if a remote store is configured.

If a remote store is configured, but name does not contain a version, the local data store is not used, as we cannot guarantee that the version there is fresh.

Local data store implemented in LocalDataStore tries to not copy the entire file to save_to if possible, but instead uses hardlinking. Therefore you should not modify the file if you don’t want to totally blow something.

This method returns the full versioned name of the retrieved file.

get_stream(name, force_refresh=False, serve_from_cache=False)[source]

Retrieves file identified by name in streaming mode.

Works like get_file(), except that returns a tuple (file-like object, versioned name).

When both remote_store and local_store are present, serve_from_cache can be used to ensure that the file will be downloaded and served from a local cache. If a full version is specified and the file exists in the cache a file will be always served locally.

list_local_files()[source]

Returns list of all stored local files.

Each element of this list is of DataStore.FileInfoEntry type.

put_file(name, filename, to_local_store=True, to_remote_store=True)[source]

Adds file filename to the filetracker under the name name.

If the file already exists, a new version is created. In practice if the store does not support versioning, the file is overwritten.

The file may be added to local store only (if to_remote_store is False), to remote store only (if to_local_store is False) or both. If only one store is configured, the values of to_local_store and to_remote_store are ignored.

Local data store implemented in LocalDataStore tries to not directly copy the data to the final cache destination, but uses hardlinking. Therefore you should not modify the file in-place later as this would be disastrous.

If you write tests, you may be also interested in filetracker.dummy.DummyClient.

Filetracker server

At some point you probably want to run a filetracker server, so that more than one machine can share the store. Just do:

$ filetracker-server --help

This script can be used to start the metadata and file servers with minimal effort.

Using filetracker from the shell

No programmer can live without a way to fiddle with filetracker from the shell:

$ filetracker --help

Filetracker Cache Cleaner

For usage, please run:

$ filetracker-cache-cleaner --help
class filetracker.cachecleaner.CacheCleaner(cache_size_limit, glob_cache_dirs, scan_interval=datetime.timedelta(0, 600), percent_cleaning_level=50.0)[source]

Tool for periodically cleaning cache of the file tracker. Designed to work as a daemon.

Cache cleaner is run by calling method run(). It supports multiple instances of Client. Configuration is passed as constructors parameters:

Parameters:
  • cache_size_limit (int) – soft limit for sum of logical files size
  • glob_cache_dirs (iterable) – list of paths to filetracker.Client cache directories as glob expressions
  • scan_interval (datetime.timedelta) – interval specifying how often scan the disk and optionally clean cache
  • percent_cleaning_level (float) – how many percent of cache_size_limit of newest cache files do not delete during cleaning cache

Cache cleaner runs the following algorithm:

  1. Ask each client (specified in constructor by cache directory) to list all stored files. This is file index.
  2. Analyze file index - check whether cache cleaner should clean the cache and what files exactly.
  3. Clean cache if necessary.
  4. Wait time specified in constructor.
  5. Go to step 1.

Files are being deleting from the oldest to newer ones taking into account modification time. If files have the same modification time, then file with greater size is being deleted before the second one.

class FileIndexEntry(file_info, client)

Entry for file index.

Associates DataStore.FileInfoEntry with filetracker.Client which owns given file.

Fields:

  • file_info instance of DataStore.FileInfoEntry
  • client instance of filetracker.Client which owns given file
client

Alias for field number 1

file_info

Alias for field number 0

run()[source]

Starts cleaning cache in infinite loop.

Internal API Reference

filetracker.split_name(name)[source]

Splits a (possibly versioned) name into unversioned name and version.

Returns a tuple (unversioned_name, version), where version may be None.

class filetracker.dummy.DummyDataStore[source]

A dummy data store which uses memory to store files.

Cool for testing, but beware — do not try to store too much. And this class is not thread-safe, too.

class filetracker.dummy.DummyClient[source]

Filetracker client which uses a dummy local data store.

To-dos and ideas

  • access control
  • cache pruning
  • support for “directories”: especially ls
  • fuse client
  • rm

Indices and tables