Creative Commons Catalog API

Purpose

The Creative Commons Catalog API ('cccatalog-api') is a system that allows programmatic access to public domain digital media. It is our ambition to index and catalog billions of Creative Commons works, including articles, songs, videos, photographs, paintings, and more. Using this API, developers will be able to access the digital commons in their own applications.

This repository is primarily concerned with back end infrastructure like datastores, servers, and APIs. The pipeline that feeds data into this system can be found in the cccatalog repository. A front end web application that interfaces with the API can be found at the cccatalog-frontend repository.

API Documentation

Browsable API documentation can be found here.

Running the server locally

Ensure that you have installed Docker (with Docker Compose) and that the Docker daemon is running.

git clone https://github.com/creativecommons/cccatalog-api.git
cd cccatalog-api
docker-compose up

After executing docker-compose up, you will be running:

A Django API server
Two PostgreSQL instances (one simulates the upstream data source, the other serves as the application database)
Elasticsearch
Redis
A thumbnail-generating image proxy
ingestion-server, a service for bulk ingesting and indexing search data.
analytics, a REST API server for collecting search usage data

Once everything has initialized, with docker-compose still running in the background, load the sample data. You will need to install PostgreSQL client tools to perform this step. On Debian, the package is called postgresql-client-common.

./load_sample_data.sh

You are now ready to start sending the API server requests. Hit the API with a request to make sure it is working: curl localhost:8000/v1/images?q=honey

Diagnosing local Elasticsearch issues

If the API server container failed to start, there's a good chance that Elasticsearch failed to start on your machine. Ensure that you have allocated enough memory to Docker applications, otherwise the container will instantly exit with an error. Also, if the logs mention "insufficient max map count", increase the number of open files allowed on your system. For most Linux machines, you can fix this by adding the following line to /etc/sysctl.conf:

vm.max_map_count=262144

To make this setting take effect, run:

sudo sysctl -p

System Architecture

Basic flow of data

Search data is ingested from upstream sources provided by the data pipeline. As of the time of writing, this includes data from Common Crawl and multiple 3rd party APIs. Once the data has been scraped and cleaned, it is transferred to the upstream database, indicating that it is ready for production use.

Every week, the latest version of the data is automatically bulk copied ("ingested") from the upstream database to the production database by the Ingestion Server. Once the data has been downloaded and indexed inside of the database, the data is indexed in Elasticsearch, at which point the new data can be served up from the CC Catalog API servers.

Description of subprojects

cccatalog-api is a Django Rest Framework API server. For a full description of its capabilities, please see the browsable documentation.
ingestion-server is a service for downloading and indexing search data once it has been prepared by the CC Catalog.
analytics is a Falcon REST API for collecting usage data.

Running the tests

Running API live integration tests

You can check the health of a live deployment of the API by running the live integration tests.

cd cccatalog-api
pipenv install
pipenv shell
./test/run_test.sh

Running Ingestion Server test

This end-to-end test ingests and indexes some dummy data using the Ingestion Server API.

cd ingestion_server
pipenv install
pipenv shell
python3 test/integration_tests.py

Deploying and monitoring the API

The API infrastructure is orchestrated using Terraform hosted in creativecommons/ccsearch-infrastructure. More details can be found on the this wiki page.

Django Admin

Custom administration views can be viewed at the /admin/ endpoint.

Contributing

Pull requests are welcome! Feel free to join us on Slack and discuss the project with the engineers on #cc-search. You are welcome to take any open issue in the tracker labeled 'help wanted' or 'good first issue'; there's no need to ask for permission in advance. See the CONTRIBUTORS file for details. Other issues are open for contribution as well, but may be less accessible or well defined in comparison to those that are explicitly labeled; you should consider reaching out to us if you are interested in implementing these tickets.

Name		Name	Last commit message	Last commit date
Latest commit History 1,617 Commits
.github		.github
.idea/dictionaries		.idea/dictionaries
analytics		analytics
cccatalog-api		cccatalog-api
image_get		image_get
ingestion_server		ingestion_server
misc		misc
.cc-metadata.yml		.cc-metadata.yml
.gitignore		.gitignore
.travis.yml		.travis.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
load_sample_data.sh		load_sample_data.sh
sample_data.csv		sample_data.csv
sample_data_http.csv		sample_data_http.csv
system_architecture.png		system_architecture.png

License

fabian19941220-gmail-com/cccatalog-api

Folders and files

Latest commit

History

Repository files navigation

Creative Commons Catalog API

Purpose

API Documentation

Running the server locally

Diagnosing local Elasticsearch issues

System Architecture

Basic flow of data

Description of subprojects

Running the tests

Running API live integration tests

Running Ingestion Server test

Deploying and monitoring the API

Django Admin

Contributing

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages