⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
version: 2
updates:
- package-ecosystem: "pip"
directory: "/requirements"
- package-ecosystem: "uv"
directory: "/"
schedule:
interval: "daily"
# Only use this to bump our libraries
allow:
- dependency-name: "unstructured[local-inference]"
- dependency-name: "unstructured[all-docs]"

- package-ecosystem: "github-actions"
# NOTE(robinson) - Workflow files stored in the
Expand Down
20 changes: 9 additions & 11 deletions .github/workflows/bump_libraries.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,32 +6,31 @@ on:
- opened
- reopened
paths:
- 'requirements/**'

env:
PYTHON_VERSION: "3.8"
- 'uv.lock'
- 'pyproject.toml'

jobs:
bump-changelog:
runs-on: ubuntu-latest
runs-on: opensource-linux-8core
if: ${{ github.actor == 'dependabot[bot]' }}
permissions:
contents: write
steps:
- uses: actions/checkout@v5
- name: Read Python version from .python-version
run: echo "PYTHON_VERSION=$(cat .python-version)" >> $GITHUB_ENV
- name: Install uv
uses: astral-sh/setup-uv@v5
- name: Set up Python ${{ env.PYTHON_VERSION }}
uses: actions/setup-python@v6
with:
python-version: ${{ env.PYTHON_VERSION }}
run: uv python install ${{ env.PYTHON_VERSION }}
- name: Dependabot metadata
id: metadata
uses: dependabot/fetch-metadata@v2
with:
github-token: "${{ secrets.GITHUB_TOKEN }}"
- name: Create release version
run: |
pip install pip-tools
make pip-compile
uv lock --upgrade
package=${{ steps.metadata.outputs.dependency-names }}
# Strip any [extras] from name
package=${package%\[*}
Expand All @@ -41,4 +40,3 @@ jobs:
- uses: stefanzweifel/git-auto-commit-action@v6
with:
commit_message: "Bump libraries and release"

106 changes: 36 additions & 70 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,147 +7,113 @@
branches: [ main ]

env:
PYTHON_VERSION: "3.12"
PIPELINE_FAMILY: "general"

jobs:
setup:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- uses: actions/cache@v5
id: virtualenv-cache
with:
path: |
.venv
key: ci-venv-${{ env.PIPELINE_FAMILY }}-${{ hashFiles('requirements/*.txt') }}
- name: Set up Python ${{ env.PYTHON_VERSION }}
uses: actions/setup-python@v6
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Setup virtual environment (no cache hit)
if: steps.virtualenv-cache.outputs.cache-hit != 'true'
run: |
python${{ env.PYTHON_VERSION }} -m venv .venv
source .venv/bin/activate
make install-ci

lint:
runs-on: ubuntu-latest
needs: setup
runs-on: opensource-linux-8core
steps:
- uses: actions/checkout@v5
- uses: actions/cache@v5
id: virtualenv-cache
- name: Read Python version from .python-version
run: echo "PYTHON_VERSION=$(cat .python-version)" >> $GITHUB_ENV
- name: Install uv
uses: astral-sh/setup-uv@v5
with:
path: |
.venv
key: ci-venv-${{ env.PIPELINE_FAMILY }}-${{ hashFiles('requirements/*.txt') }}
enable-cache: true
cache-dependency-glob: "uv.lock"
- name: Set up Python ${{ env.PYTHON_VERSION }}
run: uv python install ${{ env.PYTHON_VERSION }}
- name: Install lint dependencies
run: uv sync --only-group lint --frozen
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI lint mypy loses all third-party type information

Medium Severity

The CI lint job uses uv sync --only-group lint --frozen, which only installs ruff, mypy, and types-requests — none of the project's runtime dependencies. When mypy then runs with --ignore-missing-imports, it silently skips all imports from fastapi, unstructured, pypdf, pandas, etc. Since the source code in prepline_general/api/ heavily depends on these packages (many of which ship with inline types / py.typed), mypy effectively checks almost nothing meaningful. The previous CI setup installed all dependencies before linting, giving mypy full access to third-party type information. This trade-off isn't explicitly documented and significantly weakens the type-safety net.

Additional Locations (1)

Fix in Cursor Fix in Web

- name: Lint
run: |
source .venv/bin/activate
make check
run: make check

shellcheck:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
runs-on: ubuntu-latest
runs-on: opensource-linux-8core
steps:
- uses: actions/checkout@v5
- name: ShellCheck
uses: ludeeus/action-shellcheck@master

test:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
runs-on: ubuntu-latest
needs: [setup, lint]
runs-on: opensource-linux-8core
needs: lint
steps:
- uses: actions/checkout@v5
- uses: actions/cache@v5
id: virtualenv-cache
- name: Read Python version from .python-version
run: echo "PYTHON_VERSION=$(cat .python-version)" >> $GITHUB_ENV
- name: Install uv
uses: astral-sh/setup-uv@v5
with:
path: |
.venv
key: ci-venv-${{ env.PIPELINE_FAMILY }}-${{ hashFiles('requirements/test.txt') }}

enable-cache: true
cache-dependency-glob: "uv.lock"
- name: Set up Python ${{ env.PYTHON_VERSION }}
uses: actions/setup-python@v6
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Run core tests
run: uv python install ${{ env.PYTHON_VERSION }}
- name: Install dependencies and run core tests
run: |
python${{ env.PYTHON_VERSION }} -m venv .venv
source .venv/bin/activate
sudo apt-get update && sudo apt-get install --yes poppler-utils libreoffice
make install-test
uv sync --group test --frozen
make install-pandoc
make install-nltk-models
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
tesseract --version
make test
make check-coverage

changelog:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
runs-on: ubuntu-latest
runs-on: opensource-linux-8core
steps:
- uses: actions/checkout@v5
- if: github.ref != 'refs/heads/main'
uses: dorny/paths-filter@v3
id: changes
with:
filters: |
src:
- 'doc_recipe/**'
- 'recipe-notebooks/**'

- if: steps.changes.outputs.src == 'true' && github.ref != 'refs/heads/main'
uses: dangoslen/changelog-enforcer@v3

# TODO - figure out best practice for caching docker images
# (Using the virtualenv to get pytest)
test_dockerfile:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
runs-on: ubuntu-latest
needs: [setup, lint]
runs-on: opensource-linux-8core
needs: lint
steps:
- uses: actions/checkout@v5
- uses: actions/cache@v5
id: virtualenv-cache
- name: Read Python version from .python-version
run: echo "PYTHON_VERSION=$(cat .python-version)" >> $GITHUB_ENV
- name: Install uv
uses: astral-sh/setup-uv@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smoke test fails: bare pytest call after venv activation removed

High Severity

The old CI workflows ran source .venv/bin/activate before make docker-test, putting pytest on PATH. This PR removes that activation and uses uv sync instead, but docker-smoke-test.sh (line 83) still calls bare pytest — not uv run pytest. Since uv sync creates a .venv without activating it, and setup-uv only adds uv to PATH, pytest won't be found. Every other pytest invocation in the Makefile was correctly updated to use uv run, but this script was missed.

Additional Locations (1)

Fix in Cursor Fix in Web

path: |
.venv
key: ci-venv-${{ env.PIPELINE_FAMILY }}-${{ hashFiles('requirements/test.txt') }}
enable-cache: true
cache-dependency-glob: "uv.lock"
- name: Set up Python ${{ env.PYTHON_VERSION }}
uses: actions/setup-python@v6
with:
python-version: ${{ env.PYTHON_VERSION }}
run: uv python install ${{ env.PYTHON_VERSION }}
- name: Free up disk space
run: |
# Clear some space (https://github.com/actions/runner-images/issues/2840)
echo "Disk usage before cleanup:"
df -h

# Remove unnecessary pre-installed software
sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/ghc
sudo rm -rf /usr/local/share/boost
sudo rm -rf /usr/local/lib/android
sudo rm -rf /opt/hostedtoolcache/CodeQL
sudo rm -rf /usr/local/.ghcup
sudo rm -rf /usr/share/swift

# Clean up docker to ensure we start fresh
docker system prune -af --volumes

echo "Disk usage after cleanup:"
df -h
- name: Test Dockerfile
run: |
python${{ env.PYTHON_VERSION }} -m venv .venv
source .venv/bin/activate
make install-test
uv sync --group test --frozen
make docker-build
make docker-test

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
# - name: Scan image
# uses: anchore/scan-action@v3
# with:
# image: "pipeline-family-${{ env.PIPELINE_FAMILY }}-dev"
# # NOTE(robinson) - revert this to medium when we bump libreoffice
# severity-cutoff: critical
2 changes: 1 addition & 1 deletion .github/workflows/claude.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
(github.event_name == 'pull_request_review_comment' && contains(github.event.comment.body, '@claude')) ||
(github.event_name == 'pull_request_review' && contains(github.event.review.body, '@claude')) ||
(github.event_name == 'issues' && (contains(github.event.issue.body, '@claude') || contains(github.event.issue.title, '@claude')))
runs-on: ubuntu-latest
runs-on: opensource-linux-8core
permissions:
contents: read
pull-requests: read
Expand Down
79 changes: 26 additions & 53 deletions .github/workflows/docker-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,56 +10,34 @@
DOCKER_BUILD_REPOSITORY: quay.io/unstructured-io/build-unstructured-api
PACKAGE: "unstructured-api"
PIPELINE_FAMILY: "general"
PIP_VERSION: "25.1.1"
PYTHON_VERSION: "3.12"

jobs:
setup:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- uses: actions/cache@v5
id: virtualenv-cache
with:
path: |
.venv
key: ci-venv-${{ env.PIPELINE_FAMILY }}-${{ hashFiles('requirements/test.txt') }}
- name: Set up Python ${{ env.PYTHON_VERSION }}
uses: actions/setup-python@v6
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Setup virtual environment (no cache hit)
if: steps.virtualenv-cache.outputs.cache-hit != 'true'
run: |
python${{ env.PYTHON_VERSION }} -m venv .venv
source .venv/bin/activate
make install-ci
set-short-sha:
runs-on: ubuntu-latest
runs-on: opensource-linux-8core
outputs:
short_sha: ${{ steps.set_short_sha.outputs.short_sha }}
steps:
- name: Set Short SHA
id: set_short_sha
run: echo "short_sha=$(echo ${{ github.sha }} | cut -c1-7)" >> $GITHUB_OUTPUT
build-images:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {}
strategy:
matrix:
#arch: ["arm64", "amd64"]
# NOTE(luke): temporary disable arm64 since its failing the smoke test
arch: ["amd64"]
runs-on: ubuntu-latest
needs: [setup, set-short-sha]
arch: ["arm64", "amd64"]
runs-on: ${{ matrix.arch == 'arm64' && 'opensource-linux-arm64-4core' || 'opensource-linux-8core' }}
needs: set-short-sha
env:
SHORT_SHA: ${{ needs.set-short-sha.outputs.short_sha }}
DOCKER_PLATFORM: linux/${{ matrix.arch }}
steps:
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
with:
driver: ${{ matrix.arch == 'amd64' && 'docker' || 'docker-container' }}
driver: docker
- name: Checkout code
uses: actions/checkout@v5
- name: Read Python version from .python-version
run: echo "PYTHON_VERSION=$(cat .python-version)" >> $GITHUB_ENV
- name: Login to Quay.io
uses: docker/login-action@v3
with:
Expand Down Expand Up @@ -90,25 +68,23 @@
run: |
DOCKER_BUILDKIT=1 docker buildx build --load -f Dockerfile \
--platform=$DOCKER_PLATFORM \
--build-arg PIP_VERSION=$PIP_VERSION \
--build-arg BUILDKIT_INLINE_CACHE=1 \
--build-arg PIPELINE_PACKAGE=${{ env.PIPELINE_FAMILY }} \
--provenance=false \
--progress plain \
--cache-from $DOCKER_BUILD_REPOSITORY:${{ matrix.arch }} \
-t $DOCKER_BUILD_REPOSITORY:${{ matrix.arch }}-$SHORT_SHA .
- name: Set virtualenv cache
uses: actions/cache@v5
id: virtualenv-cache
- name: Install uv
uses: astral-sh/setup-uv@v5
with:
path: |
.venv
key: ci-venv-${{ env.PIPELINE_FAMILY }}-${{ hashFiles('requirements/test.txt') }}
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
enable-cache: true
cache-dependency-glob: "uv.lock"
- name: Set up Python ${{ env.PYTHON_VERSION }}
run: uv python install ${{ env.PYTHON_VERSION }}
- name: Install test dependencies
run: uv sync --group test --frozen
- name: Test image
run: |
source .venv/bin/activate
export DOCKER_IMAGE="$DOCKER_BUILD_REPOSITORY:${{ matrix.arch }}-$SHORT_SHA"
if [ "$DOCKER_PLATFORM" == "linux/arm64" ]; then
SKIP_INFERENCE_TESTS=true make docker-test
Expand All @@ -120,43 +96,40 @@
# write to the build repository to cache for the publish-images job
docker push $DOCKER_BUILD_REPOSITORY:${{ matrix.arch }}-$SHORT_SHA
publish-images:
runs-on: ubuntu-latest
needs: [setup, set-short-sha, build-images]
runs-on: opensource-linux-8core
needs: [set-short-sha, build-images]
env:
SHORT_SHA: ${{ needs.set-short-sha.outputs.short_sha }}
steps:
- name: Checkout code
uses: actions/checkout@v5
- name: Set SHORT_SHA
run: echo "SHORT_SHA=$(git rev-parse --short HEAD)" >> $GITHUB_ENV
- name: Login to Quay.io
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ secrets.QUAY_IO_ROBOT_USERNAME }}
password: ${{ secrets.QUAY_IO_ROBOT_TOKEN }}
- name: Pull AMD image
run: |
docker pull $DOCKER_BUILD_REPOSITORY:amd64-$SHORT_SHA
# - name: Pull ARM image
# run: |
# docker pull $DOCKER_BUILD_REPOSITORY:arm64-$SHORT_SHA
- name: Pull ARM image
run: |
docker pull $DOCKER_BUILD_REPOSITORY:arm64-$SHORT_SHA
- name: Push AMD and ARM tags
run: |
# these are used to construct the final manifest but also cache-from in subsequent runs
docker tag $DOCKER_BUILD_REPOSITORY:amd64-$SHORT_SHA $DOCKER_BUILD_REPOSITORY:amd64
docker push $DOCKER_BUILD_REPOSITORY:amd64
#docker tag $DOCKER_BUILD_REPOSITORY:arm64-$SHORT_SHA $DOCKER_BUILD_REPOSITORY:arm64
#docker push $DOCKER_BUILD_REPOSITORY:arm64
docker tag $DOCKER_BUILD_REPOSITORY:arm64-$SHORT_SHA $DOCKER_BUILD_REPOSITORY:arm64
docker push $DOCKER_BUILD_REPOSITORY:arm64
- name: Push multiarch manifest
run: |
#docker manifest create ${DOCKER_REPOSITORY}:latest $DOCKER_BUILD_REPOSITORY:amd64 $DOCKER_BUILD_REPOSITORY:arm64
docker manifest create ${DOCKER_REPOSITORY}:latest $DOCKER_BUILD_REPOSITORY:amd64
docker manifest create ${DOCKER_REPOSITORY}:latest $DOCKER_BUILD_REPOSITORY:amd64 $DOCKER_BUILD_REPOSITORY:arm64
docker manifest push $DOCKER_REPOSITORY:latest
#docker manifest create ${DOCKER_REPOSITORY}:$SHORT_SHA $DOCKER_BUILD_REPOSITORY:amd64 $DOCKER_BUILD_REPOSITORY:arm64
docker manifest create ${DOCKER_REPOSITORY}:$SHORT_SHA $DOCKER_BUILD_REPOSITORY:amd64
docker manifest create ${DOCKER_REPOSITORY}:$SHORT_SHA $DOCKER_BUILD_REPOSITORY:amd64 $DOCKER_BUILD_REPOSITORY:arm64
docker manifest push $DOCKER_REPOSITORY:$SHORT_SHA
VERSION=$(grep -m1 version preprocessing-pipeline-family.yaml | cut -d ' ' -f2)
#docker manifest create ${DOCKER_REPOSITORY}:$VERSION $DOCKER_BUILD_REPOSITORY:amd64 $DOCKER_BUILD_REPOSITORY:arm64
docker manifest create ${DOCKER_REPOSITORY}:$VERSION $DOCKER_BUILD_REPOSITORY:amd64
VERSION=$(grep -oP '(?<=__version__ = ")[^"]+' prepline_general/api/__version__.py)
docker manifest create ${DOCKER_REPOSITORY}:$VERSION $DOCKER_BUILD_REPOSITORY:amd64 $DOCKER_BUILD_REPOSITORY:arm64
docker manifest push ${DOCKER_REPOSITORY}:$VERSION

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
6 changes: 3 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -81,9 +81,6 @@ target/
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
Expand Down Expand Up @@ -120,6 +117,9 @@ venv.bak/
# mkdocs documentation
/site

# ruff
.ruff_cache/

# mypy
.mypy_cache/
.dmypy.json
Expand Down
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.12
Loading