⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

Conversation

@AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented Dec 22, 2025

Rationale for this change

In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (In and Out during the doc build). This can lead to slower builds.

What changes are included in this PR?

IPython directives are converted to runnable code-block (with >>> and ...) and pytest doctest support for .rst files is added to the conda-python-docs CI job. This means the code in the Python User Guide is tested separately to the building of the documentation.

Are these changes tested?

Yes, with the CI.

Are there any user-facing changes?

Changes to the Python User Guide examples will have to be tested with pytest --doctest-glob='*.rst' docs/source/python/file.rst

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 23, 2025

Converting this PR to draft till I figure out what would be the best way to run RST doctest on 3.12 Sphinx Documentation CI job and not on the Python 3.10 Sphinx & Numpydoc.

@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 8, 2026

@raulcd the main pain point why AMD64 Conda Python 3.10 Sphinx & Numpydoc fails and AMD64 Conda Python 3.12 Sphinx Documentation succeeds is the Python version and the use of datetime.UTC which was only added in Python 3.11, see https://docs.python.org/3/library/datetime.html#datetime.UTC.

I think the easiest solution would be to run Sphinx & Numpydoc on Python 3.11, or even Python 3.12 (I am not aware of any reason we would need the olderst Python version we support here. Sphinx Documentation runs on docs changes only while Sphinx & Numpydoc runs on any Python or C++ changes and validates the docstrings).

@raulcd
Copy link
Member

raulcd commented Jan 8, 2026

Thanks for checking that @AlenkaF ! So currently we are providing a snippet on our documentation:

.. ipython:: python
    :okexcept:

    import datetime

    current_year = datetime.datetime.now(datetime.UTC).year
    for table_chunk in birthdays_dataset.to_batches():
        print("AGES", pc.subtract(current_year, table_chunk["years"]))

that will fail for some users as we are still supporting Python 3.10, right? Is it worth for the example to add the datetime.UTC? Should we just use for the example: current_year = datetime.datetime.now().year
Or maybe add a comment with a note?

I am ok to just bump the Python version of the job but we probably should not provide examples that will fail on some of the supported versions.

@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 8, 2026

Yeah, you are right. Changing to datetime.datetime.now().year or even datetime.datetime.now(datetime.timezone.utc) makes much more sense! Will update 👍

@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 8, 2026

Ha ha, the example would fail anyways as the year changed in the meantime 🤣
Probably it is best to just hardcode it.

@raulcd
Copy link
Member

raulcd commented Jan 8, 2026

Ha ha, the example would fail anyways as the year changed in the meantime 🤣 Probably it is best to just hardcode it.

Yes, we don't want to have to update this every year because the data changes 😄

@AlenkaF AlenkaF force-pushed the gh-28859-python-docs-examples-testing branch from f417177 to 1fb6f0a Compare January 9, 2026 08:03
@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 12, 2026

@github-actions crossbow submit preview-docs

@github-actions
Copy link

Revision: 1fb6f0a

Submitted crossbow builds: ursacomputing/crossbow @ actions-b1a47e0770

Task Status
preview-docs GitHub Actions

@AlenkaF AlenkaF marked this pull request as ready for review January 12, 2026 09:42
@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 12, 2026

Hi @rmnskb @tadeja @zhengruifeng @HyukjinKwon! In case anybody fancies giving a review, it would be much appreciated.
This PR looks like a big change but it only unifies how we write code examples in the Python User guide (code-block and not ipython directive. Note >>> is needed in order for the examples to be tested).

Link to the preview: https://s3.amazonaws.com/arrow-data/pr_docs/48619/python/index.html

This PR also adds a doctest of the .rst files to the two existing documentation CI jobs. One job runs only with changes to the documentation, the other job runs with changes in the C++ and Python code. cc @raulcd in case you have time to look at the ci/ changes.

Copy link
Contributor

@rmnskb rmnskb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🔥 Thanks for working on that! I can imagine it was a tremendous amount of work. Left some general comments about some smaller things that I picked up while looking at the PR, otherwise I think it's good to merge.

.. code-block:: python
>>> import pyarrow as pa
>>> import pyarrow.compute as pc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you decide to opt out from the explicit imports? Does the documentation still compile?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to not duplicate imports per page. Meaning once on the top should suffice (see lines above, approx line 32). Yes, the compilation of docs and doctest should work.

Copy link
Member Author

@AlenkaF AlenkaF Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we keep the explicit imports, the examples will be copy-paste able which might be more friendly to users.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 12, 2026
@HyukjinKwon
Copy link
Member

ack. taking a look now

@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 13, 2026

Thanks for the review @rmnskb! 🎈

| naive| aware|
+-------------------+-------------------+
|2019-01-01 00:00:00|2019-01-01 08:00:00|
|2018-12-31 23:00:00|2019-01-01 08:00:00|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 2019-01-01 00:00:00 became 2018-12-31 23:00:00 here cuz I suspect you or CI (?) is somewhere in GMT+1. datetime(2019, 1, 1, 0) is assumed as local time (yes it's up to the system to interpret but Spark thinks so). So, Spark thought that it's a local time but the timezone was set as UTC so it decreased one hour.

I think we should probably just skip all here cuz now it seems depending on local timezone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to just keep the original input/output here and skip all. I will take a separate look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agree - these tests should already be skipped. I can remove the diff and keep it as it was before?
The example is good altogether and shows the timezone conversion behaviour. Maybe we could add a note? Claude suggests:

.. note::
   The examples above demonstrate timezone conversion behaviour.
   The exact output may differ depending on your system's local timezone, as Spark
   interprets naive timestamps relative to the local timezone when converting to UTC.

Copy link
Member

@HyukjinKwon HyukjinKwon Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove the diff and keep it as it was before?

yupyup. whichever easier. I will take a separate look after this gets merged.

<FileInfo for 'test.arrow': type=FileType.File, size=3250>
.. code-block:: python
>>> local.get_file_info('test.arrow')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a really nitpick .. feel free to ignore it for now as I don't want this kind of comment to block the PR.

Should we remove the file after the test somewhere somehow? Seems like test.arrow file will be created but not removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All saved files should be cleaned up thanks to the fixture here: https://github.com/apache/arrow/pull/48619/files#diff-de7516be9fdc98a7bfc2fe897bd93bff7c7d0a5d62ea50759956c9745082a310.

I have tested this locally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS: not a nitpick, I think this is a very valid comment!

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: LGTM

How much does it save time BTW? Some might argue that IPython input/output are better and the speed of building could be considered as a secondary (e.g., PySpark doc build takes super duper long - it generates things a lot). For myself, I prefer faster build in any event so my take is on this change.

@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 13, 2026

How much does it save time BTW? Some might argue that IPython input/output are better and the speed of building could be considered as a secondary (e.g., PySpark doc build takes super duper long - it generates things a lot). For myself, I prefer faster build in any event so my take is on this change.

My main aim was to unify the docs and have the possibility of running doctest on the examples separately. But am curious if there is any change in performance so I will try it out now 😄 (not sure if the amount of IPython directives has been that big before this change, though).

.. code-block:: python
>>> import pyarrow as pa
>>> import pyarrow.compute as pc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we keep the explicit imports, the examples will be copy-paste able which might be more friendly to users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants