Skip to content

excel

pinky_snowpark.excel

openpyxl helpers for building Excel workbooks from Snowpark DataFrames.

Requires the excel optional dependency::

pip install pinky-snowpark[excel]

All openpyxl imports are lazy — the module is importable without openpyxl installed, and errors surface only when a function is actually called.

CellFormat dataclass

Formatting descriptor for an Excel column (openpyxl).

All fields are optional — only non-None / non-False attributes are applied, leaving the openpyxl default intact for everything else.

Attributes:

Name Type Description
number_format str | None

openpyxl number format string (e.g. '#,##0.00 "€"').

bold bool

Bold font.

italic bool

Italic font.

font_color str | None

Font colour as RRGGBB hex without # (e.g. "FF0000").

font_size int | None

Font size in points.

bg_color str | None

Background fill colour as RRGGBB hex (e.g. "FFF2CC").

h_align str | None

Horizontal alignment — "left" | "center" | "right".

v_align str | None

Vertical alignment — "top" | "center" | "bottom".

wrap bool

Enable text wrap.

Source code in src/pinky_snowpark/excel.py
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
@dataclass
class CellFormat:
    """Formatting descriptor for an Excel column (openpyxl).

    All fields are optional — only non-None / non-False attributes are applied,
    leaving the openpyxl default intact for everything else.

    Attributes:
        number_format: openpyxl number format string (e.g. ``'#,##0.00 "€"'``).
        bold:          Bold font.
        italic:        Italic font.
        font_color:    Font colour as RRGGBB hex without ``#`` (e.g. ``"FF0000"``).
        font_size:     Font size in points.
        bg_color:      Background fill colour as RRGGBB hex (e.g. ``"FFF2CC"``).
        h_align:       Horizontal alignment — ``"left"`` | ``"center"`` | ``"right"``.
        v_align:       Vertical alignment — ``"top"`` | ``"center"`` | ``"bottom"``.
        wrap:          Enable text wrap.
    """
    number_format: str | None = None
    bold: bool = False
    italic: bool = False
    font_color: str | None = None
    font_size: int | None = None
    bg_color: str | None = None
    h_align: str | None = None
    v_align: str | None = None
    wrap: bool = False

ParseResult

Bases: NamedTuple

Return value of :func:parse_worksheet — supports tuple unpacking.

Attributes:

Name Type Description
headers list[str]

Column names after normalization.

rows list[dict[str, str | None]]

Row data — every cell value is cast to str, empty cells as None.

Source code in src/pinky_snowpark/excel.py
30
31
32
33
34
35
36
37
38
class ParseResult(NamedTuple):
    """Return value of :func:`parse_worksheet` — supports tuple unpacking.

    Attributes:
        headers: Column names after normalization.
        rows:    Row data — every cell value is cast to ``str``, empty cells as ``None``.
    """
    headers: list[str]
    rows: list[dict[str, str | None]]

create_workbook_from_df(df, sheet_name='Data', freeze_panes='A2', auto_fit=True, fixed_column_widths=None, wrap_columns=None, col_fmt=None, col_transform=None, wb=None, header_fmt=_DEFAULT_HEADER_FMT)

Create (or extend) an openpyxl Workbook from a Snowpark DataFrame.

Writes column headers from the DataFrame schema, then all data rows. Optionally applies: pane freeze, auto column widths, fixed widths, text wrap, per-column number formats and per-column value transforms.

Parameters:

Name Type Description Default
df 'snowpark.DataFrame'

Source Snowpark DataFrame.

required
sheet_name str

Sheet tab name (default "Data").

'Data'
freeze_panes str

Top-left cell of the frozen region (default "A2").

'A2'
auto_fit bool

Auto-size column widths from content (default True).

True
fixed_column_widths dict[str, int] | None

{"A": 20, "C": 40} — overrides auto_fit for those columns.

None
wrap_columns list[int] | None

1-based column indices where text wrap is enabled.

None
col_fmt dict[int | str, str | CellFormat] | None

{col: preset_name_or_CellFormat} applied to data rows. Column keys accept either a 1-based integer index (3) or a letter string ("C"); both forms may coexist. Available presets: date_fr, date_iso, date_us, date_ch, datetime_fr, eur, gbp, chf, usd, pct, pct_val, int, decimal, bold, center, right, wrap, bg_yellow, bg_red, bg_green, bg_blue, bg_grey, eur_bold, eur_right, pct_center, header, header_dark.

None
col_transform dict[int | str, Callable[[Any], Any]] | None

{col: callable} applied to each cell value before writing. Column keys accept either a 1-based integer index or a letter string (same as col_fmt). Use functions from pinky_core.fmt such as format_boolean, format_duration, format_trend, format_stars, format_fraction, format_number.

None
wb 'Workbook | None'

Existing Workbook to add a sheet to (None = new Workbook).

None
header_fmt CellFormat | None

Format applied to header row (default: bold 12pt).

_DEFAULT_HEADER_FMT

Returns:

Type Description
'Workbook'

openpyxl Workbook with the populated sheet.

Source code in src/pinky_snowpark/excel.py
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
def create_workbook_from_df(
    df: "snowpark.DataFrame",
    sheet_name: str = "Data",
    freeze_panes: str = "A2",
    auto_fit: bool = True,
    fixed_column_widths: dict[str, int] | None = None,
    wrap_columns: list[int] | None = None,
    col_fmt: dict[int | str, str | CellFormat] | None = None,
    col_transform: dict[int | str, Callable[[Any], Any]] | None = None,
    wb: "Workbook | None" = None,
    header_fmt: CellFormat | None = _DEFAULT_HEADER_FMT,
) -> "Workbook":
    """Create (or extend) an openpyxl Workbook from a Snowpark DataFrame.

    Writes column headers from the DataFrame schema, then all data rows.
    Optionally applies: pane freeze, auto column widths, fixed widths, text
    wrap, per-column number formats and per-column value transforms.

    Args:
        df:                   Source Snowpark DataFrame.
        sheet_name:           Sheet tab name (default ``"Data"``).
        freeze_panes:         Top-left cell of the frozen region (default ``"A2"``).
        auto_fit:             Auto-size column widths from content (default ``True``).
        fixed_column_widths:  ``{"A": 20, "C": 40}`` — overrides auto_fit for those columns.
        wrap_columns:         1-based column indices where text wrap is enabled.
        col_fmt:              ``{col: preset_name_or_CellFormat}`` applied to data rows.
                              Column keys accept either a 1-based integer index (``3``)
                              or a letter string (``"C"``); both forms may coexist.
                              Available presets: ``date_fr``, ``date_iso``, ``date_us``,
                              ``date_ch``, ``datetime_fr``, ``eur``, ``gbp``, ``chf``,
                              ``usd``, ``pct``, ``pct_val``, ``int``, ``decimal``,
                              ``bold``, ``center``, ``right``, ``wrap``, ``bg_yellow``,
                              ``bg_red``, ``bg_green``, ``bg_blue``, ``bg_grey``,
                              ``eur_bold``, ``eur_right``, ``pct_center``,
                              ``header``, ``header_dark``.
        col_transform:        ``{col: callable}`` applied to each cell value before
                              writing. Column keys accept either a 1-based integer index
                              or a letter string (same as ``col_fmt``).
                              Use functions from ``pinky_core.fmt`` such as
                              ``format_boolean``, ``format_duration``, ``format_trend``,
                              ``format_stars``, ``format_fraction``, ``format_number``.
        wb:                   Existing Workbook to add a sheet to (``None`` = new Workbook).
        header_fmt:           Format applied to header row (default: bold 12pt).

    Returns:
        openpyxl Workbook with the populated sheet.
    """
    from openpyxl import Workbook as _Workbook
    from openpyxl.styles import Alignment, Font

    col_fmt = _resolve_col_keys(col_fmt) if col_fmt else None
    col_transform = _resolve_col_keys(col_transform) if col_transform else None

    if wb is None:
        wb = _Workbook()
        ws: Worksheet = wb.active  # type: ignore[assignment]
    else:
        ws = wb.create_sheet()  # type: ignore[assignment]
    ws.title = sheet_name

    headers = [f.name.strip('"') for f in df.schema.fields]
    ws.append(headers)

    if header_fmt:
        font_kwargs: dict[str, Any] = {}
        if header_fmt.bold:
            font_kwargs["bold"] = True
        if header_fmt.font_size:
            font_kwargs["size"] = header_fmt.font_size
        if header_fmt.font_color:
            font_kwargs["color"] = header_fmt.font_color
        if font_kwargs:
            for col_idx in range(1, len(headers) + 1):
                ws.cell(row=1, column=col_idx).font = Font(**font_kwargs)

    for row in [r.as_dict(recursive=True) for r in df.collect()]:
        values: list[Any] = list(row.values())
        if col_transform:
            values = [
                col_transform[i + 1](v) if (i + 1) in col_transform else v
                for i, v in enumerate(values)
            ]
        ws.append(values)

    ws.freeze_panes = freeze_panes

    if auto_fit:
        _auto_fit_columns(ws)

    if fixed_column_widths:
        for col_letter, width in fixed_column_widths.items():
            ws.column_dimensions[col_letter].width = width

    if wrap_columns:
        alignment = Alignment(wrapText=True, vertical="center")
        for col_idx in wrap_columns:
            for row_idx in range(2, ws.max_row + 1):
                ws.cell(row=row_idx, column=col_idx).alignment = alignment

    if col_fmt:
        for col_idx, fmt in col_fmt.items():
            resolved = CELL_FORMATS[fmt] if isinstance(fmt, str) else fmt
            _apply_cell_format(ws, col_idx, resolved)

    return wb

parse_worksheet(ws, *, normalize_headers=True, deduplicate=True)

Parse an openpyxl Worksheet into a list of row dicts.

Skips leading empty rows to find the header row, then reads all subsequent non-empty rows as string values.

Typical use: download an XLSX from an API (INSEE Melodi, DGEFP…), open with openpyxl.load_workbook(io.BytesIO(xlsx_bytes)), then call this function on each sheet before building a Snowpark DataFrame.

Parameters:

Name Type Description Default
ws 'Worksheet'

openpyxl Worksheet object (read-only or normal).

required
normalize_headers bool

Convert headers to UPPER_SNAKE_CASE using to_upper_snake_case from pinky_core.fmt (default True). Set False to keep raw header strings as-is.

True
deduplicate bool

Append _2, _3… to duplicate header names after normalization (default True).

True

Returns:

Type Description
ParseResult

Tuple (headers, rows) where headers is the list of normalized

ParseResult

column names and rows is a list[dict[str, str | None]] — cell

ParseResult

values cast to str, empty cells as None.

Example::

import io, openpyxl
from pinky_snowpark import parse_worksheet

wb = openpyxl.load_workbook(io.BytesIO(xlsx_bytes), read_only=True, data_only=True)
for sheet_name in wb.sheetnames:
    headers, rows = parse_worksheet(wb[sheet_name])
    # → build Snowpark DataFrame or stage_write(session, rows, ...)
Source code in src/pinky_snowpark/excel.py
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
def parse_worksheet(
    ws: "Worksheet",
    *,
    normalize_headers: bool = True,
    deduplicate: bool = True,
) -> ParseResult:
    """Parse an openpyxl Worksheet into a list of row dicts.

    Skips leading empty rows to find the header row, then reads all
    subsequent non-empty rows as string values.

    Typical use: download an XLSX from an API (INSEE Melodi, DGEFP…),
    open with ``openpyxl.load_workbook(io.BytesIO(xlsx_bytes))``, then
    call this function on each sheet before building a Snowpark DataFrame.

    Args:
        ws:                 openpyxl Worksheet object (read-only or normal).
        normalize_headers:  Convert headers to UPPER_SNAKE_CASE using
                            ``to_upper_snake_case`` from ``pinky_core.fmt``
                            (default ``True``).  Set ``False`` to keep raw
                            header strings as-is.
        deduplicate:        Append ``_2``, ``_3``… to duplicate header names
                            after normalization (default ``True``).

    Returns:
        Tuple ``(headers, rows)`` where ``headers`` is the list of normalized
        column names and ``rows`` is a ``list[dict[str, str | None]]`` — cell
        values cast to ``str``, empty cells as ``None``.

    Example::

        import io, openpyxl
        from pinky_snowpark import parse_worksheet

        wb = openpyxl.load_workbook(io.BytesIO(xlsx_bytes), read_only=True, data_only=True)
        for sheet_name in wb.sheetnames:
            headers, rows = parse_worksheet(wb[sheet_name])
            # → build Snowpark DataFrame or stage_write(session, rows, ...)
    """
    from pinky_core.fmt import to_upper_snake_case

    rows_iter = ws.iter_rows(values_only=True)

    raw_headers: list[str] | None = None
    for row in rows_iter:
        if any(cell is not None for cell in row):
            raw_headers = [
                str(c) if c is not None else f"COL_{i}"
                for i, c in enumerate(row)
            ]
            break

    if not raw_headers:
        return [], []

    headers = (
        [to_upper_snake_case(h) for h in raw_headers]
        if normalize_headers
        else list(raw_headers)
    )

    if deduplicate:
        seen: dict[str, int] = {}
        deduped: list[str] = []
        for h in headers:
            if h in seen:
                seen[h] += 1
                deduped.append(f"{h}_{seen[h]}")
            else:
                seen[h] = 0
                deduped.append(h)
        headers = deduped

    data_rows: list[dict[str, str | None]] = []
    for row in rows_iter:
        if not any(cell is not None for cell in row):
            continue
        cells = list(row)
        if len(cells) < len(headers):
            cells += [None] * (len(headers) - len(cells))
        else:
            cells = cells[: len(headers)]
        data_rows.append({
            h: str(v) if v is not None else None
            for h, v in zip(headers, cells)
        })

    return ParseResult(headers, data_rows)