Fast serialize Excel file to XML with apache poi

Excel is a human readable and writable format, and XML is an important machine language. We need an efficient bridge between those two technologies.

Since 2007, the Excel files are a zip file containing XML data. Those XML files must be serialized into a minimal XML containing the cells data inside tags with the column headers as names. The resulting XML will look like that :

<?xml version="1.0" encoding="UTF-8"?>
<pre:root xmlns:pre="http://uri/">
	<pre:feuil1>
		<pre:ligne>
			<pre:header1>A</pre:header1>
			<pre:header2>B</pre:header2>
		</pre:ligne>
		<pre:ligne>
			<pre:header1>A</pre:header1>
			<pre:header2>B</pre:header2>
		</pre:ligne>
	</pre:feuil1>
	<pre:feuil2>
	</pre:feuil2>
	<pre:feuil3>
	</pre:feuil3>
</pre:root>

Here is the code to do it.

Read Excel file

In that post, I only treat the case of the recent XLSX format. I will use Apache POI and I want it to be able to manage very big files, so I have to use an event API.

The POI streaming API only treats the old XLS format. For XLSX, you have to use the SAX API which only helps you providing access to a stream of the XML of the sheets. The reading of the tags must be coded into your SAX handler and the given example is very useful but not complete.

The code in order to open a file and trigger the SAX reading is :

OPCPackage p = OPCPackage.open(xls.getPath(), PackageAccess.READ);
XSSFReader xssfReader = new XSSFReader(p);
XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
while (iter.hasNext()) {
    InputStream stream = iter.next();
    InputSource sheetSource = new InputSource(stream);
    SAXParserFactory saxFactory = SAXParserFactory.newInstance();
    SAXParser saxParser = saxFactory.newSAXParser();
    XMLReader sheetParser = saxParser.getXMLReader();
    sheetParser.setContentHandler(this);
    sheetParser.parse(sheetSource);
    stream.close();
}

Write XML

As I said, I want that serializer to be fast. So I will not use a XML API in order to write the XML output. With some lines of codes, I write the content as a XML containing a namespace for reusability.

Here is the code :

public void writeXmlStart(Writer writer) throws IOException {
        writer.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
        writeTag(writer, "root");
        writer.write(" xmlns:");
        writer.write(PREFIX);
        writer.write("=\"");
        writer.write("http://uri/");
        writer.write('"');
        writeSuperior(writer);
    }

    public void writeSheet(Writer writer, String normalizedName) throws IOException {
        writeTag(writer, normalizedName);
        writeSuperior(writer);
    }

    public void writeLine(Writer writer, StringWriter rowStr) throws IOException {
        writer.write('\t');
        writeTag(writer, LIGNE);
        writeSuperior(writer);
        writer.write(rowStr.toString());
        writer.write('\t');
        writeClosingTag(writer, LIGNE);
    }

    public void writeCell(Writer writer, String cellToString, String colHeader) throws IOException {
        writer.write("\t\t");
        writeTag(writer, colHeader);
        writer.write('>');
        String escaped = StringEscapeUtils.escapeXml10(cellToString);
        writer.write(escaped);
        writeClosingTag(writer, colHeader);
    }

    public void writeSuperior(Writer writer) throws IOException {
        writer.write(">\n");
    }

    public void writeClosingTag(Writer writer, String name) throws IOException {
        writer.write("</");
        writeQName(writer, name);
        writeSuperior(writer);
    }

    public void writeTag(Writer writer, String name) throws IOException {
        writer.write('<');
        writeQName(writer, name);
    }

    public void writeQName(Writer writer, String name) throws IOException {
        writer.write(PREFIX);
        writer.write(':');
        writer.write(name);
    }

Normalizing sheet names and column headers

In order to avoid issues with XML tag names, I normalize the sheet names and the column headers I found. It removes accents, normalizes spaces, and omits any characters which is not a letter, ‘_’ or a digit.

Here is the normalizing code :

public static String normalize(final String input) {
        final StringBuilder validName = new StringBuilder();
        final String spaceNormalized = StringUtils.normalizeSpace(input);
        final String trimed = Normalizer.normalize(spaceNormalized, Normalizer.Form.NFD);
        if (trimed.isEmpty()) {
            return "_";
        }
        final char firstChar = trimed.charAt(0);
        if (Character.isLetter(firstChar) || firstChar == '_') {
            validName.append(Character.toLowerCase(firstChar));
        } else if (Character.isDigit(firstChar)) {
            validName.append("_");
            validName.append(firstChar);
        }
        for (int i = 1; i < trimed.length(); i++) {
            final char charAt = trimed.charAt(i);
            if (Character.isWhitespace(charAt)) {
                validName.append("_");
            } else if (Character.isDigit(charAt) || charAt == '_') {
                validName.append(charAt);
            } else if (Character.isLetter(charAt)) {
                validName.append(Character.toLowerCase(charAt));
            }
        }
        return validName.toString();
    }

Quick overview of XSLX format

The contents of the cells are inside the tag <sheetData> of the XML files of the sheets.
In the files I use, I saw 3 different structures :

    • one with attribute t="inlineStr" :
<row r="1">
    <c r="A1" s="2" t="inlineStr">
        <is>
            <t>First column header</t>
        </is>
    </c>
</row>
    • a structure with the value in clear, with attribute t="n" :
<row r="2">
   <c r="A2" s="3" t="n">
       <v>18.0</v>
   </c>
</row>
    • and a last case, where it uses the “shared strings”. In that case the value is the index of the String in that list of shared strings :
<row r="2" spans="1:34" ht="38.25">
    <c r="A2" s="3" t="s">
        <v>29</v>
    </c>
</row>

I did not see any tag <inlineStr>, as it is expected in example from Apache POI. And in the case of an attribute t="inlineStr", the value is in a tag <t> instead of <v>, as expected in example.

Whole code

You can download the whole code here :
FastXlsx2XmlSerializer

Categories: Excel Tags: Tags: , ,

Write a ResultSet to Excel with Apache POI

We often face the need to export data in a human-readable and human-writable way. Export to Excel seems natural, giving access to all his tools of filtering, etc…

I had the occasion to do it a generic way, I share it here. It does not cover all cases of data found in database, up to you to complete it and maybe share your improvements.

Dump headers

First step is to dump the column headers, found in ResultSetMetaData. The cells will be styled in BOLD font.

public HSSFWorkbook dump(ResultSet resultSet) throws SQLException {
    HSSFWorkbook workbook = new HSSFWorkbook();
    HSSFFont boldFont = workbook.createFont();
    boldFont.setBoldweight(HSSFFont.BOLDWEIGHT_BOLD);
    HSSFSheet sheet = workbook.createSheet("sheet");
    HSSFRow titleRow = sheet.createRow(0);
    ResultSetMetaData metaData = resultSet.getMetaData();
    int columnCount = metaData.getColumnCount();
    for (int colIndex = 0; colIndex &lt; columnCount; colIndex++) {
        String title = metaData.getColumnLabel(colIndex + 1);
        HSSFCell cell = HSSFCellUtil.createCell(titleRow, colIndex, title);
        HSSFCellStyle style = workbook.createCellStyle();
        style.setFont(boldFont);
        cell.setCellStyle(style);
    }
    return workbook;
}

Dump data

Then the resultSet is browsed (a call to beforeFirst() resets the cursor), and different cases of Object retrieved are covered. That does not cover all different object types the driver can return.

    private void dumpData(ResultSet resultSet, HSSFSheet sheet, int columnCount) throws SQLException {
        int currentRow = 1;
        resultSet.beforeFirst();
        while (resultSet.next()) {
            HSSFRow row = sheet.createRow(currentRow++);
            for (int colIndex = 0; colIndex < columnCount; colIndex++) {
                Object value = resultSet.getObject(colIndex + 1);
                final HSSFCell cell = row.createCell(colIndex);
                if (value == null) {
                    cell.setCellValue("");
                } else {
                    if (value instanceof Calendar) {
                        cell.setCellValue((Calendar) value);
                    } else if (value instanceof Date) {
                        cell.setCellValue((Date) value);
                    } else if (value instanceof String) {
                        cell.setCellValue((String) value);
                    } else if (value instanceof Boolean) {
                        cell.setCellValue((Boolean) value);
                    } else if (value instanceof Double) {
                        cell.setCellValue((Double) value);
                    } else if (value instanceof BigDecimal) {
                        cell.setCellValue(((BigDecimal) value).doubleValue());
                    }
                }
            }
        }
        for (int i = 0; i < columnCount; i++) {
            sheet.autoSizeColumn(i);
        }
    }

(Optional) Add cell formatting

If you want to format the cells, you can parameter a map between JDBC formats and Excel formatting. The JDBC formats will be set with regular expressions. Such a map can be initialized with a Spring file, like this :

<util:map map-class="java.util.LinkedHashMap">
    <entry key="NUMBER\(\d+,2\)" value="0.00" />
    <entry key="NUMBER\(\d+,0\)" value="0" />
    <entry key="NUMBER.*" value="0.###" />
    <entry key="DECIMAL\(\d+,2\)" value="0.00" />
    <entry key="DECIMAL\(\d+,0\)" value="0" />
    <entry key="DECIMAL.*" value="0.###" />
    <entry key=".*CHAR.*" value="text" />
    <entry key="DATE.*" value="dd/MM/yyyy" />
    <entry key="TIMESTAMP.*" value="dd/MM/yyyy" />
</util:map>

Then in the code, add the methods :


    private LinkedHashMap<String, String> mapSqlTypeExcelFormat;

    private HSSFCellStyle getDataStyle(HSSFWorkbook workbook, ResultSetMetaData metaData, int colIndex, HSSFDataFormat dataFormat) throws SQLException {
        HSSFCellStyle dataStyle = workbook.createCellStyle();
        String columnType = metaData.getColumnTypeName(colIndex + 1);
        columnType += "(" + metaData.getPrecision(colIndex + 1);
        columnType += "," + metaData.getScale(colIndex + 1) + ")";
        String excelFormat = getExcelFormat(columnType);
        final short format = dataFormat.getFormat(excelFormat);
        dataStyle.setDataFormat(format);
        return dataStyle;
    }

    private String getExcelFormat(String columnType) {
        for (Map.Entry<String, String> entry : mapSqlTypeExcelFormat.entrySet()) {
            if (Pattern.matches(entry.getKey(), columnType)) {
                return entry.getValue();
            }
        }
        return "text";
    }

Whole code

Java Class :
ResultSetToExcel

JUnit Test Class :
ResultSetToExcelTest

Categories: Excel Tags: Tags: , ,