Fast serialize Excel file to XML with apache poi

Fast serialize Excel file to XML with apache poi

Excel is a human readable and writable format, and XML is THE machine language. We need an efficient bridge between those two technologies, for example in order to send the content of an Excel file to a Web Service.

Since 2007, the Excel files are a zip file containing XML data. Using that XML directly is difficult, I must serialize the data into a minimal XML containing the cells data inside tags with the column headers as names. The resulting XML will look like that :

<?xml version="1.0" encoding="UTF-8"?>
<pre:root xmlns:pre="http://uri/">
	<pre:feuil1>
		<pre:ligne>
			<pre:header1>A</pre:header1>
			<pre:header2>B</pre:header2>
		</pre:ligne>
		<pre:ligne>
			<pre:header1>A</pre:header1>
			<pre:header2>B</pre:header2>
		</pre:ligne>
	</pre:feuil1>
	<pre:feuil2>
	</pre:feuil2>
	<pre:feuil3>
	</pre:feuil3>
</pre:root>

Here is the code to do it.

Read Excel file

In that post, I only treat the case of the recent XLSX format. I will use Apache POI and I want it to be able to manage very big files, so I have to use an event API.

The POI streaming API only treats the old XLS format. For XLSX, you have to use the SAX API which only helps you providing access to a stream of the XML of the sheets. The reading of the tags must be coded into your SAX handler and the given example is very useful but not complete.

I started from that example of XLSX serialization to CSV in order to code that serializer.

The code in order to open a file and trigger the SAX reading is :

OPCPackage p = OPCPackage.open(xls.getPath(), PackageAccess.READ);
XSSFReader xssfReader = new XSSFReader(p);
XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
while (iter.hasNext()) {
    InputStream stream = iter.next();
    InputSource sheetSource = new InputSource(stream);
    SAXParserFactory saxFactory = SAXParserFactory.newInstance();
    SAXParser saxParser = saxFactory.newSAXParser();
    XMLReader sheetParser = saxParser.getXMLReader();
    sheetParser.setContentHandler(this);
    sheetParser.parse(sheetSource);
    stream.close();
}

Write XML

As I said, I want that serializer to be fast. So I will not use a XML API in order to write the XML output. With some lines of codes, I write the content as a XML containing a namespace for reusability.

Here is the code :

public void writeXmlStart(Writer writer) throws IOException {
        writer.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
        writeTag(writer, "root");
        writer.write(" xmlns:");
        writer.write(PREFIX);
        writer.write("=\"");
        writer.write("http://uri/");
        writer.write('"');
        writeSuperior(writer);
    }

    public void writeSheet(Writer writer, String normalizedName) throws IOException {
        writeTag(writer, normalizedName);
        writeSuperior(writer);
    }

    public void writeLine(Writer writer, StringWriter rowStr) throws IOException {
        writer.write('\t');
        writeTag(writer, LIGNE);
        writeSuperior(writer);
        writer.write(rowStr.toString());
        writer.write('\t');
        writeClosingTag(writer, LIGNE);
    }

    public void writeCell(Writer writer, String cellToString, String colHeader) throws IOException {
        writer.write("\t\t");
        writeTag(writer, colHeader);
        writer.write('>');
        String escaped = StringEscapeUtils.escapeXml10(cellToString);
        writer.write(escaped);
        writeClosingTag(writer, colHeader);
    }

    public void writeSuperior(Writer writer) throws IOException {
        writer.write(">\n");
    }

    public void writeClosingTag(Writer writer, String name) throws IOException {
        writer.write("</");
        writeQName(writer, name);
        writeSuperior(writer);
    }

    public void writeTag(Writer writer, String name) throws IOException {
        writer.write('<');
        writeQName(writer, name);
    }

    public void writeQName(Writer writer, String name) throws IOException {
        writer.write(PREFIX);
        writer.write(':');
        writer.write(name);
    }

Normalizing sheet names and column headers

In order to avoid issues with XML tag names, I normalize the sheet names and the column headers I found. It removes accents, normalizes spaces, and omits any characters which is not a letter, ‘_’ or a digit.

Here is the normalizing code :

public static String normalize(final String input) {
        final StringBuilder validName = new StringBuilder();
        final String spaceNormalized = StringUtils.normalizeSpace(input);
        final String trimed = Normalizer.normalize(spaceNormalized, Normalizer.Form.NFD);
        if (trimed.isEmpty()) {
            return "_";
        }
        final char firstChar = trimed.charAt(0);
        if (Character.isLetter(firstChar) || firstChar == '_') {
            validName.append(Character.toLowerCase(firstChar));
        } else if (Character.isDigit(firstChar)) {
            validName.append("_");
            validName.append(firstChar);
        }
        for (int i = 1; i < trimed.length(); i++) {
            final char charAt = trimed.charAt(i);
            if (Character.isWhitespace(charAt)) {
                validName.append("_");
            } else if (Character.isDigit(charAt) || charAt == '_') {
                validName.append(charAt);
            } else if (Character.isLetter(charAt)) {
                validName.append(Character.toLowerCase(charAt));
            }
        }
        return validName.toString();
    }

Quick overview of XSLX format

The contents of the cells are inside the tag <sheetData> of the XML files of the sheets.
In the files I use, I saw 3 different structures :

    • one with attribute t="inlineStr" :
<row r="1">
    <c r="A1" s="2" t="inlineStr">
        <is>
            <t>First column header</t>
        </is>
    </c>
</row>
    • a structure with the value in clear, with attribute t="n" :
<row r="2">
   <c r="A2" s="3" t="n">
       <v>18.0</v>
   </c>
</row>
    • and a last case, where it uses the “shared strings”. In that case the value is the index of the String in that list of shared strings :
<row r="2" spans="1:34" ht="38.25">
    <c r="A2" s="3" t="s">
        <v>29</v>
    </c>
</row>

I did not see any tag <inlineStr>, as it is expected in example from Apache POI. And in the case of an attribute t="inlineStr", the value is in a tag <t> instead of <v>, as expected in example.

Whole code

You can download the whole code here :
FastXlsx2XmlSerializer

No comments.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>