Parse large xml files with StAX parser

Foreword

Most of this article is excerpted from IBM developerworks (mainly theory), and the following three articles are detailed. The excerpt is mainly to make myself understand a little deeper, just as a note…It is also a reference for future use again! The excerpt is not comprehensive, the content of the original text should be much richer, see the original text for details.

Reference article:

Parsing XML with StAX, Part 1: Introduction to the Streaming API for XML (StAX): http://www.ibm.com/developerworks/cn/xml/x-stax1.html
Parsing XML with StAX, Part 2: Pull parsing and events: http://www.ibm.com/developerworks/cn/xml/x-stax2.html
Parsing XML with StAX, Part 3: Using custom events and writing XML: http://www.ibm.com/developerworks/cn/xml/x-stax3.html
—————
Original link: https://blog.csdn.net/zhyh1986/article/details/8528649

The description of StAX will not be described too much, let me talk about the problems I encountered in parsing the xml file

Requirement:
I want to parse all the content tagged as entity in a 4GB xml file, including nested sub-tags and content, and write the parsed entity data evenly into 7 new xml files

There are generally four ways to parse xml:

  • DOM parsing
  • SAX parsing
  • DOM4J Analysis
  • JDOM parsing

The pros and cons of these four methods are compared:

1. SAX parsing (Simple API for XML)

SAX analysis method: scan the document line by line, and analyze while scanning. Compared with DOM, SAX can stop parsing and parsing at any time when parsing documents, which is a faster and more efficient method.

Advantages: There is no need to transfer the entire document in advance, and it takes up less resources. Parsing can start immediately, fast and without memory pressure.

Disadvantage: cannot modify the node

Applicable: reading XML files

2. DOM analysis (Document Object Model)

DOM parsing method: defines a set of interfaces for parsing XML documents. The parser reads in the entire document and builds a tree structure in memory, which can then be manipulated using the DOM interface.

Advantages: The entire document tree is in memory, easy to operate; supports multiple functions such as deletion, modification, rearrangement, etc.

Disadvantage: If the file is relatively large and the memory is under pressure, the parsing time will be longer. Bringing the entire document into memory (including useless nodes), wasting time and space.

Applicable: modifying XML data

3. JDOM

JDOM is a pure java api for processing xml. It uses concrete classes instead of interfaces. JDOM has tree traversal and SAX java rules. JDOM is different from DOM in two main aspects.

First, JDOM only uses concrete classes and not interfaces. This simplifies the API in some ways, but also limits flexibility.

Second, the API makes extensive use of the Collections class, simplifying its use for Java developers who are already familiar with these classes.

JDOM itself does not contain a parser. It typically uses a SAX2 parser to parse and validate the input XML document (although it can also take as input a previously constructed DOM representation). It contains converters to output JDOM representations as SAX2 event streams, DOM models or XML text documents.

Advantages: 1. It is a tree-based Java API for processing xml, and loads the tree into memory.

2. There is no backward compatibility restriction, so it is simpler than DOM.

3. Fast speed.

4. Java rules with SAX.

Disadvantages: 1. Cannot handle documents larger than memory.

2. JDOM represents the logical model of XML documents, and cannot guarantee that each byte is truly transformed.

3. Does not provide any actual model of DTD and schema for instance documents.

4. It does not support corresponding traversal packages in DOM.

4. DOM4J

DOM4J has a more complex API, so dom4j has greater flexibility than jdom. DOM4J has the best performance, and even Sun’s JAXM is also using DOM4J. At present, many open source projects use DOM4J in large numbers, such as the famous Hibernate also uses DOM4J to read Take the XML configuration file. If portability is not a concern, use DOM4J.

Advantages: highest flexibility, ease of use and powerful functions, excellent performance

Disadvantages: complex api, poor portability

I have basically tried the above four methods to analyze the above requirements

The first one is DOM parsing, but this method can only parse smaller xml files, and if it is too large, it will cause memory overflow because it loads the entire document at once.

I have used DOM4J and SAX later, but due to the memory problem of the computer system, the problem of JVM memory overflow will still be reported.

There is no way, and finally found a way that StAX can also parse large XML files

Intercept a part of the xml file to be parsed:

<?xml version='1.0' encoding='UTF-8'?>
<gwl>
<version>20230417084108</version>
<entities>
<entity id="1123831" version="20230414163503">
    <name>ALMOND, LINCOLN CARTER</name>
    <listId>1021</listId>
    <listCode>USP</listCode>
    <entityType>03</entityType>
    <createdDate>09/02/2004</createdDate>
    <lastUpdateDate>04/14/2023</lastUpdateDate>
    <source>USP</source>
    <OriginalSource>PEP</OriginalSource>
    <dobs>
        <dob Y="1936">06/16/1936</dob>
    </dobs>
    <pobs>
        <pob>Pawtucket, Rhode Island, United States</pob>
    </pobs>
    <titles>
        <title>FORMER GOVERNOR OF RHODE ISLAND (JANUARY 3, 1995 - JANUARY 7, 2003). DECEASED JANUARY 02, 2023.</title>
    </titles>
    <sdfs>
        <sdf name="OtherInformation">Career: Governor of Rhode Island (January 03, 1995 - January 07, 2003); United State Attorney for the District of Rhode Island (October 09, 1981 - January 20, 1993); State Attorney for the District of Rhode Island (1969 - 1978).</sdf>
        <sdf name="DirectID">https://accuity.worldcompliance.com/signin.aspx?ent=d14d930f-7943-4363-b4d0-aa2c59437e1b</sdf>
        <sdf name="EffectiveDate">1981</sdf>
        <sdf name="EntityLevel">State</sdf>
        <sdf name="ExpirationDate">1993</sdf>
        <sdf name="Gender">MALE</sdf>
        <sdf name="NameSource">Website</sdf>
        <sdf name="Org_PID">1706394</sdf>
        <sdf name="OriginalID">7031</sdf>
        <sdf name="Relationship">Father</sdf>
        <sdf name="SubCategory">Former PEP</sdf>
    </sdfs>
    <addresses>
        <address>
            <country>US</country>
            <countryName>UNITED STATES</countryName>
        </address>
    </addresses>
</entity>
<entity id="1124766" version="20230414163503">
    <name>BAUCUS, MAX SIEBEN</name>
    <listId>1021</listId>
    <listCode>USP</listCode>
    <entityType>03</entityType>
    <createdDate>09/02/2004</createdDate>
    <lastUpdateDate>04/14/2023</lastUpdateDate>
    <source>USP</source>
    <OriginalSource>PEP</OriginalSource>
    <dobs>
        <dob Y="1941">12/11/1941</dob>
    </dobs>
    <pobs>
        <pob>Helena, Montana, United States</pob>
    </pobs>
    <aliases>
        <alias type="Alias">ENKE, MAX SIEBEN</alias>
    </aliases>
    <titles>
        <title>FORMER AMBASSADOR OF THE UNITED STATES TO CHINA (MARCH 20, 2014 - JANUARY 16, 2017).</title>
    </titles>
    <sdfs>
        <sdf name="OtherInformation">Political Party: Democratic. Career: Ambassador Extraordinary and Plenipotentiary of the United States to China, (March 20, 2014 - January 16, 2017); Member of the United States Congress, Senate from Montana (December 15, 1978 - February 06, 2014);</sdf>
        <sdf name="DirectID">https://accuity.worldcompliance.com/signin.aspx?ent=945fd382-f5b7-42c4-ad1f-a40c4bf0e285</sdf>
        <sdf name="EffectiveDate">1978</sdf>
        <sdf name="EntityLevel">National</sdf>
        <sdf name="ExpirationDate">2014</sdf>
        <sdf name="Gender">MALE</sdf>
        <sdf name="NameSource">Website</sdf>
        <sdf name="Org_PID">548118</sdf>
        <sdf name="OriginalID">7542</sdf>
        <sdf name="Relationship">Brother</sdf>
        <sdf name="SubCategory">Former PEP</sdf>
    </sdfs>
    <addresses>
        <address>
            <country>US</country>
            <countryName>UNITED STATES</countryName>
            <province>WASHINGTON, DC</province>
            <postalCode>20515</postalCode>
        </address>
        <address>
            <country>US</country>
            <countryName>UNITED STATES</countryName>
            <province>WASHINGTON, D.C.</province>
            <postalCode>20510</postalCode>
        </address>
        <address>
            <address1>55 ANJIALOU RD</address1>
            <city>BEIJING</city>
            <country>CN</country>
            <countryName>CHINA</countryName>
            <postalCode>100600</postalCode>
        </address>
    </addresses>
</entity>
<entity id="1124842" version="20230414163503">
    <name>THOMAS, CRAIG LYLE</name>
    <listId>1021</listId>
    <listCode>USP</listCode>
    <entityType>03</entityType>
    <createdDate>09/02/2004</createdDate>
    <lastUpdateDate>04/14/2023</lastUpdateDate>
    <source>USP</source>
    <OriginalSource>PEP</OriginalSource>
    <dobs>
        <dob Y="1933">02/17/1933</dob>
    </dobs>
    <pobs>
        <pob>Cody, Wyoming, United States</pob>
    </pobs>
    <titles>
        <title>FORMER MEMBER OF THE UNITED STATES CONGRESS (JANUARY 03, 1995 - JUNE 04, 2007). DECEASED JUNE 04, 2007.</title>
    </titles>
    <sdfs>
        <sdf name="OtherInformation">Political Party: Republican. Career: Member of the United States Congress, Senate, Class I (January 03, 1995 - June 04, 2007); Member of the United States Congress, House of Representatives , At-Large (April 27, 1989 - January 03, 1995). Member of the</sdf>
        <sdf name="DirectID">https://accuity.worldcompliance.com/signin.aspx?ent=4e7b1050-36b5-4b1c-9037-c2349c519d40</sdf>
        <sdf name="EffectiveDate">1989</sdf>
        <sdf name="EntityLevel">National</sdf>
        <sdf name="ExpirationDate">1995</sdf>
        <sdf name="Gender">MALE</sdf>
        <sdf name="NameSource">Website</sdf>
        <sdf name="Org_PID">1817490</sdf>
        <sdf name="OriginalID">7629</sdf>
        <sdf name="Relationship">Father</sdf>
        <sdf name="SubCategory">Former PEP</sdf>
    </sdfs>
    <addresses>
        <address>
            <country>US</country>
            <countryName>UNITED STATES</countryName>
            <province>WASHINGTON D.C.</province>
            <postalCode>20510</postalCode>
        </address>
        <address>
            <address1>200 WEST 24TH STREET</address1>
            <city>CHEYENNE</city>
            <state>WY</state>
            <stateName>WYOMING</stateName>
            <country>US</country>
            <countryName>UNITED STATES</countryName>
            <postalCode>82002</postalCode>
        </address>
    </addresses>
</entity>
<entity id="1125230" version="20230414163051">
    <name>PATRIAT, FRANCOIS</name>
    <listId>1020</listId>
    <listCode>PEP</listCode>
    <entityType>03</entityType>
    <createdDate>09/02/2004</createdDate>
    <lastUpdateDate>04/14/2023</lastUpdateDate>
    <source>PEP</source>
    <OriginalSource>PEP</OriginalSource>
    <dobs>
        <dob Y="1943">03/21/1943</dob>
    </dobs>
    <pobs>
        <pob>Semur-en-Auxois, , France</pob>
    </pobs>
    <titles>
        <title>MEMBER OF THE FRENCH PARLIAMENT (OCTOBER 01, 2008 - 2026).</title>
    </titles>
    <sdfs>
        <sdf name="OtherInformation">Political party: La Republique en marche (LREM) (currently known as Renaissance). Career: Member of the Executive Bureau of La Republique en Marche (LREM), The Republic on the Move (currently known as Renaissance), effective from November 18, 2017;</sdf>
        <sdf name="DirectID">https://accuity.worldcompliance.com/signin.aspx?ent=a4ffd4f3-5c75-440b-aeca-4e3a7d2ef642</sdf>
        <sdf name="EffectiveDate">2008</sdf>
        <sdf name="EntityLevel">National</sdf>
        <sdf name="ExpirationDate">2026</sdf>
        <sdf name="Gender">MALE</sdf>
        <sdf name="NameSource">Website</sdf>
        <sdf name="Org_PID">3759009</sdf>
        <sdf name="OriginalID">8117</sdf>
        <sdf name="Relationship">Associate</sdf>
        <sdf name="SubCategory">Govt Branch Member</sdf>
    </sdfs>
    <addresses>
        <address>
            <address1>15, RUE DE VAUGIRARD</address1>
            <city>PARIS</city>
            <country>FR</country>
            <countryName>FRANCE</countryName>
            <postalCode>75291</postalCode>
        </address>
    </addresses>
</entity>
<entity id="1125282" version="20230414163052">
    <name>BENOUTIQ, ABDELKRIM</name>
    <listId>1020</listId>
    <listCode>PEP</listCode>
    <entityType>03</entityType>
    <createdDate>09/02/2004</createdDate>
    <lastUpdateDate>04/14/2023</lastUpdateDate>
    <source>PEP</source>
    <OriginalSource>PEP</OriginalSource>
    <dobs>
        <dob Y="1959">08/19/1959</dob>
    </dobs>
    <pobs>
        <pob>Rabat, Rabat-Sale-Kenitra Region, Morocco</pob>
    </pobs>
    <aliases>
        <alias type="Alias">BEN ATIQ, ABDELKRIM</alias>
        <alias type="Alias">BENATIQ, ABDELKRIM</alias>
    </aliases>
    <nativeCharNames>
        <nativeCharName charSet="" latinCharName="BEN ATIQ, ABDELKRIM" type="Alias">?   ?</nativeCharName>
        <nativeCharName charSet="" latinCharName="BENATIQ, ABDELKRIM" type="Alias">?  ?</nativeCharName>
        <nativeCharName charSet="" latinCharName="BENOUTIQ, ABDELKRIM" type="Primary">?  ?</nativeCharName>
    </nativeCharNames>
    <titles>
        <title>FORMER MEMBER OF THE POLITICAL BUREAU OF SOCIALIST UNION OF POPULAR FORCES PARTY, MOROCCO, ELECTED JUNE 10, 2017, EFFECTIVE UNTIL APRIL 24, 2022.</title>
    </titles>
    <sdfs>
        <sdf name="OtherInformation">Political Party: Union Socialiste Des Forces Populaires (USFP) Career: Member of the Political Bureau of Union Socialiste Des Forces Populaires (USFP), Socialist Union of Popular Forces Party, elected June 10, 2017 , effective until April 24, 2022;</sdf>
        <sdf name="DirectID">https://accuity.worldcompliance.com/signin.aspx?ent=35f8bcea-6169-4a8f-9715-81de730d1c17</sdf>
        <sdf name="EffectiveDate">2000</sdf>
        <sdf name="EntityLevel">National</sdf>
        <sdf name="ExpirationDate">2001</sdf>
        <sdf name="Gender">MALE</sdf>
        <sdf name="NameSource">Website</sdf>
        <sdf name="OriginalID">8181</sdf>
        <sdf name="SubCategory">Former PEP</sdf>
    </sdfs>
    <addresses>
        <address>
            <address1>9, AVENUE AL ARAAR</address1>
            <city>RABAT</city>
            <country>MA</country>
            <countryName>MOROCCO</countryName>
            <province>RABAT-SALE-KENITRA REGION</province>
        </address>
        <address>
            <address1>AVENUE F. ROOSEVELT</address1>
            <city>RABAT</city>
            <country>MA</country>
            <countryName>MOROCCO</countryName>
            <province>RABAT-SALE-KENITRA REGION</province>
        </address>
        <address>
            <address1>NO. 9 ARAR STREET</address1>
            <city>RABAT</city>
            <country>MA</country>
            <countryName>MOROCCO</countryName>
            <province>RABAT-SALE-KENITRA REGION</province>
        </address>
    </addresses>
</entity>
<entity id="1125443" version="20230414163053">
    <name>OLLING, SVEND</name>
    <listId>1020</listId>
    <listCode>PEP</listCode>
    <entityType>03</entityType>
    <createdDate>09/02/2004</createdDate>
    <lastUpdateDate>04/14/2023</lastUpdateDate>
    <source>PEP</source>
    <OriginalSource>PEP</OriginalSource>
    <dobs>
        <dob Y="1967">11/09/1967</dob>
    </dobs>
    <pobs>
        <pob>Glostrup, , Denmark</pob>
    </pobs>
    <titles>
        <title>AMBASSADOR OF DENMARK TO SOUTH KOREA, AS OF MARCH 30, 2023.</title>
    </titles>
    <sdfs>
        <sdf name="OtherInformation">Career: Ambassador of Denmark to South Korea, as of March 30, 2023; Ambassador of Denmark to Egypt, as of May 28, 2020, expiration reported March 20, 2023; Non-Resident Ambassador of Denmark to Azerbaijan, effective from March 26, 2017, expiration</sdf>
        <sdf name="DirectID">https://accuity.worldcompliance.com/signin.aspx?ent=ef160921-f06b-4942-9527-0ee7565467c0</sdf>
        <sdf name="EffectiveDate">2023</sdf>
        <sdf name="EntityLevel">International</sdf>
        <sdf name="Gender">MALE</sdf>
        <sdf name="NameSource">Website</sdf>
        <sdf name="Org_PID">8698914</sdf>
        <sdf name="OriginalID">8384</sdf>
        <sdf name="Relationship">Father</sdf>
        <sdf name="SubCategory">Diplomat</sdf>
    </sdfs>
    <addresses>
        <address>
            <address1>416, HANGANG-DAERO, JUNG-GU</address1>
            <city>SEOUL</city>
            <country>KR</country>
            <countryName>KOREA, REPUBLIC OF</countryName>
            <postalCode>04637</postalCode>
        </address>
        <address>
            <address1>TURAN GUENES BULVARI 106</address1>
            <city>ANKARA</city>
            <country>TR</country>
            <countryName>TURKEY</countryName>
            <postalCode>06550</postalCode>
        </address>
        <address>
            <address1>ASIATISK PLADS 2</address1>
            <city>COPENHAGEN</city>
            <country>DK</country>
            <countryName>DENMARK</countryName>
            <postalCode>1448</postalCode>
        </address>
        <address>
            <address1>NORTH AVENUE</address1>
            <city>DHAKA</city>
            <country>BD</country>
            <countryName>BANGLADESH</countryName>
            <postalCode>1212</postalCode>
        </address>
        <address>
            <city>CAIRO</city>
            <country>EG</country>
            <countryName>EGYPT</countryName>
        </address>
    </addresses>
</entity>
<entity id="1125610" version="20230414163054">
    <name>TAKAHASHI, KOICHI</name>
    <listId>1020</listId>
    <listCode>PEP</listCode>
    <entityType>03</entityType>
    <createdDate>09/02/2004</createdDate>
    <lastUpdateDate>04/14/2023</lastUpdateDate>
    <source>PEP</source>
    <OriginalSource>PEP</OriginalSource>
    <dobs>
        <dob Y="1944">1944</dob>
    </dobs>
    <nativeCharNames>
        <nativeCharName charSet="" latinCharName="TAKAHASHI, KOICHI" type="Primary">たかはしこういち</nativeCharName>
        <nativeCharName charSet="" latinCharName="TAKAHASHI, KOICHI" type="Primary">Takahashi Hengichi</nativeCharName>
    </nativeCharNames>
    <titles>
        <title>FORMER AMBASSADOR OF JAPAN TO THE CZECH REPUBLIC (FEBRUARY 03, 2003 - 2005).</title>
    </titles>
    <sdfs>
        <sdf name="OtherInformation">Career: Ambassador of Japan to the Czech Republic (February 03, 2003 - 2005); Deputy Vice-Minister in charge of Immigration Bureau, Ministry of Justice (1999 - 2001); Consul-General of Japan to Berlin City, Germany (1995 - 1997); Minister of Japan to</sdf>
        <sdf name="DirectID">https://accuity.worldcompliance.com/signin.aspx?ent=9b2a063e-8d55-4806-b2f2-f2c79d815a33</sdf>
        <sdf name="EffectiveDate">1999</sdf>
        <sdf name="EntityLevel">National</sdf>
        <sdf name="ExpirationDate">2001</sdf>
        <sdf name="Gender">MALE</sdf>
        <sdf name="NameSource">Website</sdf>
        <sdf name="OriginalID">8483</sdf>
        <sdf name="SubCategory">Former PEP</sdf>
    </sdfs>
    <addresses>
        <address>
            <country>JP</country>
            <countryName>JAPAN</countryName>
        </address>
    </addresses>
</entity>
<entity id="1125925" version="20230414163054">
    <name>PINTER, SANDOR</name>
    <listId>1020</listId>
    <listCode>PEP</listCode>
    <entityType>03</entityType>
    <createdDate>09/02/2004</createdDate>
    <lastUpdateDate>04/14/2023</lastUpdateDate>
    <source>PEP</source>
    <OriginalSource>PEP</OriginalSource>
    <dobs>
        <dob Y="1948">07/03/1948</dob>
    </dobs>
    <pobs>
        <pob>Budapest, , Hungary</pob>
    </pobs>
    <titles>
        <title>DEPUTY PRIME MINISTER OF HUNGARY, EFFECTIVE FROM MAY 04, 2018.</title>
    </titles>
    <sdfs>
        <sdf name="OtherInformation">Career: Deputy Prime Minister, effective from May 04, 2018; Minister of Interior, effective from May 29, 2010; Minister of Interior (July 08, 1998 - May 27, 2002); of the Hungarian National Police (September 18, 1991 - 1996).</sdf>
        <sdf name="DirectID">https://accuity.worldcompliance.com/signin.aspx?ent=cd135a22-6242-4999-bc6f-5aae5b0f92e2</sdf>
        <sdf name="EffectiveDate">2018</sdf>
        <sdf name="EntityLevel">National</sdf>
        <sdf name="Gender">MALE</sdf>
        <sdf name="NameSource">Website</sdf>
        <sdf name="Org_PID">2544374</sdf>
        <sdf name="OriginalID">11549</sdf>
        <sdf name="Relationship">Father</sdf>
        <sdf name="SubCategory">Govt Branch Member</sdf>
    </sdfs>
    <addresses>
        <address>
            <address1>TEVE U. 4-6.</address1>
            <city>BUDAPEST</city>
            <country>HU</country>
            <countryName>HUNGARY</countryName>
            <postalCode>1139</postalCode>
        </address>
        <address>
            <address1>JOZSEF ATTILA U. 2-4.</address1>
            <city>BUDAPEST</city>
            <country>HU</country>
            <countryName>HUNGARY</countryName>
            <postalCode>1051</postalCode>
        </address>
    </addresses>
</entity>
</entities>
</gwl>

The following is to use the StAX parsing method to parse out all the content tagged as entity in the above xml file, and evenly write it into 7 new xml files, and each new xml file is a custom fixed format:

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.XMLStreamWriter;

public class StAXParserTest {
    public static void main(String[] args) {
        String inputFile = "D:\Desktop\PEP\ENTITY.XML"; // input XML file path
        String outputPrefix = "D:\Desktop\PEP\"; // output XML file prefix
        int numFiles = 7; // number of new files

        try {
            // create XML input factory and reader
            XMLInputFactory inputFactory = XMLInputFactory. newInstance();
            // create input stream
            InputStream inputStream = new FileInputStream(inputFile);
            //Create XMLStreamReader using input factory
            XMLStreamReader reader = inputFactory. createXMLStreamReader(inputStream);

            // create XML output factory and writer array
            XMLOutputFactory outputFactory = XMLOutputFactory. newInstance();
            //Create an array of output streams:
            OutputStream[] outputStreams = new OutputStream[numFiles];
            //Create XMLStreamWriter array
            XMLStreamWriter[] writers = new XMLStreamWriter[numFiles];

            for (int i = 0; i < numFiles; i ++ ) {
                String outputFileName = outputPrefix + (i + 1) + ".xml";
                outputStreams[i] = new FileOutputStream(outputFileName);
                writers[i] = outputFactory.createXMLStreamWriter(outputStreams[i]);
                //Start to write the XML file at the beginning of the head, such as: <?xml version='1.0' encoding='UTF-8'?>
                writers[i].writeStartDocument("UTF-8", "1.0");
                //Here is a carriage return added
                writers[i].writeCharacters("\\
");
                //Created the GWL tag
                writers[i].writeStartElement("gwl");
                writers[i].writeCharacters("\\
");
                //Create a Version tag and add a value inside the Version tag
                writers[i].writeStartElement("version");
                writers[i].writeCharacters("20230417084108");
                //Version tag ends, add back tag</Version>
                writers[i].writeEndElement();
                writers[i].writeCharacters("\\
");
                writers[i].writeStartElement("entities");
            }

            // parse XML and write to new file
            int currentFileIndex = 0;
            int entityCount = 0;

            while (reader. hasNext()) {
                int event = reader. next();

                switch (event) {
                    case XMLStreamConstants.START_ELEMENT:
                        String elementName = reader. getLocalName();
                        if ("entity".equals(elementName)) {
                            // parse the entity element and its child elements
                            writeEntityElement(reader, writers[currentFileIndex]);
                            entityCount++;

                            // switch to the next file
                            currentFileIndex = (currentFileIndex + 1) % numFiles;
                        }
                        break;
                }
            }

            // close the writer and output stream
            for (int i = 0; i < numFiles; i ++ ) {
            writers[i].writeCharacters("\\
");
            //entities return label
                writers[i].writeEndElement(); // entities
                writers[i].writeCharacters("\\
");
                //gwl back tab
                writers[i].writeEndElement(); // gwl
                writers[i].writeCharacters("\\
");
                writers[i].writeEndDocument();
                writers[i].flush();
                writers[i].close();
                outputStreams[i].close();
            }

            // close the input stream
            inputStream. close();

            System.out.println("total number of entities: " + entityCount);
            System.out.println("Entities per file: " + (entityCount / numFiles));

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private static void writeEntityElement(XMLStreamReader reader, XMLStreamWriter writer) throws XMLStreamException {
    writer.writeCharacters("\\
");
    //Start writing to the Entity tag
        writer.writeStartElement("entity");

        // Write the attributes of the entity element
        int attributeCount = reader. getAttributeCount();
        //Read the attribute value in the entity tag: attributeName is id/version attributeValue is value
        for (int i = 0; i < attributeCount; i ++ ) {
            String attributeName = reader. getAttributeLocalName(i);
            String attributeValue = reader. getAttributeValue(i);
            writer.writeAttribute(attributeName, attributeValue);
        }

        // parse the child elements of the entity element
        while (reader. hasNext()) {
            int event = reader. next();
            switch (event) {
                case XMLStreamConstants.START_ELEMENT:
                //Get the name of the element that is currently starting
                    String childElementName = reader. getLocalName();
                    //code to write start element
                    writer.writeStartElement(childElementName);
                    break;

                case XMLStreamConstants.END_ELEMENT:
                    String endElementName = reader. getLocalName();
                    //code to write end element
                    writer.writeEndElement();
                    if ("entity".equals(endElementName)) {
                        // The entity element is parsed and the writing is finished
                        return;
                    }
                    break;

                case XMLStreamConstants.CHARACTERS:
                    String text = reader. getText();
                    writer. writeCharacters(text);
                    break;
            }
        }
    }
}

There are a total of 8 entity elements in the xml file intercepted by the above example. After the parsing is completed, one of each of the 7 xml files will be stored on average, and the extra one will be stored in turn. Therefore, there are 2 in the first xml file. There is only one piece of data in the other 6

I have completely parsed the 4GB Entity.xml file, there is no problem of memory overflow, and the parsing speed is also very fast!