PHP reads word docx document content

HP reads text and pictures in word documents and saves them
1. Composer installs phpWord

composer require phpoffice/phpword
Portal: https://packagist.org/packages/phpoffice/phpword

2. phpWord reads docx documents (note that it is in docx format, not doc format)

If your file is in doc format, just save it as a docx; if you have a lot of doc documents, you can use the batch conversion tool: http://www.batchwork.com/en/doc2doc/download.htm

If you haven’t configured automatic loading yet, configure it first:

require ./vendor/autoload.php’;
Load the document:

$dir = str_replace(\’, /’, DIR) . /’;
$source = $dir . test.docx’;
KaTeX parse error: Undefined control sequence: \PhpOffice at position 11: phpWord = \?P?h?p?O?f?f?i?c?e?\PhpWord\IOFact…source);

3. Key points

1) Alignment: PhpOffice\PhpWord\Style\Paragraph -> getAlignment()

2) Font name: \PhpOffice\PhpWord\Style\Font -> getName()

3) Font size: \PhpOffice\PhpWord\Style\Font -> getSize()

4) Whether to bold: \PhpOffice\PhpWord\Style\Font -> isBold()

5) Read pictures: \PhpOffice\PhpWord\Element\Image -> getImageStringData()

6) Save the ba64 format image data as an image: file_put_contents(

i

m

a

g

e

S

r

c

,

b

a

s

e

6

4

d

e

c

o

d

e

(

imageSrc, base64_decode(

imageSrc,base64d?ecode(imageData))

4. Complete code

Copy code
require ./vendor/autoload.php’;

function docx2html($source)
{
KaTeX parse error: Undefined control sequence: \PhpOffice at position 11: phpWord = \?P?h?p?O?f?f?i?c?e?\PhpWord\IOFact…source);

h

t

m

l

=

;

f

o

r

e

a

c

h

(

html = ”; foreach (

html=′′;foreach(phpWord->getSections() as KaTeX parse error: Expected ‘}’, got ‘EOF’ at end of input: … foreach (section->getElements() as $ele1) {
$paragraphStyle =

e

l

e

1

?

>

g

e

t

P

a

r

a

g

r

a

p

h

S

t

y

l

e

(

)

;

i

f

(

ele1->getParagraphStyle(); if (

ele1?>getParagraphStyle();if(paragraphStyle) {
$html .=

‘;
} else {
KaTeX parse error: Expected ‘EOF’, got ‘}’ at position 28: …’; }? if…ele1 instanceof \PhpOffice\PhpWord\Element\TextRun) {
$style = $ele2->getFontStyle();

f

o

n

t

F

a

m

i

l

y

=

m

b

c

o

n

v

e

r

t

e

n

c

o

d

i

n

g

(

fontFamily = mb_convert_encoding(

fontFamily=mbc?onverte?ncoding(style->getName(), GBK’, UTF-8’);
$fontSize = $style->getSize();
$isBold = $style->isBold();
$styleString = ”;
$fontFamily & amp; & amp; KaTeX parse error: Expected ‘}’, got ‘EOF’ at end of input: … “font-family:{fontFamily};”;
$fontSize & amp; & amp; KaTeX parse error: Expected ‘}’, got ‘EOF’ at end of input: ….= “font-size:{fontSize}px ;”;
$isBold & amp; & amp; $styleString .= “font-weight:bold;”;
$html .= sprintf(%s’,

s

t

y

l

e

S

t

r

i

n

g

,

m

b

c

o

n

v

e

r

t

e

n

c

o

d

i

n

g

(

styleString, mb_convert_encoding(

styleString,mbc?onverte?ncoding(ele2->getText(), GBK’, UTF-8’)
);
} elseif ($ele2 instanceof \PhpOffice\PhpWord\Element\Image) {

i

m

a

g

e

S

r

c

=

i

m

a

g

e

s

/

.

m

d

5

(

imageSrc = ‘images/’ . md5(

imageSrc=′images/′.md5(ele2->getSource()) . .’ . $ele2->getImageExtension();
$imageData = $ele2->getImageStringData(true);
// $imageData = data:’ . $ele2->getImageType() . ;base64,’ .

i

m

a

g

e

D

a

t

a

;

f

i

l

e

p

u

t

c

o

n

t

e

n

t

s

(

imageData; file_put_contents(

imageData;filep?utc?ontents(imageSrc, base64_decode($imageData));
$html .= ’;
}
}
}
$html .=

’;

}

}

return mb_convert_encoding($html, 'UTF-8', 'GBK');

}

$dir = str_replace(\’, /’, DIR) . /’;
$source =

d

i

r

.

t

e

s

t

.

d

o

c

x

;

e

c

h

o

d

o

c

x

2

h

t

m

l

(

dir . ‘test.docx’; echo docx2html(

dir.′test.docx′;echodocx2html(source);
Copy code

5. Supplement

Obviously, this is a simple example of word reading. It only reads the alignment of the paragraph, the font, size, boldness and pictures of the text, and other information such as text color and line height. . . All the information was deceived. If necessary, please check the phpWord source code yourself and see what reading methods are available in classes such as \PhpOffice\PhpWord\Style\xxx and \PhpOffice\PhpWord\Element\xxx etc.

6. 2020-07-21 Supplement

You can directly obtain the complete html using the following method

$phpWord = \PhpOffice\PhpWord\IOFactory::load(xxx.docx’);
KaTeX parse error: Undefined control sequence: \PhpOffice at position 13: xmlWriter = \?P?h?p?O?f?f?i?c?e?\PhpWord\IOFact…phpWord, “HTML” );
$html = $xmlWriter->getContent();
Note: The html content contains the head part. If you only need style and body, you need to process it yourself; and the image is base64. If you want to save it, you also need to process it yourself.

Please refer to the above code to save base64 data as an image.

If you only want to get the content in the body, you can refer to the write method in \PhpOffice\PhpWord\Writer\HTML\Part\Body

Copy code
$phpWord = \PhpOffice\PhpWord\IOFactory::load(xxxx.docx’);
KaTeX parse error: Undefined control sequence: \PhpOffice at position 14: htmlWriter = \?P?h?p?O?f?f?i?c?e?\PhpWord\IOFact…phpWord, “HTML” );

c

o

n

t

e

n

t

=

;

f

o

r

e

a

c

h

(

content = ”; foreach (

content=′′;foreach(phpWord->getSections() as $section) {
KaTeX parse error: Undefined control sequence: \PhpOffice at position 14: writer = new \?P?h?p?O?f?f?i?c?e?\PhpWord\Writer…htmlWriter, $section );
$content .= $writer->write();
}
echo $content;exit;
Copy code

As for image processing, there is currently no good way to handle it without modifying the source code. If you change the source code, the relevant code is in \PhpOffice\PhpWord\Writer\HTML\Element\Image

Copy code
public function write()
{
if (!$this->element instanceof ImageElement) {
return ”;
}
$content = ’;
$imageData =

t

h

i

s

?

>

e

l

e

m

e

n

t

?

>

g

e

t

I

m

a

g

e

S

t

r

i

n

g

D

a

t

a

(

t

r

u

e

)

;

i

f

(

this->element->getImageStringData(true); if (

this?>element?>getImageStringData(true);if(imageData !== null) {

s

t

y

l

e

W

r

i

t

e

r

=

n

e

w

I

m

a

g

e

S

t

y

l

e

W

r

i

t

e

r

(

styleWriter = new ImageStyleWriter(

styleWriter=newImageStyleWriter(this->element->getStyle());
$style = $styleWriter->write();
// $imageData = data:’ . $this->element->getImageType() . ;base64,’ . $imageData;

i

m

a

g

e

S

r

c

=

i

m

a

g

e

s

/

.

m

d

5

(

imageSrc = ‘images/’ . md5(

imageSrc=′images/′.md5(this->element->getSource()) . .’ .

t

h

i

s

?

>

e

l

e

m

e

n

t

?

>

g

e

t

I

m

a

g

e

E

x

t

e

n

s

i

o

n

(

)

;

/

/

You can handle it yourself and upload it here

o

s

s

some type of

f

i

l

e

p

u

t

c

o

n

t

e

n

t

s

(

this->element->getImageExtension(); // You can handle it yourself and upload oss and the like file_put_contents(

this?>element?>getImageExtension();//You can handle it yourself here, upload filep?utc?ontents(imageSrc, base64_decode($imageData));

 $content .= $this->writeOpening();
    $content .= "<img border="0" style="{$style}" src="{$imageSrc}"/>\ ";
    $content .= $this->writeClosing();
}

return $content;

}
Copy code