How to remove specific line breaks in PDF

How to remove specific line breaks in PDF - php

I have PDF file with unquoted currencies and this file is a little broken. I am using PHP and Smalot\PdfParser\Parser. When I parse the file and get the full text, one line is displayed incorrectly due to line breaks
The code :
$parser = new Parser();
$pdf = $parser->parseFile($path . '/' . $filename);
$pdf_text = $pdf->getText();
$pdf_with_line_breaks = nl2br($pdf_text);
The result :
PEN Peruvian Sol 4,145256 0,241240 <br />
GEL Georgian Lari 2,879715 0,347257 <br />
XAF <br />
Central African CFA Franc <br />
BEAC 655,957000 0,001524 <br />
FJD Fijian Dollar 2,341310 0,427111 <br />
BYN Belarusian Ruble 2,706597 0,369468 <br />
My problem is that the PDF creator has not followed the correct principles of PDF compilation and one entry about the currency is mutilated and it would be necessary to format this entry correctly on one line so that it can then be processed with
explode("\n", $pdf_text);
See attachment
How is it possible to combine the erroneous lines marked in the attachment to be on one line without line breaks.

Related

phpWord how to load images using PhpOffice\PhpWord\IOFactory::load

I'm using the PHPWord library and am having difficulty using the library to read this XML trex
<mc:AlternateContent>
<mc:Choice Requires="wps">
<w:drawing>
<wp:anchor distT="0" distB="0" distL="114300" distR="114300" simplePos="0" relativeHeight="251657728" behindDoc="1" locked="0" layoutInCell="1" allowOverlap="1">
<wp:simplePos x="0" y="0" />
<wp:positionH relativeFrom="column">
<wp:posOffset>-54610</wp:posOffset>
</wp:positionH>
<wp:positionV relativeFrom="paragraph">
<wp:posOffset>132715</wp:posOffset>
</wp:positionV>
<wp:extent cx="5470525" cy="657225" />
<wp:effectExtent l="6350" t="0" r="0" b="0" />
<wp:wrapNone />
<wp:docPr id="2" name="Rectangle 5" />
<wp:cNvGraphicFramePr>
<a:graphicFrameLocks
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" />
</wp:cNvGraphicFramePr>
<a:graphic
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
<wps:wsp>
<wps:cNvSpPr>
<a:spLocks noChangeArrowheads="1" />
</wps:cNvSpPr>
<wps:spPr bwMode="auto">
<a:xfrm>
<a:off x="0" y="0" />
<a:ext cx="5470525" cy="657225" />
</a:xfrm>
<a:prstGeom prst="rect">
<a:avLst />
</a:prstGeom>
<a:solidFill>
<a:srgbClr val="DBE5F1">
<a:alpha val="50000" />
</a:srgbClr>
</a:solidFill>
<a:ln>
<a:noFill />
</a:ln>
<a:extLst>
<a:ext uri="{91240B29-F687-4F45-9708-019B960494DF}">
<a14:hiddenLine
xmlns:a14="http://schemas.microsoft.com/office/drawing/2010/main" w="9525">
<a:solidFill>
<a:srgbClr val="000000" />
</a:solidFill>
<a:miter lim="800000" />
<a:headEnd />
<a:tailEnd />
</a14:hiddenLine>
</a:ext>
</a:extLst>
</wps:spPr>
<wps:bodyPr rot="0" vert="horz" wrap="square" lIns="91440" tIns="45720" rIns="91440" bIns="45720" anchor="t" anchorCtr="0" upright="1">
<a:noAutofit />
</wps:bodyPr>
</wps:wsp>
</a:graphicData>
</a:graphic>
<wp14:sizeRelH relativeFrom="page">
<wp14:pctWidth>0</wp14:pctWidth>
</wp14:sizeRelH>
<wp14:sizeRelV relativeFrom="page">
<wp14:pctHeight>0</wp14:pctHeight>
</wp14:sizeRelV>
</wp:anchor>
</w:drawing>
</mc:Choice>
<mc:Fallback>
<w:pict>
<v:rect w14:anchorId="3C49E1DC" id="Rectangle 5" o:spid="_x0000_s1026" style="position:absolute;margin-left:-4.3pt;margin-top:10.45pt;width:430.75pt;height:51.75pt;z-index:-251658752;visibility:visible;mso-wrap-style:square;mso-width-percent:0;mso-height-percent:0;mso-wrap-distance-left:9pt;mso-wrap-distance-top:0;mso-wrap-distance-right:9pt;mso-wrap-distance-bottom:0;mso-position-horizontal:absolute;mso-position-horizontal-relative:text;mso-position-vertical:absolute;mso-position-vertical-relative:text;mso-width-percent:0;mso-height-percent:0;mso-width-relative:page;mso-height-relative:page;v-text-anchor:top" o:gfxdata="UEsDBBQABgAIAAAAIQC2gziS/gAAAOEBAAATAAAAW0NvbnRlbnRfVHlwZXNdLnhtbJSRQU7DMBBF
90jcwfIWJU67QAgl6YK0S0CoHGBkTxKLZGx5TGhvj5O2G0SRWNoz/78nu9wcxkFMGNg6quQqL6RA
0s5Y6ir5vt9lD1JwBDIwOMJKHpHlpr69KfdHjyxSmriSfYz+USnWPY7AufNIadK6MEJMx9ApD/oD
OlTrorhX2lFEilmcO2RdNtjC5xDF9pCuTyYBB5bi6bQ4syoJ3g9WQ0ymaiLzg5KdCXlKLjvcW893
SUOqXwnz5DrgnHtJTxOsQfEKIT7DmDSUCaxw7Rqn8787ZsmRM9e2VmPeBN4uqYvTtW7jvijg9N/y
JsXecLq0q+WD6m8AAAD//wMAUEsDBBQABgAIAAAAIQA4/SH/1gAAAJQBAAALAAAAX3JlbHMvLnJl
bHOkkMFqwzAMhu+DvYPRfXGawxijTi+j0GvpHsDYimMaW0Yy2fr2M4PBMnrbUb/Q94l/f/hMi1qR
JVI2sOt6UJgd+ZiDgffL8ekFlFSbvV0oo4EbChzGx4f9GRdb25HMsYhqlCwG5lrLq9biZkxWOiqY
22YiTra2kYMu1l1tQD30/bPm3wwYN0x18gb45AdQl1tp5j/sFB2T0FQ7R0nTNEV3j6o9feQzro1i
OWA14Fm+Q8a1a8+Bvu/d/dMb2JY5uiPbhG/ktn4cqGU/er3pcvwCAAD//wMAUEsDBBQABgAIAAAA
IQCH0ufrjAIAABwFAAAOAAAAZHJzL2Uyb0RvYy54bWysVG1v2yAQ/j5p/wHxPfWL7CS26lRt00yT
uq1atx9ADI7RMDAgcbpq/30HJGm6fZmm5YPDwfHcPXfPcXm1HwTaMWO5kg3OLlKMmGwV5XLT4K9f
VpM5RtYRSYlQkjX4iVl8tXj75nLUNctVrwRlBgGItPWoG9w7p+sksW3PBmIvlGYSDjtlBuLANJuE
GjIC+iCSPE2nyagM1Ua1zFrYXcZDvAj4Xcda96nrLHNINBhyc+Frwnftv8niktQbQ3TP20Ma5B+y
GAiXEPQEtSSOoK3hf0ANvDXKqs5dtGpIVNfxlgUOwCZLf2Pz2BPNAhcojtWnMtn/B9t+3D0YxGmD
c4wkGaBFn6FoRG4EQ6Uvz6htDV6P+sF4glbfq/abRVLd9uDFro1RY88IhaQy75+8uuANC1fRevyg
KKCTrVOhUvvODB4QaoD2oSFPp4awvUMtbJbFLC3zEqMWzqblLIe1D0Hq421trHvH1ID8osEGcg/o
ZHdvXXQ9uoTsleB0xYUIhtmsb4VBOwLiWN7class3hW6J3G3TOF3CGmjewhvz3GE9GhSedwYMu4A
C0jCn3k+QQ3PVZYX6U1eTVbT+WxSrIpyUs3S+STNqptqmhZVsVz99FlkRd1zSpm855IdlZkVf9f5
w4xETQVtorHBlS9lIH6e/YFW5OvpvhA+dxu4g0EVfGjw/OREat/5O0mBNqkd4SKuk9fph5JBDY7/
oSpBJ14aUWJrRZ9AJkZBF2FQ4UmBRa/MD4xGGM8G2+9bYhhG4r0EqVVZUfh5DkYBygDDnJ+sz0+I
bAGqwQ6juLx18Q3YasM3PUSKnZfqGuTZ8aAcL92YFeTtDRjBwODwXPgZP7eD18ujtvgFAAD//wMA
UEsDBBQABgAIAAAAIQBFxM1S3QAAAAkBAAAPAAAAZHJzL2Rvd25yZXYueG1sTI/BasMwDIbvg72D
UWGX0TozXcmyOKUURmG3pWNnNXZj09gOtttkbz/ttN0k/o9fn+rt7AZ20zHZ4CU8rQpg2ndBWd9L
+Dy+LUtgKaNXOASvJXzrBNvm/q7GSoXJf+hbm3tGJT5VKMHkPFacp85oh2kVRu0pO4foMNMae64i
TlTuBi6KYsMdWk8XDI56b3R3aa9OQjJ2Zwc8TOWj2Mcztvb98NVK+bCYd6/Asp7zHwy/+qQODTmd
wtWrxAYJy3JDpARRvACjvHwWNJwIFOs18Kbm/z9ofgAAAP//AwBQSwECLQAUAAYACAAAACEAtoM4
kv4AAADhAQAAEwAAAAAAAAAAAAAAAAAAAAAAW0NvbnRlbnRfVHlwZXNdLnhtbFBLAQItABQABgAI
AAAAIQA4/SH/1gAAAJQBAAALAAAAAAAAAAAAAAAAAC8BAABfcmVscy8ucmVsc1BLAQItABQABgAI
AAAAIQCH0ufrjAIAABwFAAAOAAAAAAAAAAAAAAAAAC4CAABkcnMvZTJvRG9jLnhtbFBLAQItABQA
BgAIAAAAIQBFxM1S3QAAAAkBAAAPAAAAAAAAAAAAAAAAAOYEAABkcnMvZG93bnJldi54bWxQSwUG
AAAAAAQABADzAAAA8AUAAAAA
" fillcolor="#dbe5f1" stroked="f">
<v:fill opacity="32896f" />
</v:rect>
</w:pict>
</mc:Fallback>
</mc:AlternateContent>
I am using the following code to read the word file
if ($model->getIsNewRecord()) {
$phpWord = \PhpOffice\PhpWord\IOFactory::load(
\Yii::getAlias('#app') . '/web/test.docx'
);
$htmlWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, 'HTML');
$model->corpo = $htmlWriter->getContent();
}
My problem is how to implement correctly so that the library can read the tag <w:drawing>, and I was analyzing the xml and I saw something that may be an image I'm not sure, the tag is <w:pict> <v:rect>
<w:pict>
<v:rect w14:anchorId="3C49E1DC" id="Rectangle 5" o:spid="_x0000_s1026" style="position:absolute;margin-left:-4.3pt;margin-top:10.45pt;width:430.75pt;height:51.75pt;z-index:-251658752;visibility:visible;mso-wrap-style:square;mso-width-percent:0;mso-height-percent:0;mso-wrap-distance-left:9pt;mso-wrap-distance-top:0;mso-wrap-distance-right:9pt;mso-wrap-distance-bottom:0;mso-position-horizontal:absolute;mso-position-horizontal-relative:text;mso-position-vertical:absolute;mso-position-vertical-relative:text;mso-width-percent:0;mso-height-percent:0;mso-width-relative:page;mso-height-relative:page;v-text-anchor:top" o:gfxdata="UEsDBBQABgAIAAAAIQC2gziS/gAAAOEBAAATAAAAW0NvbnRlbnRfVHlwZXNdLnhtbJSRQU7DMBBF
90jcwfIWJU67QAgl6YK0S0CoHGBkTxKLZGx5TGhvj5O2G0SRWNoz/78nu9wcxkFMGNg6quQqL6RA
0s5Y6ir5vt9lD1JwBDIwOMJKHpHlpr69KfdHjyxSmriSfYz+USnWPY7AufNIadK6MEJMx9ApD/oD
OlTrorhX2lFEilmcO2RdNtjC5xDF9pCuTyYBB5bi6bQ4syoJ3g9WQ0ymaiLzg5KdCXlKLjvcW893
SUOqXwnz5DrgnHtJTxOsQfEKIT7DmDSUCaxw7Rqn8787ZsmRM9e2VmPeBN4uqYvTtW7jvijg9N/y
JsXecLq0q+WD6m8AAAD//wMAUEsDBBQABgAIAAAAIQA4/SH/1gAAAJQBAAALAAAAX3JlbHMvLnJl
bHOkkMFqwzAMhu+DvYPRfXGawxijTi+j0GvpHsDYimMaW0Yy2fr2M4PBMnrbUb/Q94l/f/hMi1qR
JVI2sOt6UJgd+ZiDgffL8ekFlFSbvV0oo4EbChzGx4f9GRdb25HMsYhqlCwG5lrLq9biZkxWOiqY
22YiTra2kYMu1l1tQD30/bPm3wwYN0x18gb45AdQl1tp5j/sFB2T0FQ7R0nTNEV3j6o9feQzro1i
OWA14Fm+Q8a1a8+Bvu/d/dMb2JY5uiPbhG/ktn4cqGU/er3pcvwCAAD//wMAUEsDBBQABgAIAAAA
IQCH0ufrjAIAABwFAAAOAAAAZHJzL2Uyb0RvYy54bWysVG1v2yAQ/j5p/wHxPfWL7CS26lRt00yT
uq1atx9ADI7RMDAgcbpq/30HJGm6fZmm5YPDwfHcPXfPcXm1HwTaMWO5kg3OLlKMmGwV5XLT4K9f
VpM5RtYRSYlQkjX4iVl8tXj75nLUNctVrwRlBgGItPWoG9w7p+sksW3PBmIvlGYSDjtlBuLANJuE
GjIC+iCSPE2nyagM1Ua1zFrYXcZDvAj4Xcda96nrLHNINBhyc+Frwnftv8niktQbQ3TP20Ma5B+y
GAiXEPQEtSSOoK3hf0ANvDXKqs5dtGpIVNfxlgUOwCZLf2Pz2BPNAhcojtWnMtn/B9t+3D0YxGmD
c4wkGaBFn6FoRG4EQ6Uvz6htDV6P+sF4glbfq/abRVLd9uDFro1RY88IhaQy75+8uuANC1fRevyg
KKCTrVOhUvvODB4QaoD2oSFPp4awvUMtbJbFLC3zEqMWzqblLIe1D0Hq421trHvH1ID8osEGcg/o
ZHdvXXQ9uoTsleB0xYUIhtmsb4VBOwLiWN7class3hW6J3G3TOF3CGmjewhvz3GE9GhSedwYMu4A
C0jCn3k+QQ3PVZYX6U1eTVbT+WxSrIpyUs3S+STNqptqmhZVsVz99FlkRd1zSpm855IdlZkVf9f5
w4xETQVtorHBlS9lIH6e/YFW5OvpvhA+dxu4g0EVfGjw/OREat/5O0mBNqkd4SKuk9fph5JBDY7/
oSpBJ14aUWJrRZ9AJkZBF2FQ4UmBRa/MD4xGGM8G2+9bYhhG4r0EqVVZUfh5DkYBygDDnJ+sz0+I
bAGqwQ6juLx18Q3YasM3PUSKnZfqGuTZ8aAcL92YFeTtDRjBwODwXPgZP7eD18ujtvgFAAD//wMA
UEsDBBQABgAIAAAAIQBFxM1S3QAAAAkBAAAPAAAAZHJzL2Rvd25yZXYueG1sTI/BasMwDIbvg72D
UWGX0TozXcmyOKUURmG3pWNnNXZj09gOtttkbz/ttN0k/o9fn+rt7AZ20zHZ4CU8rQpg2ndBWd9L
+Dy+LUtgKaNXOASvJXzrBNvm/q7GSoXJf+hbm3tGJT5VKMHkPFacp85oh2kVRu0pO4foMNMae64i
TlTuBi6KYsMdWk8XDI56b3R3aa9OQjJ2Zwc8TOWj2Mcztvb98NVK+bCYd6/Asp7zHwy/+qQODTmd
wtWrxAYJy3JDpARRvACjvHwWNJwIFOs18Kbm/z9ofgAAAP//AwBQSwECLQAUAAYACAAAACEAtoM4
kv4AAADhAQAAEwAAAAAAAAAAAAAAAAAAAAAAW0NvbnRlbnRfVHlwZXNdLnhtbFBLAQItABQABgAI
AAAAIQA4/SH/1gAAAJQBAAALAAAAAAAAAAAAAAAAAC8BAABfcmVscy8ucmVsc1BLAQItABQABgAI
AAAAIQCH0ufrjAIAABwFAAAOAAAAAAAAAAAAAAAAAC4CAABkcnMvZTJvRG9jLnhtbFBLAQItABQA
BgAIAAAAIQBFxM1S3QAAAAkBAAAPAAAAAAAAAAAAAAAAAOYEAABkcnMvZG93bnJldi54bWxQSwUG
AAAAAAQABADzAAAA8AUAAAAA
" fillcolor="#dbe5f1" stroked="f">
<v:fill opacity="32896f" />
</v:rect>
</w:pict>
How can you see this excerpt and where I think it is part of the image
And the HTML I'm trying to generate, the two docx images are in the header and footer, and I need these two images to be in html, the text with the styling was working normally, just the image that is not bringing it.

I believe that the PHPWord library is not up to date yet. The current OpenOffice XML defines images as w:drawing but the older versions were w:pict.
I found a pull request that may bring you further:
https://github.com/PHPOffice/PHPWord/pull/1324

php movie search bar using variables

I am trying to create a page that when someone types a specific movie in the search bar it goes to a page on my server with an embedded vlc player and plays that movie. I tried by storing the userinput as a variable in an array using php but i either get one of two errors array to string conversion or undefined index!
I'm using pregmatch to scan a txt file containing my movies and want it to
take the file name that most matches the user input and add it to the src = $variable of my embeded player so i nly have to have one page to play 100 or more movies.
heres the php source and the embeded video player html
html video embed :
' />
and the source code for pregmatch!
<div class="search" align = "center">
<form action="MovieMatch.php" method ="post">
<input type="text" name="userget" />
<input type="submit"name="submit" value="<?php$pick ?>" />
</form>
</div>
<h1 align = "center" class="text-primary">Search our Movies!</h1>
<div class="images" >
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
if(isset($_POST['submit'])){
$submit = $_POST ["userget"];//storing input
$file = 'movielist.txt';
// get the file contents, assuming the file to be readable (and exist)
$contents = file_get_contents($file);
// escape special characters in the query
$pattern = $submit;
// finalise the regular expression, matching the whole line
$pattern = "/^.*$pattern.*\$/m";
$handle = fopen("movielist.txt", "r");
// search, and store all matching occurences in $matches
if(preg_match_all($pattern, $contents, $matches)){
$pick = implode("\n", $matches[0]);
echo $pick[1];
echo "Found matches:\n";
echo 'click me!'; // this is my fix bu
}
else{
echo "No matches ";
}
}
?>

Strlen is not giving correct output

I have a string which I retrieve from a database, I want to calculate the length of the string without spaces but it is displaying a larger value of length(21 characters greater than the actual count) I have removed tab and newline characters and also the php and html tags but no result! I have tried almost every function on the w3schools php reference but I'm unable to find any success. I also have observed that if I don't retrieve the value from the database and input it like this:
$string = "my string";
I get the correct length, please help me. Here is the code:
if($res_tutor[0]['tutor_experience']){
$str = trim(strip_tags($res_tutor[0]['tutor_experience']));
$str = $this->real_string($str);
$space = substr_count($str, ' ');
$experience = strlen($str) - $space;
function real_string($str)
{
$search = array("\t","\n","\r\n","\0","\v");
$replace = array('','','','','');
$str = str_replace($search,$replace,$str);
return $str;
}
And this is the string from the database but as you can see above I have removed all php and html tags using strip_tags() :
<span class=\"experience_font\">You are encouraged to write a short description of yourself, teaching experience and teaching method. You may use the guidelines below to assist you in your writing.<br />
<br />
.Years of teaching experience<br />
.Total number of students taught<br />
.Levels & subjects that you have taught<br />
.The improvements that your students have made<br />
.Other achievements/experience (Relief teaching, a tutor in a tuition centre, Dean's list, scholarship, public speaking etc.)<br />
.For Music (Gigs at Esplanade, Your performances in various locations etc.)</span><br />
</p>
and when I print it, it displays as:
<span class=\"experience_font\">You are encouraged to write a short description of yourself, teaching experience and teaching method. You may use the guidelines below to assist you in your writing.<br />
<br />
.Years of teaching experience<br />
.Total number of students taught<br />
.Levels & subjects that you have taught<br />
.The improvements that your students have made<br />
.Other achievements/experience (Relief teaching, a tutor in a tuition centre, Dean's list, scholarship, public speaking etc.)<br />
.For Music (Gigs at Esplanade, Your performances in various locations etc.)</span><br />
</p>

#Svetilo, not to be rude just wanted to post my findings, your str_replace worked wonderfully, except for the fact that I was still outputting incorrect values with it in the order that you currently have, I found that the following worked flawlessly.
$string = str_replace(array("\t","\r\n","\n","\0","\v"," "),'', $string);
mb_strlen($string, "UTF-8");
Changing around the \r\n & \n made the str_replace not strip out the \n from the \r\n leaving it just a \r.
Cheers.

Try using mb_strlen. http://php.net/manual/en/function.mb-strlen.php
Its more more precise.
mb_strlen($str,"UTF-8")
Where UTF-8 is your default encoding...
To remove all freespaces try something like that..
$string = str_replace(array("\t","\n","\r\n","\0","\v"," "),"",$string);
mb_strlen($string, "UTF-8");

How can I get the principal image from MediaWiki API?

Hello I'm using Curl to get information from Wikipedia,and I want to receive only information about the principal image,I don't want to receive all images of an article..
For example..
If I want to get info about all images of the English Language (http://en.wikipedia.org/wiki/English_language) I should go to this URL:
http://en.wikipedia.org/w/api.php?action=query&titles=English_Language&prop=images
but I receive flags of countries where people speak English in XML:
<?xml version="1.0"?> <api> <query>
<normalized>
<n from="English_language" to="English language" />
</normalized>
<pages>
<page pageid="8569916" ns="0" title="English language">
<images>
<im ns="6" title="File:Anglospeak(800px)Countries.png" />
<im ns="6" title="File:Anglospeak.svg" />
<im ns="6" title="File:Circle frame.svg" />
<im ns="6" title="File:Commons-logo.svg" />
<im ns="6" title="File:Flag of Argentina.svg" />
<im ns="6" title="File:Flag of Aruba.svg" />
<im ns="6" title="File:Flag of Australia.svg" />
<im ns="6" title="File:Flag of Bolivia.svg" />
<im ns="6" title="File:Flag of Brazil.svg" />
<im ns="6" title="File:Flag of Canada.svg" />
I only want the information about the principal image.

There's news! (from 2014)
A new extension, PageImages, is available and also got already installed on the Wikimedia wikis.
Instead of prop=images, use prop=pageimages, and you'll get a pageimage attribute and a <thumbnail> child node for each <page> element.
Admittedly, it's not guaranteed to give the best results, but in your example (English Language) it works well and only yields the map of the geographic distribution, not all the flags.
Also, the OpenSearch API does return an <image> in it's xml representation, but this API is not usable with lists and cannot be combine with the Query API.

This is how I got it working...
$.getJSON("http://en.wikipedia.org/w/api.php?action=query&format=json&callback=?", {
titles: "India",
prop: "pageimages",
pithumbsize: 150
},
function(data) {
var source = "";
var imageUrl = GetAttributeValue(data.query.pages);
if (imageUrl == "") {
$("#wiki").append("<div>No image found</div>");
} else {
var img = "<img src=\"" + imageUrl + "\">"
$("#wiki").append(img);
}
}
);
function GetAttributeValue(data) {
var urli = "";
for (var key in data) {
if (data[key].thumbnail != undefined) {
if (data[key].thumbnail.source != undefined) {
urli = data[key].thumbnail.source;
break;
}
}
}
return urli;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<html>
<head></head>
<body>
<div id="wiki"></div>
</body>
</html>

As others have noted, Wikipedia articles don't really have any such thing as a "principal image", so your first problem will be deciding how to choose between the different images used on a given page. Some possible selection criteria might be:
Biggest image in the article.
First image exceeding some specific minimum dimensions, e.g. 60 × 60 pixels.
First image referenced directly in the article's source text, rather than through a template.
For the first two options, you'll want to fetch the rendered HTML code of the page via action=parse and use an HTML parser to find the img tags in the code, like this:
http://en.wikipedia.org/w/api.php?action=parse&page=English_language&prop=text|images
(The reason you can't just get the sizes of the images, as used on the page, directly from the API is that that information isn't actually stored anywhere in the MediaWiki database.)
For the last option, what you want is the source wikitext of the article, available via prop=revisions with rvprop=content:
http://en.wikipedia.org/w/api.php?action=query&titles=English_language&prop=revisions|images&rvprop=content
Note that many images in infoboxes and such are specified as parameters to a template, so just parsing for [[Image:...]] syntax will miss some of them. A better solution is probably to just get the list of all images used on the page via prop=images (which you can do in the same query, as I showed above) and look for their names (with or without Image: / File: prefix) in the wikitext.
Keep in mind the various ways in which MediaWiki automatically normalizes page (and image) names: most notably, underscores are mapped to spaces, consecutive whitespace is collapsed to a single space and the first letter of the name is capitalized. If you decide to go this way, here's some sample PHP code that will convert a list of file names into a regexp that should match any of them in wikitext:
foreach ($names as &$name) {
$name = trim( preg_replace( '/[_\s]+/u', ' ', $name ) );
$name = preg_quote( $name, '/' );
$name = preg_replace( '/^(\\\\?.)/us', '(?i:$1)', $name );
$name = preg_replace( '/\\\\? /u', '[_\s]+', $name );
}
$regexp = '/' . implode( '|', $names ) . '/u';
For example, when given the list:
Anglospeak(800px)Countries.png
Anglospeak.svg
Circle frame.svg
Commons-logo.svg
Flag of Argentina.svg
Flag of Aruba.svg
the generated regexp will be:
/(?i:A)nglospeak\(800px\)Countries\.png|(?i:A)nglospeak\.svg|(?i:C)ircle[_\s]+frame\.svg|(?i:C)ommons\-logo\.svg|(?i:F)lag[_\s]+of[_\s]+Argentina\.svg|(?i:F)lag[_\s]+of[_\s]+Aruba\.svg/u

Important addendum
Bergi's answer, above, seemed super great, but I was bashing my head out because I couldn't get it to work.
I needed to include pilicense=any in my query, because otherwise any copyrighted imagery was ignored.
Here's the query I ultimately got working:
https://en.wikipedia.org/w/api.php?action=query&pilicense=any&format=jsonfm&prop=pageimages&generator=search&gsrsearch=My+incategory:English-language_films+prefix:My&gsrlimit=3
I know it's been awhile, but this is one of the first pages I landed on when I started my days-long search for how to do this, so I wanted to share this specifically on this page, for others like me who might come here.

You can limit your query to the first image in the article with the imlimit parameter:
http://en.wikipedia.org/w/api.php?action=query&titles=English_Language&redirects&prop=images&imlimit=1

Regular Expression: Converting non-block elements with <br /> to <p> in PHP

Someone has asked a similar question, but the accepted answer doesn't meet my requirements.
Input:
<strong>bold <br /><br /> text</strong><br /><br /><br />
link<br /><br />
<pre>some code</pre>
I'm a single br, <br /> leave me alone.
Expected output:
<p><strong>bold <br /> text</strong><br /></p>
<p>link<br /></p>
<pre>some code</pre>
<p>I'm a single br, <br /> leave me alone.</p>
The accepted answer I mentioned above will convert multiple br to p, and at last wrap all the input with another p. But in my case, you can't wrap pre inside a p tag. Can anyone help?
update
the expected output before this edit was a little bit confusing. the whole point is:
convert multiple br to a single one (achieved with preg_replace('/(<br />)+/', '<br />', $str);)
check for inline elements and unwrapped text (there's no parent element in this case, input is from $_POST) and wrap with <p>, leave block level elements alone.

Do not use regex. Why? See: RegEx match open tags except XHTML self-contained tags
Use proper DOM manipulators. See: http://php.net/manual/en/book.dom.php
EDIT:
I'm not really a fan of giving cookbook-recipes, so here's a solution for changing double <br />'s to text wrapped in <p></p>:
script.php:
<?php
function isBlockElement($nodeName) {
$blockElementsArray = array("pre", "div"); // edit to suit your needs
return in_array($nodeName, $blockElementsArray);
}
function hasBlockParent(&$node) {
if (!($node instanceof DOMNode)) {
// return whatever you wish to return on error
// or throw an exception
}
if (is_null($node->parentNode))
return false;
if (isBlockElement($node->parentNode))
return true;
return hasBlockParent($node->parentNode);
}
$myDom = new DOMDocument;
$myDom->loadHTMLFile("in-file");
$myDom->normalizeDocument();
$elems =& $myDom->getElementsByTagName("*");
for ($i = 0; $i < $elems->length; $i++) {
$element =& $elems->item($i);
if (($element->nextSibling->nodeName == "br" && $element->nextSibling->nextSibling->nodeName == "br") && !hasBlockParent($element)) {
$parent =& $element->parentNode;
$parent->removeChild($element->nextSibling->nextSibling);
$parent->removeChild($element->nextSibling);
// check if there are further nodes on the same level
$nSibling;
if (!is_null($element->nextSibling))
$nSibling = $element->nextSibling;
else
$nSibling = NULL;
// delete the old node
$saved = $parent->removeChild($element);
$newNode = $myDom->createElement("p");
$newNode->appendChild($saved);
if ($nSibling == NULL)
$parent->appendChild($newNode);
else
$parent->insertBefore($newNode, $nSibling);
}
}
$myDom->saveHTMLFile("out-file");
?>
This is not really a full solution, but it's a starting point. This is the best I could write during my lunch break, and please bear in mind that the last time I coded in PHP was about 2 years ago (been doing mostly C++ since then). I was not writing it as a full solution but rather to give you a...well, starting point :)
So anyways, the input file:
[dare2be#schroedinger dom-php]$ cat in-file
<strong>bold <br /><br /> text</strong><br /><br /><br />
link<br /><br />
<pre>some code</pre>
I'm a single br, <br /> leave me alone.
And the output file:
[dare2be#schroedinger dom-php]$ cat out-file
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p><strong>bold <br><br> text</strong></p><br><p>link</p><pre>some code</pre>
I'm a single br, <br> leave me alone.</body></html>
The whole DOCTYPE mumbo jumbo is a side-effect. The code doesn't do the rest of the things you said, like changing <bold><br><br></bold> to <bold><br></bold>. Also, this whole script is a quick draft, but you'll get the idea.

Alright, I'v got myself an answer, and I believe this is gonna work really well.
It's from WordPress...the wpautop function.
I'v tested it with the input (from my question), and the output is -almost- the same as I expected, I just need to modify it a bit to fit my needs.
Thanks dare2be, but I'm not very familiar with DOM manipulator in PHP.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to remove specific line breaks in PDF - php

Related

phpWord how to load images using PhpOffice\PhpWord\IOFactory::load

php movie search bar using variables

Strlen is not giving correct output

How can I get the principal image from MediaWiki API?

Regular Expression: Converting non-block elements with <br /> to <p> in PHP

Categories

Resources