Perl to extract from a pdf


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Perl to extract from a pdf
# 1  
Old 04-25-2017
Perl to extract from a pdf

The below perl script produces the metrics.txt below using the run.txt as the input.

Code:
perl -ne 'BEGIN{print join("\t","R_Index", "ISP Loading", "Pre-Enrichment", "Total Reads", "Read Length", "Key Signal", "Usable Sequence", "Enrichment",   "Polyclonal" ,"Low Quality" ,"Test Fragment", "Aligned Bases", "Unaligned Bases",
"Exception"),"\n"};s/[\%\,]//g;@f=split/\s+/;/Aligned Read/ and $ar=$f[-1];/TF/ and $tf||=$f[-1];/Low Quality/ and $lq=$f[-1];/Polyclonal/ and $pc=$f[-1];/Live/ and $en||=$f[-1];/Usable/ and ($il,$us)=@p[1,2];/Key Signal/ and
($ks,$tr,$rl)=@p[3..5];/Unaligned Reads/ and print join("\t",1,$il,".",$tr,$rl,$ks,$us,$en,$pc,$lq,$tf,$ar,$f[-1]," "),"\n";@p=@f' run.txt > metrics.txt

run.txt created using pdftotext -layout
Code:
                      Run Report for Name


Run Summary
       20.4 G              84              94,495,222                     216 bp       229 bp       275 bp
      Total Bases    Key Signal             Total Reads                     Mean        Median       Mode

                94%                           68%                                   Read Length
             ISP Loading                   Usable Reads
             ISP Density                  ISP Summary




                                                          Addressable Wells        148,155,732
                                                          With ISPs                 139,740,599    94.3%
                                                          Live                      139,057,205    99.5%
                                                          Test Fragment               1,063,667    00.8%
                                                          Library                   137,993,538    99.2%

                                                        Library ISPs                137,993,538
                                                        Filtered: Polyclonal          37,959,918    27.5%
                                                        Filtered: Low Quality          6,536,035    04.7%
                                                        Filtered: Adapter Dimer              349    00.0%
                                                        Final Library ISPs           94,495,222     68.5%




    Barcode Name       Sample             Bases           ≥ Q20           Reads           Mean Read Length
    No barcode         none               151,751,086     122,614,710     844,020         180 bp
    IonXpress 004      00-0000 Last-      8,373,945,632 7,188,703,690 38,774,136          216 bp
                       First
    IonXpress 005      00-0001 LastN-   5,226,515,080 4,502,314,522 24,025,446          218 bp
                       FirstN
    IonXpress 006      00-0002 La-    6,651,737,354 5,681,526,265 30,850,757          216 bp
                       Fi




     Test Fragment          Reads          Percent 50AQ17           Read Length Histogram

     TF 1                   192,011        86%




1
                        Run Report for Name


Alignment Summary (aligned to Homo sapiens)
          17.01 G                  5.5X                                          98.6%
    Total Alignment Bases    Average Coverage                            Mean Raw Accuracy 1x
                             Depth of Reference




                               Count         %
     Total Reads            93,650,339        –
     Aligned Reads          93,073,879    99.4%
     Unaligned Reads           576,460     0.6%



                                 Alignment Quality
                                             AQ17           AQ20      Perfect
           Total Number of Bases [Mbp]            14.9 G    11.2 G     2.55 G
           Mean Length [bp]                          182       178        130
           Longest Alignment [bp]                    360       355        336
           Mean Coverage Depth                        4.8       3.6        0.8




2
                Run Report for Name


coverageAnalysis




variantCaller




3
                   Run Report for Name


Analysis Details
    Run Name              RunName
    Run Date              Date and time
    Run Flows             500
    Projects              name
    Sample                00-0000 Last-First ,  00-0001 LastN-FirstN, 00-0002 La
    Reference
    Instrument            S5-00580
    Flow Order            TACGTACGTCTGAGCATCGATCGATGTACAGC
    Library Key           TCAG
    TF Key                ATCG
    Chip ID               DACJ01029
    Chip Check            Passed
    Chip Type             540
    Chip Data             tiled
    Chip Lot Number       QNC297
    Barcode Set           IonXpress
    Analysis Name         Name
    Analysis Date         Date and time
    Analysis Flows        0
    runID                 DLIUA
    BeadFind Args         justBeadFind –args-json /opt/ion/config/args5 40b eadf ind.json
    Analysis Args         Analysis –args-json /opt/ion/config/args˙540˙analysis.json
    Pre-BaseCaller Args   BaseCaller –barcode-filter 0.01 –barcode-filter-minreads 10
    for calibration       –phasing-residual-filter=2.0 –max-phasing-levels 2 –wells-normalization on
    Calibration Args      Calibration
    BaseCaller Args       BaseCaller –barcode-filter 0.01 –barcode-filter-minreads 10
                          –phasing-residual-filter=2.0 –max-phasing-levels 2 –num-unfiltered 1000
                          –barcode-filter-postpone 1 –wells-normalization on
    Alignment Args        tmap mapall -q 50000 ... stage1 map4
    IonStats Args         ionstats alignment
    Analysis Parameters   default




4
                     Run Report for Name


Chef Summary
Chef Template Prep Information:
    Chef Last Updated          Date and time
    Chef Instrument Name       number
    Sample Position            1
    Tip Rack Barcode           46C080060
    Chip Type 1                540v1
    Chip Type 2                540v1
    Chip Expiration 1          None
    Chip Expiration 2          None
    Templating Kit Type        Ion 540 Kit-Chef
    Reagent Expiration         171031
    Reagent Lot Number         1824918
    Reagent Part Number        A27758C
    Solution Lot Number        1817390
    Solution Part Number       A27754C
    Solution Expiration        170731
    Chef Script Version        406
    Chef Package Version       IC.5.2.1
    Templating Protocol        (use instrument default)

S5 Consumables Summary
    Chip Type                540v1
    Chip Barcode             DACJ01029

       Product Description          Part Number     Lot Number      Exp. Date    Remaining Uses
       Ion S5 Sequencing Reagents       100033230          013309   2017/07/31                1
       Ion S5 Cleaning Solution         100031096          013718   2017/12/31                3
       Ion S5 Wash Solution             100031090          013315   2017/07/31                1




5
                    Run Report for Name


Software Version
    Torrent Suite       5.2.1
    host                tsvm
    ion-analysis        5.2.25-1
    ion-chefupdates     5.2.8
    ion-dbreports       5.2.49-1
    ion-gpu             5.2.0-1
    ion-pipeline        5.2.13-1
    ion-plugins         5.2.20-1
    ion-protonupdates   5.2.4
    ion-s5updates       5.2.7
    ion-torrentpy       5.2.2-1
    ion-torrentr        5.2.0-1
    S5 Script           0.1.16
    LiveView            2196
    DataCollect         3401
    OIA                 5208
    OS                  20
    Graphics            86
    Ion Chef            IC.5.2.1




6

metrics.txt tab-delimeted
Code:
R_Index ISP Loading     Pre-Enrichment  Total Reads     Read Length     Key Signal      Usable Sequence Enrichment      Polyclonal      Low Quality     Test Fragment   Aligned Bases   Unaligned Bases Exception
1       94      .       94495222        216     84      68      99.5    27.5    04.7    86      99.4    0.6

I am having trouble modifing it to look like the below:

In the run.txt under Barcode Name the last 4 fields can be found. The naming of the barcode used varies, but the naming format is always the same. So in this run IonXpress 004, IonXpress 005, IonXpress 006
were used but in the next run IonXpress 001, IonXpress 002, IonXpress 003 may be used. On the run.txt in the first case the order will be:

Code:
Barcode Name       Sample             Bases           Q20           Reads           Mean Read Length
No barcode         none               151,751,086     122,614,710     844,020         180 bp
IonXpress 004      00-0000 Last-      8,373,945,632 7,188,703,690 38,774,136          216 bp
                       First
IonXpress 005      00-0001 LastN-   5,226,515,080 4,502,314,522 24,025,446          218 bp
                       FirstN
IonXpress 006      00-0002 La-    6,651,737,354 5,681,526,265 30,850,757          216 bp

and in the second case the order will be:
Barcode Name
No Barcode
IonXpress 001

IonXpress 002

IonXpress 003

Since I am going to use a few if statements later each IonXpress barcode (usually 3, but not always), is set to Barcode1,Barcode2,Barcode3. If the barcode is not present the field is not printed, so if
IonXpress 003 is not present then READS3 is not printed .

The Reads column (f[4] is what is extracted under each IonXpress barcode in the output. The No Barcode is a calculation (844,020 / 94495222) *100). Thank you.
READS TOTAL READS

desired tab-delimited
Code:
Read Length     Usable Sequence Polyclonal      Low Quality     Unaligned Bases Barcode1   Barcode2   Barcode3   No Barcode     Exception
216                    68                         27.5                4.7                   0.6                      38774136   24025446   30850757   0.89

Hopefully, this is a start or maybe there is a better way... thank you Smilie.

Last edited by cmccabe; 04-28-2017 at 08:52 AM.. Reason: fixed format
# 2  
Old 04-25-2017
Final alignment step

Hi.

Given, say on file data1, TABs as shown:
Code:
Read Length^IUsable Sequence Polyclonal^ILow Quality^IUnaligned Bases^IReads1^IReads2^IReads3^INo Barcode^IException$
216^I68^I27.5^I4.7^I0.6^I38774136^I24025446^I30850757^I0.89$

Looking like:
Code:
Read Length     Usable Sequence Polyclonal      Low Quality     Unaligned BasesReads1   Reads2  Reads3  No Barcode      Exception
216     68      27.5    4.7     0.6     38774136        24025446        308507570.89

Then align -an data1 produces, as a final alignment step:
Code:
Read Length Usable Sequence Polyclonal Low Quality Unaligned Bases Reads1   Reads2   Reads3 No Barcode Exception
        216                         68        27.5             4.7    0.6 38774136 24025446   30850757      0.89

Some details for align:
Code:
align   Align columns of text. (what)
Path    : ~/p/stm/common/scripts/align
Version : 1.7.0
Length  : 270 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/perl
Help    : probably available with --help
Home    : http://kinzler.com/me/align/
Modules : (for perl codes)
 Getopt::Std    1.10

Best wishes ... cheers, drl
These 2 Users Gave Thanks to drl For This Post:
# 3  
Old 04-30-2017
The perl script below is very close but I am struggling with extracting from the last 4 fields and running a calculation in the last field. I hope I have included enough detail in the below and thank you Smilie.

Extraction of Reads1, Reads2,Reads,3,NoBarcode fields are extracted from this portion of
run.txt:

Code:
Reads1 is IonXpress 004 with the Reads value extracted
Reads2 is IonXpress 005 with the Reads value extracted
Reads3 is IonXpress 006 with the Reads value extracted
NoBarcode is No Barcode with the Reads value extracted

The above Reads1,2,3 can all have different IonXpress digits but the format will always be the same. So in this example the digits were 004,005,006, but the next time it might be 001,002,003. However the same format below will apply:
f[0] will always be the barcode and f[4] will alsways ne the reads
Code:
Barcode Name       Sample             Bases           ≥ Q20           Reads           Mean Read Length

No barcode         none               151,751,086     122,614,710     844,020         180 bp

IonXpress 004      00-0000 Last-      8,373,945,632 7,188,703,690 38,774,136          216 bp

                       First
) * 100
IonXpress 005      00-0001 LastN-   5,226,515,080 4,502,314,522 24,025,446          218 bp

                       FirstN

IonXpress 006      00-0002 La-    6,651,737,354 5,681,526,265 30,850,757          216 bp

                       Fi

Calculation in last field:
The very last field is a calculation that uses the f[4] reads in the f[0] No barcode
divided by the Total Reads in [ICODE], in the original code in post 1 Key Signal/ and ($ks,$tr,$rl)=@p[3..5]
extracted the Total Reads, but I am not sure how to perform the calculation of (844020 / 94495222)*100.

Code:
perl -ne 'BEGIN{print join("\t","ReadLength", "UsableSequence", "Polyclonal", "LowQuality", "UnalignedBases", "Barcode1", "Barcode2", "Barcode3", "NoBarcode", "Exception"),"\n"};s/[\%\,]//g;@f=split/\s+/> ;/Key Signal/ and ($ks,$tr,$rl)=@p[3..5];/Usable/ and ($il,$us)=@p[1,2];/Polyclonal/ and $pc=$f[-1];/Low Quality/ and $lq=$f[-1];/Unaligned Reads/ and $ur=$f[-1] and print join("\t",$rl,$us,$pc,$lq,$ur,$f[-1]," "),"\n";@p=@f' run.txt

current output
Code:
ReadLength UsableSequence Polyclonal LowQuality UnalignedBases Barcode1 Barcode2 Barcode3 NoBarcode Exception
216       68              27.5         04.7    0.6           0.6

desired output tab-delimeted
Code:
ReadLength UsableSequence Polyclonal LowQuality UnalignedBases Barcode1 Barcode2 Barcode3 NoBarcode Exception
216       68              27.5         4.7    0.6      8774136  24025446 30850757  0.89   
                                                                                   (Reads / TotalReads) * 100


Last edited by cmccabe; 04-30-2017 at 10:03 AM.. Reason: fixed format
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Converting secured pdf files to pdf using acroread

Does anybody have idea of Converting secured pdf files to pdf using acroread ? ---------- Post updated at 04:49 PM ---------- Previous update was at 04:44 PM ---------- This file is not password protected. (4 Replies)
Discussion started by: Soham
4 Replies

2. Shell Programming and Scripting

PDF Script to extract PDF Links MOD in Need

In here we have a script to extract all pdf links from a single page.. any idea's in how make this read instead of a page a list of pages.. and extract all pdf links ? #!/bin/bash # NAME: pdflinkextractor # AUTHOR: Glutanimate (http://askubuntu.com/users/81372/), 2013 #... (1 Reply)
Discussion started by: danielldf
1 Replies

3. Shell Programming and Scripting

Perl how to compare two pdf files line by line

Hi Experts, Would really appreciate if anyone can guide me how to compare two pdf files line by line and report the difference to another file. (3 Replies)
Discussion started by: prasanth_babu
3 Replies

4. Shell Programming and Scripting

Shell Script to Dynamically Extract file content based on Parameters from a pdf file

Hi Guru's, I am new to shell scripting. I have a unique requirement: The system generates a single pdf(/tmp/ABC.pdf) file with Invoices for Multiple Customers, the format is something like this: Page1 >> Customer 1 >>Invoice1 + invoice 2 >> Page1 end Page2 >> Customer 2 >>Invoice 3 + Invoice 4... (3 Replies)
Discussion started by: DIps
3 Replies

5. Programming

help me with perl script that creat pdf

Hi, I have one xml file, I extracted some comments and saved in pdf file.I written code like this #!/usr/bin/perl use warnings; use strict; use PDF::API2; use PDF::API2::Page; use XML::LibXML::Reader; use Data::Dumper; my $file; open( $file, 'formal.xml'); my $reader =... (1 Reply)
Discussion started by: veerubiji
1 Replies

6. Shell Programming and Scripting

Perl program to convert PDF to text/CSV

Please suggest ways to easily convert pdf to text in perl only on windows (no other tools can be downloaded) Here is what I have been doing : using a module CAM::PDF to extract data. But it shows everything in messy format :wall: But this module is the only one working with the pdf... (0 Replies)
Discussion started by: chakrapani
0 Replies

7. Shell Programming and Scripting

Perl - Convert html to pdf - PDF::FromHTML

Hi, I am trying to convert html to pdf using perl module PDF::FromHTML, am getting the error as given below. not well-formed (invalid token) at line 2, column 17, byte 56 at C:/Perl/lib/XML/Parser.pm line 187 at C:/Perl/site/lib/PDF/FromHTML.pm line 140 The perl code is as given... (2 Replies)
Discussion started by: DILEEP410
2 Replies

8. Shell Programming and Scripting

Converting html to pdf perl

Hi All, I have a requirement of converting an html form into pdf using perl. The html form contains images, tables and css implementation. I tried using various perl modules but failed to achive the target. I succeeded in generating a pdf from the html file using... (2 Replies)
Discussion started by: DILEEP410
2 Replies

9. Shell Programming and Scripting

Extract Table from PDF

Hi Guys! I want to extract table from PDF in HTML. Can we do this using Shell script....??. Please provide me your suggestions. Any help will be highly appreciated. Thanks! (2 Replies)
Discussion started by: parshant_bvcoe
2 Replies
Login or Register to Ask a Question