Data to import the database as snippets


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Data to import the database as snippets
# 8  
Old 10-11-2016
Hi,

Below is a update of the script data2imp.sh with many comments for help.

Regarding any enhancement on the algorithm , I cannot guess what is to be done without seeing the files.

Could you for instance provide the header of the files, once formatted by this script ( the .lst files ) .

The command will create an output of the top 10 records of each file.
Code:
#head -10 *.lst > headers.txt 
# zip headers.zip headers.txt

Then attach the zipped file to your reply, if this content is not confidential.
( this zipped file size shouldn't exceed 100 kbytes for 1000 .lst files .
Code:
#data2imp.sh
#this script will process all files as *.txt in the current directory.
#The result file contains one line for each file, with the first field as the original file name.
#
#we need first to format the input file
function format_item {

awk '
# Filter non blank lines. NF = Number of Fields ( NF = 0 for empty line)
#  replace all consecutive blanks with 1 single blank space
NF {gsub(/  */," ")
# print line without Line Feed
 printf "%s" ,$0 
 istext++
 next
}
# Print a Line Feed when text is followed an empty line (NF==0)
istext {print "";istext=0}' $1
}
# Creation of a sample csv record 
# Take care there is no semicolumn inside the original text. In such case the field separator might be changed
function insert_item {
awk -v OFS=";" -v ITEM=$itemname '
# Record 1 is the item date
NR==1 {DATE=$1 " " $2}
# Record 2 is item title
# if length <= 100 : TITLE is the full record
# else seek for a dot in position between 51 and 100 of the record and cut record to this position
#    if no dot  in position between 51 and 100 : cut to the 100 first characters.           
NR==2 { if (length($0) <= 100) TITLE = $0
        else {
           dotposition=index(substr($0,51),".")
           if (dotposition == 0) {
             TITLE = substr($0,1,100) "..."}
           else {
             TITLE = substr($0,1,50 + dotposition)
           }
        }
       }
# same method as for  TITLE
NR==4{ if (length($0) <= 500) SNIPPET = $0
        else {
           dotposition=index(substr($0,401),".")
           if (dotposition == 0) {
             SNIPPET = substr($0,1,500) "..."}
           else {
             SNIPPET = substr($0,1,400 + dotposition)
           }
        }
print ITEM,DATE,TITLE,SNIPPET
}' $1
}
#--- main -----------------------------------
# Output initialisation
#
>items.csv
#
for i in *.txt
do
itemname=$(basename $i .txt)
echo Processing item $itemname
format_item $i > $itemname.lst
insert_item $itemname.lst >> items.csv
done


Last edited by blastit.fr; 10-11-2016 at 06:48 PM.. Reason: typo
# 9  
Old 10-13-2016
Hi,

See the attached script as a big enhancement of my previous one.
Now the fields are processed on different criterias, independant of the record number.
The script creates temporary files for debugging purpose.
These files are the real input , but clean from what we can call "noise" , i.e. useless keywords or full lines.

You can check for other extra noise using this command:

Code:
$ ./data2imp.sh 
( output remove) 
# check for useless keywords :
$ sort *.tmp |uniq -c |sort -rn | head -20 |cut -c-60
     56
     40 (FOR FY
     18 2017)
     18 2016)
     17 AUGUST 2012 1
     16 2015)
     13 DEPARTMENT OF
     13 AUGUST 2011 1
     10 (FY
      9 AUGUST 2015 TABLE OF CONTENTS
      9 1. MANDATE
      8 1
      7 DEPARTMENT OF ENERGY
      6 AUGUST 2016 1
      6 AUGUST 2013 2
      6 ((FFYY 22001133))
      5 SITUATIONER
      5 SEPTEMBER 2013 1
      5 DEPARTMENT OF TRADE AND INDUSTRY
      5 DEPARTMENT OF FINANCE

You can then enhance code by adding new exclusions , like for instance /FFYY/

Regards
# 10  
Old 10-14-2016
Hi,

Thanks for the update
running the script I got error message if this commented if you don't need to format files:

Code:
# remove comment if required to format files (txt to lst)
# for i in *.txt
# do
# itemname=$(basename $i .txt)
# echo Formatting item $i
# format_item $i > $itemname.lst
# done

Code:
rm: cannot remove `*.tmp': No such file or directory
Processing item *.lst
awk: cmd. line:40: fatal: cannot open file `*.lst' for reading (No such file or directory)

Thanks

Last edited by lxdorney; 10-14-2016 at 02:58 AM..
# 11  
Old 10-15-2016
Hi blastit.fr,

how to make this in one line in awk or in sh file:
Code:
NR==2 { if (length($0) <= 100) TITLE = $0
        else {
           dotposition=index(substr($0,51),".")
           if (dotposition == 0) {
             TITLE = substr($0,1,100) "..."}
           else {
             TITLE = substr($0,1,50 + dotposition)
           }
        }
       }

Thanks
# 12  
Old 10-16-2016
Hi ,
I have updated my script : see the attached file.
This greatly improves resultats , as the rules are changed.

1) The field recognition is nomore based on the line number .
2) The rule to filter the TITLE : find a line with no lowercase letter , the longest as possible.

Now you should see the outfile file items.csv filled with relevant results.
Take care that many files , like the 3*.txt , won't fit for this filter .
It fits only with the files like 1181.txt

cdt
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

SSH import mysql database

Hi all, I am trying to import a database in putty with the syntax: mysql –u database_username –p database_name < filename.mysql As you can see in the screenshot it asks me for the database password - which suggests that the syntax is correct - but then after I enter the password it gives... (2 Replies)
Discussion started by: Juc1
2 Replies

2. Shell Programming and Scripting

shellscript to read data from txt file and import to oracle db

Hi all, Help needed urgently. I am currently writing a shellscript to read data/record from a flat file (.txt) file, and import/upload the data to oracle database. The script is working fine, but it takes too long time (for 18000 records, it takes around 90 mins). I guess it takes so long... (1 Reply)
Discussion started by: robot_mas
1 Replies

3. Solaris

import lun data to mount point - Solaris 10

Hi Guys, I have EMC Storage and from this storage I have maped lun5 to Sun Solaris server and I have created on this lun mount point with name /application I have anothere Sun Solaris server and I'll colne lun5 to lun10 from storage level so the data of lun5 will be in lun10 how to... (6 Replies)
Discussion started by: Mr.AIX
6 Replies

4. UNIX for Dummies Questions & Answers

Import dump to database

Hi... I have dump in unix machine...How can I this import dump to Oracle database? Many thanks in advance. (2 Replies)
Discussion started by: agarwal
2 Replies

5. Shell Programming and Scripting

Shell snip to import CSV data into BASH array

I have been trying to write a simple snip of bash shell code to import from 1 to 100 records into a BASH array. I have a CSV file that is structured like: record1,item1,item2,item3,item4,etc.,etc. .... (<= 100 items) record2,item1,item2,item3,item4,etc.,etc. .... (<= 100 items)... (5 Replies)
Discussion started by: dstrout
5 Replies

6. Shell Programming and Scripting

Data Import perl script

Hi, I have a requirement for creating a Perl Script which will perform Data Import process in an automated way and I am elaborating herewith : Section 1 ) - use the following command line format : "./import.pl -h hostname -p port -f datafile.txt" Section 2) datafile.txt will... (3 Replies)
Discussion started by: scott_apc
3 Replies

7. UNIX for Dummies Questions & Answers

How can import data files to XL sheet.

Hi, I have the file(F1.XL) in Unix Box. it's updating every 1hr. I would like to import f1.xl to Windows excel sheet, when i need see the reports. can any one clarify, is there any VB script for importing data from UNIX, like sql connection.... thanks (1 Reply)
Discussion started by: koti_rama
1 Replies

8. Windows & DOS: Issues & Discussions

import data files from Unix to Windows?

Hi, Is there any way to import data files from Unix system to Windows system? I have many data files on Unix machine generated every night. I need to pick certain data from each file and plug them into this windows file on the network share drive. Anyone has any idea? Thanks in advance! (8 Replies)
Discussion started by: whatisthis
8 Replies

9. Programming

Code Snippets

Can Anyone give me an implimentation of virtual memory (simulation using paging only) .it should have the following algos for page replacement 1. LRU 2.FIFO 3.Clock references to web sites would be gr8 too it should have the code/algo no executables(in C only) (0 Replies)
Discussion started by: wojtyla
0 Replies

10. Shell Programming and Scripting

Import data from compressed file

HI I need to import data from a file which is in comressed format but system doesn't have enough space to uncompress file Is there any way so that i can do import from compressed file. (4 Replies)
Discussion started by: ap_gore79
4 Replies
Login or Register to Ask a Question