Data to import the database as snippets


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Data to import the database as snippets
# 1  
Old 10-09-2016
Data to import the database as snippets

Hi all,

I don't know if this the right place to post my question to get some ideas on how to done this.

I have a text files extracted from OCR that need to have snippets to be import to database as snippet table which have columns "snippet, date, title" I dont know if shell scripts can do it with power of grep and regex command in linux or is there any opensource or commercial tools can use to do this task.

Thank you
# 2  
Old 10-09-2016
Hi ,
Could you provide a sample input file and the expected result ?
# 3  
Old 10-09-2016
Hi here's the sample txt file link: Update your browser to use Google Drive - Drive Help

sample at google drive public share 1181.txt
Code:
date: September  2015 
title: THE MACROECONOMIC PERSPECTIVE: ENSURING THE BUDGET AS AN EFFECTIVE HANDLE AMID PRESSING CHALLENGES

Code:
snippet:  The proposed 2016 national budget is being considered in the legislative mill amid daunting global challenges, the vagaries of a harsh El Nino phenomenon, and the heightened political uncertainty owing to the looming transition in the reins of government.  Further, China’s continued unraveling in recent weeks has spooked global investors, affirmed the persistent global economic slowdown, and exposed the vulnerabilities of emerging markets.

or snippets would be one paragraph with atlist 400 to 500 character end with period or paragraph cut at maximum 500 character with 3 dot.

title: atlist 50 to 100 character end with period. or paragraph cut at maximum 100 character with 3 dot.
Thanks for the response

Moderator's Comments:
Mod Comment Please use CODE tags (as required by forum rules) when posting sample input, sample output, and code segments.

Last edited by Don Cragun; 10-09-2016 at 03:58 PM.. Reason: Add CODE and ICODE tags.
# 4  
Old 10-09-2016
Hi lxdorney,
The script should fit your request .
Code:
#
#data2imp.sh
#we need first to format the input file
awk 'NF{
     gsub(/  */," ")
     printf "%s" ,$0 ;istext++;next}
     istext{print "";istext=0}' 1181.txt  > 1181.res
# Creation of a sample .csv file , ready for database input
# Take care there is no semicolumn inside the original text . In such case the field separator might be changed
awk -v OFS=";" '
NR==1 {DATE=$1 " " $2}
NR==2 { if (length($0) <= 100) TITLE = $0
        else {
           PART2=substr($0,51)
           dotposition=index(PART2,".")
           if (dotposition == 0) {
             TITLE = substr($0,1,100) "..."}
           else {
             TITLE = substr($0,1,50 + dotposition)
           }
        }
       }
NR==4{ if (length($0) <= 500) SNIPPET = $0
        else {
           PART2=substr($0,401)
           dotposition=index(PART2,".")
           if (dotposition == 0) {
             SNIPPET = substr($0,1,500) "..."}
           else {
             SNIPPET = substr($0,1,400 + dotposition)
           }
        }
print DATE,TITLE,SNIPPET
}' 1181.res >1181.csv


Last edited by blastit.fr; 10-09-2016 at 08:15 PM.. Reason: comments added
# 5  
Old 10-09-2016
Hi blastit.fr,

Thanks for your time and effort to create and share the script, I see your script focus in 1181.txt file, in this script can execute as bulk file, where talking about 2000 to 4000 text files to increment the result in one csv file including filename of text.

Thanks again

Last edited by lxdorney; 10-10-2016 at 08:00 AM..
# 6  
Old 10-10-2016
Hi,
See my attached file : this script will process all files as *.txt in the current directory .
The result file contains one line for each file , with the first field as the original file name.
# 7  
Old 10-11-2016
I will try this and thanks again

---------- Post updated 10-11-16 at 10:46 AM ---------- Previous update was 10-10-16 at 05:53 PM ----------

Hi,

Here's the result after execute the script.

1. field ITEM - result looking good.

2. field DATE - Testing 100 text files and the result was 2 good out of 100. Maybe because not all content have "Month Year" in the first row.
I tried to replace from
Code:
NR==1 {DATE=$1 " " $2}

to this
Code:
NR==1 {DATE=system("egrep -R "^[a-zA-Z]{3,9} [0-9]{4} $" -m 1")}

to match date pattern, but no luck to make it work.

3. field TITLE - The length of title sometimes <=10 characters, maybe if we could add a conditions, for example First match of the title must be atlist minimum of 30 but not
exceeded to 100 characters and esc for not match and stop at first match.

4. filed SNIPPET - The length of snippet sometimes <=70 characters, maybe if we could add a conditions, for example First match of the snippet must be atlist minimum of 400
but not exceeded to 500 characters and esc for not match and stop at first match.

Also if you could explain the flow of the script much better, for not only me but for the benefit of other users.

Thank you for reading, effort and your patient.

Last edited by lxdorney; 10-11-2016 at 12:54 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

SSH import mysql database

Hi all, I am trying to import a database in putty with the syntax: mysql –u database_username –p database_name < filename.mysql As you can see in the screenshot it asks me for the database password - which suggests that the syntax is correct - but then after I enter the password it gives... (2 Replies)
Discussion started by: Juc1
2 Replies

2. Shell Programming and Scripting

shellscript to read data from txt file and import to oracle db

Hi all, Help needed urgently. I am currently writing a shellscript to read data/record from a flat file (.txt) file, and import/upload the data to oracle database. The script is working fine, but it takes too long time (for 18000 records, it takes around 90 mins). I guess it takes so long... (1 Reply)
Discussion started by: robot_mas
1 Replies

3. Solaris

import lun data to mount point - Solaris 10

Hi Guys, I have EMC Storage and from this storage I have maped lun5 to Sun Solaris server and I have created on this lun mount point with name /application I have anothere Sun Solaris server and I'll colne lun5 to lun10 from storage level so the data of lun5 will be in lun10 how to... (6 Replies)
Discussion started by: Mr.AIX
6 Replies

4. UNIX for Dummies Questions & Answers

Import dump to database

Hi... I have dump in unix machine...How can I this import dump to Oracle database? Many thanks in advance. (2 Replies)
Discussion started by: agarwal
2 Replies

5. Shell Programming and Scripting

Shell snip to import CSV data into BASH array

I have been trying to write a simple snip of bash shell code to import from 1 to 100 records into a BASH array. I have a CSV file that is structured like: record1,item1,item2,item3,item4,etc.,etc. .... (<= 100 items) record2,item1,item2,item3,item4,etc.,etc. .... (<= 100 items)... (5 Replies)
Discussion started by: dstrout
5 Replies

6. Shell Programming and Scripting

Data Import perl script

Hi, I have a requirement for creating a Perl Script which will perform Data Import process in an automated way and I am elaborating herewith : Section 1 ) - use the following command line format : "./import.pl -h hostname -p port -f datafile.txt" Section 2) datafile.txt will... (3 Replies)
Discussion started by: scott_apc
3 Replies

7. UNIX for Dummies Questions & Answers

How can import data files to XL sheet.

Hi, I have the file(F1.XL) in Unix Box. it's updating every 1hr. I would like to import f1.xl to Windows excel sheet, when i need see the reports. can any one clarify, is there any VB script for importing data from UNIX, like sql connection.... thanks (1 Reply)
Discussion started by: koti_rama
1 Replies

8. Windows & DOS: Issues & Discussions

import data files from Unix to Windows?

Hi, Is there any way to import data files from Unix system to Windows system? I have many data files on Unix machine generated every night. I need to pick certain data from each file and plug them into this windows file on the network share drive. Anyone has any idea? Thanks in advance! (8 Replies)
Discussion started by: whatisthis
8 Replies

9. Programming

Code Snippets

Can Anyone give me an implimentation of virtual memory (simulation using paging only) .it should have the following algos for page replacement 1. LRU 2.FIFO 3.Clock references to web sites would be gr8 too it should have the code/algo no executables(in C only) (0 Replies)
Discussion started by: wojtyla
0 Replies

10. Shell Programming and Scripting

Import data from compressed file

HI I need to import data from a file which is in comressed format but system doesn't have enough space to uncompress file Is there any way so that i can do import from compressed file. (4 Replies)
Discussion started by: ap_gore79
4 Replies
Login or Register to Ask a Question