Visit The New, Modern Unix Linux Community


File filter script


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting File filter script
# 1  
File filter script

I need help to write a script to filter the input file INPUT.TXT as given below:

Code:
<DOC id="ID-NAME" type="story" >
<HEADLINE>
Relative Size Capital 
</HEADLINE>
<DATELINE>
Los , Monday 
</DATELINE>
<TEXT>
<P>
The first para consists of this format.have fully
</P>
<P>
Meanwhile, the rest of the story are in the XML format as in the present document format. 
</P>
</TEXT>
</DOC>

After filtering the above document, I want to get output as given below as OUTPUT.TXT:


Code:
Relative Size Capital 
Los , Monday 
The first para consists of this format.have fully
Meanwhile, the rest of the story are in the XML format as in the present document format.

Thanks in advance Smilie
# 2  
Hello my_Perl,

You could try following for same.
Code:
awk '($0 !~ /</ && $0 !~ />/)' Input_file

Output will be as follows.
Code:
Relative Size Capital
Los , Monday
The first para consists of this format.have fully
Meanwhile, the rest of the story are in the XML format as in the present document format.

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 3  
In case you have text between tags on the same line as text with tags, you could also try:
Code:
awk '{gsub(/<[^>]*>/, "")}$0!=""' INPUT.TXT

Note that this code (and the code RavinderSingh13 suggested) doesn't throw away trailing spaces at the end of input lines as you did in your desired output.

If that is important to you, you can add a call to sub() or gsub() after the call to gsub() to strip trailing (or leading, or both leading and trailing) spaces or spaces and tabs.

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.
This User Gave Thanks to Don Cragun For This Post:
# 4  
Hi.

I like awk scripts, but I also like generality, as long as it's not difficult or complicated. If the files are formatted nicely into lines as shown in the OP, then basic awk scripts are fine. However, if the markup spans lines as shown below, then other solutions might be useful (and not a lot more difficult), as shown here:
Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate plain-text transformation of URLs.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C awk sed grep elinks

FILE=${1-data2}

pl " Input data file $FILE:"
cat $FILE

pl " Results, first awk:"
awk '($0 !~ /</ && $0 !~ />/)' $FILE

pl " Results, second awk:"
awk '{gsub(/<[^>]*>/, "")}$0!=""' $FILE

pl " Results, links (or elinks):"
links -dump $FILE

pl " Results, elinks (with added paragraph after headline):"
sed 's/<\/HEADLINE>/& <p>/' $FILE  > f1
elinks -dump f1

pl " Results, elinks (with added paragraph after headline, delete empy lines):"
sed 's/<\/HEADLINE>/& <p>/' $FILE  > f1
elinks -dump f1 |
grep -v '^[[:space:]]*$'

exit 0

producing:
Code:
$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
bash GNU bash 3.2.39
awk GNU Awk 3.1.5
sed GNU sed version 4.1.5
grep GNU grep 2.5.3
ELinks 0.11.4 (built on Sep 20 2008 16:40:51)

-----
 Input data file data2:
<DOC id="ID-NAME" type="story" > <HEADLINE> Relative Size Capital
</HEADLINE> <DATELINE> Los , Monday </DATELINE> <TEXT> <P> The first
para consists of this format.have fully </P> <P> Meanwhile, the rest
of the story are in the XML format as in the present document format.
</P> </TEXT> </DOC>

-----
 Results, first awk:
of the story are in the XML format as in the present document format.

-----
 Results, second awk:
  Relative Size Capital
  Los , Monday    The first
para consists of this format.have fully   Meanwhile, the rest
of the story are in the XML format as in the present document format.
  

-----
 Results, links (or elinks):
   Relative Size Capital Los , Monday

   The first para consists of this format.have fully

   Meanwhile, the rest of the story are in the XML format as in the present
   document format.

-----
 Results, elinks (with added paragraph after headline):
   Relative Size Capital

   Los , Monday

   The first para consists of this format.have fully

   Meanwhile, the rest of the story are in the XML format as in the present
   document format.

-----
 Results, elinks (with added paragraph after headline, delete empy lines):
   Relative Size Capital
   Los , Monday
   The first para consists of this format.have fully
   Meanwhile, the rest of the story are in the XML format as in the present
   document format.

The utility elinks is available in many repositories including CentOS, Debian, etc., and even one for the Mac (brew repository).

Best wishes ... cheers, drl
This User Gave Thanks to drl For This Post:

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #77
Difficulty: Easy
MINIX was a Unix-like operating system created for educational purposes.
True or False?

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Shell script to filter certain lines from a file

hi~ i need script on AIX. and have a text file following : create aa 1 2 3 from a@a; create bb from b; create cc 3 4 5 6 6 7 from c@c; (7 Replies)
Discussion started by: tomato00
7 Replies

2. Shell Programming and Scripting

Filter line in a script

Hi Team, I am trying hard to find a way I can filter a line as below: If I have a line as /abc /def 123:/ghi /jkl 456:/mno /pqrs 7890:/tuvw /xyz I am expecting my output to be as below: /abc /def /jkl /pqrs /xyz basically I want to ignore anything preceding or succeeding colon... (2 Replies)
Discussion started by: SiddhVi
2 Replies

3. Shell Programming and Scripting

Shell script to filter records in a zip file that contains matching columns from another file

Not sure if this is the correct forum for this question. I have two files. file1.zip, file2 Input: file1.zip col1, col2 , col3 a , b , 0:0:0:0:0:c436:9346:d40b x, y, 0:0:0:0:0:880:39f9:c9a7 m, n , 0:0:0:0:0:80c7:9161:fe00 file2.txt col1 c4:36:93:46:d4:0b... (1 Reply)
Discussion started by: anil.v
1 Replies

4. UNIX for Dummies Questions & Answers

Filter records in a huge text file from a filter text file

Hi Folks, I have a text file with lots of rows with duplicates in the first column, i want to filter out records based on filter columns in a different filter text file. bash scripting is what i need. Data.txt Name OrderID Quantity Sam 123 300 Jay 342 498 Kev 78 2500 Sam 420 50 Vic 10... (3 Replies)
Discussion started by: tech_frk
3 Replies

5. Shell Programming and Scripting

Help with shell script - filter txt file full of ips

Hello again gentlemen. I would like to make a shell script to 'optimize' a plain text full of IPs. Let's suppose to have this text file: 1.192.63.253-1.192.63.253 1.0.234.46/32 1.1.128.0/17 1.116.0.0/14 1.177.1.157-1.177.1.157 1.23.22.19 1.192.61.0-1.192.61.99 8.6.6.6 I want to... (2 Replies)
Discussion started by: accolito
2 Replies

6. UNIX for Dummies Questions & Answers

Shell script to read lines in a text file and filter user data Shell Programming and Scripting

sxsaaas (3 Replies)
Discussion started by: VikrantD
3 Replies

7. Shell Programming and Scripting

Script to filter by date

Hello, I currently have the need to perform backup, naming the file by date. How do I get the script, you can choose the most current file or current date and then upload it? My script is related to this topic that is already closed. Read Post Can anyone help me? (12 Replies)
Discussion started by: hdegenaro
12 Replies

8. Shell Programming and Scripting

How to filter required data from file using bash script?

Hi All , I have one file like below , Owner name = abu2-kxy-m29.hegen.app Item_id = AX1981, special_id = *NULL*, version = 1 Created = 09/01/2010 12:56:56 (1283389016) Enddate = 03/31/2011 00:00:00 (1301554800) From the above file I need to get the output in the below format ,i need... (3 Replies)
Discussion started by: gnanasekar_beem
3 Replies

9. Shell Programming and Scripting

Shell script to read lines in a text file and filter user data

hi all, I have this file with some user data. example: $cat myfile.txt FName|LName|Gender|Company|Branch|Bday|Salary|Age aaaa|bbbb|male|cccc|dddd|19900814|15000|20| eeee|asdg|male|gggg|ksgu|19911216||| aara|bdbm|male|kkkk|acke|19931018||23| asad|kfjg|male|kkkc|gkgg|19921213|14000|24|... (4 Replies)
Discussion started by: srimal
4 Replies

10. Shell Programming and Scripting

filter parts of a big file using awk or sed script

I need an assistance in file generation using awk, sed or anything... I have a big file that i need to filter desired parts only. The objective is to select (and print) the report # having the string "apple" on 2 consecutive lines in every report. Please note that the "apple" line has a HEX... (1 Reply)
Discussion started by: apalex
1 Replies

Featured Tech Videos