Extracting content of a file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extracting content of a file
# 1  
Old 08-21-2010
Extracting content of a file

Hello, I'm working on a script to extract the contents of a file (in general, plain txt file with numbers, symbols, and letters) and output it into a .txt file. but it is kind of all over the place. It needs to not include duplicates and the content has to be readable. I jumped all over the place as far as learning scripting but I managed to get down the translate feature. kind of new to awk but i heard it can be more effective and works similar. I was also wondering if im just making something more complicated when sort & uniq might be able to do the job?


Note: I will be using this script numerous times. Is it possible to keep updating the output file so that the context is extracted collectively?

My logic of the script so far is

1.read (while loop maybe?)
2.sort/uniq -c (to eliminate duplicates)
3.awk (to eliminate gibberish?)

> filename.txt


my code so far:

Code:
#!/bin/bash
# Check for input file on command line.
ARGS=1
E_BADARGS=65
E_NOFILE=66

if [ $# -ne "$ARGS" ]  # Correct number of arguments passed to script or too complicated for something easy?
then
  echo "Usage: `basename $0` filename"
  exit $E_BADARGS
fi

if [ ! -f "$1" ]       # Check if file exists.
then
  echo "File \"$1\" does not exist."
  exit $E_NOFILE
fi


#so far i have it set to translate output by feeding tr back to itself. will this work?
#or is awk more effective. what about the use of | sort | uniq -c?

tr A-Z a-z | tr '[:space:]' Z | \
tr -cs '[:alpha:]' Z | tr -s '\173-\377' Z | tr Z ' '` 

# for or while loop?

> output.txt 


exit 0


Last edited by l20N1N; 08-21-2010 at 09:09 PM.. Reason: corrections
# 2  
Old 08-21-2010
I think everything for that can be done in awk using associative arrays that will flag every entry and prevent printing of a second duplicate. Conversion of chars are also easily handled. The problem in order to solve that quickly in one shot,.. can you give us an adequate example of the file's contents and the intended output?
# 3  
Old 08-21-2010
you are on the right track, maybe no need for awk.

uniq -c does a count.

you can simply sort -u

define gibberish.

you can use tr -cd to complement the search (if that's any easier):
eg: delete anything not alphanumeric or space

Code:
tr -cd '[:space:][:alnum:]'

This User Gave Thanks to bigearsbilly For This Post:
# 4  
Old 08-21-2010
The thing is...

The thing is, the input files vary. It could be in logs, records, database, information converted into plain text. The script will need to be able to read everything on it. One file for example had:

Code:
John Smith  555-5555  to 555-5555 Hello Jane Doe

another file was an email message so it was all text

The output just needs to have everything taken from the input printed in the output. The problem here is that it needs to be collectively done. For example I input one file and output it to the output file. Input another file and output it to the same(adding into) output file. That's where I'm stuck. I read that it will overwrite it the existing file, but I was wondering if it can be updated instead.

Quote:
define gibberish
gibberish meaning non-printable that might be mixed into the regular expressions


Update:

so for the while loop portion where it reads I can use this code correct?

Code:
while read line 

do echo "${line}"

 done < <(cat file.lst)

/tmp/file1.txt
/tmp/file with space.txt

which inputs a file list of files to extract content out of and output it into a txt file in temp?

---------- Post updated at 07:38 PM ---------- Previous update was at 04:52 PM ----------

Will this also work?

Code:
cd <input_file_directory>
for file in `dir -d *` ; do
<exeFile with full path> "$file" <output_file_path/"$file".out>
done


Last edited by l20N1N; 08-21-2010 at 11:39 PM.. Reason: update
# 5  
Old 08-22-2010
If you intend to do that in bash:
Code:
#!/bin/bash

[[ BASH_VERSINFO -ge 4 ]] || {
    echo "Bash version 4.0 or newer is required by this script."
    exit 1
}

declare -A FLAGS=()

while read; do
    REPLY=${REPLY//[^[:print:]]}
    [[ -n ${FLAGS[$REPLY]} ]] && continue
    FLAGS[$REPLY]=.
    echo "$REPLY"
done

exit 0

bash script.sh < input_file

That one requires version 4.0 or newer of bash.

With bigearsbilly's suggestion:
Code:
tr -cd '[:print:]' input_file | sort -u

# 6  
Old 08-22-2010
I'm using bash 3.2.39.

Ok lets try this approach. Lets say I use this simple code
Code:
cat file.txt |while read line; do echo "${line}"; done >> output.txt

Is it possible for me to code it so that the "file.txt" could be all txt files in a directory?
# 7  
Old 08-22-2010
just change it to *.txt
Code:
cat *.txt ...

This User Gave Thanks to konsolebox For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Linux read specific content file from tar.gz files without extracting

hello i wish to write the result of these below conditions in a file: 1. in a specific folder, i have many tar.gz files. 2. each tar.gz file contains ".dat" file in sub folders. 3. i wish to get the full path of these .dat files, if i find in it a specific word ("ERROR24"). 4. all this... (6 Replies)
Discussion started by: jimmyjames9
6 Replies

2. Shell Programming and Scripting

Insert content of a file to another file at a line number which is given by third file

Hi friends, here is my problem. I have three files like this.. cat file1.txt ======= unix is best unix is best linux is best unix is best linux is best linux is best unix is best unix is best cat file2.txt ======== Windows performs better Mac OS performs better Windows... (4 Replies)
Discussion started by: Jagadeesh Kumar
4 Replies

3. Shell Programming and Scripting

How to remove exisiting file content from a file and have to append new file content?

hi all, i had the below script x=`cat input.txt |wc -1` awk 'NR>1 && NR<'$x' ' input.txt > output.txt by using above script i am able to remove the head and tail part from the input file and able to append the output to the output.txt but if i run it for second time the output is... (2 Replies)
Discussion started by: hemanthsaikumar
2 Replies

4. Shell Programming and Scripting

Extracting content from xml file

Hello All, Hope you are doing well!!!!! I have a small code in the below format in xml file: <UML:ModelElement.taggedValue> <UML:TaggedValue tag="documentation" value="This sequence&#xA;&#xA;HLD_EA_0001X&#xA;HLD_DOORS_002X"/> <UML:TaggedValue tag="documentation" value="This... (11 Replies)
Discussion started by: suvendu4urs
11 Replies

5. Shell Programming and Scripting

Sed: replace content from file with the content from file

Hi, I am having trouble while using 'sed' with reading files. Please help. I have 3 files. File A, file B and file C. I want to find content of file B in file A and replace it by content in file C. Thanks a lot!! Here is a sample of my question. e.g. (file A: a.txt; file B: b.txt; file... (3 Replies)
Discussion started by: dirkaulo
3 Replies

6. Shell Programming and Scripting

Extracting content from a file in specific format

Hi All, I have the file in this format **** Results Data **** Time or Step 1 2 20 0.000000000e+00 0s 0s 0s 1.024000000e+00 Us 0s 0s 1.100000000e+00 1s 0s 0s 1.100000001e+00 1s 0s 1s 2.024000000e+00 Us Us 1s 2.024000001e+00 ... (7 Replies)
Discussion started by: diehard
7 Replies

7. Shell Programming and Scripting

Need help with awk - how to read a content of a file from every file from file list

Hi Experts. I need to list the file and the filename comes from the file ListOfFile.txt. Basicly I have a filename "ListOfFile.txt" and it contain Example of ListOfFile.txt /home/Dave/Program/Tran1.P /home/Dave/Program/Tran2.P /home/Dave/Program/Tran3.P /home/Dave/Program/Tran4.P... (7 Replies)
Discussion started by: tanit
7 Replies

8. Shell Programming and Scripting

Parsing file, yaml file? Extracting specific sections

Here is a data file, which I believe is in YAML. I am trying to retrieve just the 'addon_domains" section, which doesnt seem to be as easy as I had originally thought. Any help on this would be greatly appreciated!! I have been trying to do this in awk and mostly bash scripting instead of perl... (3 Replies)
Discussion started by: Rhije
3 Replies

9. Shell Programming and Scripting

How to read the content of the particular file from tar.Z without extracting?

Hi All, I want to read the content of the particular file from tar.Z without extracting. aaa.tar.Z contains a file called one.txt, I want to read the content of the one.txt without extracting. Please help me to read the content of it. Regards, Kalai. (12 Replies)
Discussion started by: kalpeer
12 Replies

10. Shell Programming and Scripting

Extracting data from text file based on configuration set in config file

Hi , a:) i have configuration file with pattren <Range start no>,<Range end no>,<type of records to be extracted from the data file>,<name of the file to store output> eg: myfile.confg 9899000000,9899999999,DATA,b.dat 9899000000,9899999999,SMS,a.dat b:) Stucture of my data file is... (3 Replies)
Discussion started by: suparnbector
3 Replies
Login or Register to Ask a Question