![]() |
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !! |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Extracting records with unique fields from a fixed width txt file | sitney | Shell Programming and Scripting | 8 | 02-10-2008 03:18 AM |
| combining fields in two text fields | shocker | Shell Programming and Scripting | 3 | 01-16-2008 11:27 AM |
| extracting fields | prvnrk | Shell Programming and Scripting | 2 | 10-08-2007 03:39 AM |
| Extracting information from a template | Ernst | Shell Programming and Scripting | 4 | 03-07-2007 01:18 AM |
| Extracting fields from an output 8-) | csaha | Shell Programming and Scripting | 6 | 01-20-2006 08:37 AM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
||||
|
Extracting information from text fields.
Dear friends,
I'm a novice Unix user and I'm trying to learn the ropes. I have a big task I have to accomplish and I'm convinced Unix can get the job done, I just haven't figured out how. I recently posted on the topic of cutting text between unique text patterns and somebody helped me a great deal. It worked great. There are other tasks, however, that I want to accomplish. I'm doing a content analysis of newspaper articles that I've exported in .txt format from a ProQuest database. The .rtf files look like this when I cat them in Unix. Quote:
Ultimately, I will have hundreds of these news stories to extract the information from. Does anybody have any suggestions? Simon |
|
||||
|
just a snippet of it:
Code:
awk 'BEGIN{FS=":"}
{ gsub(/\\/,"")}
/Author/{print "Author: "$2 }
/Document types/{print "Document: "$2 }
/Text Word Count/{sub(/Text Word Count/,"");print "Word count: " $0}
' "file"
Code:
# ./test1.sh Author: Karen Miles Document: News Word count: 314 |
|
||||
|
That's really helpful. For some reason, however, the gsub command doesn't delete the backslashes in the files. When I play with it, the command doesn't return an error; it seems to complete, but the files still contain the slashes.
Also, the field variables don't write exactly right. The /Author/{print "Author: "$2 }string only prints the first, not both names of the author. I had to modify it to read "$2, $3" to get first and last names. Any clue why that might be? Thanks again though, this is what I was looking for! Simon |
|
||||
|
Quote:
Quote:
|
|
||||
|
Hi, On the first point, this is the code I was playing with from the command line, based on your suggestion.
awk 'BEGIN { gsub(/\\/,"")}' 03152000.rtf Did you put the gsub command in there just to get rid of the backslashes? Or is it related to the field extraction process? Regarding the second point: Here is the text, copied straight from the .txt file I'm trying to extract information from. Author(s): Ashley Geddes, Provincial Affairs Writer Here is the code (entered in the command line) I got from you and it's output. awk 'Begin{FS=":"} ; /Author/{print $2 }' 03152000.rtf > Author.txt Ashley Here is the modified way I wrote the code and its result awk 'Begin{FS=":"} ; /Author/{print $2,$3 }' 03152000.rtf > Author.txt Ashley Geddes, the same problem occurred with the field "Document Types" I had to change around the fields as you wrote them to get the result. Obviously I'm not that concerned, because it seems to be working. However I am very curious because I'd like to know how the thing works. Thanks for your help! |
![]() |
| Bookmarks |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|