Sponsored Content
Top Forums Shell Programming and Scripting Extracting the column containing URL from a text file Post 302909399 by MadeInGermany on Wednesday 16th of July 2014 02:49:43 PM
Old 07-16-2014
Or this one?
Code:
awk 'BEGIN {OFS=FS="\t"} {s=""; for (col=2; col<=3; col++) {sep=""; n=split ($col,T,"[ ,<>]+"); for (i=1;i<=n;i++) if (NR==1 || T[i]~/^(http|https|ftp|gopher|mailto):|^www\.[-a-z0-9]+\./) {s=s sep T[i]; sep=" "}; s=s OFS} print s}' file
URL     Text
http://example.com      www.test.com http://example4.com
http://example1.net     http://example6.com
http://example2.net

This User Gave Thanks to MadeInGermany For This Post:
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extracting anchor text and its URL from HTML files in BASH

Hi All, I have some HTML files and my requirement is to extract all the anchor text words from the HTML files along with their URLs and store the result in a separate text file separated by space. For example, <a href="/kid/stay_healthy/">Staying Healthy</a> which has /kid/stay_healthy/ as... (3 Replies)
Discussion started by: shoaibjameel123
3 Replies

2. UNIX for Dummies Questions & Answers

Extracting rows from a text file based on the first column

I have a tab delimited text file where the first column can take on three different values : 100, 150, 250. I want to extract all the rows where the first column is 100 and put them into a separate text file and so on. This is what my text file looks like now: 100 rs3794811 0.01 0.3434... (1 Reply)
Discussion started by: evelibertine
1 Replies

3. UNIX for Dummies Questions & Answers

Extracting rows from a text file based on the first column

I have a tab delimited text file where the first column can take on three different values : 100, 150, 250. I want to extract all the rows where the first column is 100 and put them into a separate text file and so on. This is what my text file looks like now: 100 rs3794811 0.01 0.3434 100... (1 Reply)
Discussion started by: evelibertine
1 Replies

4. UNIX for Dummies Questions & Answers

Extracting rows from a text file based on numerical values of a column

I have a text file where the second column is a list of numbers going from small to large. I want to extract the rows where the second column is smaller than or equal to 0.0001. My input: rs10082730 9e-08 12 46002702 rs2544081 1e-07 12 46015487 rs1425136 1e-06 7 35396742 rs2712590... (1 Reply)
Discussion started by: evelibertine
1 Replies

5. UNIX for Dummies Questions & Answers

Extracting rows from a space delimited text file based on the values of a column

I have a space delimited text file. I want to extract rows where the third column has 0 as a value and write those rows into a new space delimited text file. How do I go about doing that? Thanks! (2 Replies)
Discussion started by: evelibertine
2 Replies

6. UNIX for Dummies Questions & Answers

Extracting the last column of a text file

I would like to extract the last column of a text file but different rows of the text file have different numbers of columns. How do I go about doing that? Thanks! (1 Reply)
Discussion started by: evelibertine
1 Replies

7. Shell Programming and Scripting

Extracting the file name from the specified URL

Hello Everyone, I am trying to write a shell script(or Perl Script) that would do the following: I have a file that contains the following lines: File: https://ims-svnus.com/dev/DB/trunk/feeds/templates/shell_script.txt -r860... (5 Replies)
Discussion started by: filter
5 Replies

8. UNIX for Dummies Questions & Answers

Extracting rows from a text file if the value of a column falls between a certain range

Hi, I have a file that looks like the following: 10 100080417 rs7915867 ILMN_1343295 12 6243093 7747537 10 100190264 rs2296431 ILMN_1343295 12 6643093 6647537 10 100719451 SNP94374 ILMN_1343295 12 6688093 7599537 ... (1 Reply)
Discussion started by: evelibertine
1 Replies

9. Shell Programming and Scripting

Extracting the column containing URL from a text file

I have the file like this: Timestamp URL Text 1331635241000 http://example.com Peoples footage at www.test.com,http://example4.com 1331635231000 http://example1.net crack the nuts http://example6.com 1331635280000 http://example2.net ... (0 Replies)
Discussion started by: csim_mohan
0 Replies

10. Shell Programming and Scripting

Extracting the column containing URL from a text file

I have the file like this: Timestamp URL Text 1331635241000 http://example.com Peoples footage at www.test.com,http://example4.com 1331635231000 http://example1.net crack the nuts http://example6.com 1331635280000 http://example2.net ... (0 Replies)
Discussion started by: csim_mohan
0 Replies
col(1)								   User Commands							    col(1)

NAME
col - reverse line-feeds filter SYNOPSIS
col [-bfpx] DESCRIPTION
The col utility reads from the standard input and writes to the standard output. It performs the line overlays implied by reverse line- feeds, and by forward and reverse half-line-feeds. Unless -x is used, all blank characters in the input will be converted to tab charac- ters wherever possible. col is particularly useful for filtering multi-column output made with the .rt command of nroff(1) and output resulting from use of the tbl(1) preprocessor. The ASCII control characters SO and SI are assumed by col to start and end text in an alternative character set. The character set to which each input character belongs is remembered, and on output SI and SO characters are generated as appropriate to ensure that each character is written in the correct character set. On input, the only control characters accepted are space, backspace, tab, carriage-return and newline characters, SI, SO, VT, reverse line- feed, forward half-line-feed and reverse half-line-feed. The VT character is an alternative form of full reverse line-feed, included for compatibility with some earlier programs of this type. The only other characters to be copied to the output are those that are printable. The ASCII codes for the control functions and line-motion sequences mentioned above are as given in the table below. ESC stands for the ASCII escape character, with the octal code 033; ESC- means a sequence of two characters, ESC followed by the character x. reverse line-feed ESC-7 reverse half-line-feed ESC-8 forward half-line-feed ESC-9 vertical-tab (VT) 013 start-of-text (SO) 016 end-of-text (SI) 017 OPTIONS
-b Assume that the output device in use is not capable of backspacing. In this case, if two or more characters are to appear in the same place, only the last one read will be output. -f Although col accepts half-line motions in its input, it normally does not emit them on output. Instead, text that would appear between lines is moved to the next lower full-line boundary. This treatment can be suppressed by the -f (fine) option; in this case, the output from col may contain forward half-line-feeds (ESC-9), but will still never contain either kind of reverse line motion. -p Normally, col will ignore any escape sequences unknown to it that are found in its input; the -p option may be used to cause col to output these sequences as regular characters, subject to overprinting from reverse line motions. The use of this option is highly discouraged unless the user is fully aware of the textual position of the escape sequences. -x Prevent col from converting blank characters to tab characters on output wherever possible. Tab stops are considered to be at each column position n such that n modulo 8 equals 1. ENVIRONMENT VARIABLES
See environ(5) for descriptions of the following environment variables that affect the execution of col: LC_CTYPE, LC_MESSAGES, and NLSPATH. EXIT STATUS
The following error values are returned: 0 Successful completion. >0 An error occurred. ATTRIBUTES
See attributes(5) for descriptions of the following attributes: +-----------------------------+-----------------------------+ | ATTRIBUTE TYPE | ATTRIBUTE VALUE | +-----------------------------+-----------------------------+ |Availability |SUNWesu | +-----------------------------+-----------------------------+ |CSI |enabled | +-----------------------------+-----------------------------+ SEE ALSO
nroff(1), tbl(1), ascii(5), attributes(5), environ(5) NOTES
The input format accepted by col matches the output produced by nroff with either the -T37 or -Tlp options. Use -T37 (and the -f option of col) if the ultimate disposition of the output of col will be a device that can interpret half-line motions, and -Tlp otherwise. col cannot back up more than 128 lines or handle more than 800 characters per line. Local vertical motions that would result in backing up over the first line of the document are ignored. As a result, the first line must not have any superscripts. SunOS 5.10 1 Feb 1995 col(1)
All times are GMT -4. The time now is 04:58 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy