Using Linux Commands on selected text


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Using Linux Commands on selected text
# 8  
Old 11-01-2015
There wasn't any duplicate text in your sample input file. And, it isn't clear if you want to remove text that is duplicated or if you want to remove text along with the opening and closing XML <Text> tags where the text and tags are duplicated.

Furthermore, although gawk on Linux systems uses the entire string assigned to the RS variable as the input record separator, the standards say that the behavior is unspecified if RS is more than one character. The version of awk I'm using only uses the first character of RS, so the structure of my code is slightly different from that suggested by cgkmal.

Your sample input included the line:
Code:
<Text Text_ID="10154713369385165_10154714426085165" From="415855878601070" Created="2015-10-30T23:27:48+0000" use_count="1">This is the fourth text........</Text>

but that line does not appear in the output that you said should be produced. Why shouldn't this line be included in the output?

Assuming that you just want to consider the text between the tags (and not the tags themselves) when looking for duplicates, you could try something like:
Code:
awk '
BEGIN {	RS = ">"
}
$1 == "<Text" {
	open_tag = substr($0 RS, index($0, "<"))
	next
}
NF >= limit && !($0 in seen) {
	print open_tag $0 RS
	seen[$0]
}' limit=4 file

If the sample input you showed us in post #1 is contained in a file named file, it produces the output:
Code:
<Text Text_ID="10155645315850165_10155645333075165" From="460350337463650" Created="2014-10-16T17:05:37+0000" use_count="536">This is the first text</Text>
<Text Text_ID="10155645315850165_10155645317025165" From="1626711840908498" Created="2014-10-16T17:01:02+0000" use_count="408">This is the second text</Text>
<Text Text_ID="10155645315850165_10155645320000165" From="1481727095388591" Created="2014-10-16T17:02:04+0000" use_count="1064">This is the third text
If counted 
GOT IT... ����</Text>
<Text Text_ID="10154713369385165_10154714450825165" From="464236763734179" Created="2015-10-30T23:34:47+0000" use_count="1">This is is just a sample text......</Text>
<Text Text_ID="10154713369385165_10154714444345165" From="642181809247720" Created="2015-10-30T23:31:48+0000" use_count="1">This is just another sample text.......</Text>
<Text Text_ID="10154713369385165_10154714426085165" From="415855878601070" Created="2015-10-30T23:27:48+0000" use_count="1">This is the fourth text........</Text>
<Text Text_ID="10154713369385165_10154714406055165" From="10202898434142187" Created="2015-10-30T23:23:34+0000" use_count="1">Jor se Bharat Mata ki jai</Text>

(including the line shown in red that was not included in your desired output).

Note that this code assumes that there is no whitespace between the last word in your text and the closing </Text> tag, that there are no > characters in the text in your file, and that the only tags in your XML file are opening and closing text tags (<Text ...> and </Text>, respectively). The code cgkmal provided makes these same assumptions and additionally assumes that there is no whitespace after the end of the opening text tag before the first word of the text, that there is nothing other than a <newline> character after a closing text tag, and that there are always five words in an opening text tag.
This User Gave Thanks to Don Cragun For This Post:
# 9  
Old 11-01-2015
Missed out in the output

Yes, the following text was missed out in my output. Sorry for the confusion.

Code:
<Text Text_ID="10154713369385165_10154714426085165" From="415855878601070" Created="2015-10-30T23:27:48+0000" use_count="1">This is the fourth text........</Text>

should appear in the output. Thanks a lot for the correction. Smilie . I made the correction in the posting accordingly.

Is there some way I can remove the non-Roman characters from the texts and remove the blanks in between line if any and followed by trimming of tabs and spaces at the end of each line if any?

Last edited by my_Perl; 11-01-2015 at 01:38 AM..
# 10  
Old 11-01-2015
I'm glad to hear that I understood part of your original requirements.

Does the code I suggested work for you?

What do you consider to be duplicated text? If I didn't guess correctly, please show us some sample input that contains duplicated text and explain clearly what your requirements for removing duplicates.

UPDATE: I see that you added text to your last post after I started answering it...

PLEASE stop making us guess at your requirements. State them clearly with examples! If you refuse to make any attempts to write your own code and expect volunteers to write all of your code for you, you could at least clearly define your terms and show us examples of what you're trying to do. You can start with:
  1. What is a non-Roman character (or alternatively, what is a Roman character)?
  2. What are "blanks in between line"?
  3. What lines are you referring to in "trimming of tabs and spaces at the end of each line"? Do you mean blanks after the closing Text tag? Do you mean blanks at the end of the text before a closing Text tag? Do you mean blanks at the end of a line of text with no tags that is part of multiple line text between opening and closing Text tags?

Last edited by Don Cragun; 11-01-2015 at 01:15 AM..
This User Gave Thanks to Don Cragun For This Post:
# 11  
Old 11-01-2015
Thanks a lot it worked. Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. What is on Your Mind?

JQuery to Add Code Tags to Selected Text

Hey. Someone find or write some jQuery code where we can select text with our mouse and then click or double click the highlighted / selected text and then it will wrap code tags around the highlighted text (in our editors). :) (0 Replies)
Discussion started by: Neo
0 Replies

2. Linux

How to run commands with pipe from text file?

Hello, I have standard loop while read -r info; do command $info done < info in info text file I have multiple commands each on line that I want to execute. When I used them in console they worked, but not with this loop. This is one of the commands in info file: grep... (4 Replies)
Discussion started by: adamlevine
4 Replies

3. Shell Programming and Scripting

Bash to select text and apply it to a selected file in bash

In the bash below I am asking the user for a panel and reading that into bed. Then asking the user for a file and reading that into file1.Is the grep in bold the correct way to apply the selected panel to the file? I am getting a syntax error. Thank you :) ... (4 Replies)
Discussion started by: cmccabe
4 Replies

4. Emergency UNIX and Linux Support

Executing several commands in a text file

I have a file that has about 3000 commands , listed one below the other. I would like to execute them all in one go. Is there a simpler way to do it - like a batch file processing, than executing one line at a time? (3 Replies)
Discussion started by: ggayathri
3 Replies

5. Web Development

How to copy a selected value of list box into a text box in html form?

hi, i have a list box , a text box and a button in a html form. list box displays some values, when a user selects a value from the list box and press the button. the selected value should be copied to the text box value. can any1 give me a html and javascript code to do this facility. ... (1 Reply)
Discussion started by: Little
1 Replies

6. UNIX for Dummies Questions & Answers

Hoe to copy selected strings from file into another text file

Hi Experts, I just want to copy some selected strings from a a file into a new .txt file . I am using below command to find the data now want to copy the search results into another .txt file please help me . find /Path -exec grep -w "filename1|filename1|filename1|" '{}' \;... (2 Replies)
Discussion started by: mumakhij
2 Replies

7. Shell Programming and Scripting

To display the selected part in text file of unix

0400903071220312 20120322 20:21 1TRANTELSTRAFLEXCAB22032012CMP201323930000812201108875802100A003485363 12122011AUS 182644 000C2 8122011 0000 000 1TRANTELSTRAFLEXCAB22032012CMP201323930000812201108875802100A003485363 12122011AUS ... (6 Replies)
Discussion started by: rammm
6 Replies

8. Shell Programming and Scripting

Commands to reorganize a text file

Hi! I am trying to create a script to reorder the contents of a text file. Below is the text file initially, followed by how I would like it reordered: File initially: --- Initial lines with text and/or numbers Initial lines with text and/or numbers Initial lines with text and/or numbers... (11 Replies)
Discussion started by: gwr
11 Replies

9. Shell Programming and Scripting

trying to print selected fields of selected lines by AWK

I am trying to print 1st, 2nd, 13th and 14th fields of a file of line numbers from 29 to 10029. I dont know how to put this in one code. Currently I am removing the selected lines by awk 'NR==29,NR==10029' File1 > File2 and then doing awk '{print $1, $2, $13, $14}' File2 > File3 Can... (3 Replies)
Discussion started by: ananyob
3 Replies
Login or Register to Ask a Question