Sponsored Content
Top Forums UNIX for Beginners Questions & Answers Comparing fastq files and outputting common records Post 303008673 by Don Cragun on Monday 4th of December 2017 08:36:11 PM
Old 12-04-2017
What RavinderSingh13 suggested might work for you, but it is comparing lines instead of records. With his suggestion it is theoretically possible to match (and copy) parts of records without matching entire records. How likely false matches are depends on the sources of the data in your two input files.

The standards say that only the first character of RS matters and that if more than one character is contained in RS, the results are unspecified.

If the awk on your system specifies that it handles multi-character RS values, I think the following trivial change will work for you, but I don't have a awk I can use to test the results:
Code:
awk 'BEGIN { FS=" "; ORS=RS="@M" } FNR==NR {a[$1]; next} $1 in a' File_1.txt File_2.txt

In addition, the sample data you have provided includes a <space> character at the start of each line in all three of your files. If there really are <space>s there on your real files, that may skew the results you get from this code.
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Help comparing 2 files to find deleted records

Hi, I need to compare todays file to yesterdays file to find deletes. I cannot use comm -23 file.old file.new. Because each record may have a small change in it but is not really a delete. I have two delimited files. the first field in each file is static. All other fields may change. I... (2 Replies)
Discussion started by: eja
2 Replies

2. Solaris

Comparing the common columns of a table in two files

Hi, I have two text files.The first and the 2nd file have data in the same format For e.g. The first file has BOOKS COUNT: 40 BOOKS AUTHOR1 SUM:1018 MAX:47 MIN:1 AVG:25.45 BOOKS AUTHOR3 SUM:181 MAX:48 MIN:3 AVG:18.1 Note:Read it as Table columnname sum(column) max(column) min(column)... (1 Reply)
Discussion started by: ragavhere
1 Replies

3. Shell Programming and Scripting

removing duplicate records comparing 2 csv files

Hi All, I want to remove the rows from File1.csv by comparing a column/field in the File2.csv. If both columns matches then I want that row to be deleted from File1 using shell script(awk). Here is an example on what I need. File1.csv: RAJAK,ACTIVE,1 VIJAY,ACTIVE,2 TAHA,ACTIVE,3... (6 Replies)
Discussion started by: rajak.net
6 Replies

4. Shell Programming and Scripting

Common records

Hi, I have the following files, A M 2 3 B E 4 5 C I 5 6 D O 4 5 A M 3 4 B E 5 2 F U 7 9 J K 2 3 OUTPUT A M 2 3 3 4 B E 4 5 5 2 thanks in advance, (7 Replies)
Discussion started by: jacobs.smith
7 Replies

5. Shell Programming and Scripting

Combining 3 fastq files

Hello, I am working with next-gen short-read sequence data, which we receive in 3 fastq files. These are arranged in 4-line groups for each read: line1: read identifier, beginning, e.g., "@HWI-ST1342..." line2: DNA sequence, for files 1 and 2, 101 characters, for file 3, 7 chars. line3: "+"... (2 Replies)
Discussion started by: ljk
2 Replies

6. Shell Programming and Scripting

Two columns-Common records - 20 files

Hi Friends, I have an input file like this cat input1 x 1 y 2 z 3 a 2 b 4 c 6 d 9 cat input2 x 7 h 8 k 9 l 5 m 9 d 12 (5 Replies)
Discussion started by: jacobs.smith
5 Replies

7. Shell Programming and Scripting

Compare multiple files, identify common records and combine unique values into one file

Good morning all, I have a problem that is one step beyond a standard awk compare. I would like to compare three files which have several thousand records against a fourth file. All of them have a value in each row that is identical, and one value in each of those rows which may be duplicated... (1 Reply)
Discussion started by: nashton
1 Replies

8. UNIX for Dummies Questions & Answers

Diff command on two Fastq.gz files

Hello. I have to compare two different fastq.gz files that I concatenated, and then zipped into a new merge fastq.gz file. The files that need to be merged are: Sample-136-P_S7_L001_R1_001.fastq.gz and Sample -136-P_S7_L002_R1_001.fastq.gz They were meged to a new file called:... (1 Reply)
Discussion started by: arcolombo698
1 Replies

9. Shell Programming and Scripting

Comparing two files to get only records to be inserted and updated

Hello all, Please help me for a script that compares two files and reads only those records that are to be inserted and updated. File1: c_id name place contact_no 1 abc xyz 34567 10 efg uvw 82725 6 hjk wth 01823 2 iuy ... (4 Replies)
Discussion started by: T@ni@
4 Replies

10. UNIX for Beginners Questions & Answers

Comparing two files and list the difference with common first line content of both files

I have two file as given below which shows the ACL permissions of each file. I need to compare the source file with target file and list down the difference as specified below in required output. Can someone help me on this ? Source File ************* # file: /local/test_1 # owner: own #... (4 Replies)
Discussion started by: sarathy_a35
4 Replies
AGREP(1)						    BSD General Commands Manual 						  AGREP(1)

NAME
agrep -- print lines approximately matching a pattern SYNOPSIS
agrep [options] pattern [files] DESCRIPTION
Searches for approximate matches of pattern in each FILE or standard input. OPTIONS
Regexp selection and interpretation -e pattern, --regexp=pattern Use PATTERN as a regular expression; useful to protect patterns beginning with '-'. -i, --ignore-case Ignore case distinctions (as defined by the current locale) in pattern and input files. -k, --literal Treat pattern as a literal string, that is, a fixed string with no special characters. -w, --word-regexp Force pattern to match only whole words. A ``whole word'' is a substring which either starts at the beginning or the record or is preceded by a non-word constituent character. Similarly, the substring must either end at the end of the record or be fol- lowed by a non-word constituent character. Word-constituent characters are alphanumerics (as defined by the current locale) and the underscore character. Note that the non-word constituent characters must surround the match; they cannot be counted as errors. Approximate matching settings -D num, --delete-cost=num Set cost of missing characters to num. -I num, --insert-cost=num Set cost of extra characters to num. -S num, --substitue-cost=num Set cost of incorrect characters to num. Note that a deletion (a missing character) and an insertion (an extra character) together constitute a substituted character, but the cost will be the that of a deletion and an insertion added together. Thus, if the const of a substitution is set to be larger than the sum of the costs of deletion and insertion, direct substitutions will never be done. -E -num, --max-errors=num Select records that have at most num errors. -# Select records that have at most # errors (# is a digit between 0 and 9). Miscellaneous -d -pattern, --delimiter=pattern Set the record delimiter regular expression to pattern. The text between two delimiters, before the first delimiter, and after the last delimiter is considered to be a record. The default record delimiter is the regexp `` '', so by default a record is a line. pattern can be any regular expression that does not match the empty string. For example, using -d file ... defines mail messages as records in a Mailbox format file. -v, --invert-match Select non-matching records instead of matching records. -V, --version Print version information and exit. -y, --nothing Does nothing. This options exists only for compatibility with the non-free agrep program. --help Display a brief help message and exit. Output control -B, --best-match Only output the best matching records, that is, the records with the lowest cost. This is currently implemented by making two passes over the input files and cannot be used when reading from standard input. --color, --colour Highlight the matching strings in the output with a color marker. The color string is taken from the GREP_COLOR environment variable. The default color is red. -c, --count Only print a count of matching records per each input file, suppressing normal output. -h, --no-filename Suppress the prefixing filename on output when multiple files are searched. -H, --with-filename Prefix each output record with the name of the input file where the record was read from. -l, --files-with-matches Only print the name of each input file which contains at least one match, suppressing normal output. The scanning for each file will stop on the first match. -n, --record-number Prefix each output record with its sequence number in the input file. The number of the first record is 1. -q, --quiet, --silent Do not write anything to standard output. Exit immediately with zero exit status if a match is found. -s, --show-cost Print match cost with output. --show-position Prefix each output record with the start and end offset of the first match within the record. The offset of the first character of the record is 0. The end position is given as the offset of the first character after the match. -M, --delimiter-after By default, the record delimiter is the newline character and is output after the matching record. If -d is used, the record delimiter will be output before the matching record. This option causes the delimiter to be output after the matching record. With no file, or when file is ``-'', agrep reads standard input. If less than two files are given -h is assumed, otherwise -H is the default. EXAMPLES
agrep -2 optimize foo.txt outputs all lines in file foo.txt that match ``optimize'' within two errors. E.g. lines which contain ``optimise'', ``optmise'', and ``opitmize'' all match. DIAGNOSTICS
Exit status is 0 if a match is found, 1 for no match, and 2 if there were errors. If -E or -# is not specified, only exact matches are selected. pattern is a POSIX extended regular expression (ERE) with the TRE extensions. REPORTING BUGS
Report bugs to the TRE mailing list <tre-general@lists.laurikari.net>. COPYRIGHT
Copyright (C) 2002-2004 Ville Laurikari. This is free software, and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute this software under certain conditions; see the source for the full license text. BSD
November 21, 2004 BSD
All times are GMT -4. The time now is 03:28 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy