Finding contiguous numbers in a list but with a gap number tolerance


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Finding contiguous numbers in a list but with a gap number tolerance
# 1  
Old 11-08-2012
Finding contiguous numbers in a list but with a gap number tolerance

Dear all,
I have a imput file like this
imput
Code:
scaffold_0      10558458        10558459        1.8
scaffold_0      10558464        10558465        1.75
scaffold_0      10558467        10558468        1.8
scaffold_0      10558468        10558469        1.71428571428571
scaffold_0      10558469        10558470        1.71428571428571
scaffold_0      10558470        10558471        1.71428571428571
scaffold_0      10558471        10558472        1.59090909090909
scaffold_0      10558472        10558473        1.66666666666667
scaffold_0      10558473        10558474        1.75
scaffold_0      10558474        10558475        1.75
scaffold_0      10558475        10558476        1.7
scaffold_0      10558476        10558477        1.7
scaffold_0      10558477        10558478        1.7
scaffold_0      10558478        10558479        1.7
scaffold_0      10558479        10558480        1.7
scaffold_0      10558480        10558481        1.61904761904762
scaffold_0      10577262        10577263        1.6

I would like to retrieve the lines are relative to contiguous number presented in the second column. In this examples, I would have:

output
Code:
scaffold_0      10558467        10558468        1.8
scaffold_0      10558468        10558469        1.71428571428571
scaffold_0      10558469        10558470        1.71428571428571
scaffold_0      10558470        10558471        1.71428571428571
scaffold_0      10558471        10558472        1.59090909090909
scaffold_0      10558472        10558473        1.66666666666667
scaffold_0      10558473        10558474        1.75
scaffold_0      10558474        10558475        1.75
scaffold_0      10558475        10558476        1.7
scaffold_0      10558476        10558477        1.7
scaffold_0      10558477        10558478        1.7
scaffold_0      10558478        10558479        1.7
scaffold_0      10558479        10558480        1.7
scaffold_0      10558480        10558481        1.61904761904762

Note that the line "scaffold_0 10558464 10558465 1.75" is not included, because is missing the numbers 10558465 and 10558466. However, I would like to have a tolerance up to five number, which would include that line and others that have a gap up to 5 numbers.

Anybody could help me?

Cheers.
# 2  
Old 11-08-2012
Try something like this

Code:
sort -nk2 file | awk '{if($2 == s){print P;k=1}else if(1==k){print P;k=0;}}  
  {s=$3;P=$0}'

To decide tolerance difference use
Code:
sort -nk2 file | awk -v tol="5" '{if(($2-s)<=tol){print P;k=1}else if(1==k){print P;k=0;}}  
  {s=$3;P=$0}'

This User Gave Thanks to pamu For This Post:
# 3  
Old 11-08-2012
Hi Pamu, thanks for your script. I'm very happy with the results.
I just have a problem that I would like to have your help, because I'm a newbie in these things.
My imput is:

imput
Code:
scaffold_0    1    2    1.6
scaffold_0    2    3    1.6
scaffold_0    100    101    1.6
scaffold_0    104    105    1.6
scaffold_100    1    2    1.6
scaffold_100    1000    1001    1.6
scaffold_65    543    544    1.6
scaffold_10    1    2    1.6
scaffold_10    200    201    1.6
scaffold_10    1000    1001    1.6

Runing the next script

script
Code:
#!/bin/bash
cat imput |cut -f1 |sort |uniq >scaffolds
wait
while read line
do
one_position=`grep -w -c "$line" teste`
wait
    if [ "$one_position" -ne "0" ] #-eq not equal 
      then
        cat imput |grep -w "$line" |sed 's/scaffold_/scaffold_ /g' |sort -nk2 -nk3 |sed 's/scaffold_ /scaffold_/g' | awk -v tol="1" '{if(($2-s)<=tol){print P;k=1}else if(1==k){print P;k=0;}}
        {s=$3;P=$0}'
    fi
done < scaffolds

I have this output

output
Code:
scaffold_0    1    2    1.6
scaffold_0    2    3    1.6
scaffold_10    1    2    1.6
scaffold_100    1    2    1.6



However, I supose if I'm running an awk tolerance of 1 (-v tol=1) I would have just:

output desirable
Code:
scaffold_0    1    2    1.6
scaffold_0    2    3    1.6

How could I fix this script in attempt just to have the above output (output desirable)? Could you explain this awk.


Cheers.
# 4  
Old 11-09-2012
grepping something won't be good idea of doing this work..

try this (replace your whole script with this.. Smilie)

Code:
 awk '!X[$1]++{print $1}' file > scaffolds
  while read line
  do
  awk -v var="$line" '$1 == var' file| sort -nk2 | awk -v tol="1" '{if(($2-s)<=tol && s){print P;k=1}else if(1==k){print P;k=0;}else{k=0}}  
  {s=$3;P=$0}END{if(1==k){print P}}'
  done<scaffolds

I am working single one liner of awk. will post when i get time..Smilie

pamu
This User Gave Thanks to pamu For This Post:
# 5  
Old 11-09-2012
Hey Pamu, great script using awk!!! It worked very well!!! Could you consider to help me one more time?
After to run your script I had a file that I would like to split it since the second column of the next line minus the second column of the current line is <=100

imput
Code:
scaffold_100    1    2    10.6
scaffold_100    2    3    4.6
scaffold_100    102    103    5.6
scaffold_100    103    104    6.6
scaffold_100    1000    1001    6.6
scaffold_100    1001    1002    9.6
scaffold_100    3000    3001    6.6
scaffold_100    3002    3003    9.6
scaffold_100    3100    3101    6.6

output one
Code:
scaffold_100    1    2    10.6
scaffold_100    2    3    4.6
scaffold_100    102    103    5.6
scaffold_100    103    104    6.6

output two
Code:
scaffold_100    1000    1001    6.6
scaffold_100    1001    1002    9.6

output three
Code:
scaffold_100    3000    3001    6.6
scaffold_100    3002    3003    9.6
scaffold_100    3100    3101    6.6

I did a lot of things using while and putting things into a bunche of variables but I did not had the correct outputs. Do you know how to do this in a smarter way, using split, csplit, awk or whathever you want?

I'm very gratefull for your help and time.

Cheers.
# 6  
Old 11-10-2012
try

Code:
awk -v tol="100" '{if(($2-s)>tol || NR==1){fn="out"++a}}
{s=$2;print > fn}' file

It will create three files as out1,out2,out3
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Number of elements, average value, min & max from a list of numbers using awk

Hi all, I have a list of numbers. I need an awk command to find out the numbers of elements (number of numbers, sort to speak), the average value the min and max value. Reading the list only once, with awk. Any ideas? Thanks! (5 Replies)
Discussion started by: black_fender
5 Replies

2. UNIX for Dummies Questions & Answers

grep specific number from a list of numbers

Hello. I have 9060 files labelled File1 to File9060. They are in numerical order. When I grep a file eg. File90 it will show me all files that contain the pattern "File90", eg File901 or File9001. I can only get specific files for File1000 or higher. How can I resolve this problem? Is there a... (5 Replies)
Discussion started by: godzilla07
5 Replies

3. Shell Programming and Scripting

finding lowest numbers

i want to basically get the lowest numbers from a list ... for example my input file is .... 1 2 3 6 7 8 9 10 11 13 Now i want to create a script or a one liner which i can use like this ... for example ..../getlowest 3 --> this gives me the next 3 lowest numbers which... (6 Replies)
Discussion started by: greycells
6 Replies

4. UNIX for Dummies Questions & Answers

Finding numbers in lines with strings and number and doing some manipulation

Hi, I want to write a script that does something like this: I have a file, in which in every line, there is a string of words, and followed by some space, a number. Now, I want to identify the line, which has the largest startFace number (say m=8118), take that number and add it to the... (2 Replies)
Discussion started by: super_commando
2 Replies

5. Shell Programming and Scripting

the smallest number from 90% of highest numbers from all numbers in file

Hello All, I am having problem to find what is the smallest number from 90% of highest numbers from all numbers in file. I am having file with thousands of lines and hundreds of columns. I am familiar mainly with bash but I am open to whatever suggestion witch will lead to the solutions. If I... (11 Replies)
Discussion started by: Apfik
11 Replies

6. Shell Programming and Scripting

Finding number of strings in List

I have a list of strings stored in $Lst Example set Lst = "John Fred Kate Paul" I want to return 4 in this case. (1 Reply)
Discussion started by: kristinu
1 Replies

7. Shell Programming and Scripting

Finding occurences of numbers

I have two files The first file is in following format 5 937 8 1860 5 1 683 2 1 129 2 2 5 938 8 1122 5 1 20 520 4 1860 1851 1 5 939 8 1122 1124 1 20 521 4 5883 14 6 1860 1852 1 683 4 2 (5 Replies)
Discussion started by: stuggler
5 Replies

8. Shell Programming and Scripting

Need to find the gap in the sequence of numbers

Hi Guys, I have a file with numbers in sequence. The sequence have been broken somewhere.. I need to find out at which number the sequence has been broken... For an example, consider this sequence, it needs to give me output as 4 (as 5 is missing) and 6(as 7 is missing) Thanks for... (3 Replies)
Discussion started by: mac4rfree
3 Replies

9. Shell Programming and Scripting

Finding the sum of two numbers

cat *.out |grep "<some text>" | awk '{print $6}' For ex,This will reutrn me 11111 22222 is it possible to add these two numbers in the above given command itself?I can write this to a file and find the sum. But I prefer to this calculation in the above given line itself. Any... (3 Replies)
Discussion started by: prasperl
3 Replies

10. IP Networking

finding port numbers

hither! whatz the command to find which process is using a specific port number? for example, port 8082? (3 Replies)
Discussion started by: darkcastle
3 Replies
Login or Register to Ask a Question