please help here. I have 4 cols, I am looking for consistent 'geno' values within
'line', 'part' combinations. If the geno values are not consistent within a 'line', 'part' block, then we delete that block. One of the complications is that geno values are always 2 character, but AT and TA mean the same thing,similarly CG=GC etc.
I have 2 requests
1) Convert all 'geno' values in the whole dataset like AT and TA to either one or the other. I dont care which one, as long as they are consistent across the whole dataset.
2) Delete inconsistent 'line', 'part' combinations (red) from the data. If it is consistent (green) then just keep any one, in the example I have kept the first one.
Being a bit short on time i will deal only with the first part:
Quote:
Originally Posted by ritakadm
1) Convert all 'geno' values in the whole dataset like AT and TA to either one or the other. I dont care which one, as long as they are consistent across the whole dataset.
For the first part I can try the following
but is there a smarter way to do this?
No, your solution is good. It is possible to do the same in awk, though, and if you want to use awk anyway for the second part of your requirement you might want to do all in one pass instead of doing something like sed '....' file | awk '....'. This, of course, is only the case if your two requirements are always two parts of one step always done together. If not, this point is moot.
I would make the regexp a bit more robust, though (here just the first rule as example):
This will change every "TA" to "AT". But i fact you are interested only in changing the "TA" at the end of the line. This will also improve your code because accidental changes somewhere in the middle of a line can't happen any more. Therefore:
For the same reason you can drop the "g" option of the substitution: you want to change only one "TA" per line (the one immediately before the line end), hence:
Finally, as long as it is possible you should enclose sed-expressions in single quotes instead of double quotes. In your special case it doesn't matter but some meta-characters are expanded within double-quotes whereas in single quotes nothing is expanded at all. Therefore, finally:
One last tip: i usually write sed-scripts line per line, it is easier to read this way, IMHO:
thanks, very helpful explanation ! .. I was wondering if there is a shortcut to reverse matching the 'geno' values without using specific sed arguments..like converting all generic xy$ and yx$ values to xy ... waiting on some part 2 help
Yes, Scrutinizer's script works with unsorted data.
Yes, this section prints the 1st line then jumps to the next line.
The script takes lots of memory that is processed in the END section, after reading the whole file.
Senhias script reads the file twice that takes time, but maybe takes less memory.
---------- Post updated at 01:14 PM ---------- Previous update was at 12:24 PM ----------
The following is a little improvement regarding memory consumption and speed:
Doesn't need a sorted file.
---------- Post updated at 01:59 PM ---------- Previous update was at 01:14 PM ----------
The following uses minimum memory but is a little slower:
Last edited by MadeInGermany; 01-13-2015 at 02:20 PM..
This User Gave Thanks to MadeInGermany For This Post:
An id was a part of a particular user group. All of a sudden that id was removed from that group, because of which many things failed. How do I find out when/who modified the id settings? (2 Replies)
Hi Gurus,
I need check existing number of file based on the list in file list.
for example:
in my file list. I have below:
abc, file1.txt
abc, file2.txt
abc, file3.txt
abc, file4.txt
cde, filea1.txt
cde, filea2.txt
cde, filea3.txt
...
in my current file direcotry, I have file:... (0 Replies)
Hi huys,
Sorry for my bad english, i'm french :o .
I've got a little question : is there a way to check status of a VG on aix 6.1 ? I want to know if a VG is locked or not...
I can do a "lsvg -Ll rootvg" for example, but if this VG is already locked, the process waits without gives me the... (2 Replies)
Hi All,
I am stuck with a problem here. I have two directories with really huge number of files about 200000+. I did some file processing and in between my program crashed thereby creating some inconsistent files. Running the script over again is out of question now as it takes lot of time to... (1 Reply)
Hello,
How can I obtain a lists of disks with their size (anytype: SAN LUNs, internal disks, etc.) attached to the system and not being extended inside a volume group?
The purpose of this list is to be part of a function of a script that I'm doing in order to resize filesystems and in the... (6 Replies)
Hi all,
I would like to ensure that a volume group has an effective quorum setting of 1 (or off). I know you can change the quorum setting using the chvg -Q command but want to know if the setting has been changed before the vg was varied on or a reboot.
In other words how can I ensure that... (3 Replies)
Hi all,
How can I check if a particular user id belongs to a group?
(ie. how to check if the current user `whoami` is part of the a certain group? do i use the group name of group id?)
Thanks in advance (2 Replies)