The reason to save them into map is for later retrieval.
Say, there are millions of entries (that's exactly any small lab would have!), but only ten or hundreds need be retrieved from it.
After some reading, it seems current programs take two steps:
1) index the dataset;
2) retrieve subsample from the indexed dataset.
It seems to me a hash_map was used. This morning I was reviewing the codes we discussed and thought a program could do the job this way:
where sample.list only have the sequence names, i.e. the keys of the map.
sample.list:
Does this make any sense to you? Or what I missed?
I tried this for the tab-delimited format file, which worked fine but that is not general. If it is a tab-delimited file, the job can be done with the awk script, even grep can do the job easily. However, it seems not easy with grep for the generic format. Thanks.
---------------
You are so fast! While I was writing your second one popped out. Thanks a lot!
---------- Post updated at 03:30 PM ---------- Previous update was at 03:22 PM ----------
I was trying to print each key and value of the map, i.e. the pair of seqID vs. sequence.
The reason to save them into map is for later retrieval.
Say, there are millions of entries (that's exactly any small lab would have!), but only ten or hundreds need be retrieved from it.
I see, I see. Hmmm.
How about, instead of storing the entire file, store the locations you've found things. That's your "index". Then, when asked for that information, seek to that spot in the file and read it.
A map is probably not the best data structure for this. A map is probably array or list-based, so if you have 2-million sequences, map["mysequence"] takes a 2-million item loop to tell whether it has it. A tree or a hash would be good. I never got the hang of trees in C++, though, and C++ doesn't have a generic hash table type (unless they added one while I wasn't looking).
On the other hand -- if you know what items you want, why not just print them?
Understand that:
How about, instead of storing the entire file, store the locations you've found things. That's your "index". Then, when asked for that information, seek to that spot in the file and read it.
Isn't that the same to loop/hash the map? And is it do-able?
On the other hand -- if you know what items you want, why not just print them?
Two things there:
1) I do not know if the entry is in the dataset or not,
2) If it is there, I want to get full information (sequences may be stored in unknown number of rows!) of that entry, so that need use a program.
I am aware bioperl/biopython is better to do this type of job, but I am catching C++. And C++ is way faster than perl for sure for millions of queries.
Knowing where in 10 gigs of data your information is, and keeping all that 10 gigs of data in memory whether you need it or not, are somewhat different.
Quote:
On the other hand -- if you know what items you want, why not just print them?
Two things there:
1) I do not know if the entry is in the dataset or not
2) If it is there, I want to get full information (sequences may be stored in unknown number of rows!) of that entry, so that need use a program.
OK, now I see the situation.
But I still think you have it backwards. Whenever an idea begins with "store the universe in memory, then use a tiny part of it" my hackles go up. Keep a list of the things you want to find. Scan the file and print only those without storing the universe.
Quote:
I am aware bioperl/biopython is better to do this type of job, but I am catching C++. And C++ is way faster than perl for sure for millions of queries.
I think I mentioned, long ago, a thread on this forum where the OP was using C++ for text processing. But he kept wanting to do more and more with it -- to the point it had rudimentary expressions. In the end it was still a little faster than awk, but it wasn't that fast.
awk, perl, and python are all written in C or C++. If they're slower than your programs, it's because your program does a whole lot less.
awk honestly sounds great for the job here. If your awk program is short, awk will run fast. It already has a very fast array that's based on a hash or tree.
You seems to know every tiny corner of my mind! I am not fluent using any of those programming languages, so that my comments on speed does not count at all.
My colleague simply said to me:"You are overthinking it!" or "You are resistant to this approach!" whenever I ask technique details for things like this.
For this practice, I am struggling to catch the flow of the
Regular books seldom address this part in great detail. When I took the CS200 course, the professor always emphasized "C and only C, no OOP allowed!"
Now I realized what he meant, seriously!
The part I am still not sure is:
1) In the line with ">", the first field is stored as one string, except the '>' char which is a separator for each record (like RS in awk).
2) All the rest of the field next to the ">" line are concatenated to have a single string. It is easy for printing, but to track them in memory with
I am not sure at all. Neither am I with this line:
For example, the entry:
Onlyseq01 is picked up for key on the first line, the other part are discarded; from the second row of the entry all is concatenated: AGCTACGTACATCAGTCGTGTGATCGAGCGGG for value of the map (if I insist map be used!)
I seem to understand the syntax, as I can print out the individual field parsed, but do not know how to combine certain fields together if needed. Maybe I should not say I understand the syntax.
How the pointer/reference is manipulated behind is the bottleneck for me to catch the whole point. Can you elaborate that? Thanks!
Hi
I have xml file with multiple records and would like to extract records from xml with specific condition if specific tag is present extract entire row otherwise skip .
<logentry revision="21510">
<author>mantest</author>
<date>2015-02-27</date>
<QC_ID>334566</QC_ID>... (12 Replies)
I have a String class with a function that reads tokens using a delimiter.
For example
String sss = "6:8:12:16";
nfb = sss.nfields_b (':');
String tkb1 = sss.get_token_b (':');
String tkb2 = sss.get_token_b (':');
String tkb3 = sss.get_token_b (':');
String tkb4 =... (1 Reply)
I am fairly new to programming and trying to resolve this problem. I have the file like this.
CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam
tg93 77 T C T T T T T
tg93 79 ... (4 Replies)
Hi All,
Im writing a shell script in which I want to get the folder names in one folder to be used in for loop.
I have used:
packsName=$(cd ~/packs/Acquisitions; ls -l| awk '{print $9}')
echo $packsName
o/p: opt temp user1 user2
ie. Im getting the output as a string.
But I want... (3 Replies)
Hey everyone, i needed some help with this one. We move into a new file system (which should be the same as the previous one, other than the name directory has changed) and the script worked fine in the old file system and not the new. I'm trying to add the results from one with another but i'm... (4 Replies)
Hi all,
I have a variable with value
DateFileFormat=NAME.CODE.CON.01.#.S001.V1.D$.hent.txt
I want this variable to get replaced with :
var2 is a variable with string value
DateFileFormat=NAME\\.CODE\\.CON\\.01\\.var2\\.S001\\.V1\\.D+\\.hent\\.txt\\.xml$
Please Help (3 Replies)
Hi,
How do I parse/split lines (strings) read from a file and display the individual tokens in a shell script? Given that the length of individual lines is not constant and number of tokens in each line is also not constant.
The input file could be as below:
... (3 Replies)
Hello-
Trying to add two numbers in a ksh shell scripts and i get this error every time I execute
stat1_ex.ksh: + : more tokens expected
stat1=`cat .stat1a.tmp | cut -f2 -d" "`
stat2=`cat .stat2a.tmp | cut -f2 -d" "`
j=$(($stat1 + $stat2)) # < Here a the like the errors out
echo $j... (3 Replies)
I have one file say CM.txt which contains values like below.Its just a flat file
1000,A,X
1001,B,Y
1002,B,Z
...
..
total around 4 million lines of entries will be in that file.
Now i need to write another file CM1.txt which should have
1000,1
1001,2
1002,3
....
...
..
Here i... (6 Replies)
im trying to remove all occurences of " OF xyz " in a file where xyz could be any word assuming xyz is the last word on the line but I won't always be.
at the moment I have sed 's/OF.*//'
but I want a nicer solution which could be in pseudo code
sed 's/OF.* (next token)//'
Is... (6 Replies)