![]() ![]() The different tools have very different compression characteristics. The tools listed in above table typically give some control over this trade-off at compression time by offering nine different options: –1 means optimize for speed and -9 means optimize for space. Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems.Īll compression algorithms exhibit a space/time trade-off: faster compression and decompression speeds usually come at the expense of smaller space savings. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. It does not aim for maximum compression, or compatibility with any other compression library instead, it aims for very high speeds and reasonable compression. Snappy is a compression/decompression library. But that is still 20-50% of the size of the files without any compression at all, which means that IO-bound jobs complete the map phase about four times faster. It doesn’t compress quite as well as gzip - expect files that are on the order of 50% larger than their gzipped version. Moreover, it was designed with speed in mind: it decompresses about twice as fast as gzip, meaning it’s fast enough to keep up with hard drive read speeds. The LZO compression format is composed of many smaller (~256K) blocks of compressed data, allowing jobs to be split along block boundaries. It typically compresses files to within 10% to 15% of the best available techniques (the PPM family of statistical compressors), whilst being around twice as fast at compression and six times faster at decompression. gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman Coding.īzip2 is a freely available, patent free (see below), high-quality data compressor. Since the map output is written to disk and transferred across the network to the reducer nodes, by using a fast compressor such as LZO or Snappy, you can get performance gains simply because the volume of data to transfer is reduced. Therefore, it is necessary to compress the output before storing on HDFS.Įven if your MapReduce application reads and writes uncompressed data, it may benefit from compressing the intermediate output of the map phase. However, these history files may not be used very frequently, resulting in a waste of HDFS space. If the amount of output per day is extensive, and we often need to store history results for future use, then these accumulated results will take extensive amount of HDFS space. Often we need to store the output as history files. gz can be identified as gzip-compressed file and thus read with GzipCodec. If the input files are compressed, they will be decompressed automatically as they are read by MapReduce, using the filename extension to determine which codec to use. This time conservation is beneficial to the performance of job execution. ValueError: RDD is empty Working code (uncompressed) text fileĪ_file = sc.textFile("hdfs://master:54310/gene_regions.If the input file is compressed, then the bytes read in from HDFS is reduced, which means less time to read data. home/user/Software/spark-1.3.0-bin-hadoop2.4/python/pyspark/rdd.pyc in first(self) ValueError Traceback (most recent call last) I then launch my spark master and slave and finally my ipython notebook where I am executing the code below.Ī_file = sc.textFile("hdfs://master:54310/gene_") I then added the following to spark-env.shĮxport JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/nativeĮxport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/nativeĮxport SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/nativeĮxport SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_HOME/lib/lib/snappy-java-1.1.1.8-SNAPSHOT.jar Python-snappy -m snappy -c gene_regions.vcf gene_ Should I be using sc.sequenceFile ? Thanks! I first compressed the file and pushed it to hdfs I can read the text file (uncompressed) version fine. I can't find an example of how to read the file in so I can process it. I am now trying to read it in like so but I get the following traceback. I have compressed a file using python-snappy and put it in my hdfs store. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |