Friday, December 6, 2013

hbase note

1.  pre-split regions
rm /tmp/region.splits; for i in $(seq 1 1 99); do printf "%02d00\n" $i >> /tmp/region.splits ; done
create 'networkProfile3', {NAME => 'c', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '1', TTL => '7776000', BLOCKSIZE => '65536', IN_MEMORY => 'false ', BLOCKCACHE => 'true'}, {SPLITS_FILE => '/tmp/region.splits'}


2. flush sharded redis server

 for i in $(seq 1 7)
 do 
 host="flredis0"$i"-mini.private"  
 ssh $host '(echo "flushdb" | redis-cli)'

 done

Tuesday, November 19, 2013

Scala Notes

1. A function calling async function should return a future too.

def foo (b: B) : Future[A] {
    val c = aync {
       ...
    }

2. access a future:

 for {
      s <- c
   } yield {
      ...
   }
}

3. change a list of future to future of list
Future.sequence(x)

4. wrap anything with Future:
sync(None)

Thursday, November 14, 2013

Test udf


 java -cp ./insights-etl-0.7.6.jar:~/insights/lib/*:/usr/lib/hive/lib/* com.klout.perk.GnipTupleUDTF

Friday, November 1, 2013

Hive Mapside Join Configuration

Disable mapside join
set hive.auto.convert.join=false;
Other configuration : go to https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
search mapside join

Wednesday, September 25, 2013

Scala

Means case class returns not single types in different branches:

 Cannot prove that Product with Serializable <:< (T, U).

Tuesday, September 17, 2013

ElasticSearch

1. ElasticSearch transfers data scoll size setup. Will transfer up to 100 documents starting from the first one.

_search?from=0&size=100

Thursday, August 22, 2013

Bash Looping

#times of try
times=36
#interval in seconds
interval=450

if [ $(date +%u) -eq 1 ]
then
 times=108
fi

echo "check if /data/hive/insights/brand/irm-free-insights-pipeline/${date}/_SUCCESS exits."
i=0
while [ $i -lt $times ]
do
s=$(ssh insights@jobs-aa-sched1 "(hadoop fs -ls /data/hive/insights/brand/irm-free-insights-pipeline/${date} | grep _SUCCESS | wc -l)")
if [ $s -eq 1 ]
then
  echo "perks pipeline success."
  exit 0
else
  echo "sleep for $interval seconds and check again, tried $i out of $times"
  sleep $interval
fi
i=$(expr $i + 1)
done

echo "insights pipeline failed."
exit 1

Wednesday, August 14, 2013

Hubot Lock Structure

1. Two kinds of locks :

"Uploading Lock"  single node in zookeeper
"Health"  with two children "targetingA" and "targetingB"

2. Whenever uploading (indexing) one cluster
We acquire "Uploading Lock" and set "Health/targetingA"  false. API cannot read from targetingA any more.
If you set "Health/targetingB" true (we call this "overwrite to targetingB"), that allows API read from targetingB

3. "Uploading Lock" make sure anytime, only one cluster is doing uploading.
"Health" lock mark the one API can use.

Tuesday, August 6, 2013

Hive Several Problem

1.
insert overwrite table foo
select a.*
from
(select c, d, e from too1
union all
select c, d, e from too2
) a


This wouldn't work if the column sequence in foo is other than c, d, e
Hive will map data to wrong column based on the sequence. (if only the column type match)

2.
select * form table where id not in ('a', 'b', 'c');

The records filtered out are not only id = 'a' 'b' 'c', also include null
if id is null. the record will also be filtered out.

3.
avoid group by too many columns, especially long length strings, slow down the speed and easy to get error when processing the row. Use group by id1, id2, id3 and use group_first(other column)

Thursday, July 11, 2013

Backfill HBase Data

put 'perk_benchmark_complete', '51c0a5a5e4b04e95568bc531', 'c:q', '{"benchmark":{"score":{"other":25.73531446900622,"self":63.61400714763741},"contentCreated":{"other":39.478001720716996,"self":181.0},"contentCreatedPerUser":{"other":0.04595809280642258,"self":0.21071012805587894},"networkSize":{"other":175.5376223347399,"self":1898.0},"totalTrueReach":{"other":933052.3242595217,"self":7167783.878724266},"totalImpressions":{"other":1925504.5258158229,"self":1.0672166152029522E7},"trueImpressions":{"other":1925504.5258158229,"self":1.0672166152029522E7}},"brandComparison":{"size":{"peer":-1.0,"self":-1.0},"averageScore":{"peer":-1.0,"self":-1.0},"averageContent":{"peer":-1.0,"self":-1.0},"feedback":{"peer":-1.0,"self":-1.0}}}'


deleteall 'perk_benchmark_complete', '51c0a5a5e4b04e95568bc531'


scan 'perk_benchmark_complete', {STARTROW=>'51c0a5a5e4b04e95568bc531', LIMIT=>3}

Wednesday, May 29, 2013

Bloom Filter and Distributed Map


insert overwrite local directory 'delisted_user_${networkAbbr}'
select
  ks_uid , delist_date
from
  delist_user
where dt=${dateString}
...
4:45 PM
 (select  *
    from collect_moment_contrib_view_${networkAbbr}
    where ! bloom_contains(concat( cast(ks_uid as string), "_", content_id),
           distributed_bloom( 'dup_moment_bloom_${networkAbbr}'))
      and ! bloom_contains( cast(ks_uid as string),
           distributed_bloom( 'optout_bloom_${networkAbbr}'))
      and distributed_map( ks_uid, "delisted_user_${networkAbbr}" ) is null
4:46 PM
insert overwrite local directory 'dup_moment_bloom_${networkAbbr}'
select bloom( concat(cast(ks_uid as string), "_", content_id) )
 from duplicate_moments
    where dt = ${dateString}
      and network_abbr = "${networkAbbr}"
      and label = "DUPLICATE"
;
add file dup_moment_bloom_${networkAbbr};

Wednesday, May 15, 2013

Distcp between CDH3 and CDH4

Assume aa is CDH3 and dev is CDH4:

Run this on jobs-aa
hadoop distcp hftp://jobs-aa-hnn/root/my/dir hdfs://jobs-dev-hnn/root/my/dir

Reverse direction:


hadoop distcp hdfs://jobs-dev-hnn:50070/root/my/dir hdfs://jobs-aa-hnn/root/my/dir
(got to have this port number because CDH4 requires that)

Monday, April 22, 2013

Read content of HFile via CLI

hbase org.apache.hadoop.hbase.io.hfile.HFile -p -f hdfs://jobs-aa-hnn:8020/data/prod/jobs/hfiles/primaryNetworkProfile/20130415/output/c/d3cc3d77adb8451187be4123a0964062

Wednesday, April 10, 2013

Bash Command

1. cut, rev, uniq, sort

cat 111 | egrep -o "Deleted: /.*/[0-9]{8}" | rev | cut -d "/" -f2- | rev | uniq -c | sort -nr


2. egrep all number and sum up

cat 111 | egrep -o "\[[0-9]+\] bytes" | egrep -o "[0-9]+" | awk '{sum+=$1} END {print sum}'

3. sh bash.sh  parameters

$@ means any parameters you passed to the script

4. strace all system call logs of a specific bash command

   strace -fvo /home/insights/insights/hive/tt3 -e\!futex -s 8192 bash ./hv


grep " open(" tt3|grep -v ENOENT|grep -v WR|awk -F\" '{print $2}'|sort -u | sed 's/home\/insights/xxx/g'|sed 's/xxx\/insights/yyy/g' | sed 's/yyy-etl-0.3.9-bin/yyy-etl-1.60-cdh4-bin/g' > files.7


5. For BSD or GNU grep you can use -B num to set how many lines before the match and -A num for the number of lines after the match.
grep -B 3 -A 2 foo README.txt
If you want the same amount of lines before and after you can use -C num.
grep -C 3 foo README.txt


6.  Copy or paste to clipboard (for mac OS)

pbcopy
pbpaste



7. Print number of fields of each line delimited by '\t'

cat kfb_topic_task1_v4 | awk -F'\t' '{print NF}' 

8. redirection doesn't work for sudo
e.g. this won't work if you don't have permission to write the file since sudo won't apply on the redirection

sudo echo 1 > /proc/sys/vm/overcommit_memory'
To solve this :
sudo sh -c 'echo 1 > /proc/sys/vm/overcommit_memory'

You can also do this easily by : echo 1 | sudo tee /proc/sys/vm/overcommit_memory




Friday, April 5, 2013

Regex



re.match(r"^[a-z]+[*]?$", s)
  1. The ^ matches the start of the string.
  2. The [a-z]+ matches one or more lowercase letters.
  3. The [*]? matches zero or one asterisks.
  4. The $ matches the end of the string.
Your original regex matches exactly one lowercase character followed by one or more asterisks.

Monday, April 1, 2013

HBase Maintainence Tool


Usage: fsck [opts] {only tables}
 where [opts] are:
   -help Display help options (this)
   -details Display full report of all regions.
   -timelag {timeInSeconds}  Process only regions that  have not experienced any metadata updates in the last  {{timeInSeconds} seconds.
   -sleepBeforeRerun {timeInSeconds} Sleep this many seconds before checking if the fix worked if run with -fix
   -summary Print only summary of the tables and status.
   -metaonly Only check the state of ROOT and META tables.

  Metadata Repair options: (expert features, use with caution!)
   -fix              Try to fix region assignments.  This is for backwards compatiblity
   -fixAssignments   Try to fix region assignments.  Replaces the old -fix
   -fixMeta          Try to fix meta problems.  This assumes HDFS region info is good.
   -fixHdfsHoles     Try to fix region holes in hdfs.
   -fixHdfsOrphans   Try to fix region dirs with no .regioninfo file in hdfs
   -fixHdfsOverlaps  Try to fix region overlaps in hdfs.
   -fixVersionFile   Try to fix missing hbase.version file in hdfs.
   -maxMerge <n>     When fixing region overlaps, allow at most <n> regions to merge. (n=5 by default)
   -sidelineBigOverlaps  When fixing region overlaps, allow to sideline big overlaps
   -maxOverlapsToSideline <n>  When fixing region overlaps, allow at most <n> regions to sideline per group. (n=2 by default)
   -fixSplitParents  Try to force offline split parents to be online.
   -ignorePreCheckPermission  ignore filesystem permission pre-check

  Datafile Repair options: (expert features, use with caution!)
   -checkCorruptHFiles     Check all Hfiles by opening them to make sure they are valid
   -sidelineCorruptHfiles  Quarantine corrupted HFiles.  implies -checkCorruptHfiles

  Metadata Repair shortcuts
   -repair           Shortcut for -fixAssignments -fixMeta -fixHdfsHoles -fixHdfsOrphans -fixHdfsOverlaps -fixVersionFile -sidelineBigOverlaps
   -repairHoles      Shortcut for -fixAssignments -fixMeta -fixHdfsHoles
Heap
 par new generation   total 176960K, used 28318K [0x0000000412e00000, 0x000000041ee00000, 0x000000041ee00000)
  eden space 157312K,  18% used [0x0000000412e00000, 0x00000004149a7b50, 0x000000041c7a0000)
  from space 19648K,   0% used [0x000000041c7a0000, 0x000000041c7a0000, 0x000000041dad0000)
  to   space 19648K,   0% used [0x000000041dad0000, 0x000000041dad0000, 0x000000041ee00000)
 concurrent mark-sweep generation total 5312K, used 0K [0x000000041ee00000, 0x000000041f330000, 0x00000007fae00000)
 concurrent-mark-sweep perm gen total 21248K, used 10311K [0x00000007fae00000, 0x00000007fc2c0000, 0x0000000800000000)

Thursday, March 28, 2013

Bash Tool Crontab

Print all processes :

ps -ef

Bash Tool Crontab

1. To edit cron config file :

crontab -e

2. To print cron config file :

crontab -l

Wednesday, March 27, 2013

Bash pipe direct


To redirect stdout in bash, overwriting file
cmd > file.txt
To redirect stdout in bash, appending to file
cmd >> file.txt
To redirect both stdout and stderr, overwriting
cmd &> file.txt

redirect both stdout and stderr appending to file
cmd >>file.txt 2>&1

Monday, March 25, 2013

Java heap space or GC out of limit issue


set hive.map.aggr=true;
set hive.map.aggr.hash.force.flush.memory.threshold=0.75;
set hive.map.aggr.hash.percentmemory=0.3;
set hive.groupby.mapaggr.checkinterval=10000;
set mapred.child.java.opts=-Xmx3072M;
set hive.exec.compress.output=true;
set io.seqfile.compression.type=BLOCK;
all of those are good param
except maybe set hive.exec.compress.output=true;
and
set io.seqfile.compression.type=BLOCK;

Wednesday, March 20, 2013

Awk

cat ~/12 | awk '{print "/data/prod/"$1}'




echo "list_snapshots" | hbase shell | egrep "\([A-Za-z]{3} [A-Za-z]{3} [0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} \+[0-9]{4} [0-9]{4}\)" | grep $(date +%b) | awk -F ' ' '{printf ("delete_snapshot '\''%s'\''\n", $1)}' | hbase shell




cat 1  | grep -v main | grep "\[.*\]" | egrep -o "\"[^,|^\"]*\"" | tr -d '"' | awk -v date=$dateString '{printf("snapshot '\''%s'\'', '\''%s-snapshot-%s'\''\n", $1, $1, date)}'

Oozie job weird error : No input path Specified


Today I debugged with a couple guys on a strange oozie error. The MapRed job with error
"No input path Specified"
But we have the input dir set up in configuration and workflow.
It turned out to be that we missed two configurations for oozie job to tell oozie use the new api :

<property>
                    <name>mapred.mapper.new-api</name>
                    <value>true</value>
                </property>
                <property>
                    <name>mapred.reducer.new-api</name>
                    <value>true</value>
                </property>

Tuesday, March 19, 2013

Bash Quich Note

1. Each line in a file:

for line in $(cat 2); do echo $line; done;

2. Sed
prefer to use '|' as delimiter if possible:

sed 's|my/home/directory||g' < in > out

in place replacement :
sed -i 's|analytics/etl/maxwell/src/assembly/hive/maxwell/||g' in


3. Sed replace \n to ,\n

sed ':a;N;$!ba;s/\n/,\n/g'

4. sort by column

sort -t "," -k 2 -n input.csv
sorted by column 2

5. bash loop in number

for i in $(seq 0 855)
do
date=$(date --date "$i day ago" "+%Y%m%d")
echo "alter table oauth_user_services add if not exists partition (dt = '$date') location '$date';" >> partitions
done

Friday, March 15, 2013

HBase Lock and Override

HBase lock is like gate keeper.

Before bulkloading, set HBase lock first, then set HBase Override.

After bulkloading, release Override first and then unlock HBase.

Thursday, March 7, 2013

SSH Key

ssh-keygen
With no passphrase
Keep the .ssh permission 700
Keep the .ssh/id_rsa permission 600



Thursday, February 21, 2013

Hive table for lzo.deflated files

Should be able to say
stored by textfile, but cann't select * from where partition = ...

And when you select, must force hive to run a mapreduce job. For example select count(*)

Wednesday, February 13, 2013

Friday, January 25, 2013

Thursday, January 17, 2013

Wednesday, January 16, 2013

Github Quick Note


1. Check unpushed commit
git log origin/master..HEAD
git diff origin/master..HEAD

2.Revert uncommitted changes

# Revert changes to modified files.
git reset --hard

# Remove all untracked files and directories.
git clean -fd


3. unstaging a staged file
git reset HEAD <file>

4. Amend last commmit :

$ git commit -m 'initial commit' $ git add forgotten_file $ git commit --amend

5. show diff of a commit with commit #:
git show 7f1ef64274b588b8d7430f31fbf915257a605f45

6. reset unpushed commit :
delete the most recent commit:
git reset --hard (HEAD~1 or head number)
Delete the most recent commit, without destroying the work you've done:
git reset --soft (HEAD~1 or head number)
7. revert a single file : 
git checkout filename
git reset --hard will revert all changes.

8.  Avoid others' changes in my check in records
git checkout master
git pull -rebase
git checkout -b your-branch

git commit -m "something"

git commit -m "more things"

git checkout master

git pull -rebase   (put your commit on the top of the stack)  or git pull -r

git checkout your-branch

git rebase master

git checkout master

git merge your-branch ==> fast forward push


9. git rebase -i HEAD~2
combine last two commits

Friday, January 11, 2013

Eclipse slow

Tweek eclipse.ini for more heap space

eclipse -vmargs -Xms512m -Xmx1024m

Tuesday, January 8, 2013

HBase Copy Table convenient jobs


hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=jobs-dev-zoo1,jobs-dev-zoo2,jobs-dev-zoo3:2181:/hbase tableName

reference : http://hbase.apache.org/book/ops_mgt.html#copytable


on the other hand, copy files between cluster is convenient using distcp


hadoop distcp hdfs://jobs-hnn1/data/prod/jobs/mr/gnip/harvester/user_data/20130110/20130110004838936/output/scor* hdfs://jobs-hnn2/data/prod/jobs/mr/gnip/harvester/user_data/20130110/20130110004838936/output/

Maven version

Sometime when you miss artifact in a extremely strange way, don't forget to check you maven version is compatible with the old pom.

Saturday, January 5, 2013

When you MapRed job yelling "Too many fetch-failures"

That happens when too many fetch failures happen on a specific reducer task node.

Three attributes could be check for this issure:

- mapred.reduce.slowstart.completed.maps = 0.80

allows reducers from other jobs to run while a big job waits on mappers

- tasktracker.http.threads = 80

specifies number of threads used by the reducer task node to serve output from mapper

- mapred.reduce.parallel.copies = sqrt(#of nodes) with a floor of 10

number of parallel copies used by reducers to fetch map output

Friday, January 4, 2013

ssh pub key auto passby

Scenario:
log in from you laptop to a cluster and then go from there to some other cluster. Need set up auto pass by my ssh pub key.

1. make sure add all you ssh key to ssh agent
ssh-add
ssh-add -l

2. ssh to the intermediate cluster with -A (auto passby):
ssh -A user@host
ssh-add -l

3. ssh to the destiny cluster:
ssh user@destiny

Thursday, January 3, 2013