Cool App Linux Software: 2013

Friday, December 6, 2013

hbase note

1. pre-split regions

rm /tmp/region.splits; for i in $(seq 1 1 99); do printf "%02d00\n" $i >> /tmp/region.splits ; done

create 'networkProfile3', {NAME => 'c', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '1', TTL => '7776000', BLOCKSIZE => '65536', IN_MEMORY => 'false ', BLOCKCACHE => 'true'}, {SPLITS_FILE => '/tmp/region.splits'}

2. flush sharded redis server

for i in $(seq 1 7)
do
host="flredis0"$i"-mini.private"
ssh $host '(echo "flushdb" | redis-cli)'

done

Tuesday, November 19, 2013

Scala Notes

1. A function calling async function should return a future too.

def foo (b: B) : Future[A] {
val c = aync {
...
}

2. access a future:

for {
s <- c
} yield {
...
}
}

3. change a list of future to future of list
Future.sequence(x)

4. wrap anything with Future:
sync(None)

Thursday, November 14, 2013

Test udf

java -cp ./insights-etl-0.7.6.jar:~/insights/lib/*:/usr/lib/hive/lib/* com.klout.perk.GnipTupleUDTF

Friday, November 1, 2013

Hive Mapside Join Configuration

Disable mapside join

set hive.auto.convert.join=false;

Other configuration : go to https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

search mapside join

Wednesday, September 25, 2013

Scala

Means case class returns not single types in different branches:

Cannot prove that Product with Serializable <:< (T, U).

Tuesday, September 17, 2013

ElasticSearch

1. ElasticSearch transfers data scoll size setup. Will transfer up to 100 documents starting from the first one.

_search?from=0&size=100

Thursday, August 22, 2013

Bash Looping

#times of try
times=36
#interval in seconds
interval=450

if [ $(date +%u) -eq 1 ]
then
times=108
fi

echo "check if /data/hive/insights/brand/irm-free-insights-pipeline/${date}/_SUCCESS exits."
i=0
while [ $i -lt $times ]
do
s=$(ssh insights@jobs-aa-sched1 "(hadoop fs -ls /data/hive/insights/brand/irm-free-insights-pipeline/${date} | grep _SUCCESS | wc -l)")
if [ $s -eq 1 ]
then
echo "perks pipeline success."
exit 0
else
echo "sleep for $interval seconds and check again, tried $i out of $times"
sleep $interval
fi
i=$(expr $i + 1)
done

echo "insights pipeline failed."
exit 1

Wednesday, August 14, 2013

Hubot Lock Structure

1. Two kinds of locks :

"Uploading Lock" single node in zookeeper

"Health" with two children "targetingA" and "targetingB"

2. Whenever uploading (indexing) one cluster

We acquire "Uploading Lock" and set "Health/targetingA" false. API cannot read from targetingA any more.

If you set "Health/targetingB" true (we call this "overwrite to targetingB"), that allows API read from targetingB

3. "Uploading Lock" make sure anytime, only one cluster is doing uploading.

"Health" lock mark the one API can use.

Tuesday, August 6, 2013

Hive Several Problem

1.
insert overwrite table foo
select a.*
from
(select c, d, e from too1
union all
select c, d, e from too2
) a

This wouldn't work if the column sequence in foo is other than c, d, e
Hive will map data to wrong column based on the sequence. (if only the column type match)

2.
select * form table where id not in ('a', 'b', 'c');

The records filtered out are not only id = 'a' 'b' 'c', also include null
if id is null. the record will also be filtered out.

3.
avoid group by too many columns, especially long length strings, slow down the speed and easy to get error when processing the row. Use group by id1, id2, id3 and use group_first(other column)

Thursday, July 11, 2013

Backfill HBase Data

put 'perk_benchmark_complete', '51c0a5a5e4b04e95568bc531', 'c:q', '{"benchmark":{"score":{"other":25.73531446900622,"self":63.61400714763741},"contentCreated":{"other":39.478001720716996,"self":181.0},"contentCreatedPerUser":{"other":0.04595809280642258,"self":0.21071012805587894},"networkSize":{"other":175.5376223347399,"self":1898.0},"totalTrueReach":{"other":933052.3242595217,"self":7167783.878724266},"totalImpressions":{"other":1925504.5258158229,"self":1.0672166152029522E7},"trueImpressions":{"other":1925504.5258158229,"self":1.0672166152029522E7}},"brandComparison":{"size":{"peer":-1.0,"self":-1.0},"averageScore":{"peer":-1.0,"self":-1.0},"averageContent":{"peer":-1.0,"self":-1.0},"feedback":{"peer":-1.0,"self":-1.0}}}'

deleteall 'perk_benchmark_complete', '51c0a5a5e4b04e95568bc531'

scan 'perk_benchmark_complete', {STARTROW=>'51c0a5a5e4b04e95568bc531', LIMIT=>3}

Wednesday, May 29, 2013

Bloom Filter and Distributed Map

insert overwrite local directory 'delisted_user_${networkAbbr}'
select
  ks_uid , delist_date
from
  delist_user
where dt=${dateString}
...

4:45 PM

 (select  *
    from collect_moment_contrib_view_${networkAbbr}
    where ! bloom_contains(concat( cast(ks_uid as string), "_", content_id),
           distributed_bloom( 'dup_moment_bloom_${networkAbbr}'))
      and ! bloom_contains( cast(ks_uid as string),
           distributed_bloom( 'optout_bloom_${networkAbbr}'))
      and distributed_map( ks_uid, "delisted_user_${networkAbbr}" ) is null

4:46 PM

insert overwrite local directory 'dup_moment_bloom_${networkAbbr}'
select bloom( concat(cast(ks_uid as string), "_", content_id) )
 from duplicate_moments
    where dt = ${dateString}
      and network_abbr = "${networkAbbr}"
      and label = "DUPLICATE"
;
add file dup_moment_bloom_${networkAbbr};

Wednesday, May 15, 2013

Distcp between CDH3 and CDH4

Assume aa is CDH3 and dev is CDH4:

Run this on jobs-aa
hadoop distcp hftp://jobs-aa-hnn/root/my/dir hdfs://jobs-dev-hnn/root/my/dir

Reverse direction:

hadoop distcp hdfs://jobs-dev-hnn:50070/root/my/dir hdfs://jobs-aa-hnn/root/my/dir
(got to have this port number because CDH4 requires that)

Monday, April 22, 2013

Read content of HFile via CLI

hbase org.apache.hadoop.hbase.io.hfile.HFile -p -f hdfs://jobs-aa-hnn:8020/data/prod/jobs/hfiles/primaryNetworkProfile/20130415/output/c/d3cc3d77adb8451187be4123a0964062

Wednesday, April 10, 2013

Bash Command

1. cut, rev, uniq, sort

cat 111 | egrep -o "Deleted: /.*/[0-9]{8}" | rev | cut -d "/" -f2- | rev | uniq -c | sort -nr

2. egrep all number and sum up

cat 111 | egrep -o "\[[0-9]+\] bytes" | egrep -o "[0-9]+" | awk '{sum+=$1} END {print sum}'

3. sh bash.sh parameters

$@ means any parameters you passed to the script

4. strace all system call logs of a specific bash command

strace -fvo /home/insights/insights/hive/tt3 -e\!futex -s 8192 bash ./hv

grep " open(" tt3|grep -v ENOENT|grep -v WR|awk -F\" '{print $2}'|sort -u | sed 's/home\/insights/xxx/g'|sed 's/xxx\/insights/yyy/g' | sed 's/yyy-etl-0.3.9-bin/yyy-etl-1.60-cdh4-bin/g' > files.7

5. For BSD or GNU grep you can use -B num to set how many lines before the match and -A num for the number of lines after the match.

grep -B 3 -A 2 foo README.txt

If you want the same amount of lines before and after you can use -C num.

grep -C 3 foo README.txt

6. Copy or paste to clipboard (for mac OS)

pbcopy

pbpaste

7. Print number of fields of each line delimited by '\t'

cat kfb_topic_task1_v4 | awk -F'\t' '{print NF}'

8. redirection doesn't work for sudo
e.g. this won't work if you don't have permission to write the file since sudo won't apply on the redirection

sudo echo 1 > /proc/sys/vm/overcommit_memory'

To solve this :

sudo sh -c 'echo 1 > /proc/sys/vm/overcommit_memory'

You can also do this easily by : echo 1 | sudo tee /proc/sys/vm/overcommit_memory

Friday, April 5, 2013

Regex


re.match(r"^[a-z]+[*]?$", s)

The ^ matches the start of the string.
The [a-z]+ matches one or more lowercase letters.
The [*]? matches zero or one asterisks.
The $ matches the end of the string.

Your original regex matches exactly one lowercase character followed by one or more asterisks.

Monday, April 1, 2013

HBase Maintainence Tool

Usage: fsck [opts] {only tables}
where [opts] are:
-help Display help options (this)
-details Display full report of all regions.
-timelag {timeInSeconds} Process only regions that have not experienced any metadata updates in the last {{timeInSeconds} seconds.
-sleepBeforeRerun {timeInSeconds} Sleep this many seconds before checking if the fix worked if run with -fix
-summary Print only summary of the tables and status.
-metaonly Only check the state of ROOT and META tables.

Metadata Repair options: (expert features, use with caution!)
-fix Try to fix region assignments. This is for backwards compatiblity
-fixAssignments Try to fix region assignments. Replaces the old -fix
-fixMeta Try to fix meta problems. This assumes HDFS region info is good.
-fixHdfsHoles Try to fix region holes in hdfs.
-fixHdfsOrphans Try to fix region dirs with no .regioninfo file in hdfs
-fixHdfsOverlaps Try to fix region overlaps in hdfs.
-fixVersionFile Try to fix missing hbase.version file in hdfs.
-maxMerge <n> When fixing region overlaps, allow at most <n> regions to merge. (n=5 by default)
-sidelineBigOverlaps When fixing region overlaps, allow to sideline big overlaps
-maxOverlapsToSideline <n> When fixing region overlaps, allow at most <n> regions to sideline per group. (n=2 by default)
-fixSplitParents Try to force offline split parents to be online.
-ignorePreCheckPermission ignore filesystem permission pre-check

Datafile Repair options: (expert features, use with caution!)
-checkCorruptHFiles Check all Hfiles by opening them to make sure they are valid
-sidelineCorruptHfiles Quarantine corrupted HFiles. implies -checkCorruptHfiles

Metadata Repair shortcuts
-repair Shortcut for -fixAssignments -fixMeta -fixHdfsHoles -fixHdfsOrphans -fixHdfsOverlaps -fixVersionFile -sidelineBigOverlaps
-repairHoles Shortcut for -fixAssignments -fixMeta -fixHdfsHoles
Heap
par new generation total 176960K, used 28318K [0x0000000412e00000, 0x000000041ee00000, 0x000000041ee00000)
eden space 157312K, 18% used [0x0000000412e00000, 0x00000004149a7b50, 0x000000041c7a0000)
from space 19648K, 0% used [0x000000041c7a0000, 0x000000041c7a0000, 0x000000041dad0000)
to space 19648K, 0% used [0x000000041dad0000, 0x000000041dad0000, 0x000000041ee00000)
concurrent mark-sweep generation total 5312K, used 0K [0x000000041ee00000, 0x000000041f330000, 0x00000007fae00000)
concurrent-mark-sweep perm gen total 21248K, used 10311K [0x00000007fae00000, 0x00000007fc2c0000, 0x0000000800000000)

Thursday, March 28, 2013

Bash Tool Crontab

Print all processes :

ps -ef

Bash Tool Crontab

1. To edit cron config file :

crontab -e

2. To print cron config file :

crontab -l

Wednesday, March 27, 2013

Bash pipe direct

To redirect stdout in bash, overwriting file

cmd > file.txt

To redirect stdout in bash, appending to file

cmd >> file.txt

To redirect both stdout and stderr, overwriting

cmd &> file.txt

redirect both stdout and stderr appending to file

cmd >>file.txt 2>&1

Monday, March 25, 2013

Java heap space or GC out of limit issue

set hive.map.aggr=true;
set hive.map.aggr.hash.force.flush.memory.threshold=0.75;
set hive.map.aggr.hash.percentmemory=0.3;
set hive.groupby.mapaggr.checkinterval=10000;
set mapred.child.java.opts=-Xmx3072M;
set hive.exec.compress.output=true;
set io.seqfile.compression.type=BLOCK;

all of those are good param

except maybe set hive.exec.compress.output=true;

and

set io.seqfile.compression.type=BLOCK;

Wednesday, March 20, 2013

Awk

cat ~/12 | awk '{print "/data/prod/"$1}'

echo "list_snapshots" | hbase shell | egrep "$[A-Za-z]{3} [A-Za-z]{3} [0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} \+[0-9]{4} [0-9]{4}$" | grep $(date +%b) | awk -F ' ' '{printf ("delete_snapshot '\''%s'\''\n", $1)}' | hbase shell

cat 1 | grep -v main | grep "\[.*\]" | egrep -o "\"[^,|^\"]*\"" | tr -d '"' | awk -v date=$dateString '{printf("snapshot '\''%s'\'', '\''%s-snapshot-%s'\''\n", $1, $1, date)}'

Oozie job weird error : No input path Specified

Today I debugged with a couple guys on a strange oozie error. The MapRed job with error
"No input path Specified"
But we have the input dir set up in configuration and workflow.
It turned out to be that we missed two configurations for oozie job to tell oozie use the new api :

<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>

Tuesday, March 19, 2013

Bash Quich Note

1. Each line in a file:

for line in $(cat 2); do echo $line; done;

2. Sed
prefer to use '|' as delimiter if possible:

sed 's|my/home/directory||g' < in > out

in place replacement :
sed -i 's|analytics/etl/maxwell/src/assembly/hive/maxwell/||g' in

3. Sed replace \n to ,\n

sed ':a;N;$!ba;s/\n/,\n/g'

4. sort by column

sort -t "," -k 2 -n input.csv
sorted by column 2

5. bash loop in number

for i in $(seq 0 855)
do
date=$(date --date "$i day ago" "+%Y%m%d")
echo "alter table oauth_user_services add if not exists partition (dt = '$date') location '$date';" >> partitions
done

Friday, March 15, 2013

HBase Lock and Override

HBase lock is like gate keeper.

Before bulkloading, set HBase lock first, then set HBase Override.

After bulkloading, release Override first and then unlock HBase.

Wednesday, March 13, 2013

Remove leading and tailing spaces of each line

cat input.txt | sed 's/^[ \t]*//;s/[ \t]*$//' > output.txt

Thursday, March 7, 2013

SSH Key

ssh-keygen
With no passphrase
Keep the .ssh permission 700
Keep the .ssh/id_rsa permission 600

Thursday, February 21, 2013

Hive table for lzo.deflated files

Should be able to say
stored by textfile, but cann't select * from where partition = ...

And when you select, must force hive to run a mapreduce job. For example select count(*)

Wednesday, February 13, 2013

Oozie Framework Improvement

Use coordinator
Hue
upgrade to CHD4 (job name etc.)

Friday, January 25, 2013

Scala Eclipse

Scala IDE plugin for Eclipse Juno

http://download.scala-ide.org/nightly-update-juno-master-29x

Get start:

http://scala-ide.org/docs/user/gettingstarted.html

Thursday, January 17, 2013

About error : find: {some file name}: unknown option

{some file name} contains special characters. use "some file name"

Wednesday, January 16, 2013

Github Quick Note

1. Check unpushed commit
git log origin/master..HEAD
git diff origin/master..HEAD

2.Revert uncommitted changes

# Revert changes to modified files.
git reset --hard

# Remove all untracked files and directories.
git clean -fd

3. unstaging a staged file

git reset HEAD <file>

4. Amend last commmit :

$ git commit -m 'initial commit' $ git add forgotten_file $ git commit --amend

5. show diff of a commit with commit #:

git show 7f1ef64274b588b8d7430f31fbf915257a605f45

6. reset unpushed commit :

delete the most recent commit:

git reset --hard (HEAD~1 or head number)

Delete the most recent commit, without destroying the work you've done:

git reset --soft (HEAD~1 or head number)

7. revert a single file :

git checkout filename

git reset --hard will revert all changes.

8.  Avoid others' changes in my check in records

git checkout master
git pull -rebase
git checkout -b your-branch

git commit -m "something"

git commit -m "more things"

git checkout master

git pull -rebase   (put your commit on the top of the stack)  or git pull -r

git checkout your-branch

git rebase master

git checkout master

git merge your-branch ==> fast forward push

9. git rebase -i HEAD~2

combine last two commits

Friday, January 11, 2013

Eclipse slow

Tweek eclipse.ini for more heap space

eclipse -vmargs -Xms512m -Xmx1024m

Tuesday, January 8, 2013

HBase Copy Table convenient jobs

hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=jobs-dev-zoo1,jobs-dev-zoo2,jobs-dev-zoo3:2181:/hbase tableName

reference : http://hbase.apache.org/book/ops_mgt.html#copytable

on the other hand, copy files between cluster is convenient using distcp

hadoop distcp hdfs://jobs-hnn1/data/prod/jobs/mr/gnip/harvester/user_data/20130110/20130110004838936/output/scor* hdfs://jobs-hnn2/data/prod/jobs/mr/gnip/harvester/user_data/20130110/20130110004838936/output/

Maven version

Sometime when you miss artifact in a extremely strange way, don't forget to check you maven version is compatible with the old pom.

Saturday, January 5, 2013

When you MapRed job yelling "Too many fetch-failures"

That happens when too many fetch failures happen on a specific reducer task node.

Three attributes could be check for this issure:

- mapred.reduce.slowstart.completed.maps = 0.80

allows reducers from other jobs to run while a big job waits on mappers

- tasktracker.http.threads = 80

specifies number of threads used by the reducer task node to serve output from mapper

- mapred.reduce.parallel.copies = sqrt(#of nodes) with a floor of 10

number of parallel copies used by reducers to fetch map output

Friday, January 4, 2013

ssh pub key auto passby

Scenario:
log in from you laptop to a cluster and then go from there to some other cluster. Need set up auto pass by my ssh pub key.

1. make sure add all you ssh key to ssh agent
ssh-add
ssh-add -l

2. ssh to the intermediate cluster with -A (auto passby):
ssh -A user@host
ssh-add -l

3. ssh to the destiny cluster:
ssh user@destiny

Thursday, January 3, 2013

Interesting points

Distributed pub key to all nodes using : pdsh.

Wednesday, January 2, 2013

find a file filtering out all error msg including permission denied

find . -name hbase-hbase-master-hbasedev1-nn1.klout.out 2>/dev/null