Thursday, August 22, 2013

Bash Looping

#times of try
times=36
#interval in seconds
interval=450

if [ $(date +%u) -eq 1 ]
then
 times=108
fi

echo "check if /data/hive/insights/brand/irm-free-insights-pipeline/${date}/_SUCCESS exits."
i=0
while [ $i -lt $times ]
do
s=$(ssh insights@jobs-aa-sched1 "(hadoop fs -ls /data/hive/insights/brand/irm-free-insights-pipeline/${date} | grep _SUCCESS | wc -l)")
if [ $s -eq 1 ]
then
  echo "perks pipeline success."
  exit 0
else
  echo "sleep for $interval seconds and check again, tried $i out of $times"
  sleep $interval
fi
i=$(expr $i + 1)
done

echo "insights pipeline failed."
exit 1

Wednesday, August 14, 2013

Hubot Lock Structure

1. Two kinds of locks :

"Uploading Lock"  single node in zookeeper
"Health"  with two children "targetingA" and "targetingB"

2. Whenever uploading (indexing) one cluster
We acquire "Uploading Lock" and set "Health/targetingA"  false. API cannot read from targetingA any more.
If you set "Health/targetingB" true (we call this "overwrite to targetingB"), that allows API read from targetingB

3. "Uploading Lock" make sure anytime, only one cluster is doing uploading.
"Health" lock mark the one API can use.

Tuesday, August 6, 2013

Hive Several Problem

1.
insert overwrite table foo
select a.*
from
(select c, d, e from too1
union all
select c, d, e from too2
) a


This wouldn't work if the column sequence in foo is other than c, d, e
Hive will map data to wrong column based on the sequence. (if only the column type match)

2.
select * form table where id not in ('a', 'b', 'c');

The records filtered out are not only id = 'a' 'b' 'c', also include null
if id is null. the record will also be filtered out.

3.
avoid group by too many columns, especially long length strings, slow down the speed and easy to get error when processing the row. Use group by id1, id2, id3 and use group_first(other column)