PDA

View Full Version : Indexing images on Tumblr



Peppwyn
03-31-2016, 06:17 PM
I've found that bing, google and tineye dont have a very good index on Tumblr and the massive number of images stored on there. I've started a little side project to hopefully provide some indexing of those images and a somewhat reliable way to find similar images using image hashing. It's using Microsoft's cloud computing platform to search for all images on a given Tumblr blog and then indexes those hashes (doesn't store the image) for later comparison.

If you're interested, check out http://tumblr-index.cloudapp.net and submit some sites to get indexed. I've got roughly 200k images indexed so far, and just need to know other sites to index.

m444w
03-31-2016, 06:35 PM
Interesting idea. Is the app published on a repo somewheres?

Peppwyn
03-31-2016, 06:54 PM
No, not at the moment. I've only been working on it for about 4 hours.

Peppwyn
03-31-2016, 06:58 PM
I think i need to actually split some of the worker roles apart to make it more robust in case something needs to scale differently depending on demand.

Peppwyn
03-31-2016, 08:43 PM
I've taken down the tumblr-index website, but will bring it back up when i move it around some. I've also updated the roles to add links from the indexed site so it can branch out to other tumblr pages. Went from 3 blogs to 1600 in 30 minutes.

Latrinsorm
04-01-2016, 07:34 PM
Post here when you bring it back up please, this is something I am interested in.

Peppwyn
04-03-2016, 10:14 PM
I am revamping some of the concepts here. I got to about 2.2M images and had some database issues. I've also swapped to some different algorithms for the image hashing... hopefully get more accurate matches.

Peppwyn
04-05-2016, 12:00 PM
As expected, it just gets too damn expensive to index and store all that data in the cloud. I am looking into some other alternatives that may help with cost but reduce scale. Since this is just something I am doing for fun, I'm not too concerned.

Peppwyn
04-06-2016, 06:47 PM
Moved away from MS SQL and am now using MySQL on Ubuntu ... running much better!

m444w
04-06-2016, 07:32 PM
As expected, it just gets too damn expensive to index and store all that data in the cloud. I am looking into some other alternatives that may help with cost but reduce scale. Since this is just something I am doing for fun, I'm not too concerned.

What's the schema look like? You should be able to store a billion or more perceptual hashes in the cloud before it becomes even remotely expensive for a toy project.

Peppwyn
04-07-2016, 10:32 AM
Sorry, I didn't mean monetarily expensive. I meant computationally expensive. The problem is that I don't know how I can store the images in partitions of other 'similar' images. If you look at Azure table storage ( cheap storage ), the power comes from the ability to partition the data so you can look up by partition/row VERY quickly. For example, storing every email address in the world you could partition on the domain and the row key would be the username. If each image has a unique hash, I dont know how to partition. ba19c8ab5fa05a59 IS similar to ba19caab5f205a59 which is similar to ba3dcfabbc004a49.

m444w
04-07-2016, 11:01 AM
Google says Azure supports lucene based queries* which means you should be able to use Levenshtein distance which is probably going to be way more performant than some sort of partition based hack.

* I don't use Azure for anything

Peppwyn
04-07-2016, 12:59 PM
I've been calculating the aHash, dHash and pHash of the images this morning trying to figure out which is going to give me the results i'm looking for. I was originally using Compact Composite Descriptors, but that is more for finding 'similar' images, like a picture of a person in a room on a couch and then a picture of that same room with someone not on the couch. That requires a lot more computation to find 'matches'.

I am currently using hamming distance to find similar images, since mysql has BIT_COUNT built in the query is quite fast.

Peppwyn
04-08-2016, 04:25 PM
created site http://tafuta.cloudapp.net that you can use to search for images in the index. You have to put in a url, I don't think people are going to want to upload pictures. Index is around 300k items right now.

Peppwyn
04-08-2016, 04:42 PM
critical mistake, please hold.

Peppwyn
04-10-2016, 08:12 PM
I am switching to using the Tumblr API rather than greping the site content. I found that I was missing 'hidden' posts and posts behind the 'secure' iframe that Tumblr allows people to use.

Peppwyn
04-11-2016, 02:14 PM
Okay, it's back online and plugging away. Thanks for those private messages and rep comments. Appreciate the support.

Peppwyn
04-14-2016, 06:07 PM
Up to 2.5M images. I think the service would certainly benefit from having a dedicated database server and web server and then it could scale based on the number of 'workers' going through the blogs. Right now I am just running it on a single instance because I dont want to pay for the added processing. The basic structure of the workers is as follows:

1) check if anything is in the queue to process (name of blog and startIndex of search since tumblr API only allows 20 per request)
2) if we didnt find anything, use a stored procedure on the database that will return a blog that has not been index in the last 30 days. I use a stored proc here because I need a transaction and need to lock the row so multiple workers dont get the same answer.
2a) push that to the page queue with startIndex of 0. Go to step 1.
3) If we did find a page, use the API to get all the Photo posts from that blog starting at the index.
4) If no records were found, we can go back to step 1.
5) If we did find records, move through them and hash and insert into the database.
5a) Database has a unique index on the page name and the hash of the image so we dont get duplicates. I was doing a check here but find that just having the index manage if it is successful or not was much faster.
6) If we had images that successfully went into the database, we assume there might be more pages we dont know about so we create a queue message with startIndex + length of the posts returned from the API.
7) Check each of the posts to see where it came from (repost). If it is NOT the site we're currently looking at, attempt to add it to the list of sites (unique index on name here) so we can index it later.

Peppwyn
07-07-2016, 02:46 PM
I've indexed 60M images now, and have changed to just look for exact matches. If you're interested in looking for a particular image, let me know. I've taken the site offline right now, but will probably put something up in the near future.

Peppwyn
07-29-2016, 10:58 PM
Exact matches aren't really working out. The subtle differences can really suck.

I've read up on a few ideas like "Locality Sensitive Hashing" to help with indexing the images and am going to try that in order reduce scanning the entire database to find similar images. At 60M images, scanning everything to find a similar image simply takes too long (especially when I am trying to keep the cost of the entire solution to under $150/mo).