rulu ruru

post Downloading Flickr image sets with Python and wget

May 2nd, 2008

Filed under: python, web — starenka @ 23:55
Tags: , ,

Have you ever came across a kool image set on Flickr and wanted to download all those pics in the “large” size? Yes, there are thousand ways how to skin the cat… I wrote a very simple “spider” [dl here] in Python to do all the clicking for me. Once i have all image links I run wget to retrieve them.

Let’s make function to fetch given uri first:

#!/usr/bin/env python

import sys,re,string
import urllib,urllib2

def readFile(uri):
    request = urllib2.Request(uri)  
    try:
        response = urllib2.urlopen(request)  
    except urllib2.HTTPError, e:
        print ‘ERR: (’+str(e.code)+‘) Error occured. Current URI:’+uri
    except urllib2.URLError, e:
        print ‘ERR: Failed to reach the URI (’+str(e.reason[0])+‘:’+e.reason[1]+‘)’
    else:
        return response.read()

Then two simple functions to search for appropriate hrefs and images. getThumbs extracts all links to “image detail” pages and send those to getImages function. getImages jumps to “Available sizes” page and prints the image uri.

def getThumbs(data):
    global thumb_match
   
    for match in re.finditer(thumb_match,data):
        getImages(match.group(1))

def getImages(uri):
    global set_id,big_match
   
    pos = uri.find(‘in/set-’+set_id)
    uri = uri[0:pos-1]+‘/sizes/o/’+uri[pos:]
    data = readFile(‘http://www.flickr.com’+uri)
    for match in re.finditer(big_match,data):
        print match.group(1)

The last section just parses given parameters, compiles regular expressions and starts harvesting links:

if(len(sys.argv)>1):
   uri = sys.argv[1]
   if(uri.find(‘?page=’) > 0):
      pos = uri.find(‘?page=’)
      page = uri[pos:]
      uri = uri[0:pos-1]
      set_id = uri[uri.rfind(‘/’)+1:]
      uri = uri+‘/’+page
   else:
       if(uri[len(uri)-1:] == ‘/’):
           uri = uri[0:-1]
       set_id = uri[uri.rfind(‘/’)+1:]
   
   thumb_match = re.compile(r‘.*?<a.*?href="(.*?set-’+set_id+‘/)".*?>.*?’,re.IGNORECASE)
   big_match = re.compile(r‘.*?<p><img.*?src="(.*?static.flickr.com.*?)".*? /></p>.*?’,re.IGNORECASE)
   
   getThumbs(readFile(uri))

else:
    print ‘lack of params’

Now i just run the script on all gallery pages and let it save the links into a file:

python herrflick.py http://www.flickr.com/photos/pinkponk/sets/72157600267969060/ >> /down/adverts.txt
python herrflick.py http://www.flickr.com/photos/pinkponk/sets/72157600267969060/?page=2 >> /down/adverts.txt
python herrflick.py http://www.flickr.com/photos/pinkponk/sets/72157600267969060/?page=3 >> /down/adverts.txt

and let wget download the images:

wget -i /down/adverts.txt

PS. More comfy is to make the script executable (linux only):

  • rename the script to “herrflick”
  • copy it to /usr/bin
  • make it executable (chmod +x /usr/bin/herrflick)

and run it:

 

herrflick http://www.flickr.com/photos/pinkponk/sets/72157600267969060/ | wget -i /dev/stdin

Popularity: 31% [?]

ruldrurd
© starenka 2oo7, cute alien monster by noizcut, original theme by Laurentiu Piron - customized by starenka | proudly powered by WordPress