Hw2
Movie Recommendations based on IMDB Scraping
For this project, I am creating a scraper for the movie, Catch Me if You Can, and providing recommendations based on movies that share the actors from Catch Me if You Can.
First Parse Method - Movie Page
The method starts on the movie page and navigates to the full credits page
- After creating the class and saving the movie page’s url as a list, define the first parse method
- Index into the zero index of the movie url list and add “fullcredits”, save this in a variable
- Yield a request object that executes on the variable made in step 2. It’s callback argument takes in the following parse method.
class ImdbSpider(scrapy.Spider): name = 'imdb_spider' start_urls = ['https://www.imdb.com/title/tt0264464/'] def parse(self, response): cast_crew = self.start_urls[0] + "fullcredits" yield scrapy.Request(cast_crew, callback= self.parse_full_credits)
Second Parse Method - Full Credits
This method will navigate from the full credits page to each actor’s page
- Create a list comprehension that will run a relative path for each actor, save it to the variable actor_path
- Create a prefix variable for https://www.imdb.com/ 3, Create an acot_urls variable that adds to the prefix and will run a for loop through actor_paths, this mimics clicking on the actor headshots
- Create a for loop that indexes through actor_urls and yield a request object with a callback argument for the next parse method
def parse_full_credits(self, response):
actor_path = [a.attrib["href"] for a in response.css("td.primary_photo a")]
prefix = "https://www.imdb.com"
actor_urls = [prefix + suffix for suffix in actor_path] #clicking on actor headshot to go to actor page
for info in actor_urls:
yield scrapy.Request(info, callback = self.parse_actor_page)
Third Parse Method - Actor Page
This method will attain the actor’s name and their movies.
- Create a response.css object to get the actor name
- Create a for loop through a response.css object that contains the actor’s movies and will iterate through this list of movies
- Inside of the for loop create a response.css object that gets all the names of the movies
- yield a dictionary that holds the actor’s name and the movies they have been in
def parse_actor_page(self, response):
actor_name = response.css("span.itemprop::text").get() #actor name
for movie in response.css("div.filmo-row b a::text").getall(): #get all the names of the movies the actor was in
yield{
"actor": actor_name,
"movie" : movie
}
Recommendation
By using pandas on the csv file attained from the scraper, we can see what movies have the most shared actors with Catch Me if You Can.
impot pandas as pd
data = pd.read_csv("results.csv")
topten = data.groupby("movie").count() #group the data by the movies
topten = topten.sort_values(by = "actor", ascending= False).head(10) #sort and show the data based on what movies have the most shared actors
topten
topten.plot.bar()
If you like the movie, Catch Me if You Can, these are 10 other movies you might like based on the shared actors
Written on May 1, 2022