Sunday, March 27, 2016

Intro to database week 3 - SQL gets real

Week 3 - SQL gets real

Up until this point we've mostly been talking about the theory of databases and how they generally work. This week was nice because we started to chug along on a real database. Since I have no experience writing SQL commands the first thing I did was hit the ground running by downloading an SQL training app on my phone so I could do problems on BART. 

My phone runs Android so there was a nice knowledify SQL app to run through lots of practice problems. The video below shows how this app works if people are interested. This app is nice for a few reasons. First of all it has auto-complete so you don't feel like you're wasting your life typing things out in full for no good reason. Overall the user interface is very nice and it has a quick guide to let you brush up on basic commands as you learn them. The app looks like it runs out of questions pretty quickly but I think it is a good free way to get some repetition in for someone just getting started like me.  


With basic SQL queries in hand the week's homework was filled with setting up example databases and querying them. I think if we stopped here I could do some useful things in my side projects. For example just with what we learned in this weeks homework we can download the cx_Oracle python library and start parsing files and loading the information into a a database. This let's us scrape data from social media sites to harvest information about people and automate tasks based on that information. Without the information from this weeks topics I would be storing that information in a comma delimited file which is a bit of a mess. Having the information in a database makes searching for data like last profile view data simple in python. For example after harvesting used id links you could search for pages you haven't visited yet by doing this:
import cx_Oracle
db = cx_Oracle.connect('linkspider', 'password', 'localhost:1521')
cursor = db.cursor()
cursor.execute("select User_ID from linkedINUsers where LastViewed is null and rownum < 10")
userIDs = cursor.fetchall()

cx_Oracle wants you to get a cursor object which let's you act on the database after you connect to a database and get a database object. From there you can use the cursor to start performing the commands that we learned this week. Select rows, modify them, add new ones, etc. By using rownum in your query you can limit the results returned because if your bot tries to view too many pages too quickly you'll get caught! 

Before you send your bot out to start harvesting you'll want to get some data of course. The example I wrote below shows how to do that for a basic linkedin html file taken from the people you may know page. By seatching for "/view?id=" we can scrape a list of user names and store them in the database. The critical part here is that you must perform a commit on the database. Otherwise your results won't be stored! You also probably don't want to waste time trying to commit duplicate user ID's, so it is nice to keep track of them as you go along. Another important aspect is that when you try to add rows to the database that already exist then cx_Oracly will throw an integrity error. So make sure to catch that and do whatever you like. 
import sys
import cx_Oracle
db = cx_Oracle.connect('linkspider','password', 'localhost:1521')
cursor = db.cursor()
sourceFileName = sys.argv[1]
sourceFile = open(sourceFileName, "r", errors='ignore')
buffer = sourceFile.read()
index = 0
done = False
urls = []
OK = 1
matchString = "https://www.linkedin.com/profile/view?id="
matchStringLen = len(matchString)
while OK != 0:
   index = buffer.find(matchString, index)
   if(index == -1):
      OK = 0
      continue
   index += matchStringLen
   index2 = buffer.find("&", index)
   match = substring = buffer[index:index2]
 
   duplicate = 0
   for url in urls:
     if(match == url):
           duplicate = 1
   
   if(duplicate == 0):
      urls.append(match)
      try:
        cursor.execute("insert into linkedInUsers values('" + match + "', NULL, NULL, NULL, NULL)")
      except cx_Oracle.IntegrityError :
        print("Already in database")
       
   index = index2
db.commit()
Overall I think this was a solid week. We learned some really useful tricks to create and use databases. Now we just have to make sure to keep going otherwise we could end up thinking we know what we're doing, which would end up creating a real mess! At least now we could do basics like monitor home automation systems, aquaponics system sensors, and other simple projects. Just don't go off trying to apply for a database administrator job at a bank! 

No comments:

Post a Comment