Sunday, March 27, 2016

Intro to database week 3 - SQL gets real

Week 3 - SQL gets real

Up until this point we've mostly been talking about the theory of databases and how they generally work. This week was nice because we started to chug along on a real database. Since I have no experience writing SQL commands the first thing I did was hit the ground running by downloading an SQL training app on my phone so I could do problems on BART. 

My phone runs Android so there was a nice knowledify SQL app to run through lots of practice problems. The video below shows how this app works if people are interested. This app is nice for a few reasons. First of all it has auto-complete so you don't feel like you're wasting your life typing things out in full for no good reason. Overall the user interface is very nice and it has a quick guide to let you brush up on basic commands as you learn them. The app looks like it runs out of questions pretty quickly but I think it is a good free way to get some repetition in for someone just getting started like me.  


With basic SQL queries in hand the week's homework was filled with setting up example databases and querying them. I think if we stopped here I could do some useful things in my side projects. For example just with what we learned in this weeks homework we can download the cx_Oracle python library and start parsing files and loading the information into a a database. This let's us scrape data from social media sites to harvest information about people and automate tasks based on that information. Without the information from this weeks topics I would be storing that information in a comma delimited file which is a bit of a mess. Having the information in a database makes searching for data like last profile view data simple in python. For example after harvesting used id links you could search for pages you haven't visited yet by doing this:
import cx_Oracle
db = cx_Oracle.connect('linkspider', 'password', 'localhost:1521')
cursor = db.cursor()
cursor.execute("select User_ID from linkedINUsers where LastViewed is null and rownum < 10")
userIDs = cursor.fetchall()

cx_Oracle wants you to get a cursor object which let's you act on the database after you connect to a database and get a database object. From there you can use the cursor to start performing the commands that we learned this week. Select rows, modify them, add new ones, etc. By using rownum in your query you can limit the results returned because if your bot tries to view too many pages too quickly you'll get caught! 

Before you send your bot out to start harvesting you'll want to get some data of course. The example I wrote below shows how to do that for a basic linkedin html file taken from the people you may know page. By seatching for "/view?id=" we can scrape a list of user names and store them in the database. The critical part here is that you must perform a commit on the database. Otherwise your results won't be stored! You also probably don't want to waste time trying to commit duplicate user ID's, so it is nice to keep track of them as you go along. Another important aspect is that when you try to add rows to the database that already exist then cx_Oracly will throw an integrity error. So make sure to catch that and do whatever you like. 
import sys
import cx_Oracle
db = cx_Oracle.connect('linkspider','password', 'localhost:1521')
cursor = db.cursor()
sourceFileName = sys.argv[1]
sourceFile = open(sourceFileName, "r", errors='ignore')
buffer = sourceFile.read()
index = 0
done = False
urls = []
OK = 1
matchString = "https://www.linkedin.com/profile/view?id="
matchStringLen = len(matchString)
while OK != 0:
   index = buffer.find(matchString, index)
   if(index == -1):
      OK = 0
      continue
   index += matchStringLen
   index2 = buffer.find("&", index)
   match = substring = buffer[index:index2]
 
   duplicate = 0
   for url in urls:
     if(match == url):
           duplicate = 1
   
   if(duplicate == 0):
      urls.append(match)
      try:
        cursor.execute("insert into linkedInUsers values('" + match + "', NULL, NULL, NULL, NULL)")
      except cx_Oracle.IntegrityError :
        print("Already in database")
       
   index = index2
db.commit()
Overall I think this was a solid week. We learned some really useful tricks to create and use databases. Now we just have to make sure to keep going otherwise we could end up thinking we know what we're doing, which would end up creating a real mess! At least now we could do basics like monitor home automation systems, aquaponics system sensors, and other simple projects. Just don't go off trying to apply for a database administrator job at a bank! 

Wednesday, March 16, 2016

Intro to databases week 2 - A relationship with Algebra

This week was serious meat and potatoes business: relational algebra and some SQL.

I know what you're thinking: what's relational algebra? Well it's like your logic & computation class met SQL and decided it didn't want to have a good structure and be a real language. So relational algebra gives us some example ways to perform operations on database tables in a bit of pseudo code. Useful keywords in this area include the following including simplified description:

SELECT - pull some columns out of a table.
PROJECT - pull some rows out of a table.
UNION - join two tables that share columns
INTERSECT - join two tables that share columns only on the rows that match
DIFFERENCE - subtract one table from another, giving you the first table with the values in the second removed from it when they match
PRODUCT - makes a super table that is the combination of the two multiplied out in almost every combination. it generated a bit of a mess but is used as part of the join operation.
JOIN - add some columns from one table to another to make a super table where the keys match.
DIVIDE - this one is kind of weird and hard to describe without a picture, but it's included for keyword completeness.

After getting all the keywords loaded up we had some exercises to do. Those exercises really served to provide the syntax for using relational algebra as if it was SQL in the final homework for the week. That seemed a bit akward like we should have just finished loading up SQL commands and then wrote it like that. But it is what it is.

From there we went on to reading but not much homework related to SQL. That's where it gets to the serious business part with real structure to send commands to a DBMS.  The big take away for me from this weeks section on SQL is that it seems like something that is straightforward to get the basics of but then tricky to become really good at. Being able to ask questions of a database is fine but the trick seems like it will be figuring out what questions to ask and how to organize the data to generate useful information. Then there is the ever looming problem of creating a database that turns into a mess so it seems important to follow along closely with these beginning sections.

Next week should be interesting as we continue our exercises but start to use an Oracle DMBS to run databases on our own computer. I am looking forward to the hands on work coming up.


Wednesday, March 9, 2016

Intro Database class week 1!

Hello internet,

This week's in-school learning was really about pre-loading students with all of the important vocabulary necessary for us to discuss databases. There were different key types, integrity between tables, and lots of examples of how you can go terribly wrong by knowing "just enough to be dangerous" with databases. I think the most important aspect of this was that just because you know how to use some database management software that doesn't mean you know what you're doing. Worse yet, if you actually convinced people you knew what you were doing you could create a massive time suck that becomes difficult to maintain. Hopefully though students don't head off on their own immediately and convince someone they can run a database.

From the homework we saw the importance of unique keys to identify rows within a table, and how to neatly link tables together with those keys. Things like SKU's on items in a store, student ID numbers, store codes, and all the other random numbers that don't seem to really give any info become magic numbers inside the database. This of course depends on being smart about how you set up your tables and which information you put inside of them. For example you probably don't want to list names unnecessarily when you can refer to people by number. Although they might feel a bit alienated during conversation, this actually ensures that you can update their name in only one location and have it updated correctly in all tables, without creating redundant names that have different spelling.

So overall I think it makes sense to continue this course and learn how to make useful databases instead of just trying to convince people I know a few key words related to database development. After all, it doesn't really matter what people think you know unless you're a shifty salesman. If you want to be part of an agile team that can manipulate matter and information to develop devices and harvest information to generate knowledge then you have to be able to do things.

Also I have some kind of fever & cold this week which makes me feel a bit dumb. It's always interesting being sick and seeing how it affects your head.

Have a good week.