Everything I Learned About Databases From...: February 2008

Saturday, February 23, 2008

Database Privacy Issue Quotable: "Garbage In and Garbage Out"

While contemplating my next blog topic about databases, I decided to google privacy issues and databases. My search returned quite a few interesting articles as expected. One article I found the most interesting was the very controversial subject of the use of massive databases for law enforcement applications. Everyone except these law enforcement agencies seem to be concerned with the potential problems that could stem from databases that are supposed to maintain data about criminals and facilitate efficient information sharing among all of our nation's law enforcement systems. There is much debate about the potential for abuse among members of these agencies, but more importantly the possiblity that all information entered into a database about individuals may not be verified as accurate and may do more harm than its proposed good. Many questions surround this issue as it relates back to our individual privacy.

Who will police the collecting, generating, storing, and dissemination of information? What safeguards, if any, exist to protect the innocent who may become victims of human errors with respect to managing such personal data? If there are any monitoring and controlling measures in place, how well are they truly being implemented? Are we really any safer if we give up our right to privacy in exchange for an even greater unknown? It is that unknown that we put into the hands of the agencies and systems that are supposed to protect us, yet continue to fall short daily. Will we feel any greater sense of security in our daily lives if we have these massive communicative efforts purported on our behalf when we know that life holds no guarantees?

Each day presents uncertainty for us all and despite the efforts to safeguard lives, we will not be able to prevent every instance of a crime. We may come close, but at what price? Is giving up our privacy and freedoms comparable to feeling secure in our environment? Can we truly feel secure in a world where we must give up what is ours despite where we may fit into society?

When I contemplate the possibilities that exist at both ends of the spectrum, I am always left with more uncertainty and questions. Initially, I read this article and the idea of one department of justice "OneDOJ (Eggen 1)" facilitated by massive information sharing seemed like a great idea. It spoke of unity of all the systems created to protect us and almost signaled an end to the bureaucratic divisions that too often have failed us due to insufficient communication and coordination. If I closed my eyes and dared to dream, then I could see it like the Justice League with Captain America and his Super Friends working together for the greater good. If only it were so simple, then the controversy may not exist. Unfortunately, life is only comical sometimes and quite different from the comic books.

Law enforcement and all of its systems are not super heroes despite their ability to fight crime. Generally, the comic book characters---the objects of heroism--- do not feel victimized by those who are trying to protect them. However, the same can not be said for innocent people victimized by potential errors in database systems that may lead to false accusations and arrests. As this article states, some information stored in databases is grossly inaccurate thus this broad-based information sharing is like putting trash into the system---data pollution to say the least. We may forget about One Department of Justice being akin to Captin America and the Justice League. We must instead, scrap the comic and go back to the drawing board for as long as we remain human there will be errors in these database systems. Thus, in this case it is basically "garbage in and garbage out(Eggen 1)".

"Justice Dept. Database Stirs Privacy Fears Size and Scope of the Interagency Investigative Tool Worry Civil Libertarians" (By Dan Eggen Washington Post Staff Writer December 26, 2006; A07)

Wednesday, February 20, 2008

IDMB.COM: The Internet Movie Database (Data Validity and Transaction Management)

As I was searching for more database articles to blog about, Google returned results for a professional online database(s) for people in the entertainment industry. This site seems like a great site for entertainment professionals. The database must be almost infinite in size as everyone can contribute data similar to the wiki technology used on Wikipedia. I explored the site a little bit trying to find out more about it in the Home section, but could not find anything useful. However, I decided to investigate the frequently asked questions section digging for more information about IDMB Inc. I did not find much about the company either. I did find, however, a question about the source of the data in the database and the nature of its accuracy or reliability. This question made me think about the "garbage in and garbage out" quote from a previous blog and then about transaction management.

I thought about the previous blog because the FAQ section of this site like many websites today issued a disclaimer regarding the source and validity of the information it presents. There was an honest statement about how the information being shared might not be accurate because the sources vary within the industry or otherwise public. Therefore, the information being viewed by various users at some point in time might be inaccurate due to updates to the database. Now, I know this does not necessarily mean that there is no lock on database tables that store the information, but it made me think about transacation management.

I thought about how today I might access some data about a movie or business contact in the industry (given that I may be in that league of site members) that might be incorrect, but later be updated if I accessed it again. Now, I just want to make sure that I have the concept of transaction management correct. Transaction management applies to users with rights to access, read, update, or delete objects in a database. If I access this website just searching for information, then I am only able to read the information as it is presented despite the accuracy. However, if I can update information about a business contact that is being shared with other users in the database then I can only do this when other users are not reading or making changes to the database also. I am assuming that I understand this process to be implemented if we are accessing the same tables e.g. those objects related to the information that I am trying to change. It seems like common sense to me in some instances, but in others I am a little confused. At any rate, I am still learning about transaction management so I will go back to the the validity of data.

If I was networking on this site or researching anything on this site like any others, I would have to worry about how accurate the data is because of the suppliers of the data. Since this site acknowledges that it maintains a database that may contain data that is inconsistent with user input, then it is not reliable and like putting some garbage in to the system and getting some garbage out of it.

This blog is not meant to be heavily analytical or otherwise educationally valid. I was just exploring some of the database concepts and topics I been exposed to in class or have considered via way of blog responses to database articles.

Google vs. Scroggle: The Good and Bad of Search Engines

http://boston.com/news/nation/articles/2006/01/21/google_subpoena_roils_the_web/

The link above references another article that sheds light on how our privacy is further being compromised by database technology; the companies who initiate this compromise; and by the government with its attempts to gain greater access to personal information via the claim of civil protection. As the article headlines "Google Subpoea Roils The Web," there is indeed lot of conflicting emotions regarding whether Google---the mega search engine--should release the contents of its repositories to the government in an effort to fight Internet crime.

The information in Google's repositories can or could identify users via a reverse search of information obtained from online users who utilize search engines. It seems that Google and many other search engine companies collect data from users conducting searches, stores it for various business-related research (at least that is their official statement and no real admittance of what they actually do with that data), and maintains this information indefintely despite the risk of data compromise. These search engine companies create indices to the information that they maintain like a reference resource in a library and similar to a reference book with printed information, the information remains available as long as it is not discarded. The only difference here is that the public does not have access to the very information about them that seems public information to the search engine companies. Until I read this article, I was not fully aware of just how much or the exact nature of the information that was being collected and stored by companies who gather data from Internet users. Now, I am aware and unfortunately so is the government ---not that this was an entity left in the dark because they are basically becoming more like George Orwell's Big Brother. If Big Brother is becoming more of a reality, then all I can say is Poor Us and Poor Them.

Poor Us: all of us Internet users duped by the convenience and useful resources that these mega databank companies provide and now its Poor Them---search engine companies---whose chickens have come home to roost as the government seeks to benefit from their profiteering efforts. Now, they are being sued to release the information that they perhaps never should have collected in the first place or at the very least, should have never stored. They have no definite plans for these obvious datawarehouses so instead of discarding it they continue to stockpile it despite the potential for misuse if in the hands of malicious individuals or other worse case scenarios possibly our government in its quest to help.

A lot of these companies have begun to cooperate with the government, vowing to only supply necessary information while protecting users' identities. However, Goggle is putting up a fight supposedly to protect the users yet this seems a little contradictory because there is no mention of simply purging the data. I say if Google is really interested in the rights of Internet users for whom they provide both a good service and disservice to, then they can choose to end bilateral services and simply do good by their users---delete those repositories of user information. If Google does nothing more than put up a good fight with the government over their right to keep data collected via users' ignorance of their unauthorized data gathering to themselves, then Google and companies alike might as well join the government in the emerging role of Big Brother.

Google and companies alike, that execute these sneaky activities without fully disclosing them to the users of their services represent what is bad about search engines. It is one of the most popular search engines around and has become a catch phrase in pop culture, but I wonder how popular Google would be if ALL users knew that their personal information was being collected. I don't believe most users are aware regardless of what is on the news, Internet, or in the papers. It is big news to me and I obtain information from all mediums. Perhaps, I have been too comfortable with the information that I could get from the resource to notice what information that resource could and possibly would get from me. I am sure that is likely the case for most users. Yet, I wonder what they would do differently if armed with the knowledge of all the companies like Google to whom we must be easy prey.

Well, I know what I would do---go back to the basics for research, use public computers when I can, and or search for public interest group (privacy activists) online services who believe in our continued right to privacy despite the advances of technology. One such group mentioned in this article is Scroggle.org who represents good and what could be a good quality about all search engines. It is a "public interest group, Public Information Research Inc. of San Antonio," who "runs scroogle.org, an Internet service that disguises the Internet address of searchers who want to run Google and Yahoo searches anonymously (Hiawatha Bray Globe 2006 page 2) . "

"...Internet users concerned about privacy should do their Internet searches through Scroogle or other Internet ''proxies" that hide the address of the searcher (Osphalt, Bray 2). " Also, Google and other search companies should regularly erase their database of saved searches (Osphalt, Bray 2). "Perhaps they should consider whether it's worthwhile to keep all this information indefinitely. (Osphalt, Bray 2)."

After reading this, I visited the site scroogle.org and added it to my list of search engines on Internet Explorer. Now, I would like to believe that if I use this site to conduct my searches online that it will honestly perform the duty that it advertises. I will keep my fingers crossed because in today's society everyone has something to lose and something to gain even the "do good" groups. It is that tradeoff that got us all in this mess in the first place as we gained the information we searched for and lost our privacy in the process. However, I remain hopeful that there are still pure, privacy advocates out there who lend legitimate services such as scroogle.org to all of us still concerned about the loss of our privacy.

Could "End User Buy-In And Support For Accurate Data" Resolve The "Garbage In, Garbage Out" Issues of Databases?

In a previous blog, I referenced a database privacy quotable "Garbage In, Garbage Out" to address not only the concerns for data privacy, but the accuracy of that data. Database accuracy is such a big concern because information about every aspect of our lives continues to be recorded and stored in infinite datawarehouses to be accessible by individuals who have the power to grant or deny us privileges. It is not enough that our ability to make decisions is no longer an autonomous process, but that which should be our sole right is now dictated by powerhouses of information that could be dangerously inaccurate. We see the effects of bad data maintained in businesses when individuals experience identity theft; a keeper of the data stored about us makes a clerical error; and or when a company never continuously validates the integrity or quality of data it stores.

I recall reading an excerpt of Database Nation (Simson Garfinkel) in which he discusses how outdated data is at a lot of the companies who continuously collect and store data about individuals. This old or bad data may be rarely updated or never validated in a lot of companies whose sole function is to share information for the purpose of conducting business. We know this to be true because of the credit reporting, billing, and mailing errors that many of us have to live with daily. How many times have you received someone else's mail? And how many times have you received calls at your residence for the same wrong person-wrong phone number even after the passage of multiple years? Someone or some company simply is not, properly, maintaining the data being stored. This failure to validate and update data is not only annoying, but can be detrimental if this bad data is used to make critical business decisions.

We recognize this problem obviously, but what can we do about it? How do we begin to tackle this problem when it seems like an infinite one that will take infinite lifetimes to resolve? When companies maintain bad data while simultaneously collecting and storing new data, how can they reasonably expect to achieve data accuracy and consistency? It is possible we may never solve this problem, but if we choose to tackle the problem then we might minimize the number of errors and their impacts. There are companies who have taken steps toward approaching a solution.

In the article that is referenced below, "The Secret to Successful Business Intelligence A Top Notch Data Warehouse" Rensselaer Polytechnic Institute seeks to gain end-user endorsement and support for achieving accurate data. An essential first step for the institution was to review how data was being defined, stored, and used by various entities within the institution. It appeared that data was being managed with no real guidelines in place. There was chaos in a sense because no "data definitions" were established (Daniel 1). Different departments and business functions used their own definitions and methods for looking at data (Daniel 1). This presented a lot of problems for Renassler as the following excerpt reveals.

"...Finally, the admissions staff needed more timely demographic information about its applicants to inform student selection decisions.

Getting a handle on the data has been critical because higher education today is a tough arena. Government funding is down, requests for financial aid are up and admitting a diverse student body—in terms of gender, geography, ethnicity and academic achievement—has become more challenging. All these factors make balancing the supply of enrollment acceptances and financial aid with the demand from student applicants more challenging than in the past. The better Rensselaer could optimize its administrative resources and time, the more revenue it would have for courses and scholarships to attract the best and the brightest.

The answer was a business intelligence and enterprise data warehouse implementation (Daniel 1)."

Ultimately, Renassler decided to "create enterprisewide processes for collecting and using data (Daniel 1)" which included communication, training, and support for end users (Daniel 1). They implemented this new process via the following focused steps as listed in the article:

1. Create cross-functional support.
2. Think big, start small, deliver quickly.
3. Create one version of data truth.
4. Provide support for new behaviors.

I like Renassler's approach to solving an ongoing business intelligence problem. It showed some maturity (e.g. as it applies to Capability Maturity Model® Integration) on the part of the institution to go from chaos to at least recognizing the problem and trying to find a valid solution. Also, I believe that their new approach of putting more value on the keepers of data by providing "broad user support (Daniel 2)" could serve as a best practices methodology for other companies seeking to improve the quality of their information they store. Renassler's redirection serves as a good best practices because "enterprise data warehouse and business intelligence projects' success depends on broad user support and because consequential business decisions are made on the faith that information is accurate (Daniel 2)."

If every datawarehouse infrastructure applies this ideal of continuous proces improvement (CPI) or total quality management (TQM) of data, then we may get closer to weeding out the garbage that could potentially go into databases thus minimizing the garbage that goes out of them as well. It looks like it did well for Renassler in terms of ROI and "optimized expenses (Daniel 4)." See the section titled "An A+ for Rensselaer's Business Intelligence" on page 4 of the article.

Referencing Article:
http://www.cio.com/article/151601/The_Secret_to_Successful_Business_Intelligence_A_Top_Notch_Data_Warehouse
"Outdated information and disagreement over data definitions was impeding Rensselaer Polytechnic Institute's progress. To the rescue: a business intelligence plan that emphasized end user buy-in and support for accurate data"

Privacy Issues Surrounding Databases

This topic is one of an infinite nature once you begin to explore all the instances of privacy piracy. I got more information about how our privacy gets compromised by databases than I could ever have imagined. When I discover the ways that databases or database technologies, can be used and are being used to capture information about us as we go about our daily lives I am in awe that more is not being done to regulate these acts.

When I speak of these acts I am referring to how easily we get tricked into providing information about ourselves to individuals who use it for their own internal purposes or to sell or share it with third parties without our consent. As I read more excerpts of Simson's Database Nation I could not believe that I was not more guarded with my information.

There are certain instances where I questioned the collection of my information e.g. in hospitals, doctor's offices, and other cases where the collection of my personal data might seem more pertinent to my well-being. Yet, I discovered the gross abuse of my personal information even in places where I least expected infringement upon my privacy (e.g. medical records). I was not aware that when I signed consent to release my medical information to what I believed were eligible third parties like the providers of my medical insurance, that other third parties for which I never knowingly would have given consent also have permission to access my medical data. Who knew that the keepers of medical records could sell that information to insurance companies, current and potential employers, and anyone else who could use that information to make critical decisions about individuals.

It is like we do not even own our information any more once we give consent to collect our personal data whether in awareness or ignorance to third parties. When I think of how I use my Visa check cards instead of cash and sign up for discount programs in grocery stores in the quest to be frugal, I feel silly now each time I go to my mailbox and am uninundated in junk mail or solicitations. Then, I think of all my efforts to opt-out of marketing promotions and credit card offers and see them as wasted when there is no way to eliminate each group of individuals responsible for the intricate web of advertising. Who has time to call each company that sells their information to stop doing so or to remove their names from the proverbial list. I even get ads faxed to me now, which is such a gross waste of paper. It angers me to think that not only am I paying the price for someone else's careless mishandling of my information, but also incur costs in terms of paper and other printer supplies not to mention the infinite cost of privacy lost.

The privacy issues surrounding databases are so infinite and coupled with the ever increasing advances in technologies that assist the neglectful transmissions of our personal information that we may never see an end to this problem. At best we can only expect that technology will exacerbate the problem thus allowing it to get worst before it gets better. Who decides on when things get better ultimately is not up to the keepers, seekers, and senders of our information, but up to the us, the victims of database privacy abuses.

References:
Database Nation: The Death of Privacy In The 21st Centruy by Simson Garfinkel
ISBN 0-596-00105-3

Monday, February 11, 2008

Exploiting Children Via Databases (Database Nation Discussion)

In Chapter 7 "Buy Now!: Selling It To Our Youngest Consumers", Simson Garfinkel talks about this exploitative practice of collecting data from children while they use the Internet to be stored in databases for marketing purposes. I was unaware for some odd reason that sites that I once deemed child-friendly because of the age-appropriate content, were equally as harmful to my children. I now know why I have received all types of marketing offers in the mail for various children's magazines, toys, and other products for which I did not personally seek.

It is not enough that we have to deal with all of the commercials marketing to kids and other child predators. Now, we have to start screening the sites that our children frequent on the Internet that appear harmless. Who preys on children's ignorance of data gathering methods anyway? Are companies so concerned with increased profits that they will stop at nothing to entice a child to provide personal information about themselves and sometimes other members of their families?
I wonder if they realize that in their attempts to gain information for marketing campaigns geared toward children that they might be putting children in harms way. There is no way to guarantee that some malicious individuals are not intercepting the same personal data that they see as harmless.

Where is the social responsibility in all of these ploys to collect information? Who will protect the information that they are gathering about children? I do not want to even think about what type of malicious individuals could be working for the companies who collect address information from children via mandatory product or site registrations. I think that we really need tougher regulations in place to sanction any company who attempts to collect information from children without their parents' consent despite disclaimers and acceptance agreements. There needs to be some way to protect our kids from further exploitation.

I read that some regulatory efforts have been made toward minimizing how data can be gathered, but not prohibiting or completely outlawing data collection from minors. We as parents will simply continue to regulate this practice. Such a task will require continuous monitoring and trying to control what information is allowed into our homes and what is allowed to go out of them. A Database Nation is definitely where we reside today, but you would think they would take a little easier on the parents---the decision and purchase makers. It's like our jobs as parents are not difficult enough now we must learn to navigate around this intricate mess.

Understanding The Normalization of Database Tables

The first time I was introduced to this concept of normalization was in my Information Systems Design and Analysis class (IT 361) and honestly it never made sense to me. Perhaps, it was due to the fact that we were not focusing so much on this material in the analysis class like we do now. Now, that I look back and I have reviewed more of the material in the textbook Database Systems: Design, Implementation, and Management by Rob Coronel 7th edition, I see how they are related. I can not explain it quite as eloquently as the text does, but I know the relationship that exists has to due with the the System Development Life Cycle(SDLC) and the Database Life Cycle (DBLC). If I can recall this correctly, each of these cycles or processes has a framework or methodology (best practices) for being executed. They are connected in that each is a reflection of the other. The Rob Coronel text puts it more eloquently in Chapter 9, Database Design, "that successful database design must reflect the information system of which the database is a part. Successful information systems are developed within a framework known as the SDLC. That within the information system, the most successful databases are subject to frequent evaluation and revision within a framework known as the DBLC(Coronel 359)." This text talks about how the database is part of the overall picture in an information system (Coronel 359) and that seems logical only now to me.

It was not that I simply could not make the connection of Information System to Database System because obviously the common thread between them is information. However, it was not until I begin to learn database design that things became more clear to me. I began to see them as partners in a marriage in the sense that one is not more important than the other; they must coexist in a way that the needs of each are met; and to be successful they can not be rushed into or occur at random but must be carefully planned so that all needs or requirements are met. Otherwise, like a marriage poorly put together the two will fall apart or fail. Thus, I realize the importance of the sequence of the systems analysis class and the database design class and why it was important to introduce some database concepts in the analysis class.

I recall my instructor explaining primary keys, candidate keys, composite keys, and first, second, and third normal form and feeling overwhelmed because I simply could not figure out why these concepts made sense to system analysis and design. Now, I was taking the class via delayed tape and was always playing catchup with the material perhaps that is the reason why I never quite understood it. At any rate, I remember my instructor saying that we would see this again later in some class. Honestly, I'm sure he likely said with database design, but I totally forgot about it.

So, now the concepts of database design has been introduced to me and we finally get to the normalization of database tables and I begin to panic. I'm thinking oh oh, here goes that difficult stuff again that never quite made sense to me. However, I watch the DVD for IT 450 and recall the ERD concepts and it does not seem so bad. After watching the class lecture, I then decide to tackle the chapter on normalization to see if the book read easier. I found myself rereading and dozing off a little bit because of the wording, but then I start to pay attention to the diagrams. It seems simple enough at least going from first normal form (1NF) to second normal form (2NF). This simplicity, I discovered by finally understanding what partial dependency ("dependency based only on part of a composite key" Colonel 154); transitive dependency (dependedency of one attribute that is not part of a key upon another attribute that is not part of a key Colonel 154); and desirable dependencies actually meant to the database design.

I understand that the goal of this normalization process it to maintain integrity of data to minimize errors that could impact the performance of the database. Yes, this is basically what I should have got from class, but it is still sometimes difficult to apply these concepts. In an effort to understand normalization, I tried to find real world examples of situations in which database design was poor due to normalization errors, however, finding that exact example was like finding a needle in a haystack. However, I did find some blogs written by professional database programmers that helped me to understand the normalizaton of database tables (http://database-programmer.blogspot.com/2007/12/database-skills-first-normal-form.html). I am not as strong with it as I like therefore I will not try to explain it. This is just a blog acknowledging that I am more comfortable with the normalization of database tables because of external research as well as class lectures and the text. Now, how I will fair come test time depends on how much time I get to practice normalizing tables and the difficulty level of the questions.

My Introduction to Relational Databases

My very first experience with relational databases was between 1997 and 2000. I was enrolled in community college seeking an Associates degree in Information Systems Technology. The course was called Introduction to Database Management as expected, however, I do not believe I really learned how to manage databases. My instructor was great and I managed to get through the course just fine, but I did not retain much of what I learned. I believe that was largely due to not using what skills I learned beyond the course. What I remember most was that the course project required a lot of time and that I ended up buying a computer and the Microsoft Professional Suite to have unlimited access to Microsoft Access 1997.

I spent countless hours learning to build a database, creating primary keys, and performing queries either for sample projects for student enrollment (courses) or medical records (patients). Today, I have discovered that those sample projects must be the best examples because we use them as models again in my Introduction to Database Concepts course IT 450.

In my IT 450 course, I am learning about ERDs and relational schemas for the first time since they were previewed in my Systems Analysis and Design classes (IT 361 and IT 473). They are critical to database design, yet, I never learned about them in my very first database management course. I only recall using this thick textbook called Microsoft Access 97 and building the database due to a long list of requirements. The requirements were not written similar to the business requirements that I am being exposed to now. It was more along the lines of what would be expected once our database design was analyzed with respect to primary keys, field and index requirements, datatypes, and other things. This did not bother me at the time because I never knew how I would use this course later. However, it intrigues me today that I either was asleep at the wheel or that the very basis for database design was not a focus of the course. Honestly, I do not remember any discussion of ERDs and relational schemas.

Today, I am a little disappointed because I feel like I would be more prepared for my current course with all of the logistical issues I'm having with the course delivery. I would feel a lot less pressure while trying to grasp the concepts had we at least discussed this topic. Perhaps, the idea was to just have us be succintly introduced to databases and leave the real ground work for even higher educational institutions---those providing the Bachelor's and Master's degrees. If that is the case, then I can minimize my disappointments and focus on actually learning and possibly applying in the work force what I gain from the Introduction to Database Concepts course today.

Everything I Learned About Databases From...