IT 385 Allegany College of Maryland Data Quality Assessment Paper Scenario:Your team is doing research on web traffic on Marymounts network (dorms and aca
IT 385 Allegany College of Maryland Data Quality Assessment Paper Scenario:Your team is doing research on web traffic on Marymounts network (dorms and academic/admin buildings). You will be using web traffic data, firewall data, and data from two external data sources of your choosing. (You can use your best guess for size of datasets from the university). Please write out a Data Quality Assessment that will include the following information below for your data sets. Student privacy must be included Data Quality Assessment
Description of Data
Type of Research
Types of Data
Format and Scale of Data
Data Collection / Generation
Methodologies for data collection/generation
Data Quality and Standards
Data Management, Documentation, and Curation
Managing/storing and curating data
Metadata standards and data documentation
Data preservation strategy and standards
Data Security and confidentiality
Format information / data security standards
Main risks to data security
Data Sharing and Access
Suitability for sharing
Discovery by potential users of the research data
Governance of Access
The study teams exclusive use of the data
Regulations of responsibilities of users
Relevant institutional, departmental or study policies on data sharing and data security
Include policy name and URL/Reference to it. Also any laws that may apply.
Deliverable:Please provide the following in a PDF or Word Document(s):
Research Question/Problem you will use.
Data Quality Assessment Report, including 2 chosen external data sources used to match against web traffic. IT 385
Managing Big Data
Professor Tim Eagle
Module 3:
Data Assets
Data Assets
?
?
?
ETL
Data Formats
Sources of Data
?
?
?
?
?
?
?
Data Providers
Data Scraping
?
?
?
Internal vs External
Log Data
Public vs. Private
Free vs. Paid
Refreshing Data
Old Way – Scraping from HTML
New Way – APIs, XML, CAML, etc.
Data Formatting
?
?
Strategies – ETL vs ELT
Tools
Extract, Transform, and Load
?
?
?
Extract is the process of reading data from a database. In this stage, the data is collected,
often from multiple and different types of sources.
Transform is the process of converting the extracted data from its previous form into the
form it needs to be in so that it can be placed into another database. Transformation
occurs by using rules or lookup tables or by combining the data with other data.
Load is the process of writing the data into the target database.
Data Formats
?
?
Excel
Delimited Files – Most Common for larger sets
?
?
?
?
Can be easy, but older sets are often complex. (FECA example)
JSON – Javascript Object Notation – Becoming more popular
XML – Extensible Markup Language – Most common for newer data sets
Proprietary Formats
?
?
SAS Datasets
Database Backups/dumps
Sources of Data
Internal vs External
Internal
?
?
?
?
Free
Varied depending on business
Should be able to paint a picture of
your business
Can be quite messy
External
?
?
?
Not Free
Plenty of places to look
Pay for cleaner data
Sources of Data
Log Data
?
?
Depending on the project you are working with, log Data might be useful
Sources:
?
?
?
?
?
?
System Logs
Network Logs
Application Logs
Often need more formatting than other Data sources
Often very noisy
Log aggregation tools bring it all together, might handle formatting
?
?
Splunk
Greylog
Sources of Data
Public vs. Private
Public Data Sets:
?
?
?
?
?
NWS Weather
Data.gov
SSA DMF
?
?
Private Data Sets
?
?
?
?
?
Typically not free
Very business-speci?c
Update frequently
Very speci?c licensing terms
https://cloud.google.com/public-datasets/
?
https://aws.amazon.com/opendata/public-dat
asets/
https://www.fdbhealth.com/solutions/medkno
wledge/medknowledge-drug-pricing
?
Duns and Bradstreet
https://datasetsearch.research.google.com/
https://github.com/awesomedata/awesome-p
ublic-datasets
https://risk.lexisnexis.com/products/accurint-f
or-healthcare
Sources of Data
Free vs. Paid
Free
?
?
?
?
?
Paid
Generally will require more cleaning
Updated less often
Some sets are free for partial sets of
data
No Support for data issues
You get what you paid for…
?
?
?
?
?
?
Cleanliness of data varies
Pay for more updates
Full sets of data, or built for your
speci?c needs
Support often provided
Can get costly (Cost-Bene?t Analysis
needed)
Some Public Data costs as well
Sources of Data
Refreshing Data
?
?
How often?
Refresh Policy
?
?
?
?
?
Replace Existing Data
Update Existing Data
Append Data
Update and Add new Data
Storage Space limitations
Data Providers
Data Providers – Companies/Websites that aggregate various datasets, then provide data in
either a paid license or open license. Can be very market speci?c. Often have a single API to
access all data sets.
?
?
?
?
https://www.ecoinvent.org/home.html
https://intrinio.com/
https://www.programmableweb.com/category/all/apis
Data.gov
Also, there is a new push for Data as a Service, where we dont download data sets, we just
query against a service provider.
Data Scraping
Old Way
?
?
?
Since the internet used to be fairly static, to get data, we would Scrape the data from the
web pages using code.
Youd have scripts run through pages, look for certain spots or words, and would capture
it in another ?le.
Often Memory and resource intensive.
Data Scraping
New Way
?
?
?
Grab RSS/XML feeds from pages
Use APIs to sites data
Use tools to Scrape data from Social media or webpages
?
?
?
?
?
Data Scraper – Chrome Plugin
WebHarvey
Import.io
Buy Pre Scraped and formatted data
Use a hybrid of new and old way to see the whole picture
?
https://www.scrapehero.com/tutorial-how-to-scrape-amazon-product-details-using-python-and-s
electorlib/
Data Formatting
ETL vs. ELT
https://www.xplenty.com/blog/etl-vs-elt/
Data Formatting
ETL
?
A continuous, ongoing process with a well-de?ned work?ow: ETL ?rst extracts data from
homogeneous or heterogeneous data sources. Next, it deposits the data into a staging
area. Then the data is cleansed, enriched, transformed, and stored in the data warehouse.
?
Used to required detailed planning, supervision, and coding by data engineers and
developers: The old-school methods of hand-coding ETL transformations in data
warehousing took an enormous amount of time. Even after designing the process, it took
time for the data to go through each stage when updating the data warehouse with new
information.
?
Modern ETL solutions are easier and faster: Modern ETL, especially for cloud-based data
warehouses and cloud-based SaaS platforms, happens a lot faster.
Data Formatting
ELT
?
Ingest anything and everything as the data becomes available: ELT paired with a data
lake lets you ingest an ever-expanding pool of raw data immediately, as it becomes
available. There’s no requirement to transform the data into a special format before
saving it in the data lake.
?
Transforms only the data you need: ELT transforms only the data required for a
particular analysis. Although it can slow down the process of analyzing the data, it offers
more ?exibilitybecause you can transform the data in different ways on the ?y to
produce different types of metrics, forecasts and reports. Conversely, with ETL, the entire
ETL pipelineand the structure of the data in the OLAP warehousemay require
modi?cation if the previously-decided structure doesn’t allow for a new type of analysis.
?
ELT is less-reliable than ETL: Its important to note that the tools and systems of ELT are
still evolving, so they’re not as reliable as ETL paired with an OLAP database. Although it
takes more effort to setup, ETL provides more accurate insights when dealing with
massive pools of data. Also, ELT developers who know how to use ELT technology are
more dif?cult to ?nd than ETL developers
Data Formatting
Tools
?
?
?
?
?
Excel – It can do a ton of great stuff, however, it SH*** the bed with larger data sources
Scripting – Perl or Python – Work great with ?at ?les, XML, JSON, but not with others
SQL – Can do much of the formatting in SQL and create new tables.
SAS / R – Same as scripting, very powerful, but a learning curve.
ETL Speci?c Tools
?
?
?
?
?
?
Informatica
Microsoft SSIS
Oracle Data Integrator
IBM InfoSphere DataStage
Apache Air?ow, Kafka, Ni?
Talend Open Studio
Data Formatting
Tips
?
?
?
?
?
?
Dates/Times – convert to same timezone
Units – Convert to the same units when possible
Standardize addresses
Classify/Tag certain data
Timestamp data
If you are going to map, Geocode early – can be costly.
In Class Assignment
Find the best Data Source to help solve these problems.
?
?
?
?
?
When should our nation-wide business start to stock snow shovels?
Where is the best location to buy real estate for a car dealership?
Which third basemen plays the better in day games?
How many people have larger incomes and less children in a speci?c geographic area?
What Data Sources would have me (your professor) in them?
Post answers to discussion board posted, as replies to each question.
The End
Quiz 1 posted, due next
week
? Assignment 1 posted,
due in 2 weeks.
?
IT 385
Managing Big Data
Professor Tim Eagle
Module 4:
Data Quality
Data Quality
?
?
?
?
?
Garbage In, Garbage Out
De?nition of Data Quality
The Continuum of Data Quality
Other Problems with Data Quality
Creating Better Data Quality
?
?
?
?
?
Data Cleansing
Master Data Management
Data deduplication
Data Interpretation
The need for Domain Experts
Garbage In,
Garbage Out
De?nition of Data Quality
Validity
Reliability
Completeness
Precision
Timeliness
Integrity
Data measure what they are supposed to measure.
Everyone defines, measures, and collects data the same wayall
the time.
Data include all of the values needed to calculate indicators.
No variables are missing.
Data have sufficient detail. Units of measurement are very clear.
Data are up to date. Information is available on time.
Data are true. The values are safe from deliberate bias and have
not been changed for political or personal reasons.
Validity
?
Data measures what they are supposed to measure.
When it goes wrong:
AI And facial recognition.
https://www.wmm.com/sponsored-project/codedbias/?fbclid=IwAR1xSLJToeXvMbNePKnlTn
dCFjnO3485Iv0AMf5wcZC-Tb1UuIJSiT3ivnQ
Reliability
Everyone de?nes, measures, and collects data the same way – all the time.
When it goes wrong:
Even the discovery of the Americas was a result of bad data. Christopher Columbus
made a few signi?cant miscalculations when charting the distance between Europe and
Asia. First, he favored values given by Persian geographer Alfraganus, over the more
accurate calculations of Greek geographer, Eratosthenes. Second, Columbus assumed
Alfraganus was referring to Roman miles in his calculations when, in reality he was
referring to Arabic miles.
Completeness
Data includes all of the values needed to calculate indicators. No variables are missing.
When it goes wrong:
2016 Election- The 2016 United States Presidential election was also mired with bad data.
National polling data used to predict state-by-state Electoral College votes led to the prediction
of a Hillary Clinton landslide, a forecast that lead to many American voters to stay home on
Election Day.
Also, the Big Short
Precision
?
Data has suf?cient detail. Units of measurement are very clear.
When it goes wrong:
In 1999, NASA took a $125 million dollar hit when it lost the Mars Orbiter. It turns out that they
engineering team responsible for developing the Orbiter used English units of measurement
while NASA used the metric system. The problem here is the data was inconsistent making it a
rather costly and disastrous mistake.
Timeliness
Data is up to date. Information is available on time.
When it goes wrong:
The turning point of the civil war, Gettysburg. Lee, general of the Confederate army had old
intel, and didnt know the accurate count of troops.
Integrity
Data is true. The values are safe from deliberate bias and have not been changed for political or
personal reasons.
When it goes wrong:
The Enron scandal in 2001 was largely a result of bad data. Enron was once the sixth-largest
company in the world. A host of fraudulent data provided to Enrons shareholders resulted in
Enrons meteoric rise and subsequent crash. An ethical external auditing ?rm could have
prevented this fraud from occurring.
Or the Anti-vaccination movement.
The Data Quality Continuum
The Data Quality Continuum
Data and information is not static, it flows in a
data collection and usage process
Data gathering
Data delivery
Data storage
Data integration
Data retrieval
Data mining/analysis
Data Gathering
How does the data enter the system?
Sources of problems:
Manual entry
No uniform standards for content and formats
Parallel data entry (duplicates)
Approximations, surrogates SW/HW constraints
Measurement errors.
Solutions
Potential Solutions:
Preemptive:
Process architecture (build in integrity checks)
Process management (reward accurate data entry, data sharing,
data stewards)
Retrospective:
Cleaning focus (duplicate removal, merge/purge, name & address
matching, field value standardization)
Diagnostic focus (automated detection of glitches).
Data Delivery
Destroying or mutilating information by inappropriate pre-processing
Inappropriate aggregation
Nulls converted to default values
Loss of data:
Buffer overflows
Transmission problems
No checks
Solutions
Build reliable transmission protocols
Use a relay server
Veri?cation
Checksums, veri?cation parser
Do the uploaded ?les ?t an expected pattern?
Relationships
Are there dependencies between data streams and processing steps
Interface agreements
Data quality commitment from the data stream supplier.
Data Storage
You get a data set. What do you do with it?
Problems in physical storage
Can be an issue, but terabytes are cheap.
Problems in logical storage (ER ?? relations)
Poor metadata.
Data feeds are often derived from application programs or legacy
data sources. What does it mean?
Inappropriate data models.
Missing timestamps, incorrect normalization, etc.
Ad-hoc modifications.
Structure the data to fit the GUI.
Hardware / software constraints.
Data transmission via Excel spreadsheets, Y2K
Solutions
Metadata
Document and publish data specifications.
Planning
Assume that everything bad will happen.
Can be very difficult.
Data exploration
Use data browsing and data mining tools to examine the data.
Does it meet the specifications you assumed?
Has something changed?
Data Integration
Combine data sets (acquisitions, across departments).
Common source of problems
Heterogenous data : no common key, different field formats
Approximate matching
Different definitions
What is a customer: an account, an individual, a family,
Time synchronization
Does the data relate to the same time periods? Are the time
windows compatible?
Legacy data
IMS, spreadsheets, ad-hoc structures
Sociological factors
Reluctance to share loss of power.
Solutions
Commercial Tools
Significant body of research in data integration
Many tools for address matching, schema mapping are available.
Data browsing and exploration
Many hidden problems and meanings : must extract metadata.
View before and after results : did the integration go the way you
thought?
Data Retrieval
Exported data sets are often a view of the actual data. Problems occur
because:
Source data not properly understood.
Need for derived data not understood.
Just plain mistakes.
Inner join vs. outer join
Understanding NULL values
Computational constraints
E.g., too expensive to give a full history, well supply a snapshot.
Incompatibility
Ebcdic?
Data Mining and Analysis
What are you doing with all this data anyway?
Problems in the analysis.
Scale and performance
Confidence bounds?
Black boxes and dart boards
fire your Statisticians
Attachment to models
Insufficient domain expertise
Casual empiricism
Solutions
Data exploration
Determine which models and techniques are appropriate, find data bugs,
develop domain expertise.
Continuous analysis
Are the results stable? How do they change?
Accountability
Make the analysis part of the feedback loop.
Other problems in DQ – Missing Data
Missing data – values, attributes, entire records, entire sections
Missing values and defaults are indistinguishable
Truncation/censoring – not aware, mechanisms not known
Problem: Misleading results, bias.
Data Glitches
Systemic changes to data which are external to the recorded process.
Changes in data layout / data types
Integer becomes string, fields swap positions, etc.
Changes in scale / format
Dollars vs. euros
Temporary reversion to defaults
Failure of a processing step
Missing and default values
Application programs do not handle NULL values well
Gaps in time series
Especially when records represent incremental changes.
Departmental Silos
?
?
?
?
Everyone sees their job, department, business as the most important thing.
Often Departments or other Groups will have their own Data Quality Standards for
their speci?c mission.
Data Quality Suffers when you have to look at data across the business or between
companies.
Example: Federal ID for Companies and Businesses
? DUNS vs. TaxID vs. NPI vs. SSN vs. Department ID vs. Universal ID
Then why is every DB dirty?
Consistency constraints are often not used
Cost of enforcing the constraint
E.g., foreign key constraints, triggers.
Loss of flexibility
Constraints not understood
E.g., large, complex databases with rapidly changing requirements
DBA does not know / does not care.
Garbage in
Merged, federated, web-scraped DBs.
Undetectable problems
Incorrect values, missing data
Metadata not maintained
Database is too complex to understand
Improving Data Quality
Data Cleansing
?
?
?
?
Not just about the Data itself, also about standardizing business log data & metrics
Create universal identi?ers across your business, look at best practices.
Convert dates to same timezone and format.
Standardize naming conventions in metadata
Cleansing Methods
?
?
?
?
?
Histograms
Conversion Tables
? Example – USA, U.S., U.S.A., US, United States
Tools
Algorithms
Manually
Master Data Management
MDM Continued
Data Deduplication
?
Data deduplication: A process that examines new data blocks
using hashing compares them to existing data blocks, and skips
redundant blocks when data is transferred to the target.
?
Data reduction: A process that tracks block changes, usually
using some kind of log or journal, and then transfers only new
blocks to the backup target.
Data Interpretation
?
Data interpretation refers to the implementation of processes through which data is
reviewed for the purpose of arriving at an informed conclusion. The interpretation of data
assigns a meaning to the information analyzed and determines its signi?cation and
implications.
?
?
Qualitative Interpretation – Observations, Documents, Interviews
Quantitative Interpretation – Mean, Standard Deviation, Frequency Distribution
Data Interpretation Problems
?
Correlation mistaken for Causation
?
Con?rmation Bias
?
Irrelevant Data
Domain Expertise
Data quality gurus: We found these peculiar records in
your database after running sophisticated algorithms!
Domain Experts: Oh, those apples – we put them in the
same baskets as oranges because there are too few
apples to bother. Not a big deal. We knew that already.
Why Domain Expertise?
DE is important for understanding the data, the problem and interpreting the
results
The counter resets to 0 if the number of calls exceeds N.
The missing values are represented by 0, but the default billed
amount is 0 too.
Insufficient DE is a primary cause of poor DQ data are unusable
DE should be documented as metadata
Where is the Domain Expertise?
Usually in peoples heads seldom documented
Fragmented across organizations
Often experts dont agree. Force consensus.
Lost during personnel and project transitions
If undocumented, deteriorates and becomes fuzzy over time
The End
Readings: ebook Ch. 3
and Ch. 5
? Homework 1 is due
next week
?
ALSPAC DATA MANAGEMENT PLAN, 2019-2024
0. Proposal name
The Avon Longitudinal Study of Parents and Children (ALSPAC). Core Program Support
2019-2024.
1. Description of the data
1.1
Type of study
ALSPAC is a multi-generation, geographically based cohort study following 14,541 mothers recruited
during pregnancy (in 1990-1992) and their partners (G0), offspring (G1) and grandchildren (G2).
1.2
Types of data
Quantitative data
Data from numerous self-completed pa…
Purchase answer to see full
attachment
We've got everything to become your favourite writing service
Money back guarantee
Your money is safe. Even if we fail to satisfy your expectations, you can always request a refund and get your money back.
Confidentiality
We don’t share your private information with anyone. What happens on our website stays on our website.
Our service is legit
We provide you with a sample paper on the topic you need, and this kind of academic assistance is perfectly legitimate.
Get a plagiarism-free paper
We check every paper with our plagiarism-detection software, so you get a unique paper written for your particular purposes.
We can help with urgent tasks
Need a paper tomorrow? We can write it even while you’re sleeping. Place an order now and get your paper in 8 hours.
Pay a fair price
Our prices depend on urgency. If you want a cheap essay, place your order in advance. Our prices start from $11 per page.