Frictionless was a library I heard about recently that impressed me a great deal it:
- Support for various ingestion formats (csv, Excel, Gsheets, Json, yaml, SQL, AWS etc)
- Automatic issue detection
- Fast response times even on big data
- Seamless integration with other tools
- A wide variety of checks
- A useful report afterwards informing you early so you can fix the error now rather than breaking everything down stream!
If you are running into headaches trying to find that one error in a file or a file you are expected to get into your ecosystem asap but you are uncertain it conforms to a standard this should be your go-to first stop solution for ensuring data quality.
To get started coding first you need to install Frictionless
Install
pip install frictionless
Next you want to start coding! Use whichever editor you prefer I prefer vscode for these kinds of small projects.
Usage
Make sure if you copy this code, you change the file_path variable to point to the file you want, find the basic file I used below
from frictionless import validate
import os
def validate_file(file_path):
try:
# Use Frictionless to validate the file
report = validate(file_path)
return report
except Exception as e:
return f"Error occurred while validating file: {e}"
def output_validation_results(report):
if report.valid:
print("Validation successful! No issues found.")
else:
print("Validation failed! Issues found:")
for error in report.flatten():
print(f"- {error}")
def main():
# Provide the file path here
file_path = "./invalid_file.csv" # Replace with the actual file path
if not os.path.exists(file_path):
print("File not found.")
return
print(f"Validating file: {file_path}")
# Validate the provided file
report = validate_file(file_path)
# Output validation results
output_validation_results(report)
if __name__ == "__main__":
main()
invalid csv file
id,name,,name
1,english
1,english
2,german,1,2,3
As you've seen, running your code with Frictionless reveals a list of errors within the file you used. While this example may seem trivial with a small dataset, the true value of Frictionless shines when working with larger datasets. Imagine dealing with hundreds of thousands of data points – finding just that one error manually compared to the one error in a few seconds could save an enormous amount of time and effort.
But let's consider the broader potential. I began to envision Frictionless not just as a tool for data professionals, but as a catalyst for organisational efficiency. What if we flipped the script? Instead of stakeholders delivering flawed data to us, what if we provided them with a user-friendly interface to upload files? With Frictionless integrated into a web UI, stakeholders could instantly identify any issues with their files, enabling us to collaborate on fixes in real-time. This not only ensures the integrity of each individual file but also establishes a proactive approach to data quality assurance.
By leveraging Frictionless in innovative ways, we can transform data validation from a reactive process to a proactive strategy, driving efficiency and confidence in our data-driven initiatives.