Dear workflow experts,
is it possible to compute a textual diff of two string columns in workflows? Preferably without python or other scripting solutions
My use case is that I need to diff content of JIRA fields, i.e. a simplified markdown structure. It would be fine if the output would be in a format similar to what GNU diff or other shell difftools provide.
best regards,
Adrian
example:
text 1
"* A user can view / manage / organize test runs/results
* A user can specify a location where tests are stored
* A user can easily find the testing environment
* A user can specify which apps can be tested
* Documentation about the new module is available
* A can start a test run
Both automated and manual tests were conducted for the changes of this epic.
"
text 2
"* A user can view / manage / organize test runs/results
* A user can specify a location where tests are stored
* A user can easily find the testing environment
* A user can specify which apps can be tested
* Documentation about the new module is available
* A user can start a test run
Both automated and manual tests were conducted for the changes of this epic.
"
example output (using GNU diff)
6c6
< * A can start a test run
---
> * A user can start a test run
Hey Adrian,
it´s not a solution for your problem with the diffs. But maybe these two Spark functions could help you to find a first indication how similar strings are without any additonal scripting solution:
https://spark.apache.org/docs/2.3.0/api/sql/#levenshtein
and
https://spark.apache.org/docs/2.3.0/api/sql/#soundex
Hi @jakob.schreff
thanks for the pointers. Sorrowly, they don’t help with my challenge. The data already is prefiltered to show only rows with different values, and I need to determine the exact difference.
With help from @christoph.schober I was able to create a solution with python. The following is the script used in the python processor
import pandas as pd
import difflib as dl
def my_diff(row):
expected=row.old_value.splitlines(False)
actual=row.new_value.splitlines(False)
diff=dl.unified_diff(expected, actual, lineterm='')
return '\n'.join(diff)
# od_input keys represent name of the input dataset set in OD processor
dataset = od_input['input'].get_as_pandas()
dataset["delta"] = dataset.apply(my_diff, axis=1)
# publish your output - key will be used for assigning the dataset to a specific output of OD processor in the future
# currently, key can be any non-empty string
od_output.add_data("output", dataset)
This is the generated output for the example in the initial post:
---
+++
@@ -3,7 +3,7 @@
* A user can easily find the testing environment
* A user can specify which apps can be tested
* Documentation about the new module is available
-* A can start a test run
+* A user can start a test run
Both automated and manual tests were conducted for the changes of this epic.