Textual diff of two columns in workflows

Dear workflow experts,

is it possible to compute a textual diff of two string columns in workflows? Preferably without python or other scripting solutions

My use case is that I need to diff content of JIRA fields, i.e. a simplified markdown structure. It would be fine if the output would be in a format similar to what GNU diff or other shell difftools provide.

best regards,
Adrian

example:
text 1

"* A user can view / manage / organize test runs/results
* A user can specify a location where tests are stored
* A user can easily find the testing environment
* A user can specify which apps can be tested
* Documentation about the new module is available
* A can start a test run

Both automated and manual tests were conducted for the changes of this epic.




"

text 2

"* A user can view / manage / organize test runs/results
* A user can specify a location where tests are stored
* A user can easily find the testing environment
* A user can specify which apps can be tested
* Documentation about the new module is available
* A user can start a test run

Both automated and manual tests were conducted for the changes of this epic.




"

example output (using GNU diff)

6c6
< * A can start a test run
---
> * A user can start a test run

Hey Adrian,

it´s not a solution for your problem with the diffs. But maybe these two Spark functions could help you to find a first indication how similar strings are without any additonal scripting solution:

https://spark.apache.org/docs/2.3.0/api/sql/#levenshtein

and

https://spark.apache.org/docs/2.3.0/api/sql/#soundex

Hi @jakob.schreff

thanks for the pointers. Sorrowly, they don’t help with my challenge. The data already is prefiltered to show only rows with different values, and I need to determine the exact difference.

With help from @christoph.schober I was able to create a solution with python. The following is the script used in the python processor

import pandas as pd
import difflib as dl

def my_diff(row):    
    expected=row.old_value.splitlines(False)
    actual=row.new_value.splitlines(False)

    diff=dl.unified_diff(expected, actual, lineterm='')

    return '\n'.join(diff)

# od_input keys represent name of the input dataset set in OD processor
dataset = od_input['input'].get_as_pandas()

dataset["delta"] = dataset.apply(my_diff, axis=1)

# publish your output - key will be used for assigning the dataset to a specific output of OD processor in the future
# currently, key can be any non-empty string
od_output.add_data("output", dataset)

This is the generated output for the example in the initial post:

---
+++
@@ -3,7 +3,7 @@
* A user can easily find the testing environment
* A user can specify which apps can be tested
* Documentation about the new module is available
-* A can start a test run
+* A user can start a test run

Both automated and manual tests were conducted for the changes of this epic.