Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Dataframe Comparison
#1
Hi all. I am pretty new to this language. I have a dataframe of 18col x 36k rows containing SKU data. Within this data there are Category and Subcategory columns. Then I have a second dataframe with the valid combinations of Category and Subcategory. What is the best approach for auditing the SKU dataframe to ensure that all Category and Subcategory pairs used are valid according to second dataframe master list? The second dataframe only has about 15 rows of data as it only contains the valid combinations.

Thank you!
Ted
Reply
#2
One possible way it to evaluate specific columns to boolean values and then use that for indices (needed to be inverted in order to get indices of False):

import pandas as pd

products = (('fruit', 'apple', 'Granny Smith'),
            ('fruit', 'pear', 'Red Anjour'),
            ('car', 'sedan', 'Model S'),
            ('fruit', 'tomato', 'Cherry'))


audit = (('fruit', 'apple'),
         ('fruit', 'pear'))

product_df = pd.DataFrame(products)
audit_df = pd.DataFrame(audit)

indices = pd.eval("(product_df[0] in audit_df[0]) & (product_df[1] in audit_df[1])")

print(product_df[~indices])

       0       1        2
2    car   sedan  Model S
3  fruit  tomato   Cherry
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#3
This looks very promising - I very much appreciate your help. Will try to implement this today!
Reply
#4
Having issues with this.

here's what I have. I am getting a whole mess of errors I cannot even interpret. But it seems to start breaking-down with the reference to pd.eval. This is not defined anywhere prior and not sure if it needs to be. In addition once this is working it will return the rows that match the master valid data. What I need is the rows that do NOT match that are invalid - so I will also need to essentially know how to say "not in". thank you!
import pandas as pd
skudata = pd.read_csv("SKU Data.csv")
cat_master = pd.read_csv("Valid Categories & Sub-Categories.csv")

indices = pd.eval("(product_df[0] in audit_df[0]) & (product_df[1] in audit_df[1])")
print(skudata[~indices])
Yoriz write Dec-06-2022, 06:31 PM:
Please post all code, output and errors (in their entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Reply
#5
(Dec-05-2022, 01:01 PM)tdbozarth Wrote: Having issues with this.

Provide 5-10 rows of sample data, audit data and your code with which you have problem. Then we can have meaningful discussion about errors.
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#6
Valid Entries example data:

[U_SOP_CATEGORY], [U_SOP_SUBCATEGORY]
Sparkling - All Other Spec Can,
Sparkling - Std Glass,
Inactive,
Non-SOP, Cups and Lids
Non-SOP, Raw Materials

Main SKU file example data:

[ITEM], [DESCR], [LOC], [XFER], [U_SOP_CATEGORY], [U_SUB_CATEGORY] (12 more irrelevant cols follow)
100278, 12Z CN 6P_COKE, C001, TRUE, Sparkling - 12oz Can w/o 35pk,
Reply
#7
Questions regarding sample data:

In valid entries first three records have only U_SOP_CATEGORY and no U_SOP_SUBCATEGORY. Is this correct?

In example data there are 6 column labels but only 5 column values. Is this correct?
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#8
Yes, I put commas at end in an attempt to show blank entries in last cols. In other words a valid value for Category is a blank and same for subcategory.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020