How unique is each row based on 3-4 columns? - python

I am going to merge two datasets soon by 3 columns.
The hope is that there are no/few 3 column group repeats in the original dataset. I would like to produce something that says approximately how unique each row is. Like maybe some kind of frequency plot (might not work as I have a very large dataset), maybe a table that displays the average frequency for each .5million rows or something like that.
Is there a way to determine how unique each row is compared to the other rows?
1 2 3
A 100 B
A 200 B
A 200 B
Like for the above data frame, I would like to say that each row is unique
1 2 3
A 200 B
A 200 B
A 100 B
For this data set, rows 1 and 2 are not unique. I don't want to drop one, but I am hoping to quantify/weight the amount of non-unique rows.
The problem is my dataframe is 14,000,000 lines long, so I need to think of a way I can show how unique each row is on a set this big.

Assuming you are using pandas, here's one possible way:
import pandas as pd
# Setup, which you can probably skip since you already have the data.
cols = ["1", "2", "3"]
rows = [
["A", 200, "B",],
["A", 200, "B",],
["A", 100, "B",],
]
df1 = pd.DataFrame(rows, columns=cols)
# Get focus column values before adding a new column.
key_columns = df1.columns.values.tolist()
# Add a line column
df1["line"] = 1
# Set new column to cumulative sum of line values.
df1["match_count"] = df1.groupby(key_columns )['line'].apply(lambda x: x.cumsum())
# Drop line column.
df1.drop("line", axis=1, inplace=True)
Print results
print(df1)
Output -
1 2 3 match_count
0 A 200 B 1
1 A 200 B 2
2 A 100 B 1
Return only unique rows:
# We only want results where the count is less than 2,
# because we have our key columns saved, we can just return those
# and not worry about 'match_count'
df_unique = df1.loc[df1["match_count"] < 2, key_columns]
print(df_unique)
Output -
1 2 3
0 A 200 B
2 A 100 B

Related

Keep unique values with only 1 instance

I have the following dataset:
Col_A Amounts
0 A 100
1 B 200
2 C 500
3 D 100
4 E 500
5 F 300
The output I am trying to achieve is to basically remove all values based on the "Amounts" column which have a duplicate value and to keep only the rows where there is one unique instance of a value.
Desired Output:
Col_A Amounts
1 B 200
5 F 300
I have tried to use the following with no luck:
df_1.drop_duplicates(subset=['Amounts'])
This removes the duplicates, however, it still keeps the values which have occurred more than once.
Using the pandas .unique function also provides a similiar undesired output.
You are close, need keep=False for remove all duplicates per Amounts column:
print (df.drop_duplicates(subset=['Amounts'], keep=False))
Col_A Amounts
1 B 200
5 F 300
Less straight forward than the previous answer, but if you want to be able keep the rows that appear n times, you could use value_counts() as a mask, and keep only the rows that appear exactly / at least / less than n times:
import pandas as pd
data = { 'Col_1': ['A','B','C','D','E','F'],
'Amounts': [100, 200, 500, 100, 500,300]
}
df = pd.DataFrame(data)
n=1
mask = df.Amounts.value_counts()
df[df.Amounts.isin(mask.index[mask.lt(n+1)])]
outputs:
Col_1 Amounts
1 B 200
5 F 300

Using Numpy to filter two dataframes

I have two data frames. They are structured like this:
df a
Letter
ID
A
3
B
4
df b
Letter
ID
Value
A
3
100
B
4
300
B
4
100
B
4
150
A
3
200
A
3
400
I need to take for each combo of Letter and ID in df A values from df B and run an outlier function on then.
Currently I am using over 40,000 rows of A and a list of 4,500,000 of list b
a['Results'] = a.apply(lambda x: outliers(b[(b['Letter']==x['Letter']) & (b['ID']==x['ID'])]['value'].to_list()),axis=1)
As you can imagine this is taking forever. Is there some mistake im making or something that can improve this code?
I'd first aggregate every combination of [Letter, ID] in df_b into a list using .groupby, then merge with df_a and apply your outliers function afterwards. Should be faster:
df_a["results"] = df_a.merge(
df_b.groupby(["Letter", "ID"])["Value"].agg(list),
left_on=["Letter", "ID"],
right_index=True,
how="left",
)["Value"].apply(outliers)
print(df_a)
You can first try to merge the datasets a and b and then run a group by over letter and ID, aggregate the Value by Outlier function.
pd.merge(a,b,how="inner",on = ['letter','ID']).groupby(['letter','ID']).agg(outlier).reset_index()

How can I compare part of the string of a cell on a row with another string on the same row and swap them places if they meet my conditions?

I have the following Data Frame:
import pandas as pd
df = {'Country': ['A','A','B','B','B'],'MY_Product': ['NS_1', 'SY_1','BMX_3','NS_5','NK'],'Cost': [5, 35,34,45,9],'Competidor_Country_2': ['A', 'A' ,'B','B','B'],'Competidor_Product_2': ['BMX_2','TM_0','NS_6','SY_8','NA'],'Competidor_Cost_2': [35, 20,65,67,90]}
df_new = pd.DataFrame(df,columns=['Country', 'MY_Product', 'Cost','Competidor_Country_2','Competidor_Product_2','Competidor_Cost_2'])
print(df_new)
Information:
My products must start with "NS","SY", "NK" or "NA";
In the first three columns is represented informations of my products and in the last three the competitor's product
I did not put all examples to simplify the exercise
Problem:
As you can see in the third row, there is a product that is not mine ("BMX_3") and the competidor is one of mine...So I would like to replace not only the pruduct but the other competidor's columns too, thus leaving the first three columns with my product and the last 3 with the competitor's
Considerations:
if the two products in the line are my products (last row for exemple), I don't need to do anything (but if possible leave a "comment code" to delete this comparison will help me, just in case)
If I understand you right, you want to swap values of the 3 columns if the product in MY_Product isn't yours:
# create a mask
mask = ~df_new.MY_Product.str.contains(r"^(?:NS|SY|NK|NA)")
# swap the values of the three columns:
vals = df_new.loc[mask, ["Country", "MY_Product", "Cost"]].values
df_new.loc[mask, ["Country", "MY_Product", "Cost"]] = df_new.loc[
mask, ["Competidor_Country_2", "Competidor_Product_2", "Competidor_Cost_2"]
].values
df_new.loc[
mask, ["Competidor_Country_2", "Competidor_Product_2", "Competidor_Cost_2"]
] = vals
# print the dataframe
print(df_new)
Prints:
Country MY_Product Cost Competidor_Country_2 Competidor_Product_2 Competidor_Cost_2
0 A NS_1 5 A BMX_2 35
1 A SY_1 35 A TM_0 20
2 B NS_6 65 B BMX_3 34
3 B NS_5 45 B SY_8 67
4 B NK 9 B NA 90

Adding elements to an empty dataframe in pandas

I am new to Python and have a basic question. I have an empty dataframe Resulttable with columns A B and C which I want to keep filling with the answers of some calculations which I run in a loop represented by the loop index n. For ex. I want to store the value 12 in the nth row of column A, 35 in nth row of column B and so on for the whole range of n.
I have tried something like
Resulttable['A'].iloc[n] = 12
Resulttable['B'].iloc[n] = 35
I get an error single positional indexer is out-of-bounds for the first value of n, n=0.
How do I resolve this? Thanks!
You can first create an empty pandas dataframe and then append rows one by one as you calculate. In your range you need to specify one above the highest value you want i.e. range(0, 13) if you want to iterate for 0-12.
import pandas as pd
df = pd.DataFrame([], columns=["A", "B", "C"])
for i in range(0, 13):
x = i**1
y = i**2
z = i**3
df_tmp = pd.DataFrame([(x, y, z)], columns=["A", "B", "C"])
df = df.append(df_tmp)
df = df.reset_index()
This will result in a DataFrame as follows:
df.head()
index A B C
0 0 0 0 0
1 0 1 1 1
2 0 2 4 8
3 0 3 9 27
4 0 4 16 64
There is no way of filling an empty dataframe like that. Since there are no entries in your dataframe something like
Resulttable['A'].iloc[n]
will always result in the IndexError you described.
Instead of trying to fill the dataframe like that you better store the results from your loop in a list which you could call 'result_list'. Then you can create a dataframe using your list like that:
Resulttable= pd.DataFrame({"A": result_list})
If you've got another another list of results you want to store in another column of your dataframe, let's say result_list2, then you can create your dataframe like that:
Resulttable= pd.DataFrame({"A": result_list, "B": result_list2})
If 'Resulttable' has already been created you can add column B like that
Resulttable["B"] = result_list2
I hope I could help you.

pandas count number of filled cells within row

I have a large dataset with columns labelled from 1 - 65 (among other titled columns), and want to find how many of the columns, per row, have a string (of any value) in them. For example, if all rows 1 - 65 are filled, the count should be 65 in this particular row, if only 10 are filled then the count should be 10.
Is there any easy way to do this? I'm currently using the following code, which is taking very long as there are a large number of rows.
array = pd.read_csv(csvlocation, encoding = "ISO-8859-1")
for i in range (0, lengthofarray)
for k in range(1,66):
if array[k][i]!="":
array["count"][i]=array["count"][i]+1
From my understanding of the post and the subsequent comments, you are interested in knowing the number of strings in each row for columns labels 1 through 65. There are two steps, the first is to subset your data down to columns 1 through 65, and then the following is the count the number of strings in each row. To do this:
import pandas as pd
import numpy as np
# create sample data
df = pd.DataFrame({'col1': list('abdecde'),
'col2': np.random.rand(7)})
# change one val of column two to string for illustration purposes
df.loc[3, 'col2'] = 'b'
# to create the subset of columns, you could use
# subset = [str(num) for num in list(range(1, 66))]
# and then just use df[subset]
# for each row, count the number of columns that have a string value
# applymap operates elementwise, so we are essentially creating
# a new representation of your data in place, where a 1 represents a
# string value was there, and a 0 represent not a string.
# we then sum along the rows to get the final counts
col_str_counts = np.sum(df.applymap(lambda x: 1 if isinstance(x, str) else 0), axis=1)
# we changed the column two value above, so to check that the count is 2 for that row idx:
col_str_counts[3]
>>> 2
# and for the subset, it would simply become:
# col_str_counts = np.sum(df[subset].applymap(lambda x: 1 if isinstance(x, str) else 0), axis=1)
You should be able to adapt your problem to this example
Say we have this dataframe
df = pd.DataFrame([["","foo","bar"],["","","bar"],["","",""],["foo","bar","bar"]])
0 1 2
0 foo bar
1 bar
2
3 foo bar bar
Then we create a boolean mask where a cell != "" and sum those values
df['count'] = (df != "").sum(1)
print(df)
0 1 2 count
0 foo bar 2
1 bar 1
2 0
3 foo bar bar 3
df = pandas.DataFrame([["","foo","bar"],["","","bar"],["","",""],["foo","bar","bar"]])
total_cells = df.size
df['filled_cell_count'] = (df != "").sum(1)
print(f"{df}")
0 1 2 filled_cell_count
0 foo bar 2
1 bar 1
2 0
3 foo bar bar 3
total_filled_cells = df['filled_cell_count'].sum()/total_cells
print()
print(f"Total Filled Cells in dataframe: {total_filled_cells}")
Total Filled Cells in dataframe: 0.5

Categories

Resources