remove duplicate word from pandas column - python

I have dataframe with information like below stored in one column
>>> Results.Category[:5]
0 issue delivery wrong master account
1 data wrong master account batch
2 order delivery wrong data account
3 issue delivery wrong master account
4 delivery wrong master account batch
Name: Category, dtype: object
Now I want to keep unique word in Category column
For Example :
In first row word "wrong" is present I want to remove it from all rest of the rows and keep word "wrong" in first row only
In second row word "data" is available then I want to remove it from all rest of the rows and keep word "data" in second row only
I found that if duplicates are available in row we can remove using below , but I need to remove duplicate words from columns, Can anyone please help me here.
AFResults['FinalCategoryN'] = AFResults['FinalCategory'].apply(lambda x: remove_dup(x))

It seems you want something like,
out = []
seen = set()
for c in df['Category']:
words = c.split()
out.append(' '.join([w for w in words if w not in seen]))
seen.update(words)
df['FinalCategoryN'] = out
df
Category FinalCategoryN
0 issue delivery wrong master account issue delivery wrong master account
1 data wrong master account batch data batch
2 order delivery wrong data account order
3 issue delivery wrong master account
4 delivery wrong master account batch
If you don't care about the ordering, you can use set logic:
u = df['Category'].apply(str.split)
v = split.shift().map(lambda x: [] if x != x else x).cumsum().map(set)
(u.map(set) - v).str.join(' ')
0 account delivery issue master wrong
1 batch data
2 order
3
4
Name: Category, dtype: object

In you case you need split it first then remove duplicate by drop_duplicates
df.c.str.split(expand=True).stack().drop_duplicates().\
groupby(level=0).apply(','.join).reindex(df.index)
Out[206]:
0 issue,delivery,wrong,master,account
1 data,batch
2 order
3 NaN
4 NaN
dtype: object

What you what cannot be vectorized, so let us just forget about pandas and use a Python set:
total = set()
result = []
for line in AFResults['FinalCategory']:
line = set(line.split()).difference(total)
total = total.union(line)
result.append(' '.join(line))
You get that list: ['wrong issue master delivery account', 'batch data', 'order', '', '']
You can use it to populate a dataframe column:
AFResults['FinalCategoryN'] = result

Use apply with sorted and set and str.join and list.index:
AFResults['FinalCategoryN'] = AFResults['FinalCategory'].apply(lambda x: ' '.join(sorted(set(x.split()), key=x.index)))

Related

Python, Lists, Removing Rows

I have list of items (each row has the following: item number, lot number, description, total quantity). If a certain lot-number in my list exists twice, I add the quantities of both those rows together. "data" is my original list. "max_item" is the max times an item occurs in "data". I created a new list (one_lot_per_row_list) and have appended my updated rows to it, but I also need to add the rows from "data" that did not have duplicate lots. Or I need to remove the row that was not updated from "data" (data[i+1+j]) in my code below. Not sure if the best way to approach this is to create a new list or to remove rows from my original. Hopefully this makes sense! All help very appreciated!
Example list below -- The final 2 rows have the same Internal Lot number. I would like to add their Total Available quantities together, and then remove the row that was not updated.
Part Internal Lot Number Description Total available Expiration Date Location
0001 QLN03867 P 2 3/31/2025 FRZ06 Half 1
0002 QLN03923 A 15 4/30/2023 F01-S01-05
0002 QLN03469 A 3 9/30/2022 F01-S03-02
0003 QLN03924 G 15 9/30/2022 F01-S01-05
0003 QLN03470 G 2 9/30/2022 F01-S01-02
0004 QLN03466 U 4 10/31/2022 F01-S03-02
0005 QLN03925 C 10 4/30/2023 F01-S01-02
0005 QLN03471 C 2 9/30/2022 F01-S01-02
0006 QLN03468 R 5 7/31/2021 F01-S03-02
0007 QLN03994 I 2 4/13/2025 F01-S03-03
0007 QLN03994 I 1 4/13/2025 F01-S03-02
data = []
for row in csv_reader:
azpn = row[0]
azln = row[1]
description = row[2]
location = row[5]
date = datetime.strptime(row[4], '%m/%d/%Y')
total_available = int(row[3])
data.append([azpn, azln, description, total_available, date, location])
one_lot_per_row_list = []
i = 0
j = 1
for i in range(len(data)- max_item):
# if the lot number of row i is equal to the lot number of row i + j
for j in range(max_item):
if data[i][1] == data[i+1+j][1]:
#add total available of data[i] to row data[i+1+j]
data[i][3] += data[i+1+j][3]
#append the new row to one_lot_per_row_list
one_lot_per_row_list.append(data[i])
j+=1
i+=1
You could pursue your approach or go for a more elegant method as follows:
Sort by lot number
Group by lot number
Use the reduce function to merge the items in each group.
IIUC, you can do this very easily via pandas. The alternative is itertools groupby:
Here's one way via pandas:
groupby lot number and transform the column Total.
drop the duplicates based on subset = ['Internal Lot Number', 'Total']
Finally, save the CSV file via to_csv.
import pandas as pd
df = pd.read_csv('your csv file path here')
df.assign(Total=df.groupby('Internal Lot Number')['Total'].transform('sum')).drop_duplicates(
['Internal Lot Number', 'Total']).to_csv('output csv file path here')

How to calculate the number of occurrences between data in excel?

I have a huge CSV table of thousands of data, I want to make a table of number of occurrence of two elements together divided by how many that element presented
[
Like Bitcoin appeared 8 times in this rows with 2 times with API so the relation between bitcoin to API: is that API always exists with bitcoin so the value of API appearing with bitcoin is 1 and bitcoin appearing with API is 1/4.
I want something looks like this in the end
How I can do it with python or any other tool?
This is sample of file
sample of the file
This, I think, does do the job. I typed your spreadsheet into a csv by hand (would have been nice to be able to cut and paste), and the results seem reasonable.
import itertools
import csv
import numpy as np
words = {}
for row in open('input.csv'):
parts = row.rstrip().split(',')
for a,b in itertools.combinations(parts,2):
if a not in words:
words[a] = [b]
else:
words[a].append( b )
if b not in words:
words[b] = [a]
else:
words[b].append( a )
print(words)
size = len(words)
keys = list(words.keys())
track = np.zeros((size,size))
for i,k in enumerate(keys):
track[i,i] = len(words[k])
for j in words[k]:
track[i,keys.index(j)] += 1
track[keys.index(j),i] += 1
print(keys)
# Scale to [0,1].
for row in range(track.shape[0]):
track[row,:] /= track[row,row]
# Create a csv with the results.
fout = open('corresp.csv','w')
print( ','.join([' ']+keys), file=fout )
for row in range(track.shape[0]):
print( keys[row], file=fout, end=',')
print( ','.join(f"{track[row,i]}" for i in range(track.shape[1])), file=fout )
Here's the first few lines of the result:
,API,Backend Development,Bitcoin,Docker,Article Rewriting,Article writing,Blockchain,Content Writing,Ghostwriting,Android,Ethereum,PHP,React.js,C Programming,C++ Programming,ASIC,Digital ASIC Coding,Embedded Software,Article Writing,Blog,Copy Typing,Affiliate Marketing,Brand Marketing,Bulk Marketing,Sales,BlockChain,Business Strategy,Non-fungible Tokens,Technical Writing,.NET,Arduino,Software Architecture,Bluetooth Low Energy (BLE),C# Programming,Ada programming,Programming,Haskell,Rust,Algorithm,Java,Mathematics,Machine Learning (ML),Matlab and Mathematica,Data Entry,HTML,Circuit Designs,Embedded Systems,Electronics,Microcontroller, C++ Programming,Python
API,1.0,0.14285714285714285,0.5714285714285714,0.14285714285714285,0.0,0.0,0.2857142857142857,0.0,0.0,0.0,0.14285714285714285,0.0,0.14285714285714285,0.2857142857142857,0.2857142857142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Backend Development,0.6666666666666666,1.0,0.6666666666666666,0.6666666666666666,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bitcoin,0.21052631578947367,0.05263157894736842,1.0,0.05263157894736842,0.0,0.0,0.2631578947368421,0.0,0.0,0.05263157894736842,0.10526315789473684,0.10526315789473684,0.05263157894736842,0.15789473684210525,0.21052631578947367,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.0,0.0,0.0,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.0,0.0,0.05263157894736842,0.0,0.0,0.0,0.0,0.05263157894736842,0.05263157894736842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Docker,0.6666666666666666,0.6666666666666666,0.6666666666666666,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
I had a look at this by creating a pivot table in Excel for every combination of columns there are: AB AC, AD, BC, BD, CD and putting the unique entries from the first column, eg A, in the rows and the unique entries from the second, eg B, in the column and then putting column A in the values area, I find all matches and the count of all matches
This is a clunky method but I note from the Python based method that has been submitted, my answer is essentially no more or less clunky than that!

Search for column in pandas

How do you search if a value exist in a specific row?
Example I have this file which contains the following:
ID Name
1 Mark
2 John
3 Mary
The user will input 1 and it will
print("the value already exist.")
But if the user input 4 it will add a new row containing 4 and
name = input('Name')
and update the file like this
ID Name
1 Mark
2 John
3 Mary
4 (userinput)
An easy approach will be:
import pandas as pd
bool_val = False
for i in range(0, df.shape[0]):
if str(df.iloc[i]['ID']) == str(input_str):
bool_val = False
break
else:
print("there")
bool_val = True
if bool_val == True:
df = df.append(pd.Series([input_str, name], index = ['ID', 'Name']), ignore_index=True)
Remember to add the parameter ignore_index to avoid TypeError. I added a bool value to avoid appending a row multiple times.
searchid=20 #use sys.argv[1] if needed to be passed as argument to the program. Or read it as raw_input
if str(searchid) in df.index.astype(str):
print("ID found")
else:
name=raw_input("ID not found. Specify the name for this ID to update the data:") #use input() if python version >= 3
df.loc[searchid]=[str(name)]
If ID is not index:
if str(searchid) in df.ID.values.astype(str):
print("ID found")
else:
name=raw_input("ID not found. Specify the name for this ID to update the data:") #use input() if python version >= 3
df.loc[searchid]=[str(searchid),str(name)]
specifying column headers to update during df update might avoid errors of mismatch:
df.loc[searchid]={'ID': str(searchid), 'Name': str(name)}
This should help
Also read at https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html, that mentions the inherent nature of append and concat to copy the full dataframe.
df.loc['ID'] will return the row containing the ID in the index of the dataframe. Assuming IDs are the index values of the df you are referring to.
If you have a list of IDs and wish to search for them all together then:
assuming:
listofids=['ID1','ID2','ID3']
df.loc[listofids]
will yield the rows containing the above IDs
If IDs are not in index then:
Assuming df['ids'] contain the given ID list:
'searchedID' in df.ids.values
will return True or False based on presence or absence

List to dataframe, list to multiple lists, single column to dataframe

Still figuring out programming, help is appreciated! I have a single column of information that i would ultimately like to turn into a dataframe. I could transpose it but the address information varies, it is either 2 lines or 3 lines (some have suite numbers etc).
It generally looks like this.
name x,
ID 1,
123-xyz,
ID 2,
abcdefg,
ACTIVITY,
ggg,
TYPE,
C,
COUNTY,
orange county,
ADDRESS,
123 stack st,
city state zip,
PHONE,
111-111-1111,
EXPIRES,
date,
name y,
ID 1,
456-abc,
ID 2,
cvbnmnb,
ACTIVITY,
ggg,
TYPE,
A,
COUNTY,
dakota county,
ADDRESS,
234 overflow st,
lot a,
city state zip,
PHONE,
000-000-0000,
EXPIRES,
date,
name z,
...,
I was thinking of creating new lists for all desired columns and conditionally appending values with a for loop.
for i in list
if value = ID
append previous value to name list
append next value to ID list
elif value = phone
send next value to phone
elif value = address
evaluate 3 rows down
if value = phone
concatenate previous two values and append to address list
if value != phone
concatenate current and previous 2 values and append to address list
else print error message
Would this be a decently efficient option for lists of around ~20,000 values?
I don't really know how to write this, I am using python in a jupyter notebook. Looking for solutions but also looking to learn more!
-EDIT-
A user had suggested a while loop, and the original data sample I gave was simplified and contained 4 fields. My actual set contained 9, and I tried playing around but unfortunately wasn't able to figure it out on my own.
count = 0 #Pointer to start of a cluster
lengthdf = len(df) #Getting the length of the existing dataframe to use it as the terminating condition
while count != lengthdf:
name = id1 = id2 = activity = type = county = address = phone = expires = "" #Reset the fields for every cluster of information
name = df[0][count] #Name is always the first line of cluster
id1 = df[0][count+2] #id is always third line of cluster
id2 = df[0][count+4]
activity = df[0][count+6]
type = df[0][count+8]
county = df[0][count+10]
n=11
while df[0][count+n] != "Phone": #While row is not 'PHONE', everything else in between is the address, appended and separated by comma.
address=address+df[0][count+n]+", "
n+=1
phone = df[0][count+n+1] #Phone number is always the row after 'PHONE', and is only of 1 line.
expires = df[0][count+n+3]
n+=2
newdf = newdf.append({'NAME': name, 'ID 1': id1, 'ID 2': id2, 'ACTIVITY': activity, 'TYPE': type, 'COUNTY': county, 'ADDRESS': address, 'Phone': phone, 'Expires': expires}, ignore_index=True) #Append the data into the new dataframe
count=count+n
You seem to have a brief understanding of what you need to do judging by the pseudocode you provided!
I'm assuming that your xlsx file looks something like this without the commas.
Based on your sample data, this is what I can come with for you. I'll be referencing each user data as a 'cluster'.
This code works under a few assumptions:
The PHONE field always only have 1 line of data
There is complete data for all cluster (or if there is missing data, a blank exists on the next row).
Data is always in this particular order (i.e. name, ID, address, Phone)
count will be like a pointer to the start of a cluster, while n will be the offset from count. Read the comments for the explanations.
import pandas as pd
df = pd.read_excel (r'test.xlsx', header = None) #Import xlsx file
newdf = pd.DataFrame(columns=['name', 'id', 'address', 'phone']) #Creating blank dataframe
count = 0 #Pointer to start of a cluster
lengthdf = len(df) #Getting the length of the existing dataframe to use it as the terminating condition
while count != lengthdf:
this_add = this_name = this_id = this_phone = "" #Reset the fields for every cluster of information
this_name = df[0][count] #Name is always the first line of cluster
this_id = df[0][count+2] #id is always third line of cluster
n=4
while df[0][count+n] != "PHONE": #While row is not 'PHONE', everything else in between is the address, appended and separated by comma.
this_add=this_add+df[0][count+n]+", "
n+=1
this_phone = df[0][count+n+1] #Phone number is always the row after 'PHONE', and is only of 1 line.
n+=2
newdf = newdf.append({'name': this_name, 'id': this_id, 'address': this_add, 'phone':this_phone}, ignore_index=True) #Append the data into the new dataframe
count=count+n
As for performance wise, I honestly do not think there is much optimisation that can be done given the nature of the dataset (I might be wrong). If you realised my solution is pretty "hard-coded" to reduce the need for if-else statements, but 20,000 lines should not be huge of a problem for Jupyter Notebook. May take a couple of minutes but that should be alright.
I hope this gets you started on tackling other scenarios you may encounter with the remaining datasets!

Most efficient method to modify values within large dataframes - Python

Overview: I am working with pandas dataframes of census information, while they only have two columns, they are several hundred thousand rows in length. One column is a census block ID number and the other is a 'place' value, which is unique to the city in which that census block ID resides.
Example Data:
BLOCKID PLACEFP
0 60014001001000 53000
1 60014001001001 53000
...
5844 60014099004021 53000
5845 60014100001000
5846 60014100001001
5847 60014100001002 53000
Problem: As shown above, there are several place values that are blank, though they have a census block ID in their corresponding row. What I found was that in several instances, the census block ID that is missing a place value, is located within the same city as the surrounding blocks that do not have a missing place value, especially if the bookend place values are the same - as shown above, with index 5844 through 5847 - those two blocks are located within the same general area as the surrounding blocks, but just seem to be missing the place value.
Goal: I want to be able to go through this dataframe, find these instances and fill in the missing place value, based on the place value before the missing value and the place value that immediately follows.
Current State & Obstacle: I wrote a loop that goes through the dataframe to correct these issues, shown below.
current_state_blockid_df = pandas.DataFrame({'BLOCKID':[60014099004021,60014100001000,60014100001001,60014100001002,60014301012019,60014301013000,60014301013001,60014301013002,60014301013003,60014301013004,60014301013005,60014301013006],
'PLACEFP': [53000,,,53000,11964,'','','','','','',11964]})
for i in current_state_blockid_df.index:
if current_state_blockid_df.loc[i, 'PLACEFP'] == '':
#Get value before blank
prior_place_fp = current_state_blockid_df.loc[i - 1, 'PLACEFP']
next_place_fp = ''
_n = 1
# Find the end of the blank section
while next_place_fp == '':
next_place_fp = current_state_blockid_df.loc[i + _n, 'PLACEFP']
if next_place_fp == '':
_n += 1
# if the blanks could likely be in the same city, assign them the city's place value
if prior_place_fp == next_place_fp:
for _i in range(1, _n):
current_state_blockid_df.loc[_i, 'PLACEFP'] = prior_place_fp
However, as expected, it is very slow when dealing with hundreds of thousands or rows of data. I have considered using maybe ThreadPool executor to split up the work, but I haven't quite figured out the logic I'd use to get that done. One possibility to speed it up slightly, is to eliminate the check to see where the end of the gap is and instead just fill it in with whatever the previous place value was before the blanks. While that may end up being my goto, there's still a chance it's too slow and ideally I'd like it to only fill in if the before and after values match, eliminating the possibility of the block being mistakenly assigned. If someone has another suggestion as to how this could be achieved quickly, it would be very much appreciated.
You can use shift to help speed up the process. However, this doesn't solve for cases where there are multiple blanks in a row.
df['PLACEFP_PRIOR'] = df['PLACEFP'].shift(1)
df['PLACEFP_SUBS'] = df['PLACEFP'].shift(-1)
criteria1 = df['PLACEFP'].isnull()
criteria2 = df['PLACEFP_PRIOR'] == df['PLACEFP_AFTER']
df.loc[criteria1 & criteria2, 'PLACEFP'] = df.loc[criteria1 & criteria2, 'PLACEFP_PRIOR']
If you end up needing to iterate over the dataframe, use df.itertuples. You can access the column values in the row via dot notation (row.column_name).
for idx, row in df.itertuples():
# logic goes here
Using your dataframe as defined
def fix_df(current_state_blockid_df):
df_with_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] == '']
df_no_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] != '']
sections = {}
last_i = 0
grouping = []
for i in df_with_blanks.index:
if i - 1 == last_i:
grouping.append(i)
last_i = i
else:
last_i = i
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
grouping = []
grouping.append(i)
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
for i in sections.keys():
sections[i]['place'] = current_state_blockid_df.loc[i-1, 'PLACEFP']
l = []
for i in sections:
for x in sections[i]['indexes']:
l.append(sections[i]['place'])
df_with_blanks['PLACEFP'] = l
final_df = pandas.concat([df_with_blanks, df_no_blanks]).sort_index(axis=0)
return final_df
df = fix_df(current_state_blockid_df)
print(df)
Output:
BLOCKID PLACEFP
0 60014099004021 53000
1 60014100001000 53000
2 60014100001001 53000
3 60014100001002 53000
4 60014301012019 11964
5 60014301013000 11964
6 60014301013001 11964
7 60014301013002 11964
8 60014301013003 11964
9 60014301013004 11964
10 60014301013005 11964
11 60014301013006 11964

Categories

Resources