Remove partial duplicate row using column value - python

I'm trying to clean data where there is a lot of partial duplicate only storing the first row of data when the key in Col A has duplicate.
A B C D
0 foo bar lor ips
1 foo bar
2 test do kin ret
3 test do
4 er ed ln pr
expected output after cleaning
A B C D
0 foo bar lor ips
1 test do kin ret
2 er ed ln pr
I have been looking at methods such as drop_duplicates or even group_by but they don't really help in my case : the duplicate are partial since some rows contain empty data and only have similar value in col A and B.
group by partial work but doesn't return the transformed data , they just filter through.
I'm very new to panda and pointer are appreciated. I could probably doing it outside panda but i'm thinking there might be a better way to do it.
edit: sorry just noticed a mistake i made in the provided example. ( test had became " tes "

In your case how would you say partial duplicate? Please provide complicate example. In the above example instead of Col A duplication you could try Col B.
Expected output could be obtained from this following snippet,
print (df.drop_duplicates(subset=['B']))
Note: Suggested solution only works for the above sample, it won't work when it has different col A and same Col B value.

Related

Dataframe- Remove similar rows related based on two columns

I have following dataset:
this dataset print correlation of two columns at left
if you look at the row number 3 and 42, you will find they are same. only column position is different. that does not affect correlation. I want to remove column 42. But this dataset has many these row of similar values. I need a general algorithm to remove these similar value and have only unique.
As the correlation_value seems to be the same, the operation should be commutative, so whatever the value, you just have to focus on two first columns. Sort the tuple and remove duplicates
# You can probably replace 'sorted' by 'set'
key = df[['source_column', 'destination_column']] \
.apply(lambda x: tuple(sorted(x)), axis='columns')
out = df.loc[~key.duplicated()]
>>> out
source_column destination_column correlation_Value
0 A B 1
2 C E 2
3 D F 4
You could try a self join. Without a code example, it's hard to answer, but something like this maybe:
df.merge(df, left_on="source_column", right_on="destination_column")
You can follow that up with a call to drop_duplicates.

Creating a new column but creates copy of dataframe

I would like to check the value of the row above and see it it is the same as the current row. I found a great answer here: df['match'] = df.col1.eq(df.col1.shift()) such that col1 is what you are comparing.
However, when I tried it, I received a SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. warning. My col1 is a string. I know you can suppress warnings but how would I check the same row above and make sure that I am not creating a copy of the dataframe? Even with the warning I do get my desired output, but was curious if there exists a better way.
import pandas as pd
data = {'col1':['a','a','a','b','b','c','c','c','d','d'],
'week':[1,1,1,1,1,2,2,2,2,2]}
df = pd.DataFrame(data, columns=['col1','week'])
df['check_condition'] = 1
while sum(df.check_condition) != 0:
for week in df.week:
wk = df.loc[df.week == week]
wk['match'] = wk.col1.eq(wk.col1.shift()) # <-- where the warning occurs
# fix the repetitive value...which I have not done yet
# for now just exit out of the while loop
df.loc[df.week == week,'check_condition'] = 0
You can't ignore a pandas SettingWithCopyWarning!
It's 100% telling you that your code is not going to work as intended, if at all. Stop, investigate and fix it. (It's not an ignoreable thing you can filter out, like a pandas FutureWarning nagging about deprecation.)
Multiple issues with your code:
You're trying to iterate over a dataframe (but not with groupby()), take slices of it (in the subdataframe wk, which yes is a copy of a slice)...
then assign to the (nonexistent) new column wk['match']. This is bad, you shouldn't do this. (You could initialize df['match'] = np.nan, but it'd still be wrong to try to assign to the copy in wk)...
SettingWithCopyWarning is being triggered when you try to assign to wk['match']. It's telling you wk is a copy of a slice from dataframe df, not df itself. Hence like it tells you: A value is trying to be set on a copy of a slice from a DataFrame. That assignment would only get thrown away every time wk gets overwritten by your loop, so even if you could force it to work on wk it would be wrong. That's why SettingWithCopyWarning is a code smell you shouldn't be making a copy of a slice of df in the first place.
Later on, you also try to assign to column df['check_condition'] while iterating over the df, that's also bad.
Solution:
df['check_condition'] = df['col1'].eq(df['col1'].shift()).astype(int)
df
col1 week check_condition
0 a 1 0
1 a 1 1
2 a 1 1
3 b 1 0
4 b 1 1
5 c 2 0
6 c 2 1
7 c 2 1
8 d 2 0
9 d 2 1
More generally, for more complicated code where you want to iterate over each group of dataframe according to some grouping criteria, you'd use use groupby() and split-apply-combine instead.
you're grouping by wk.col1.eq(wk.col1.shift()), i.e. rows where col1 value doesn't change from the preceding row
and you want to set check_condition to 0 on those rows
and 1 on rows where col1 value did change from the preceding row
But in this simpler case you can skip groupby() and do a direct assignment.

What is the best way to process a list of numerical codes as descriptions and in Pandas?

Here the dataset:
df = pd.read_csv('https://data.lacity.org/api/views/d5tf-ez2w/rows.csv?accessType=DOWNLOAD')
The problem:
I have a pandas dataframe of traffic accidents in Los Angeles.
Each accident has a column of mo_codes which is a string of numerical codes (which I converted into a list of codes). Here is a screenshot:
I also have a dictionary of mo_codes description for each respective mo_code and loaded in the notebook.
Now, using the code below I can combine the numeric code with the description:
mo_code_list_final = []
for i in range(20):
for j in df.mo_codes.iloc[i]:
print(i, mo_code_dict[j])
So, I haven't added this as a column to Pandas yet. I wanted to ask if there is a better way to solve the problem I have which is, how best to add the textual description in pandas as a column.
Also, is there an easier way to process this with a pandas function like .assign instead of the for loop. Maybe a list comprehension to process the mo_codes into a new dataframe with the description?
Thanks in advance.
ps. if there is a technical word for this type of problem, pls let me know.
import pandas
codes = {0:'Test1',1:'test 2',2:'test 3',3:'test 4'}
df1 = pandas.DataFrame([["red",[0,1,2],5],["blue",[3,1],6]],columns=[0,'codes',2])
# first explode the list into its own rows
df2 = df1['codes'].apply(pandas.Series).stack().astype(int).reset_index(level=1, drop=True).to_frame('codes').join(df1[[0,2]])
#now use map to apply the text descriptions
df2['desc'] = df2['codes'].map(codes)
print(df2)
"""
codes 0 2 desc
0 0 red 5 Test1
0 1 red 5 test 2
0 2 red 5 test 3
1 3 blue 6 test 4
1 1 blue 6 test 2
"""
I figured out how to finally do this. However, I found the answer in Javascript but the same concept applies.
You simply create a dictionary of mocodes and its string value.
export const mocodesDict = {
"0100": "Suspect Impersonate",
"0101": "Aid victim",
"0102": "Blind",
"0103": "Crippled",
...
}
After that, its as simple as doing this
mocodesDict[item)]
where item you want to convert.

How to do columnwise operations in pandas?

I have a dataframe that looks something like:
sample parameter1 parameter2 parameter3
A 9 6 3
B 4 5 7
C 1 5 8
and I want to do an operation that does something like:
for sample in dataframe:
df['new parameter'] = df[sample, parameter1]/df[sample, parameter2]
so far I have tried:
df2.loc['ratio'] = df2.loc['reads mapped']/df2.loc['raw total sequences']
but I get the error:
KeyError: 'the label [reads mapped] is not in the [index]'
when I know well that it is in the index, so I figure I am missing some concept somewhere. Any help is much appreciated!
I should add that the parameter values are floats, just in case that is a problem as well!
The method .loc first expects row indices, then column indices, so the following should work, since you wanted to do column-wise operations:
df2['ratio'] = df2.loc[:, 'reads mapped'] / df2.loc[:, 'raw total sequences']
You can find more info in the documentation.

select certain value then output

I have a file containing mixed information while I only need certain columns of them.
Below is my example file.
A B C D
1 2 3 abcdef
5 6 7 abcdef
1 2 3 abcdef
And I want to extract the file to get the information I need. For example, looks like below in my output file.
A C D # I only need A, C, and D column.
1 3 ab # For D column, I only need ab.
5 7 ab
1 3 ab
It is not a csv or txt file, but with a space between each column.
You can still read a space-separated file with csv module by using the delimiter kwarg:
>>> with open('/tmp/data.txt') as f:
... reader = csv.DictReader(f, delimiter=' ')
... for row in reader:
... print row['A'], row['C'], row['D'][:2]
...
1 3 ab
5 7 ab
1 3 ab
If you want to do something generical for managing data structures the easiest thing you can do is use python libraries to ease the job.
You can use Pandas Lib: Python Data Analysis Library to rapidly parse the file to a DataFrame that provides methods to make what you want.
You also need Numpy lib because as_matrix method (below) returns a numpyArray.
You can see your data file as a csv (Comma separated value) file with spaces as separators.
With pd you can easily parse the file with read_csv:
import pandas as pd
import numpy as np
dataFrame = pd.read_csv("file.txt", sep = ' ')
For selecting columns you use as_matrix method:
selection = dataFrame.as_matrix((A,C,D))
Then you probably want to can cast it back to dataFrame to continue using its methods:
newDataFrame = pd.DataFrame(selection)
Dropping "cdef" of the "abcdef" values in the column D looks like a thing that can be solved by a simple for, and using [String][5] methods provided by Python. Its a very particular instruction and i don't know any implemented method of any library that accomplishes this.
I hope i helped you.
PD: I tried to post a lot of links but the system didn't let me. I recomend you to look for Numpy and Pandas in Google if you dont have them.
You should check the pandas DataFrame docs to check the methods. I the case you didn't understand what i did look for pandas.read_csv, pandas.dataFrame.as_matrix docs in Google.
And if you don't know how to operate Strings look in Python docs for String.
Edit: Anyway, if you don't want to use libs you can parse the txt file to a list of lists imitating a matrix or using the csv structure that wim mentions in his answer. Then create a function to drop columns, checking the first element of every column (Column identifier) and with some fors export that to other matrix.
Then create another function that deletes the desired values of a column, with some other fors.
The point is that using functions to accomplish what you want makes the solution generical for any table managed as a matrix.
If you have more than one columns like D and want to do the same thing as D, you can do below if you're ok with selecting columns with indices instead of letters:
# your data like this
A B C D E
1 2 3 abcdef abbbb
5 6 7 abcdef abbbb
1 2 3 abcdef abbbb
You import csv then
>>> with open('yourdata.txt') as f:
... reader = csv.reader(f, delimiter=' ')
... for row in reader:
... print(row[0], row[1], *[c[:2] for c in row[3:]])
...
A B D E
1 2 ab ab
5 6 ab ab
1 2 ab ab
The * operator before the [c[:2] for c in row[3:]] is for list argument unpacking. * basicly converts [1,2,3] into 1,2,3, so print(*[1,2,3]) is identical to print(1,2,3). Works on tuples as well.
However, this is python3. If you are using python2, print will give you syntax error, but you can make a wrapper function that takes in the unpacked list arguments, and replace print with this function:
def myprint(*args):
print ' '.join([str(i) for i in args])

Categories

Resources