Python Remove duplicates from csv if value in column duplicated

Python Remove duplicates from csv if value in column duplicated - python

I am trying to write csv parser so if i have the same name in the name column i will delete the second name's line. For example:
['CSE_MAIN\\LC-CSEWS61', 'DEREGISTERED', '2018-04-18-192446'],
['CSE_MAIN\\IT-Laptop12', 'DEREGISTERED', '2018-03-28-144236'],
['CSE_MAIN\\LC-CSEWS61', 'DEREGISTERED', '2018-03-28-144236']]
I need that the last line will be deleted because it has the same name as the first one.
What i wrote is:
file2 = str(sys.argv[2])
print ("The first file is:" + file2)
reader2 = csv.reader (open(file2))
with open("result2.csv",'wb') as result2:
wtr2= csv.writer( result2 )
for r in reader2:
wtr2.writerow( (r[0], r[6], r[9] ))
newreader2 = csv.reader (open("result2.csv"))
sortedlist2 = sorted(newreader2, key=lambda col: col[2] , reverse = True)
for i in range(len(sortedlist2)):
for j in range(len(sortedlist2)-1):
if (sortedlist2[i][0] == sortedlist2[j+1][0] and sortedlist2[i][1]!=sortedlist2[j+1][1]):
if(sortedlist2[i][1]>sortedlist2[j+1][1]):
del sortedlist2[i][0-2]
else:
del sortedlist2[j+1][0-2]
Thanks.

Try with pandas:
import pandas as pd
df = pd.read_csv('path/name_file.csv')
df = df.drop_duplicates([0]) #0 this is columns which will compare.
df.to_csv('New_file.csv') #save to csv
This method delete all duplicates from columns 1.
If you need simple delete you can use method drop.
#You file after use pandas (print(df)):
0 1 2
0 CSE_MAIN\LC-CSEWS61 DEREGISTERED 2018-04-18-192446
1 CSE_MAIN\IT-Laptop12 DEREGISTERED 2018-03-28-144236
2 CSE_MAIN\LC-CSEWS61 DEREGISTERED 2018-03-28-144236
For example you need delete 2 row.
df.drop(2,axis=0, inplace=True) #axis=0 means row, if you switch 1 this is columns.
Output:
0 1 2
0 CSE_MAIN\LC-CSEWS61 DEREGISTERED 2018-04-18-192446
1 CSE_MAIN\IT-Laptop12 DEREGISTERED 2018-03-28-144236

Related

How to avoid column header repeating in pandas dataframe when forloop placed inside while loop in python

code:
i = 0
while i<3:
def avoid_repeat_df():
rows = []
for i in range(1):
try:
rows. append([i, i + 2])
df = pd. DataFrame(rows, columns=["Aaa", "Bee"])
df.to_csv('example.csv')#mode='a' write the ouput same as df else only last occurance
print(df)
except:
pass
avoid_repeat_df()
i+=1
it prints output as
Aaa Bee
0 0 2
Aaa Bee
0 0 2
Aaa Bee
0 0 2
But in csv only last line writes (in append mode same as df output all the line)
What i expect :
avoid repeating column header
Aaa Bee
0 0 2
0 0 2
0 0 2
Hope i explained in details. Thanks for your time and help.

Try defining rows variable(list) outside of the while loop:
i = 0
rows = []
while i<3:
def avoid_repeat_df():
for i in range(1):
try:
rows. append([i, i + 2])
df = pd. DataFrame(rows, columns=["Aaa", "Bee"])
df.to_csv('example.csv')#mode='a' write the ouput same as df else only last occurance
print(df)
except:
pass
avoid_repeat_df()
i+=1
Now If you check example.csv:
,Aaa,Bee
0,0,2
1,0,2
2,0,2

try this:
i = 0
rows = []
while i<3:
l = [[k, k + 2] for k in range(1)]
rows.extend(l)
i+=1
df = pd.DataFrame(rows,columns=["Aaa", "Bee"])
df.to_csv("example.csv")
print(df)
Output
Aaa Bee
0 0 2
1 0 2
2 0 2

I see 2 ways to handle this, one as mentioned by #Anurag Dabas, to move data frame definition outside the loop. I can help you with the second one where while writing you can simply use header option in to_csv method.
While using mode='a', use header=False, this will skip writing column header each time.
But according to what is presented, the best way would be to use list instead of creating a dataframe, as it is computationally inefficient.
i = 0
rows = list()
while i<3:
def avoid_repeat_df():
global rows
for i in range(1):
try:
rows.append([i, i + 2])
except:
pass
avoid_repeat_df()
i+=1
df = pd. DataFrame(rows, columns=["Aaa", "Bee"])
df.to_csv('example.csv')
print(df)
If you still want to create df only
i = 0
while i<3:
def avoid_repeat_df():
rows = list()
for i in range(1):
try:
rows.append([i, i + 2])
df = pd.DataFrame(rows, columns=["Aaa", "Bee"])
if os.path("example.csv").exists():
df.to_csv("example.csv", mode="a", header=False)
else:
df.to_csv("example.csv")
# This won't change actual df though
print(df)
except:
pass
avoid_repeat_df()
i+=1
You could also append to df and save when all your iterations have been completed but that would be as good as creating a list and then creating data frame out of it.

Python & Pandas: appending data to new column

With Python and Pandas, I'm writing a script that passes text data from a csv through the pylanguagetool library to calculate the number of grammatical errors in a text. The script successfully runs, but appends the data to the end of the csv instead of to a new column.
The structure of the csv is:
The working code is:
import pandas as pd
from pylanguagetool import api
df = pd.read_csv("Streamlit\stack.csv")
text_data = df["text"].fillna('')
length1 = len(text_data)
for i, x in enumerate(range(length1)):
# this is the pylanguagetool operation
errors = api.check(text_data, api_url='https://languagetool.org/api/v2/', lang='en-US')
result = str(errors)
# this pulls the error count "message" from the pylanguagetool json
error_count = result.count("message")
output_df = pd.DataFrame({"error_count": [error_count]})
output_df.to_csv("Streamlit\stack.csv", mode="a", header=(i == 0), index=False)
The output is:
Expected output:
What changes are necessary to append the output like this?

Instead of using a loop, you might consider lambda which would accomplish what you want in one line:
df["error_count"] = df["text"].fillna("").apply(lambda x: len(api.check(x, api_url='https://languagetool.org/api/v2/', lang='en-US')["matches"]))
>>> df
user_id ... error_count
0 10 ... 2
1 11 ... 0
2 12 ... 0
3 13 ... 0
4 14 ... 0
5 15 ... 2
Edit:
You can write the above to a .csv file with:
df.to_csv("Streamlit\stack.csv", index=False)
You don't want to use mode="a" as that opens the file in append mode whereas you want (the default) write mode.

My strategy would be to keep the error counts in a list then create a separate column in the original database and finally write that database to csv:
text_data = df["text"].fillna('')
length1 = len(text_data)
error_count_lst = []
for i, x in enumerate(range(length1)):
errors = api.check(text_data, api_url='https://languagetool.org/api/v2/', lang='en-US')
result = str(errors)
error_count = result.count("message")
error_count_lst.append(error_count)
text_data['error_count'] = error_count_lst
text_data.to_csv('file.csv', index=False)

How to read a file containing more than one record type within?

I have a .csv file that contains 3 types of records, each with different quantity of columns.
I know the structure of each record type and that the rows are always of type1 first, then type2 and type 3 at the end, but I don't know how many rows of each record type there are.
The first 4 characters of each row define the record type of that row.
CSV Example:
typ1,John,Smith,40,M,Single
typ1,Harry,Potter,22,M,Married
typ1,Eva,Adams,35,F,Single
typ2,2020,08,16,A
typ2,2020,09,02,A
typ3,Chevrolet,FC101TT,2017
typ3,Toyota,CE972SY,2004
How can I read It with Pandas? It doesn't matter if I have to read one record type each time.
Thanks!!

Here it is a pandas solution.
First we must read the csv file in a way that pandas keeps the entires lines in one cell each. We do that by simply using a wrong separator, such as the 'at' symbol '#'. It can be whatever we want, since we guarantee it won't ever appear in our data file.
wrong_sep = '#'
right_sep = ','
df = pd.read_csv('my_file.csv', sep=wrong_sep).iloc[:, 0]
The .iloc[:, 0] is used as a quick way to convert a DataFrame into a Series.
Then we use a loop to select the rows that belong to each data structure based on their starting characters. Now we use the "right separator" (probably a comma ',') to split the desired data into real DataFrames.
starters = ['typ1', 'typ2', 'typ3']
detected_dfs = dict()
for start in starters:
_df = df[df.str.startswith(start)].str.split(right_sep, expand=True)
detected_dfs[start] = _df
And here you go. If we print the resulting DataFrames, we get:
0 1 2 3 4 5
0 typ1 Harry Potter 22 M Married
1 typ1 Eva Adams 35 F Single
0 1 2 3 4
2 typ2 2020 08 16 A
3 typ2 2020 09 02 A
0 1 2 3
4 typ3 Chevrolet FC101TT 2017
5 typ3 Toyota CE972SY 2004
Let me know if it helped you!

Not Pandas:
from collections import defaultdict
filename2 = 'Types.txt'
with open(filename2) as dataLines:
nL = dataLines.read().splitlines()
defDList = defaultdict(list)
subs = ['typ1','typ2','typ3']
dataReadLines = [defDList[i].append(j) for i in subs for j in nL if i in j]
# dataReadLines = [i for i in nL]
print(defDList)
Output:
defaultdict(<class 'list'>, {'typ1': ['typ1,John,Smith,40,M,Single', 'typ1,Harry,Potter,22,M,Married', 'typ1,Eva,Adams,35,F,Single'], 'typ2': ['typ2,2020,08,16,A', 'typ2,2020,09,02,A'], 'typ3': ['typ3,Chevrolet,FC101TT,2017', 'typ3,Toyota,CE972SY,2004']})

You can make use of the skiprows parameter of pandas read_csv method to skip the rows you are not interested in for a particular record type. The following gives you a dictionary dfs of dataframes for each type. An advantage is that records of the same types don't necessarily have to be adjacent to each other in the csv file.
For larger files you might want to adjust the code such that the file is only read once instead of twice.
import pandas as pd
from collections import defaultdict
indices = defaultdict(list)
types = ['typ1', 'typ2', 'typ3']
filename = 'test.csv'
with open(filename) as csv:
for idx, line in enumerate(csv.readlines()):
for typ in types:
if line.startswith(typ):
indices[typ].append(idx)
dfs = {typ: pd.read_csv(filename, header=None,
skiprows=lambda x: x not in indices[typ])
for typ in types}

Read the file as a CSV file using the CSV reader. The reader fortunately does not care about line formats:
import csv
with open("yourfile.csv") as infile:
data = list(csv.reader(infile))
Collect the rows with the same first element and build a dataframe of them:
import pandas as pd
from itertools import groupby
dfs = [pd.DataFrame(v) for _,v in groupby(data, lambda x: x[0])]
You've got a list of three dataframes (or as many as necessary).
dfs[1]
# 0 1 2 3 4
#0 typ2 2020 08 16 A
#1 typ2 2020 09 02 A

Panda module export, split data

I'm trying to read a .txt file and output the count of each letter which works, however, I'm having trouble exporting that data to .csv in a specific way.
A snippet of the code:
freqs = {}
with open(Book1) as f:
for line in f:
for char in line:
if char in freqs:
freqs[char] += 1
else:
freqs[char] = 1
print(freqs)
And for the exporting to csv, I did the following:
test = {'Book 1 Output':[freqs]}
df = pd.DataFrame(test, columns=['Book 1 Output'])
df.to_csv(r'book_export.csv', sep=',')
Currently when I run it, the export looks like this (Manually done):
However I want the output to be each individual row, so it should look something like this when I open it:
I want it to separate it from the ":" and "," into 3 different columns.
I've tried various other answers on here but most of them end up with giving ValueErrors so maybe I just don't know how to apply it like the following one.
df[[',']] = df[','].str.split(expand=True)

Use DataFrame.from_dict with DataFrame.rename_axis for set index name, then csv looks like you need:
#sample data
freqs = {'a':5,'b':2}
df = (pd.DataFrame.from_dict(freqs, orient='index',columns=['Book 1 Output'])
.rename_axis('Letter'))
print (df)
Book 1 Output
Letter
a 5
b 2
df.to_csv(r'book_export.csv', sep=',')
Or alternative is use Series:
s = pd.Series(freqs, name='Book 1 Output').rename_axis('Letter')
print (s)
Letter
a 5
b 2
Name: Book 1 Output, dtype: int64
s.to_csv(r'book_export.csv', sep=',')
EDIT:
If there are multiple frequencies change DataFrame constructor:
freqs = {'a':5,'b':2}
freqs1 = {'a':9,'b':3}
df = pd.DataFrame({'f1':freqs, 'f2':freqs1}).rename_axis('Letter')
print (df)
f1 f2
Letter
a 5 9
b 2 3

Splitting a CSV file into multiple csv by target columns values

I'm fairly new to programming and Python in general. I've a big CSV file that I need to split into multiple CSV files based on the target values of the target column (last column).
Here's a simplified version of the CSV file data that I want to split.
1254.00 1364.00 4562.33 4595.32 1
1235.45 1765.22 4563.45 4862.54 1
6235.23 4563.00 7832.31 5320.36 1
8623.75 5632.09 4586.25 9361.86 0
5659.92 5278.21 8632.02 4567.92 0
4965.25 1983.78 4326.50 7901.10 1
7453.12 4993.20 4573.30 8632.08 1
8963.51 7496.56 4219.36 7456.46 1
9632.23 7591.63 8612.37 4591.00 1
7632.08 4563.85 4632.09 6321.27 0
4693.12 7621.93 5201.37 7693.48 0
6351.96 7216.35 795.52 4109.05 0
I want to split so that the output extracts the data in different csv files like below:
sample1.csv
1254.00 1364.00 4562.33 4595.32 1
1235.45 1765.22 4563.45 4862.54 1
6235.23 4563.00 7832.31 5320.36 1
sample2.csv
8623.75 5632.09 4586.25 9361.86 0
5659.92 5278.21 8632.02 4567.92 0
sample3.csv
4965.25 1983.78 4326.50 7901.10 1
7453.12 4993.20 4573.30 8632.08 1
8963.51 7496.56 4219.36 7456.46 1
9632.23 7591.63 8612.37 4591.00 1
sample4.csv
7632.08 4563.85 4632.09 6321.27 0
4693.12 7621.93 5201.37 7693.48 0
6351.96 7216.35 795.52 4109.05 0
I tried with pandas and some groupby functions but it merges all 1 and 0 together in separate files one containing all values with 1 and another 0, which is not the output that I needed.
Any help would be appreciated.

What you can do is get the value of the last column in each row. If the value is the same as the value in previous row, add that row to the same list, and if it's not just create a new list and add that row to that empty list. For data structure use list of lists.

Assume the file 'input.csv' contains the original data.
1254.00 1364.00 4562.33 4595.32 1
1235.45 1765.22 4563.45 4862.54 1
6235.23 4563.00 7832.31 5320.36 1
8623.75 5632.09 4586.25 9361.86 0
5659.92 5278.21 8632.02 4567.92 0
4965.25 1983.78 4326.50 7901.10 1
7453.12 4993.20 4573.30 8632.08 1
8963.51 7496.56 4219.36 7456.46 1
9632.23 7591.63 8612.37 4591.00 1
7632.08 4563.85 4632.09 6321.27 0
4693.12 7621.93 5201.37 7693.48 0
6351.96 7216.35 795.52 4109.05 0
code below
target = None
counter = 0
with open('input.csv', 'r') as file_in:
lines = file_in.readlines()
tmp = []
for idx, line in enumerate(lines):
_target = line.split(' ')[-1].strip()
if idx == 0:
tmp.append(line)
target = _target
continue
else:
last_line = idx + 1 == len(lines)
if _target != target or last_line:
if last_line:
tmp.append(line)
counter += 1
with open('sample{}.csv'.format(counter), 'w') as file_out:
file_out.writelines(tmp)
tmp = [line]
else:
tmp.append(line)
target = _target

Perhaps you want something like this:
from itertools import groupby
from operator import itemgetter
sep = ' '
with open('data.csv') as f:
data = f.read()
split_data = [row.split(sep) for row in data.split('\n')]
gb = groupby(split_data, key=itemgetter(4))
for index, (key, group) in enumerate(gb):
with open('sample{}.csv'.format(index), 'w') as f:
write_data = '\n'.join(sep.join(cell) for cell in group)
f.write(write_data)
Unlike pd.groupby, itertools.groupby doesn't sort the source beforehand. This parses the input CSV into a list of lists and performs a groupby on the outer list based on the 5th column, which contains the target. The groupby object is an iterator over the groups; by writing each group to a different file, the result you want can be achieved.

I propose to use a function to do what was asked for.
There is the possibility of leaving unreferenced the file objects that
we have opened for writing, so that they are automatically closed when
garbage collected but here I prefer to explicitly close every output
file before opening another one.
The script is heavily commented, so no further explanations:
def split_data(data_fname, key_len=1, basename='file%03d.txt')
data = open(data_fname)
current_output = None # because we have yet not opened an output file
prev_key = int(1) # because a string is always different from an int
count = 0 # because we want to count the output files
for line in data:
# line has a trailing newline so that to extract the key
# we have to take into account that
key = line[-key_len-1:-1]
if key != prev_key # key has changed!
count += 1 # a new file is going to be opened
prev_key = key # remember the new key
if current_output: # if a file was opened, close it
current_output.close()
# open a new output file, its name derived from the variable count
current_output = open(basename%count, 'w')
# now we can write to the output file
current_output.write(line)
# note that line is already newline terminated
# clean up what is still going
current_output.close()
This answer has an history.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Remove duplicates from csv if value in column duplicated - python

Related

How to avoid column header repeating in pandas dataframe when forloop placed inside while loop in python

Python & Pandas: appending data to new column

How to read a file containing more than one record type within?

Panda module export, split data

Splitting a CSV file into multiple csv by target columns values

Categories

Resources