How to remove duplicates without pandas?

How to remove duplicates without pandas? - python

This is the data
row1| sbkjd nsdnak ABC
row2| vknfe edcmmi ABC
row3| fjnfn msmsle XYZ
row4| sdkmm tuiepd XYZ
row5| adjck rulsdl LMN
I have already tried this using pandas and got help from stackoverflow. But, I want to be able to remove the duplicates without having to use the pandas library or any library in general. So, only one of the rows having "ABC" must be chosen, only one of the rows having "XYZ" must be chosen and the last row is unique, so, it should be chosen. How do I do this?
So, my final output should contain this:
[ row1 or row2 + row3 or row4 + row5 ]

This should only select the unique rows from your original table. If there are two or more rows which share duplicate data, it will select the first row.
data = [["sbkjd", "nsdnak", "ABC"],
["vknfe", "edcmmi", "ABC"],
["fjnfn", "msmsle", "XYZ"],
["sdkmm", "tuiepd", "XYZ"],
["adjck", "rulsdl", "LMN"]]
def check_list_uniqueness(candidate_row, unique_rows):
for element in candidate_row:
for unique_row in unique_rows:
if element in unique_row:
return False
return True
final_rows = []
for row in data:
if check_list_uniqueness(row, final_rows):
final_rows.append(row)
print(final_rows)

This Bash command would do (assuming your data is in a file called test, and that values of column 4 do not appear in other columns)
cut -d ' ' -f 4 test | tr '\n' ' ' | sed 's/\([a-zA-Z][a-zA-Z]*[ ]\)\1/\1/g' | tr ' ' '\n' | while read str; do grep -m 1 $str test; done
cut -d ' ' -f 4 test chooses the data in the fourth column
tr '\n' ' ' turns the column into a row (translating new line character to a space)
sed 's/\([a-zA-Z][a-zA-Z]*[ ]\)\1/\1/g' deletes the repetitions
tr ' ' '\n' turns the row of unique values to a column
while read str; do grep -m 1 $str test; done reads the unique words and prints the first line from test that matches that word

Related

How to Remove Pipe character from a data filed in a pipe delimited file

Experts, i have a simple pipe delimited file from source system which has a free flow text field and for one of the records, i see that "|" character is coming in as part of data. This is breaking my file unevenly and not getting parsed in to correct number of fields. I want to replace the "|" in the data field with a "#".
Record coming in from source system. There are total 9 fields in the file.
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456
If you Notice the 4th field - Free"flow|Text , this is complete value from source which has a pipe in it.
i want to change it to- Free"flow#Text and then read the file with a pipe delimiter.
Desired Outcome-
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow#Text|20191029|X|X|X|3456
I tried few awk/sed combinations, but didn't get the desired output.
Thanks

Since you know there are 9 fields, and the 4th is a problem: take the first 3 fields and the last 5 fields and whatever is left over is the 4th field.
You did tag shell, so here's some bash: I'm sure the python equivalent is close:
line='OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456'
IFS='|'
read -ra fields <<<"$line"
first3=( "${fields[#]:0:3}" )
last5=( "${fields[#]: -5}" )
tmp=${line#"${first3[*]}$IFS"} # remove the first 3 joined with pipe
field4=${tmp%"$IFS${last5[*]}"} # remove the last 5 joined with pipe
data=( "${first3[#]}" "$field4" "${last5[#]}" )
newline="${first3[*]}$IFS${field4//$IFS/#}$IFS${last5[*]}"
# .......^^^^^^^^^^^^....^^^^^^^^^^^^^^^^^....^^^^^^^^^^^
printf "%s\n" "$line" "$newline"
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow#Text|20191029|X|X|X|3456
with awk, it's simpler: If there are 10 fields, join fields 4 and 5, and shift the rest down one.
echo "$line" | awk '
BEGIN { FS = OFS = "|" }
NF == 10 {
$4 = $4 "#" $5
for (i=5; i<NF; i++)
$i = $(i+1)
NF--
}
1
'
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow#Text|20191029|X|X|X|3456

You tagged your question with Python so I assume a Python-based answer is acceptable.
I assume not all records in your file have the additional "|" in it, but only some records have the "|" in the free text column.
For a more realistic example, I create an input with some correct records and some erroneous records.
I use StringIO to simulate the file, in your environment read the real file with 'open'.
from io import StringIO
sample = 'OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456\nOutboundManualCall|J|LALALA HTREDFST|FreeHalalText|20191029|X|X|X|3456\nOutboundManualCall|J|LALALA HTREDFST|FrulaalText|20191029|X|X|X|3456\nOutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456'
infile = StringIO(sample)
outfile = StringIO()
for line in infile.readlines():
cols = line.split("|")
if len(cols) > 9:
print(f"bad colum {cols[3:5]}")
line = "|".join(cols[:3]) + "#".join(cols[3:5]) + "|".join(cols[5:])
outfile.write(line)
print("Corrected file:")
print(outfile.getvalue())
Results in:
> bad colum ['Free"flow', 'Text']
> bad colum ['Free"flow', 'Text']
> Corrected file:
> OutboundManualCall|H|RTYEHLA HTREDFSTFree"flow#Text20191029|X|X|X|3456
> OutboundManualCall|J|LALALA HTREDFST|FreeHalalText|20191029|X|X|X|3456
> OutboundManualCall|J|LALALA HTREDFST|FrulaalText|20191029|X|X|X|3456
> OutboundManualCall|H|RTYEHLA HTREDFSTFree"flow#Text20191029|X|X|X|3456

remove unwanted quotes and comma in csv file [duplicate]

This question already has answers here:
Remove unwanted parts from strings in a column
(10 answers)
Closed 3 years ago.
I need to remove unwanted quotes and commas from a csv file. Sample data as below
header1, header2, header3, header4
1, "ABC", BCD, "EDG",GHT\2\TST"
The last column has some free text values which seems like a new column but it opend in excel then it look like this
EDG",GHT\2\TST
Please guide me in fixing this last column.
Tried this -
sed 's/","/|/g' $filename | sed 's/|",/||/g' | sed 's/|,"/|/g' | sed 's/",/ /g' | sed 's/^.//' | awk '{print substr($0, 1, length($0)-1)}' | sed 's/,/ /g' | sed 's/"/ /g' | sed 's/|/,/g' > "out_"$filename

this should find " or , from columns and replace it with nothing
df = df.str.replace('[",]','',regex=True)

You can do it like this :
with open("data.txt", "r") as f:
for line in f.readlines():
columns = line.split(", ") # Split by ", "
columns[3] = "".join(columns[3:]) # Merge columns 4 to ... last
columns[3] = columns[3].replace("\"", "").replace(",", "")` # Removing unwanted characters
del columns[4:] # Remove all unnecessary columns
print("%s | %s | %s | %s" % (columns[0], columns[1], columns[2], columns[3]))
My data.txt file :
1, "ABC", BCD, "EDG",GHT\2\TST"
2, "CBA", DCB, "GDV",DHZ,\2RS"
Output :
1 | "ABC" | BCD | EDGGHT\2\TST
2 | "CBA" | DCB | GDVDHZ\2RS
This solution will works if only last column contains commas.

Split csv file vertically using command line

Is it possible to split a csv file, vertically, into multiple files? I know we can split single large files into smaller files with no of rows mentioned using the command line. I have csv files in which columns are repeating after certain column no and I want to split that file column-wise.Is that possible with the command line, If not then how can we do it with python?
For Eg.
consider above sample in which site and address present multiple times vertically, I want to create 3 different csv files containing single site and single address
Any help would be highly appreciated,
Thanks

Assuming your input files is named ~/Downloads/sites.csv and looks like this:
Google,google.com,Google,google.com,Google,google.com
MS,microsoft.com,MS,microsoft.com,MS,microsoft.com
Apple,apple.com,Apple,apple.com,Apple,apple.com
You can use cut to create 3 files, each containing one pair of company/site:
cut -d "," -f 1-2 < ~/Downloads/sites.csv > file1.csv
cut -d "," -f 3-4 < ~/Downloads/sites.csv > file2.csv
cut -d "," -f 5-6 < ~/Downloads/sites.csv > file3.csv
Explanation:
For the cut command, we declare the comma (,) as a separator, which splits every line into a set for 'fields'.
We then specify for each output file, which fields we want to be included.
HTH!

If the site-address pairs are regularly repeated, how about:
awk '{
n = split($0, ary, ",");
for (i = 1; i <= n; i += 2) {
j = (i + 1) / 2;
print ary[i] "," ary[i+1] >> "file" j ".csv";
}
}' input.csv

The following script produces what you want (based on the SO answer adjusted for your needs: number of columns, field separator). It splits the original file vertically into 2 column chunks (note n=2) and creates 3 different files (tmp.examples.1, tmp.examples.2, tmp.examples.3 or whatever you specify for the f variable):
awk -F "," -v f="tmp.examples" '{for (i=1; i<=NF; i++) printf (i%n==0||i==NF)?$i RS:$i FS > f "." int((i-1)/n+1) }' n=2 example.txt
If your example.txt file has the subsequent data:
site,address,site,address,site,address
Google,google.com,MS,microsoft.com,Apple,apple.com

python: Remove trailing 0's and decimal point from awk command

I need to remove the trailing zero's from an export:
the code is reading original tempFile i need column 2 and 6 which contains:
12|9781624311390|1|1|0|0.0000
13|9781406273687|1|1|0|99.0000
14|9781406273717|1|1|0|104.0000
15|9781406273700|1|1|0|63.0000
the awk command changes the form to comma separated and dumps column 2 and 6 into tempFile2 - and i need to remove the trailing zeros from column 6 so the end result looks like this:
9781624311390,0
9781406273687,99
9781406273717,104
9781406273700,63
i believe this should do the trick but have had no luck implementing it:
awk '{sub("\\.*0+$",""); print}'
Below is the code i need to adjust: $6 is the column to remove zero's
if not isError:
print "Translating SQL output to tab delimited format"
awkRunSuccess = os.system(
"awk -F\"|\" '{print $2 \"\\,\" $6}' %s > %s" %
(tempFile, tempFile2)
)
if awkRunSuccess != 0: isError = True

You can use gsub("\\.*0+$","",$2) to do this, as per the following transcript:
pax> echo '9781624311390|0.0000
9781406273687|99.0000
9781406273717|104.0000
9781406273700|63.0000' | awk -F'|' '{gsub("\\.*0+$","",$2);print $1","$2}'
9781624311390,0
9781406273687,99
9781406273717,104
9781406273700,63
However, given you're already within Python (and it's no slouch when it comes to regexes), you'd probably want to use it natively rather than start up an awk process.

Try this awk command
awk -F '[|.]' '{print $2","$(NF-1)}' FileName
Output:
9781624311390,0
9781406273687,99
9781406273717,104
9781406273700,63

How to concatenate identifier specified on two rows?

Input where identifier specified by two rows 1-2
L1_I L1_I C-14 <---| unique idenfier
WWPTH WWPT WWPTH <---| on two rows
1 2 3
Goal: how to concatenate the rows?
L1_IWWPTH L1_IWWPT C-14WWPTH <--- unique identifier
1 2 3
P.s. I will accept the simplest and most elegant solution.

Assuming that the input is in a file called file:
$ awk 'NR==1{for (i=1;i<=NF;i++) a[i]=$i;next} NR==2{for (i=1;i<=NF;i++) printf "%-20s",a[i] $i;print"";next} 1' file
L1_IWWPTH L1_IWWPT C-14WWPTH
1 2 3
How it works
NR==1{for (i=1;i<=NF;i++) a[i]=$i;next}
For the first line, save all the column headings in the array a. Then, skip over the rest of the commands and jump to the next line.
NR==2{for (i=1;i<=NF;i++) printf "%-20s",a[i] $i;print"";next}
For the second line, print all the column headings, merging together the ones from the first and second rows. Then, skip over the rest of the commands and jump to the next line.
1
1 is awk's cryptic shorthand for print the line as is. This is done for all lines after the seconds.
Tab-separated columns with possible missing columns
If columns are tab-separated:
awk -F'\t' 'NR==1{for (i=1;i<=NF;i++) a[i]=$i;next} NR==2{for (i=1;i<=NF;i++) printf "%s\t",a[i] $i;print"";next} 1' file

If you plan to use python, you can use zip in the following way:
input = [['L1_I', 'L1_I', 'C-14'], ['WWPTH','WWPT','WWPTH'],[1,2,3]]
output = [[i+j for i,j in zip(input[0],input[1])]] + input[2:]
print output
output:
[['L1_IWWPTH', 'L1_IWWPT', 'C-14WWPTH'], [1, 2, 3]]

#!/usr/bin/awk -f
NR == 1 {
split($0, a)
next
}
NR == 2 {
for (b in a)
printf "%-20s", a[b] $b
print ""
next
}
1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove duplicates without pandas? - python

Related

How to Remove Pipe character from a data filed in a pipe delimited file

remove unwanted quotes and comma in csv file [duplicate]

Split csv file vertically using command line

python: Remove trailing 0's and decimal point from awk command

How to concatenate identifier specified on two rows?

Categories

Resources