2 questions on Python : created table & fail to find duplicates in rows

2 questions on Python : created table & fail to find duplicates in rows - python

I have this data set which is in this format in this way in csv file:
1st question : I am trying to find duplicates rows in the table just created in python below?
I did try to use the set function to run the rows and the output I got is
no duplicates even though there is a duplicate row in the data set.
2nd question: is it possible to reference this table as i realized that it becomes a table when I print?So that I can use it on the next step for calculation purpose.
COL_1_WIDTH = 10
COL_2_WIDTH = 35
for row in data:
IC1 = len(str(row[0]))
IC2 = len(str(row[1]))
print( str(row[0])+ str( (COL_1_WIDTH-IC1) *' ') +\
str(row[1]) + str( (COL_2_WIDTH-IC2) *' ') +\
str(row[2]))
for row in data:
if len(set(row)) !=len(row):
print ('duplicates: ', row)
else:
print ('no duplicates:', row)
P.s. Permit to use built in function & numpy only.
Grateful for any ideas. Thank you!

You don't really explain what kind of object is 'data', so I assumed it was a list of strings.
Here's how I created mine from a csv file:
with open('/home/sebastien/Documents/answerSO.csv') as file:
data=file.read() #a string
data=data.split('\n') #a list of strings
data.pop() #to delete the last element, an empty string
(note that using the csv module may be a better idea)
Now, to look for duplicates, I used the method explained here:
How do I find the duplicates in a list and create another list with them?
seen = set()
uniq = []
for row in data:
if row not in seen:
uniq.append(row)
seen.add(row)
else:
print("found a duplicate:",row)
And about referencing it, well, it's in 'data'

Related

Python - Get item from a list under a list

I have a list like below.
list = [[Name,ID,Age,mark,subject],[karan,2344,23,87,Bio],[karan,2344,23,87,Mat],[karan,2344,23,87,Eng]]
I need to get only the name 'Karan' as output.
How can I get that?

This is a 2D list,
list[i][j]
will give you the 'i'th list within your list and the 'j'th item within that list.
So to get Karen you want list[1][0]

I upvoted Lio Elbammalf, but decided to provide an answer that made a couple of assumptions that should have been clarified in the question:
The First item of the list is the headers, they are actually in the list (and not there as part of the question), and they are provided as part of the list because there is no guarantee that the headers will always be in the same order.
This is probably a CSV file
Ignoring 2 for the moment, what you would want to do is remove the "headers" from the list (so that the rest of the list is uniform), and then find the index of "Name" (your desired output).
myinput = [["Name","ID","Age","mark","subject"],
["karan",2344,23,87,"Bio"],
["karan",2344,23,87,"Mat"],
["karan",2344,23,87,"Eng"]]
## Remove the headers from the list to simplify everything
headers = myinput.pop(0)
## Figure out where to find the person's Name
nameindex = headers.index("Name")
## Return a list of the Name in each row
return [stats[nameindex] for stats in myinput]
If the name is guaranteed to be the same in each row, then you can just return myinput[0][nameindex] like is suggested in the other answer
Now, if 2 is true, I'm assuming you're using the csv module, in which case load the file using the DictReader class and then just access each row using the 'Name' key:
def loadfile(myfile):
with open(myfile) as f:
reader = csv.DictReader(f)
return list(reader)
def getname(rows):
## This is the same return as above, and again you can just
## return rows[0]['Name'] if you know you only need the first one
return [row['Name'] for row in rows]

In Python 3 you can do this
_, [x, _, _, _, _], *_ = ls
Now x will be karan.

Issues with adding a variable to python gspread

I have started to use the gspread library and have sheet already that I'd like to append after the last row that has data in it. I'll retrieve the values between A1 and maxrows to loop through them and check if they are empty. However, I am unable to add a variable to the second line here. But perhaps I am just not escaping it correct? I bet this is very simple:
maxrows = "A" + str(worksheet.row_count)
cell_list = worksheet.range('A1:A%s') % (maxrows)

Your variable maxrows already is in the form of "An", the concatenation already contains the letter and the number
But you are adding an extra A to it here worksheet.range('A1:A%s')
Also you're not using the string interpolation correctly with % (in your code you are not applying % to the range string)
It should have been one of these
maxrows = "A" + str(worksheet.row_count)
worksheet.range('A1:%s' % maxrows)
or
worksheet.range('A1:A%d' % worksheet.row_count)
(among other possible solutions)

How to remove duplicate lines - only in certain sections? Python 2.7.9

I am trying to consolidate a .txt file into a cleaned version of the data. Currently, the file is structured as the following:
IDENTIFIER: unique values
DATA ONE: more unique values
DATA TWO: more unique values
DATA TWO: more unique values
DATA TWO: more unique values
IDENTIFIER: unique values
DATA ONE: more unique values
DATA TWO: more unique values
DATA TWO: more unique values
IDENTIFIER:
And so on, for about ~500 'identifiers.' I want to read this file, and simply remove the duplicate "DATA TWO:"s. While I am familiar with how to simply remove duplicate lines, I need to remove the duplicates for each unique section, to yield:
IDENTIFIER: unique values
DATA ONE: more unique values
DATA TWO: more unique values
The amount of "DATA TWO:'s varies per identifier, usually two or three. It does not matter which of the "DATA TWO's" is printed to the new file; although each is worded slightly differently, they capture what I am trying to find, and any one would suffice.
I am relatively new to programming, using Python 2.7.9.

with open("input.txt") as f, open("out.txt", "w") as out::
found = False
for line in f:
# new section always reset flag
if line.startswith("IDENTIFIER:"):
out.write(line)
found = False
# if first time we have seen DATA TWO write and set flag to true
elif line.startswith("DATA TWO:") and not found:
out.write(line)
found = True
# ignore lines with "DATA TWO:" if we have already found one in the current section and continue
elif line.startswith("DATA TWO:"):
continue
# else write the other lines in the section
else:
out.write(line)
Output using your example input:
IDENTIFIER: unique values
DATA ONE: more unique values
DATA TWO: more unique values
IDENTIFIER: unique values
DATA ONE: more unique values
DATA TWO: more unique values
IDENTIFIER:

You can easily do this by using sets. For instance if you have a list [1,1,3,3,4,4], by doing set([1,1,3,3,4,4]) you obtain [1,3,4] which is a set.
>>> lines_lst = open('file.txt', 'r').readlines()
>>> lst_set = set(lines_lst)
>>> output = open('cleanfile.txt', 'w')
>>> for line in lst_set:
output.write(line)
Bear in mind that this solution does not preserve order.

Python extract substring with location of field and symbols

I have been trying to clean a field in a csv file. The field is populated with numbers and characters, which I read into a panda dataframe and convert to a string.
Goal is to extract following variables: StopId, StopCode (possible to have multiple for each record), rte, routeId from the long string. Here is what I attempted so far.
After extracting the variables listed above, I need to merge the variable/codes with another file with location data per each stop/route/rte.
Sample records for the FIELD:
'Web Log: Page generated Query [cid=SM&rte=50183&dir=S&day=5761&dayid=5761&fst=0%2c&tst=0%2c]'
'Web Log: Page generated Query: [_=1407744540393&agencyId=SM&stopCode=361096&rte=7878%7eBus%7e251&dir=W]'
Web Log: Page generated Query: [_=1407744956001&agencyId=AC&stopCode=55451&stopCode=55452stopCode=55489&&rte=43783%7eBus%7e88&dir=S]
Solutions I tried below, but I am stuck! Advice and recommendations are appreciated
# Idea 1: Splits field above in a loop by '&' into a list. This is useful but I'll
# have to write additional code to pull out relevant variables
i = 0
for t in data['EVENT_DESCRIPTION']:
s = list(t.split('&'))
data['STOPS'][i] = [ x for x in s if "Web Log" not in x ]
i+=1
# Idea 1 next step help - how to pull out necessary variables from the list in data['STOPS']
# Idea2: Loop through field with string to find the start and end of variable names. The output for stopcode_pl (et. al. variables) is tuple or list of tuples (if there are more than one in the string)
for i in data['EVENT_DESCRIPTION']:
stopcode_pl = [(a.start(), a.end() ) for a in list(re.finditer('stopCode=', i))]
stopid_pl = i[(a.start(), a.end() ) for a in list(re.finditer('stopId=', i))]
rte_pl = [(a.start(), a.end() ) for a in list(re.finditer('rte=', i))]
routeid_pl = [(a.start(), a.end() ) for a in list(re.finditer('routeId=', i))]
#Idea2: Next Step Help - how to use the string location for variable names to pull the number of the relevant variable. Is there a trick to grab the characters in between the variable name last place (i.e. after the '=' of the variable name) and the next '&'?

This function
def qdata(rec):
return [tuple(item.split('=')) for item in rec[rec.find('[')+1:rec.find(']')].split('&')]
yields, for instance, on the first record:
[('cid', 'SM'), ('rte', '50183'), ('dir', 'S'), ('day', '5761'), ('dayid', '5761'), ('fst', '0%2c'), ('tst', '0%2c')]
You can then step across that list searching for your specific items.

Help with Excel, Python and XLRD

Relatively new to programming hence why I've chosen to use python to learn.
At the moment I'm attempting to read a list of Usernames, passwords from an Excel Spreadsheet with XLRD and use them to login to something. Then back out and go to the next line. Log in etc and keep going.
Here is a snippit of the code:
import xlrd
wb = xlrd.open_workbook('test_spreadsheet.xls')
# Load XLRD Excel Reader
sheetname = wb.sheet_names() #Read for XCL Sheet names
sh1 = wb.sheet_by_index(0) #Login
def readRows():
for rownum in range(sh1.nrows):
rows = sh1.row_values(rownum)
userNm = rows[4]
Password = rows[5]
supID = rows[6]
print userNm, Password, supID
print readRows()
I've gotten the variables out and it reads all of them in one shot, here is where my lack of programming skills come in to play. I know I need to iterate through these and do something with them but Im kind of lost on what is the best practice. Any insight would be great.
Thank you again

couple of pointers:
i'd suggest you not print your function with no return value, instead just call it, or return something to print.
def readRows():
for rownum in range(sh1.nrows):
rows = sh1.row_values(rownum)
userNm = rows[4]
Password = rows[5]
supID = rows[6]
print userNm, Password, supID
readRows()
or looking at the docs you can take a slice from the row_values:
row_values(rowx, start_colx=0,
end_colx=None) [#]
Returns a slice of the values of the cells in the given row.
because you just want rows with index 4 - 6:
def readRows():
# using list comprehension
return [ sh1.row_values(idx, 4, 6) for idx in range(sh1.nrows) ]
print readRows()
using the second method you get a list return value from your function, you can use this function to set a variable with all of your data you read from the excel file. The list is actually a list of lists containing your row values.
L1 = readRows()
for row in L1:
print row[0], row[1], row[2]
After you have your data, you are able to manipulate it by iterating through the list, much like for the print example above.
def login(name, password, id):
# do stuff with name password and id passed into method
...
for row in L1:
login(row)
you may also want to look into different data structures for storing your data. If you need to find a user by name using a dictionary is probably your best bet:
def readRows():
rows = [ sh1.row_values(idx, 4, 6) for idx in range(sh1.nrows) ]
# using list comprehension
return dict([ [row[4], (row[5], row[6])] for row in rows ])
D1 = readRows()
print D['Bob']
('sdfadfadf',23)
import pprint
pprint.pprint(D1)
{'Bob': ('sdafdfadf',23),
'Cat': ('asdfa',24),
'Dog': ('fadfasdf',24)}
one thing to note is that dictionary values returned arbitrarily ordered in python.

I'm not sure if you are intent on using xlrd, but you may want to check out PyWorkbooks (note, I am the writter of PyWorkbooks :D)
from PyWorkbooks.ExWorkbook import ExWorkbook
B = ExWorkbook()
B.change_sheet(0)
# Note: it might be B[:1000, 3:6]. I can't remember if xlrd uses pythonic addressing (0 is first row)
data = B[:1000,4:7] # gets a generator, the '1000' is arbitrarily large.
def readRows()
while True:
try:
userNm, Password, supID = data.next() # you could also do data[0]
print userNm, Password, supID
if usrNm == None: break # when there is no more data it stops
except IndexError:
print 'list too long'
readRows()
You will find that this is significantly faster (and easier I hope) than anything you would have done. Your method will get an entire row, which could be a thousand elements long. I have written this to retrieve data as fast as possible (and included support for such things as numpy).
In your case, speed probably isn't as important. But in the future, it might be :D
Check it out. Documentation is available with the program for newbie users.
http://sourceforge.net/projects/pyworkbooks/

Seems to be good. With one remark: you should replace "rows" by "cells" because you actually read values from cells in every single row

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

2 questions on Python : created table & fail to find duplicates in rows - python

Related

Python - Get item from a list under a list

Issues with adding a variable to python gspread

How to remove duplicate lines - only in certain sections? Python 2.7.9

Python extract substring with location of field and symbols

Help with Excel, Python and XLRD

Categories

Resources