.csv data into a dictionary in Python: Duplicate values - python

I'm attempting to turn .csv data into a dictionary in Python but I appear to be getting duplicate dictionary entries.
This is an example of what the .csv data looks like:
ticker,1,2,3,4,5,6
XOM,10,15,17,11,13,20
AAPL,12,11,12,13,11,22
My intention is to use the first column as the key and the remaining columns as the values. Ideally I should have 3 entries: ticker, XOM, and AAPL. But instead I get this:
{'ticker': ['1', '2', '3', '4', '5', '6']}
{'ticker': ['1', '2', '3', '4', '5', '6']}
{'XOM': ['10', '15', '17', '11', '13', '20']}
{'ticker': ['1', '2', '3', '4', '5', '6']}
{'XOM': ['10', '15', '17', '11', '13', '20']}
{'AAPL': ['12', '11', '12', '13', '11', '22']}
So it looks like I'm getting row 1, then row 1 & 2, then row 1, 2 & 3.
This is the code I'm using:
def data_pull():
#gets data out of a .csv file
datafile = open("C:\sample.csv")
data = [] #blank list
dict = {} #blank dictionary
for row in datafile:
data.append(row.strip().split(",")) #removes whitespace and commas
for x in data: #organizes data from list into dictionary
k = x[0]
v = x[1:]
dict = {k:v for x in data}
print dict
data_pull()
I'm trying to figure out why the duplicate entries are showing up.

You have too many loops; you extend data then loop over the whole data list with all entries gathered so far:
for row in datafile:
data.append(row.strip().split(",")) #removes whitespace and commas
for x in data:
# will loop over all entries parsed so far
so you'd append a row to data, then loop over the list, with one item:
data = [['ticker', '1', '2', '3', '4', '5', '6']]
then you'd read the next line and append to data, so then you loop over data again and process:
data = [
['ticker', '1', '2', '3', '4', '5', '6'],
['XOM', '10', '15', '17', '11', '13', '20'],
]
so iterate twice, then add the next line, loop three times, etc.
You could simplify this to:
for row in datafile:
x = row.strip().split(",")
dict[x[0]] = x[1:]
You can save yourself some work by using the csv module:
import csv
def data_pull():
results = {}
with open("C:\sample.csv", 'rb') as datafile:
reader = csv.reader(datafile)
for row in reader:
results[row[0]] = row[1:]
return results

Use the built in csv module:
import csv
output = {}
with open("C:\sample.csv") as f:
freader = csv.reader(f)
for row in freader:
output[row[0]] = row[1:]

The loop for x in data should be outside of the loop for row in datafile:
for row in datafile:
data.append(row.strip().split(",")) #removes whitespace and commas
for x in data: #organizes data from list into dictionary
k = x[0]
Or, csv module can be your friend:
with open("text.csv") as lines:
print {row[0]: row[1:] for row in csv.reader(lines)}
A side note. It's always a good idea to use the raw strings for Windows paths:
open(r"C:\sample.csv")
If your file was named, e.g, C:\text.csv then \t would be interpreted as a tab character.

Related

Writing list and list of list inside the same file in Python

I have a file say, file1.txt which looks something like below.
27,28,29,30,1,0.67
31,32,33,34,1,0.84
35,36,37,38,1,0.45
39,40,41,42,1,0.82
43,44,45,46,1,0.92
43,44,45,46,1,0.92
51,52,53,54,2,0.28
55,56,57,58,2,0.77
59,60,61,62,2,0.39
63,64,65,66,2,0.41
75,76,77,78,3,0.51
90,91,92,93,3,0.97
Where the last column is the fitness and the 2nd last column is the class.
Now I read this file like :
rule_file_name = 'file1.txt'
rule_fp = open(rule_file_name)
list1 = []
for line in rule_fp.readlines():
list1.append(line.replace("\n","").split(","))
Then a default dictionary was created to ensure the rows are separated according to the classes.
from collections import defaultdict
classes = defaultdict(list)
for _list in list1:
classes[_list[-2]].append(_list)
Then they are paired up within each class using the below logic.
from random import sample, seed
seed(1)
for key, _list in classes.items():
_list=sorted(_list,key=itemgetter(-1),reverse=True)
length = len(_list)
middle_index = length // 2
first_half = _list[:middle_index]
second_half = _list[middle_index:]
result=[]
result=list(zip(first_half,second_half))
Later using the 2 rows of the pair, a 3rd row is being created using the below logic:
ans=[[random.choice(choices) for choices in zip(*item)] for item in result]
So if there were initially 12 rows in the file1, that will now form 6 pairs and hence 6 new rows will be created. I simply want to append those newly created rows to the file1 using below logic:
list1.append(ans)
print(ans)
with open(f"output.txt", 'w') as out:
new_rules = [list(map(str, i)) for i in list1]
for item in new_rules:
out.write("{}\n".format(",".join(item)))
#out.write("{}\n".format(item))
But now my output.txt looks like:
27,28,29,30,1,0.67
31,32,33,34,1,0.84
35,36,37,38,1,0.45
39,40,41,42,1,0.82
43,44,45,46,1,0.92
43,44,45,46,1,0.92
51,52,53,54,2,0.28
55,56,57,58,2,0.77
59,60,61,62,2,0.39
63,64,65,66,2,0.41
75,76,77,78,3,0.51
90,91,92,93,3,0.97
['43', '44', '41', '46', '1', '0.82'],['27', '28', '45', '46', '1', '0.92'],['35', '36', '33', '38', '1', '0.84']
['55', '60', '57', '58', '2', '0.77'],['51', '64', '53', '66', '2', '0.28']
['75', '91', '77', '93', '3', '0.51']
But my desired outcome is:
27,28,29,30,1,0.67
31,32,33,34,1,0.84
35,36,37,38,1,0.45
39,40,41,42,1,0.82
43,44,45,46,1,0.92
43,44,45,46,1,0.92
51,52,53,54,2,0.28
55,56,57,58,2,0.77
59,60,61,62,2,0.39
63,64,65,66,2,0.41
75,76,77,78,3,0.51
90,91,92,93,3,0.97
43,44,41,46,1,0.82
27,28,45,46,1,0.92
35,36,33,38,1,0.84
55,60,57,58,2,0.77
51,64,53,66,2,0.28
75,91,77,93,3,0.51
I would use numpy, it is flexible and compact.
import numpy as np
fin = 'file1.txt'
col1, col2, col3, col4, jclass, fitness = np.loadtxt(fin, unpack=True, delimiter=',')
rows = np.column_stack((col1, col2, col3, col4, jclass, fitness))
print(rows[0])
print(rows[-1])
print(fitness)
Then apply your logic to the rows array

How to check if a item is the first item in a list

I am trying to read a dat file and extract certain information from the dat file.
My code looks like this:
datContent = [i.strip().split() for i in open("data.dat").readlines()]
positions = []
myItem = 'ST'
# write it as a new CSV file
for list in datContent:
if myItem in list:
positions.append(list)
I would like to check whether an item is the first item in the list and i want the two list below that. How do I do that?
if you want the second next list after a list has the first item myItem you can use:
[s for f, s in zip(datContent, datContent[2:]) if f[0] == myItem]
example:
datContent = [['ST', '1', '2', '3'], ['1', '5', '3'],['2', '6', '3'],['ST', '2', '4'], ['ST', '2', '2'],['2', '6', '3']]
myItem = 'ST'
[s for f, s in zip(datContent, datContent[2:]) if f[0] == myItem]
output:
[['2', '6', '3'], ['2', '6', '3']]
you can have a look over zip built-in function

how to split a list every nth item

I am trying to split a list every 5th item, then delete the next two items ('nan'). I have attempted to use List[:5], but that does not seem to work in a loop. The desired output is: [['1','2','3','4','5'],['1','2','3','4','5'],['1','2','3','4','5'],['1','2','3','4','5']]
List = ['1','2','3','4','5','nan','nan','1','2','3','4','5','nan','nan','1','2','3','4','5','nan','nan','1','2','3','4','5','nan','nan']
for i in List:
# split first 5 items
# delete next two items
# Desired output:
# [['1','2','3','4','5'],['1','2','3','4','5'],['1','2','3','4','5'],['1','2','3','4','5']]
There are lots of ways to do this. I recommend stepping by 7 then splicing by 5.
data = ['1','2','3','4','5','nan','nan','1','2','3','4','5','nan','nan','1','2','3','4','5','nan','nan','1','2','3','4','5','nan','nan']
# Step by 7 and keep the first 5
chunks = [data[i:i+5] for i in range(0, len(data), 7)]
print(*chunks, sep='\n')
Output:
['1', '2', '3', '4', '5']
['1', '2', '3', '4', '5']
['1', '2', '3', '4', '5']
['1', '2', '3', '4', '5']
Reference: Split a python list into other “sublists”...
WARNING: make sure the list follows the rules as you said, after every 5 items 2 nan.
This loop will add the first 5 items as a list, and delete the first 7 items.
lst = ['1','2','3','4','5','nan','nan','1','2','3','4','5','nan','nan','1','2','3','4','5','nan','nan','1','2','3','4','5','nan','nan']
output = []
while True:
if len(lst) <= 0:
break
output.append(lst[:5])
del lst[:7]
print(output) # [['1', '2', '3', '4', '5'], ['1', '2', '3', '4', '5'], ['1', '2', '3', '4', '5'], ['1', '2', '3', '4', '5']]
List=['1','2','3','4','5','nan','nan','1','2','3','4','5','nan','nan','1','2','3','4','5','nan','nan','1','2','3','4','5','nan','nan']
new_list = list()
for k in range(len(List)//7):
new_list.append(List[k*7:k*7+5])
new_list.append(List[-len(List)%7])
Straightforward solution in case if the list doesn’t follow the rules you mentioned but you want to split sequence always between NAN's:
result, temp = [], []
for item in lst:
if item != 'nan':
temp.append(item)
elif temp:
result.append(list(temp))
temp = []
Using itertools.groupby would also support chunks of different lengths:
[list(v) for k, v in groupby(List, key='nan'.__ne__) if k]
I guess there is more pythonic way to do the same but:
result = []
while (len(List) > 5):
result.append(List[0:0+5])
del List[0:0+5]
del List[0:2]
This results: [['1', '2', '3', '4', '5'], ['1', '2', '3', '4', '5'], ['1', '2', '3', '4', '5'], ['1', '2', '3', '4', '5']]
mainlist=[]
sublist=[]
count=0
for i in List:
if i!="nan" :
if count==4:
# delete next two items
mainlist.append(sublist)
count=0
sublist=[]
else:
# split first 5 items
sublist.append(i)
count+=1
Generally numpy.split(...) will do any kind of custom splitting for you. Some reference:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.split.html
And the code:
import numpy as np
lst = ['1','2','3','4','5','nan','nan','1','2','3','4','5','nan','nan','1','2','3','4','5','nan','nan','1','2','3','4','5','nan','nan']
ind=np.ravel([[i*7+5, (i+1)*7] for i in range(len(lst)//7)])
lst2=np.split(lst, ind)[:-1:2]
print(lst2)
Outputs:
[array(['1', '2', '3', '4', '5'], dtype='<U3'), array(['1', '2', '3', '4', '5'], dtype='<U3'), array(['1', '2', '3', '4', '5'], dtype='<U3'), array(['1', '2', '3', '4', '5'], dtype='<U3')]
I like the splice answers.
Here is my 2 cents.
# changed var name away from var type
myList = ['1','2','3','4','5','nan','nan','1','2','3','4','10','nan','nan','1','2','3','4','15','nan','nan','1','2','3','4','20','nan','nan']
newList = [] # declare new list of lists to create
addItem = [] # declare temp list
myIndex = 0 # declare temp counting variable
for i in myList:
myIndex +=1
if myIndex==6:
nothing = 0 #do nothing
elif myIndex==7: #add sub list to new list and reset variables
if len(addItem)>0:
newList.append(list(addItem))
addItem=[]
myIndex = 0
else:
addItem.append(i)
#output
print(newList)

Getting data from a list on a specific line in a file (python)

I've got a very large file that has a format like this:
[['1', '2', '3', '4']['11', '12', '13', '14']]
[['5', '6', '7', '8']['55', '66', '77', '88']]
(numbers indicate line number)
The lists on each line are very long, unlike this example.
Now if it was only 1 list I could for example obtain the '11' value with:
itemdatatxt = open("tempoutput", "r")
itemdata = eval(itemdatatxt.read())
print itemdata[1][0]
However because the file contains a new list on each line I cannot see how I can for example obtain the '55' value.
I thought itemdatatxt.readline(1) would select the second line of the file but after reading about the .readline I understand that this would result in the 2nd symbol on the first line.
Can anyone explain to me how to do this? (preferably I wouldn't want to change the 'tempoutput' datafile format)
Try this:
import ast
with open("tempoutput", "r") as f:
for i, line in enumerate(f):
if i == 1:
itemdata = ast.literal_eval(line)
print itemdata[1][0]
break
enumerate(f) returns:
0, <<first line>>
1, <<second line>>
...
So when i becomes 1, we've reached second line and we output 55. We also break the loop since we don't care about reading the rest of the lines.
I used ast.literal_eval because it's a safer form of eval.
You can add the whole file to a dictionary where the key is the line number and the value is the content (the two lists). This way you can easily get any value you want by selecting first the line number, then the list and then the index.
data.txt
[['1', '2', '3', '4'], ['11', '12', '13', '14']]
[['5', '6', '7', '8'], ['55', '66', '77', '88']]
[['5', '6', '3', '8'], ['155', '66', '277', '88']]
code
import ast
data = {}
with open('data.txt', 'r') as f:
for indx, ln in enumerate(f):
data[indx] = ast.literal_eval(ln.strip())
print data[1][1][0] #55
print data[1][1][3] #88
readline() reads until the next line break. If you call it a second time it will read from where it stopped to the linebreak after that. Thus, you could have a loop:
lines = []
with open('filepath', 'r') as f:
lines.append(eval(f.readline()))
print lines # [[['1', '2', '3', '4'],['11', '12', '13', '14']],
# [['5', '6', '7', '8'],['55', '66', '77', '88']]]
Or you could read the entire file and split by linebreak:
lines = open('filepath', 'r').read().split('\n');
Alternatively if you want to read a specific line you can use the linecache module:
import linecache
line = linecache.getline('filepath', 2) # 2 is the second line of the file

Remove new line \n reading from CSV

I have a CSV file which looks like:
Name1,1,2,3
Name2,1,2,3
Name3,1,2,3
I need to read it into a 2D list line by line. The code I have written almost does the job; however, I am having problems removing the new line characters '\n' at the end of the third index.
score=[]
for eachLine in file:
student = eachLine.split(',')
score.append(student)
print(score)
The output currently looks like:
[['name1', '1', '2', '3\n'], ['name2', '1', '2', '3\n'],
I need it to look like:
[['name1', '1', '2', '3'], ['name2', '1', '2', '3'],
simply call str.strip on each line as you process them:
score=[]
for eachLine in file:
student = eachLine.strip().split(',')
score.append(student)
print(score)
You can use splitlines
First method
>>> s = '''Name1,1,2,3
... Name2,1,2,3
... Name3,1,2,3'''
>>> [ item.split(',') for item in s.splitlines() ]
[['Name1', '1', '2', '3'], ['Name2', '1', '2', '3'], ['Name3', '1', '2', '3']]
Second method
>>> l = []
>>> for item in s.splitlines():
... l.append(item.split(','))
...
>>> l
[['Name1', '1', '2', '3'], ['Name2', '1', '2', '3'], ['Name3', '1', '2', '3']]
If you know it's a \n, and a \n only,
score=[]
for eachLine in file:
student = eachLine[:-1].split(',')
score.append(student)
print(score)
Uses slicing to remove the trailing new line characters before the split happens.
EDITED, per the suggestions of the commentors ;) Much more neat.
Use the rstrip function to identify \n at the end of every line and remove it.
See the code below for reference.
with open('myfile.csv', 'wb') as file:
for line in file:
line.rstrip('\n')
file.write(line)

Categories

Resources