Search of elements inside a big CSV file using Python

Search of elements inside a big CSV file using Python - python

Im trying to filter a CSV file and get the fifth value of a list that is inside another list , but Im getting out of range all time .
import csv
from operator import itemgetter
teste=[]
f = csv.reader(open('power_supply_info.csv'), delimiter =',' )
for word in f:
teste.append(word)
#print teste
#print ('\n')
print map( itemgetter(5), teste)
But , I got this error :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\rafael.paiva\Dev\Python2.7\WinPython-64bit-2.7.6.4\python-2.7.6.amd64\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 540, in runfile
execfile(filename, namespace)
File "C:/Users/rafael.paiva/Desktop/Rafael/CSV.py", line 24, in <module>
print map( itemgetter(5), teste)
IndexError: list index out of range
What is inside "word" variable , appended to "teste" according with steps is :
[['2015-12-31-21:02:30.754271', '25869', '500000', 'Unknown', '1', '0', '4790780', '1', '0', '0', '375', '0', '-450060', '-326040', '3437000', 'Normal', 'N/A', '93', 'Good', '19', '1815372', 'Unknown', 'Charging', '4195078', '4440000', '4208203', '4171093', '0', '44290', 'Li-ion', '95', '1', '3000000', '1', '375', '-450060', '-326040', '3437000', '93', 'Good', '1815372', '4195000', '4440000', '4208203', '4165625', '0', '44290', '95', '3000000', '1', ''],
['2015-12-31-21:03:30.910972', '25930', '500000', 'Unknown', '1', '0', '4794730', '1', '0', '0', '377', '0', '55692', '107328', '3437000', 'Normal', 'N/A', '92', 'Good', '19', '1814234', 'Unknown', 'Charging', '4200390', '4440000', '4207734', '4214062', '0', '41200', 'Li-ion', '95', '1', '3000000', '1', '377', '55692', '107328', '3437000', '92', 'Good', '1814234', '4200390', '4440000', '4207734', '4214062', '0', '41200', '95', '3000000', '1', '']]
Can someone can help me with it please?

You should add some diagnostics to your loop, this will help to show you where a problem might be in your csv file:
import csv
from operator import itemgetter
teste = []
with open('power_supply_info.csv', 'rb') as f_input:
for line, words in enumerate(csv.reader(f_input, delimiter =',' ), start=1):
if len(words) <= 5:
print "Line {} only has {} elements".format(line, len(words))
teste.append(words)
print map(itemgetter(5), teste)
It is likely that one of you lines is either blank or has too few entries, this script will list which lines numbers have problems.

I don't know what's in your power_supply_info.csv file, but it's clear what you have after csv.reader has done its job:
a list with 2 lists (ie: 2 elements)
That's why you get an error accessing the 5th element, there are only 2
A possible approach for your problem:
import csv
f = csv.reader(open('power_supply_info.csv'), delimiter =',' )
# First iterate over the rows and then get each list in the row
teste = [x for x in (row for row in f)]
print map(lambda x: x[5], teste)
The real challenge would be to see the input you have in your csv file to understand why you end up with those 2 lists inside a list.
Note: In case you output belongs to teste and not to word, the code could be:
import csv
f = csv.reader(open('power_supply_info.csv'), delimiter =',' )
teste = [row for row in f]
print [x[5] for x in teste]
Best regards

The code you show works correctly with the data sample you have provided:
In [8]: l = [['2015-12-31-21:02:30.754271', '25869', '500000', 'Unknown', '1', '0', '4790780', '1', '0', '0'],
...: ['2015-12-31-21:03:30.910972', '25930', '500000', 'Unknown', '1', '0', '4794730', '1', '0', '0']]
In [9]: list(map(itemgetter(5),l))
Out[9]: ['0', '0']
I suspect that one line (probably the last line) in your CSV file is blank, therefore the last element of teste is actually an empty list, and therefore itemgetter(5) fails for that last line.
Instead of cramming everything into a single line, try
for item in teste:
if item:
print item[5]

Related

Increasing order Python

This code makes a connection to hbase, and then prints the result of a given row. But the result is printed like this:
import happybase
connection = happybase.Connection('MacBook-Air.local')
table = connection.table('twitter_db')
row = table.row('2018-09-21 11:55:24')
print(row)
But the result is printed like this:
{'Hashtag:#AFDs': '1', 'Hashtag:#Job': '1', 'Hashtag:#pumpkinpoundcake': '1', 'Lang:und': '81', 'Lang:pt ': '17', 'Hashtag:#InsomniaInFourWords': '2', 'Hashtag:#thegreatindianbooktour': '1', 'Hashtag:#pdx911': '2', 'Hashtag:#US': '1', 'Lang:en ': '246', 'Lang:es ': '31', 'Hashtag:#travelling': '1', 'Hashtag:#prohibition': '1', 'Hashtag:#FF': '2', 'Lang:in ': '15'}
I would like to print on one side all the hashtags with their relative numbers in ascending order and by one or all the languages with their relative numbers in ascending order. For example:
'Hashtag:#FF': '2'
'Hashtag:#AFDs': '1'
'Hashtag:#Job': '1'
.........
----------------
'Lang:en ': '246'
'Lang:und': '81'
'Lang:es ': '31'
.......
For example I could create a method but how?

How to pair 2 list into 1 list

I have a code like this:
def datauji(self):
uji = []
for x in self.fiturs:
a = [x[0],x[-5:]] #I think the problem in this line
uji.append(a)
return uji
with open('DataUjiBaru.csv','wb') as dub:
testing = csv.writer(dub)
datatest = d.datauji()
datatest.pop(0)
for x in datatest:
testing.writerow(x)
I want to pair the value in self.fiturs, In self.fiturs:
F37,0,1,0,1,1,1,0,1,0,2,1,0,0,0,1
F10,8,4,3,3,3,6,8,5,8,4,8,4,5,6,4
F8,1,0,2,0,0,0,2,0,0,0,0,0,2,0,0
So i want to pair index[0] and index[-5:] and write it to the csv, and the output on the csv like this:
F37,"['1', '0', '0', '0', '1']"
F10,"['8', '4', '5', '6', '4']"
F8,"['0', '0', '2', '0', '0']"
My Expectation in that csv is like this:
F37,1,0,0,0,1
F10,8,4,5,6,4
F8,0,0,2,0,0
How can I fix that?

You were correct about the issue with your code, it is found in the line:
a = [x[0],x[-5:]]
This creates nested items that look like this:
['F37', ['1', '0', '0', '0', '1']]
Here are two ways to fix this:
Option 1 - Use the splat* operator:
a = [x[0],*x[-5:]]
Option 2 - Concatenate two slices of your list:
a = x[:1] + x[-5:]
Both of these will remove the nesting of your lists, and instead give you lines looking like:
['F37', '1', '0', '0', '0', '1']
Which you can then write to your output file.

Getting data from a list on a specific line in a file (python)

I've got a very large file that has a format like this:
[['1', '2', '3', '4']['11', '12', '13', '14']]
[['5', '6', '7', '8']['55', '66', '77', '88']]
(numbers indicate line number)
The lists on each line are very long, unlike this example.
Now if it was only 1 list I could for example obtain the '11' value with:
itemdatatxt = open("tempoutput", "r")
itemdata = eval(itemdatatxt.read())
print itemdata[1][0]
However because the file contains a new list on each line I cannot see how I can for example obtain the '55' value.
I thought itemdatatxt.readline(1) would select the second line of the file but after reading about the .readline I understand that this would result in the 2nd symbol on the first line.
Can anyone explain to me how to do this? (preferably I wouldn't want to change the 'tempoutput' datafile format)

Try this:
import ast
with open("tempoutput", "r") as f:
for i, line in enumerate(f):
if i == 1:
itemdata = ast.literal_eval(line)
print itemdata[1][0]
break
enumerate(f) returns:
0, <<first line>>
1, <<second line>>
...
So when i becomes 1, we've reached second line and we output 55. We also break the loop since we don't care about reading the rest of the lines.
I used ast.literal_eval because it's a safer form of eval.

You can add the whole file to a dictionary where the key is the line number and the value is the content (the two lists). This way you can easily get any value you want by selecting first the line number, then the list and then the index.
data.txt
[['1', '2', '3', '4'], ['11', '12', '13', '14']]
[['5', '6', '7', '8'], ['55', '66', '77', '88']]
[['5', '6', '3', '8'], ['155', '66', '277', '88']]
code
import ast
data = {}
with open('data.txt', 'r') as f:
for indx, ln in enumerate(f):
data[indx] = ast.literal_eval(ln.strip())
print data[1][1][0] #55
print data[1][1][3] #88

readline() reads until the next line break. If you call it a second time it will read from where it stopped to the linebreak after that. Thus, you could have a loop:
lines = []
with open('filepath', 'r') as f:
lines.append(eval(f.readline()))
print lines # [[['1', '2', '3', '4'],['11', '12', '13', '14']],
# [['5', '6', '7', '8'],['55', '66', '77', '88']]]
Or you could read the entire file and split by linebreak:
lines = open('filepath', 'r').read().split('\n');
Alternatively if you want to read a specific line you can use the linecache module:
import linecache
line = linecache.getline('filepath', 2) # 2 is the second line of the file

How do I create a python list with unique indexes from a comma delimited string with duplicate values?

I'm trying to create a list from a string of comma delimited values in python using split(). I am observing when I do this my list appears to have multiple indexes that are the same, which appears to be because some of the values are the same. I'd like to have each element have its own sequential index, so I can use the index to access them positionally, how do I do this? Here is the code for context:
haproxy_socket_data ='''
pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,
fe,FRONTEND,,,0,1,2000,45,0,8415,0,0,45,,,,,OPEN,,,,,,,,,1,1,0,,,,0,0,0,1,,,,0,0,0,45,0,0,,0,1,45,,,
bend,host1,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,113,0,,1,2,1,,0,,2,0,,0,L4OK,,0,0,0,0,0,0,0,0,,,,0,0,
'''
haproxy_socket_data = haproxy_socket_data.splitlines()
for line in haproxy_socket_data:
stats = line.split(',')
print line
print stats
for i in stats:
print i
print "index: %s" % stats.index(i)
Here is the output of this code: https://gist.github.com/wjimenez5271/74df2b16b540a7d9de0c
I found these examples of how do get this data into a list, but none of them addressed my situation where some values are the same:
How can I split this comma-delimited string in Python?
How to convert comma-delimited string to list in Python?

You're misunderstanding what index() does. The Python documentation says:
s.index(x[, i[, j]])
index of the first occurrence of x in s (at or after index i and before index j)
So, each time you call stats.index(i) in your code, the index of first occurrence of i in stats will be returned.
If you want to keep track of the index of elements of a list as you iterate over it, you want enumerate():
for index, item in enumerate(stats):
print item
print "index: %s" % index

The reason why it seems like you have duplicate indices, but actually list.index() in python will return the first occurance of that value. Try using a for loop that indexes them individually rather than a for in that inherently uses an iterator.

If you want to keep the index, use a for loop with enumerate, or with range():
haproxy_socket_data = """
pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,
fe,FRONTEND,,,0,1,2000,45,0,8415,0,0,45,,,,,OPEN,,,,,,,,,1,1,0,,,,0,0,0,1,,,,0,0,0,45,0,0,,0,1,45,,,
bend,
"""
haproxy_socket_data = haproxy_socket_data.splitlines()
for line in haproxy_socket_data:
stats = [item for item in line.split(',') if len(item) >= 1] #Gets rid of items like ['']
print line
print stats
for ind, it in enumerate(stats):
print it
print "index: %d" % ind
Or, use range(len()):
haproxy_socket_data ="""
pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,
fe,FRONTEND,,,0,1,2000,45,0,8415,0,0,45,,,,,OPEN,,,,,,,,,1,1,0,,,,0,0,0,1,,,,0,0,0,45,0,0,,0,1,45,,,
bend,
"""
haproxy_socket_data = haproxy_socket_data.splitlines()
for line in haproxy_socket_data:
stats = [item for item in line.split(',') if len(item) >= 1] #Gets rid of items like ['']
print line
print stats
for i in range(len(stats):
print stats[i]
print "index: %d" % i
list.index() returns the first occurrence of the item:
>>> item = [1, 2, 5, 7, 3, 3, 8, 9, 5]
>>> item.index(5)
2
>>> item[2]
5
>>> item[8]
5
>>>
Using enumerate():
>>> for ind, it in enumerate(item):
... if it == 5:
... print ind
...
2
8
>>>

If data values have a comma in them, then the straightforward split(",") won't be correct.
Check the csv module. It supports figuring out ("sniffing") the proper split and quote parameters. It also lets you read each row of data into a dictionary, so you can refer to data by name. No more column counting!
Example. Note the backslash, so the sniffer can read header from the first line of data:
haproxy_socket_data ='''\
pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,
fe,FRONTEND,,,0,1,2000,45,0,8415,0,0,45,,,,,OPEN,,,,,,,,,1,1,0,,,,0,0,0,1,,,,0,0,0,45,0,0,,0,1,45,,,
bend,host1,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,113,0,,1,2,1,,0,,2,0,,0,L4OK,,0,0,0,0,0,0,0,0,,,,0,0,
'''
import csv, StringIO
dialect = csv.Sniffer().sniff(haproxy_socket_data)
reader = csv.reader(
StringIO.StringIO(haproxy_socket_data), dialect=dialect,
)
for row in reader:
print row
print
dictr = csv.DictReader(
StringIO.StringIO(haproxy_socket_data),
dialect=dialect,
)
for drow in dictr:
print 'svname',drow['svname']
Output:
['pxname', 'svname', 'qcur', 'qmax', 'scur', 'smax', 'slim', 'stot',
'bin', 'bout', 'dreq', 'dresp', 'ereq', 'econ', 'eresp', 'wretr',
'wredis', 'status', 'weight', 'act', 'bck', 'chkfail', 'chkdown',
'lastchg', 'downtime', 'qlimit', 'pid', 'iid', 'sid', 'throttle',
'lbtot', 'tracked', 'type', 'rate', 'rate_lim', 'rate_max',
'check_status', 'check_code', 'check_duration', 'hrsp_1xx',
'hrsp_2xx', 'hrsp_3xx', 'hrsp_4xx', 'hrsp_5xx', 'hrsp_other',
'hanafail', 'req_rate', 'req_rate_max', 'req_tot', 'cli_abrt',
'srv_abrt', ''] ['fe', 'FRONTEND', '', '', '0', '1', '2000', '45',
'0', '8415', '0', '0', '45', '', '', '', '', 'OPEN', '', '', '', '',
'', '', '', '', '1', '1', '0', '', '', '', '0', '0', '0', '1', '', '',
'', '0', '0', '0', '45', '0', '0', '', '0', '1', '45', '', '', '']
['bend', 'host1', '0', '0', '0', '0', '', '0', '0', '0', '', '0', '',
'0', '0', '0', '0', 'UP', '1', '1', '0', '0', '0', '113', '0', '',
'1', '2', '1', '', '0', '', '2', '0', '', '0', 'L4OK', '', '0', '0',
'0', '0', '0', '0', '0', '0', '', '', '', '0', '0', '']
svname FRONTEND svname host1

How to get non-csv lines in csv file

I have a csv like:
"Equipment","LNKEQP","METAST","METSER","MODSTA","METEOD"
"HLL_POS_00098",1,1,0,0,0
"TOY_GAT_00003",0,0,0,3,0
"NAT_POS_00010",0,3,0,3,0
"NAT_GAT_00002",0,0,0,0,0
"NAT_GAT_00001",0,0,0,4,0
A machine A is unavailable
And i use the code to read that csv file as:
reader = csv.DictReader(f)
s=[]
for row in reader:
But the row doesn't contain "A machine A is unavailable", how to get this line and output as this example:
'METEOD': '0', 'MODSTA': '0', 'METSER': '0', 'LNKEQP': '0', 'METAST': '0', 'Equipmnt': 'NAT_VCF_00001'
'METEOD': '0', 'MODSTA': '0', 'METSER': '0', 'LNKEQP': '1', 'METAST': '1', 'Equipment': 'NAT_TVM_00002'
A machine A is unavailable
Thank for your help

Remove the offending lines before parsing them:
import csv
from StringIO import StringIO
i = """"Equipment","LNKEQP","METAST","METSER","MODSTA","METEOD"
"HLL_POS_00098",1,1,0,0,0
"TOY_GAT_00003",0,0,0,3,0
"NAT_POS_00010",0,3,0,3,0
"NAT_GAT_00002",0,0,0,0,0
"NAT_GAT_00001",0,0,0,4,0
A machine A is unavailable
"""
# Take only those lines that contain a comma.
j = "".join([line for line in StringIO(i).readlines() if ',' in line])
# Parse the taken lines as CSV.
reader = csv.reader(StringIO(j))
for line in reader:
print line
Output:
['Equipment', 'LNKEQP', 'METAST', 'METSER', 'MODSTA', 'METEOD']
['HLL_POS_00098', '1', '1', '0', '0', '0']
['TOY_GAT_00003', '0', '0', '0', '3', '0']
['NAT_POS_00010', '0', '3', '0', '3', '0']
['NAT_GAT_00002', '0', '0', '0', '0', '0']
['NAT_GAT_00001', '0', '0', '0', '4', '0']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Search of elements inside a big CSV file using Python - python

Related

Increasing order Python

How to pair 2 list into 1 list

Getting data from a list on a specific line in a file (python)

How do I create a python list with unique indexes from a comma delimited string with duplicate values?

How to get non-csv lines in csv file

Categories

Resources