Related
I have recently stumbled upon a task utilizing some CSV files that are, to say the least, very poorly organized, with one cell containing what should be multiple separate columns. I would like to use this data in a Python script but want to know if it is possible to delete a portion of the row (all of it after a certain point) then write that to a dictionary.
Although I can't show the exact contents of the CSV, it looks like this:
useful. useless useless useless useless
I understand that this will most likely require either a regular expression or an endswith statement, but doing all of that to a CSV file is beyond me. Also, the period written after useful on the CSV should be removed as well, and is not a typo.
If you know the character you want to split on you can use this simple method:
good_data = bad_data.split(".")[0]
good_data = good_data.strip() # remove excess whitespace at start and end
This method will always work. split will return a tuple which will always have at least 1 entry (the full string). Using index may throw an exception.
You can also limit the # of splits that will happen if necessary using split(".", N).
https://docs.python.org/2/library/stdtypes.html#str.split
>>> "good.bad.ugly".split(".", 1)
['good', 'bad.ugly']
>>> "nothing bad".split(".")
['nothing bad']
>>> stuff = "useful useless"
>>> stuff = stuff[:stuff.index(".")]
ValueError: substring not found
Actual Answer
Ok then notice that you can use indexing for strings just like you do for lists. I.e. "this is a very long string but we only want the first 4 letters"[:4] gives "this". If we now new the index of the dot we could just get what you want like that. For exactly that strings have the index method. So in total you do:
stuff = "useful. useless useless useless useless"
stuff = stuff[:stuff.index(".")]
Now stuff is very useful :).
In case we are talking about a file containing multiple lines like that you could do it for each line. Split that line at , and put all in a dictionary.
data = {}
with open("./test.txt") as f:
for i, line in enumerate(f.read().split("\n")):
csv_line = line[:line.index(".")]
for j,col in enumerate(csv_line.split(",")):
data[(i,j)] = col
How one would do this
Notice that most people would not want to do it by hand. It is a common task to work on tabled data and there is a library called pandas for that. Maybe it would be a good idea to familiarise yourself a bit more with python before you dive into pandas though. I think a good point to start is this. Using pandas your task would look like this
import pandas as pd
pd.read_csv("./test.txt", comment=".")
giving you what is called a dataframe.
I have a gz file, and i want to extract the unique values from each column from the file, field separator is |, i tried using python as below.
import sys,os,csv,gzip
from sets import Set
ig = 0
max_d = 1
with gzip.open("fundamentals.20170724.gz","rb") as f:
reader = csv.reader(f,delimiter="|")
for i in range(0,400):
unique = Set()
print "Unique_value for column "+str(i+1)
flag = 0
for line in reader:
try:
unique.add(line[i])
max_d +=1
if len(unique) >= 10:
print unique
flag = 1
break
except:
continue
if flag == 0: print unique
I don't find it efficient for large files, although it is working somehow, but seeking this problems from bash point of view.
any shell script solution?
for example i have the data in my file as
5C4423,COMP,ISIN,CA2372051094,2016-04-19,
41C528,COMP,ISIN,US2333774071,2000-01-01,
B62545,COMP,ISIN,NL0000344265,2000-01-01,2007-05-11
9E7F41,COMP,ISIN,CA39260W1023,2013-02-13,2013-08-09
129DC8,COMP,ISIN,US37253A1034,2012-09-07,
4DE8CD,COMP,ISIN,QA000A0NCQB1,2008-03-06,
and in want all unique values from each column.
With the gunzipped file, you could do:
awk -F, 'END { for (i=1;i<=NF;i++) { print "cut -d\",\" -f "i" filename | uniq" } }' filename | sh
Set the field separator to , and then for each field in the file, construct a cut command piping through uniq and finally pipe the whole awk response through sh. The use of cut, uniq and sh will slow things down and there is probably a more efficient way but it's worth a go.
A shell built pipeline could indeed do this job faster, though likely less memory efficient. The primary reasons are two: parallellism and native code.
First, since we have little description of the task, I'll have to read the Python code and figure out what it does.
from sets import Set is an odd line; sets are part of the standard library, and I don't know what your sets module contains. I'll have to guess it's at best another name for the standard set type, or at least a less efficient variant of the same concept.
gzip.open lets the script read a gzipped file. We can replace this with a zcat process.
csv.readerreads character separated values, in this case splitting on '|'. Deeper inside the code we find only one column (line[i]) is read, so we can replace it with cut or awk ... until i changes. awk can handle that case too, but it's a little trickier.
The trickiest part is the end logic. Every time 10 unique values are found in a column, the program outputs those values and switches to the next column. By the way, Python's for has an else clause specifically for this case, so you don't need a flag variable.
One of the odder parts of the code is how you catch all exceptions from the inner data processing block. Why is this? There are basically only two sources of exceptions in there: Firstly, the indexing could fail if there aren't that many columns. Secondly, the unknown Set type could be throwing exceptions; the standard set type would not.
So, the analysis of your function is: in a diagonal manner (since the file is never rewound, and columns are not processed in parallel), collect unique values from each column until ten are found, and print them. This means, for instance, that if the first column had less than ten unique items nothing is ever printed for any other columns. I'm not sure this is the logic you intended.
With such complicated logic, Python's set functionality actually is a good choice; if we could partition the data more easily then uniq might have been better. What throws us off is how the program moves from column to column and only wants a specific number of values.
Thus, the two big time wasters in the Python program are decompressing in the same thread as we do all the parsing, and splitting into all columns when we only need one. The former can be addressed using a thread, and the latter is probably best done using a regular expression such as r'^(?:[^|]*\|){3}([^|]*)'. That expression would skip three columns and the fourth can be read as group 1. It gets more complicated if the CSV has quoting to contain the separator within some column. We could do the line parsing itself in a separate thread, but that wouldn't solve the issue of the many unneeded string allocations.
Note that the problem actually becomes considerably different if what you really want is to process all columns from the start of the file. I also don't know why you specifically process 400 columns regardless of the amount that exist. If we remove those two constraints, the logic would be more like:
firstline=next(reader)
sets = [{column} for column in firstline]
for line in reader:
for column,columnset in zip(line,sets):
columnset.add(column)
this is a pure python version based on your idea:
from io import StringIO
from csv import reader
txt = '''5C4423,COMP,ISIN,CA2372051094,2016-04-19,
41C528,COMP,ISIN,US2333774071,2000-01-01,
B62545,COMP,ISIN,NL0000344265,2000-01-01,2007-05-11
9E7F41,COMP,ISIN,CA39260W1023,2013-02-13,2013-08-09
129DC8,COMP,ISIN,US37253A1034,2012-09-07,
4DE8CD,COMP,ISIN,QA000A0NCQB1,2008-03-06,'''
with StringIO(txt) as file:
rows = reader(file)
first_row = next(rows)
unique = [{item} for item in first_row]
for row in rows:
for item, s in zip(row, unique):
s.add(item)
which yields for your input:
[{'129DC8', '41C528', '4DE8CD', '5C4423', '9E7F41', 'B62545'},
{'COMP'},
{'ISIN'},
{'CA2372051094',
'CA39260W1023',
'NL0000344265',
'QA000A0NCQB1',
'US2333774071',
'US37253A1034'},
{'2000-01-01', '2008-03-06', '2012-09-07', '2013-02-13', '2016-04-19'},
{'', '2007-05-11', '2013-08-09'}]
oops, now that i have posted my answer i see, that this is exactly what Yann Vernier proposes at the end of his answer. please upvote this answer which was here way earlier than mine...
if you want to limit the number of unique values, you could use a deque as data structure:
from io import StringIO
from csv import reader
MAX_LEN = 3
with StringIO(txt) as file:
rows = reader(file)
first_row = next(rows)
unique = [{item} for item in first_row]
for row in rows:
for item, s in zip(row, unique):
if len(s) < MAX_LEN:
s.add(item)
print(unique)
with the result:
[{'41C528', '5C4423', 'B62545'},
{'COMP'},
{'ISIN'},
{'CA2372051094', 'NL0000344265', 'US2333774071'},
{'2000-01-01', '2013-02-13', '2016-04-19'},
{'', '2007-05-11', '2013-08-09'}]
this way you would save some memory if one of your columns holds only unique values.
I am self-taught in MySQL, Python and Linux based OS, and quite sure there must be a more elegant solution to the problem, or, at least, one that works along my lines.
The code is taking data from the last 24 hours from a database and storing them to a .txt file to be handled further on. However, the output I am getting has additional symbols that are making further analysis troublesome - I want to know if there is a way to remove them.
My (relevant) code is:
...
cur = db.cursor()
cur.execute("SELECT * FROM Sens WHERE sdate > DATE_SUB(NOW(),INTERVAL 24 HOUR)")
query = cur.fetchall()
OutputFile = open("/root/Desktop/data.txt", "w")
for i in range (0, len(query)):
print>>OutputFile, query[i]
...
The reason I am using for loop is to have each row fetched printed in a newline.
The result I get is as follows:
('0,01/24/16,12:41:49,45.185\r\r\n',)
The result I need is:
0,01/24/16,12:41:49,45.185
Much appreciate the help,
LZ.
You could use:
for i in range (0, len(query)):
print>>OutputFile, query[i][0].strip()
The [0] index selects the string from the tuple, and the strip() function removes the whitespace from the left and right hand side of the string.
For a start, you shouldn't ever be iterating over range(len(something)). Always iterate over the thing itself. Also, it's more idiomatic to use file.write to output, rather than pritnting.
From there, all you need to do is just output the first element in each item, with [0].
for q in query:
OutputFile.write(q[0])
To start I am a complete new comer to Python and programming anything other than web languages.
So, I have developed a script using Python as an interface between a piece of Software called Spendmap and an online app called Freeagent. This script works perfectly. It imports and parses the text file and pushes it through the API to the web app.
What I am struggling with is Spendmap exports multiple lines per order where as Freeagent wants One line per order. So I need to add the cost values from any orders spread across multiple lines and then 'flatten' the lines into One so it can be sent through the API. The 'key' field is the 'PO' field. So if the script sees any matching PO numbers, I want it to flatten them as per above.
This is a 'dummy' example of the text file produced by Spendmap:
5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP
COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,42.000,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000002,1133919,359.400,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
The above has been formatted for easier reading and normally is just one line after the next with no text formatting.
The 'key' or PO field is the first bold item and the second bold/italic item is the cost to be totalled. So if this example was to be passed through the script id expect the first row to be left alone, the Second and Third row costs to be added as they're both from the same PO number and the Fourth line to left alone.
Expected result:
5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP
COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,401.400,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
Any help with this would be greatly appreciated and if you need any further details just say.
Thanks in advance for looking!
I won't give you the solution. But you should:
Write and test a regular expression that breaks the line down into its parts, or use the CSV library.
Parse the numbers out so they're decimal numbers rather than strings
Collect the lines up by ID. Perhaps you could use a dict that maps IDs to lists of orders?
When all the input is finished, iterate over that dict and add up all orders stored in that list.
Make a string format function that outputs the line in the expected format.
Maybe feed the output back into the input to test that you get the same result. Second time round there should be no changes, if I understood the problem.
Good luck!
I would use a dictionary to compile the lines, using get(key,0.0) to sum values if they exist already, or start with zero if not:
InputData = """5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,42.000,20,2013-10-31,103,xxxxxx,AP COMMENT,002143
301067,2013-09-06,2013-09-11,P000002,1133919,359.400,20,2013-10-31,103,xxxxxx,AP COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP COMMENT,002143"""
OutD = {}
ValueD = {}
for Line in InputData.split('\n'):
# commas in comments won't matter because we are joining after anyway
Fields = Line.split(',')
PO = Fields[3]
Value = float(Fields[5])
# set up the output string with a placeholder for .format()
OutD[PO] = ",".join(Fields[:5] + ["{0:.3f}"] + Fields[6:])
# add the value to the old value or to zero if it is not found
ValueD[PO] = ValueD.get(PO,0.0) + Value
# the output is unsorted by default, but you could sort or preserve original order
for POKey in ValueD:
print OutD[POKey].format(ValueD[POKey])
P.S. Yes, I know Capitals are for Classes, but this makes it easier to tell what variables I have defined...
I'm new to programming, and also to this site, so my apologies in advance for anything silly or "newbish" I may say or ask.
I'm currently trying to write a script in python that will take a list of items and write them into a csv file, among other things. Each item in the list is really a list of two strings, if that makes sense. In essence, the format is [[Google, http://google.com], [BBC, http://bbc.co.uk]], but with different values of course.
Within the CSV, I want this to show up as the first item of each list in the first column and the second item of each list in the second column.
This is the part of my code that I need help with:
with open('integration.csv', 'wb') as f:
writer = csv.writer(f, delimiter=',', dialect='excel')
writer.writerows(w for w in foundInstances)
For whatever reason, it seems that the delimiter is being ignored. When I open the file in Excel, each cell has one list. Using the old example, each cell would have "Google, http://google.com". I want Google in the first column and http://google.com in the second. So basically "Google" and "http://google.com", and then below that "BBC" and "http://bbc.co.uk". Is this possible?
Within my code, foundInstances is the list in which all the items are contained. As a whole, the script works fine, but I cannot seem to get this last step. I've done a lot of looking around within stackoverflow and the rest of the Internet, but I haven't found anything that has helped me with this last step.
Any advice is greatly appreciated. If you need more information, I'd be happy to provide you with it.
Thanks!
In your code on pastebin, the problem is here:
foundInstances.append(['http://' + str(num) + 'endofsite' + ', ' + desc])
Here, for each row in your data, you create one string that already has a comma in it. That is not what you need for the csv module. The CSV module makes comma-delimited strings out of your data. You need to give it the data as a simple list of items [col1, col2, col3]. What you are doing is ["col1, col2, col3"], which already has packed the data into a string. Try this:
foundInstances.append(['http://' + str(num) + 'endofsite', desc])
I just tested the code you posted with
foundInstances = [[1,2],[3,4]]
and it worked fine. It definitely produces the output csv in the format
1,2
3,4
So I assume that your foundInstances has the wrong format. If you construct the variable in a complex manner, you could try to add
import pdb; pdb.set_trace()
before the actual variable usage in the csv code. This lets you inspect the variable at runtime with the python debugger. See the Python Debugger Reference for usage details.
As a side note, according to the PEP-8 Style Guide, the name of the variable should be found_instances in Python.