iam trying to write some data in a csv file but i cant select different columns..
car=["car 11"]
finish=["Landhaus , Nord"]
time=["['05:36']", "['06:06']", "['06:36']", "['07:06']", "['07:36']", "['08:06']", "['08:36']", "['09:06']", "['09:36']", "['10:06']", "['10:36']", "['11:06']", "['11:36']", "['12:06']", "['12:36']", "['13:06']", "['13:36']", "['14:06']", "['14:36']", "['15:06']", "['15:36']", "['16:06']", "['16:36']", "['17:06']", "['17:36']", "['18:06']", "['18:36']", "['19:06']", "['19:36']", "['20:06']", "['20:36']"]<br/>
myfile = open("Informationen.csv", "wb")
writer = csv.writer(myfile,dialect='excel',delimiter=' ')
bla =[car,finish,time]
writer.writerow(bla)
Output:
car 11 Landhaus , Nord "['05:36']", "['06:06']", [..]
All in 1 row and Colum 1
But i want it like this
car 11 (in row 1 Colum 1) | "Landhaus , Nord" (in row 1 Column 2) | ['05:36'] (in Line 1 Column 3 ) | ['06:06'] (in row 1 Column 4 ) till Column n
Thanks for any help !
Edit
1more example how it should look like
Line 1: car 11 (column 1) Landhaus, Nord (column 2) ['05:36'] (column 3) ['05:36'] (column 4) [...]
example http://img13.imageshack.us/img13/4964/unbenanntvilw.png
Solution so far:
but got still problems with time list
car=["car 11"]
trenn=[';']
finish=['Landhaus , Nord']
time=["['05:36']", "['06:06']", "['06:36']", "['07:06']", "['07:36']", "['08:06']", "['08:36']", "['09:06']", "['09:36']", "['10:06']", "['10:36']", "['11:06']", "['11:36']", "['12:06']", "['12:36']", "['13:06']", "['13:36']", "['14:06']", "['14:36']", "['15:06']", "['15:36']", "['16:06']", "['16:36']", "['17:06']", "['17:36']", "['18:06']", "['18:36']", "['19:06']", "['19:36']", "['20:06']", "['20:36']"]
myfile = open("Informationen2.csv", "wb")
writer = csv.writer(myfile,delimiter=' ')
bla = car + trenn + finish + trenn + time
writer.writerow(bla)
myfile.close()
The Python documentation states that for the csv.writer() function the ...
"optional dialect parameter can be given which is used to define
a set of parameters specific to a particular CSV dialect".
... and that ...
"the other optional fmtparams keyword arguments can be given to override individual formatting parameters in the current dialect".
The problem you are experiencing is a consequence of writing a string representation of the time list to file and writing to file using a whitespace delimiter. If you were to view Informationen.csv as a plain text file the problem becomes apparent.
Firstly, writing to file having passed a whitespace delimiter as an argument in csv.writer(myfile, dialect='excel', delimiter=' ') overrides the default delimiter as defined in the Excel dialect and results in the elements of list bla being written to file with the format element1 element2 element3 as opposed to element1,element2,element3.
Secondly, although the majority of the elements in the time list are allocated their own columns in a spreadsheet as desired, writing the list to file as a string representation of itself has contributed to the overall undesired formatting.
When you open the file created with your script as an Excel file, Excel reads in two initial values based on the first two commas it finds which happen to be in the center of the string 'Landhaus , Nord' and within the string representation of the time list.
You can achieve the column separation you require firstly by appending the elements of the time list to the bla list, as opposed to nesting the former within the latter. You then need to omit delimiter=' ' in csv.writer(myfile, dialect='excel', delimiter=' '), thus avoiding the delimiter overriding effect when writing to file:
import csv
car = ['car 11']
finish = ['Landhaus , Nord']
time = ["['05:36']", "['06:06']", "['06:36']", "['07:06']", "['07:36']"]
try:
with open('Informationen.csv', 'w') as myfile:
writer = csv.writer(myfile, dialect='excel')
bla = [car, finish]
for each_time in time:
bla.append(each_time)
writer.writerow(bla)
except IOError as ioe:
print('Error: ' + str(ioe))
producing the following output in Excel:
http://imageshack.us/a/img839/8061/screenshotkn.png
import csv
car=["car 11"]
finish=['Landhaus , Nord']
time=["['05:36']", "['06:06']", "['06:36']", "['07:06']", "['07:36']", "['08:06']", "['08:36']", "['09:06']", "['09:36']", "['10:06']", "['10:36']", "['11:06']", "['11:36']", "['12:06']", "['12:36']", "['13:06']", "['13:36']", "['14:06']", "['14:36']", "['15:06']", "['15:36']", "['16:06']", "['16:36']", "['17:06']", "['17:36']", "['18:06']", "['18:36']", "['19:06']", "['19:36']", "['20:06']", "['20:36']"]
myfile = open("derp.csv", "wb")
writer = csv.writer(myfile)
bla = car + finish + time
writer.writerow(bla)
myfile.close()
Here is the output I get from excel
Related
The porblem
I have a csv file called data.csv. On each row I have:
timestamp: int
account_id: int
data: float
for instance:
timestamp,account_id,value
10,0,0.262
10,0,0.111
13,1,0.787
14,0,0.990
This file is ordered by timestamp.
The number of row is too big to store all rows in memory.
order of magnitude: 100 M rows, number of account: 5 M
How can I quickly get all rows of a given account_id ? What would be the best way to make the data accessible by account_id ?
Things I tried
to generate a sample:
N_ROW = 10**6
N_ACCOUNT = 10**5
# Generate data to split
with open('./data.csv', 'w') as csv_file:
csv_file.write('timestamp,account_id,value\n')
for timestamp in tqdm.tqdm(range(N_ROW), desc='writing csv file to split'):
account_id = random.randint(1,N_ACCOUNT)
data = random.random()
csv_file.write(f'{timestamp},{account_id},{data}\n')
# Clean result folder
if os.path.isdir('./result'):
shutil.rmtree('./result')
os.mkdir('./result')
Solution 1
Write a script that creates a file for each account, read rows one by one on the original csv, write the row on on the file that corresponds to the account (open and close a file for each row).
Code:
# Split the data
p_bar = tqdm.tqdm(total=N_ROW, desc='splitting csv file')
with open('./data.csv') as data_file:
next(data_file) # skip header
for row in data_file:
account_id = row.split(',')[1]
account_file_path = f'result/{account_id}.csv'
file_opening_mode = 'a' if os.path.isfile(account_file_path) else 'w'
with open(account_file_path, file_opening_mode) as account_file:
account_file.write(row)
p_bar.update(1)
Issues:
It is quite slow (i think it is inefficient to open and close a file on each row). It takes around 4 minutes for 1 M rows. Even if it works, will it be fast ? Given an account_id I know the name of the file I should read but the system has to look over 5M files to find it. Should I create some kind of binary tree with folders with the leafs being the files ?
Solution 2 (works on small example not on large csv file)
Same idea as solution 1 but instead of opening / closing a file for each row, store files in a dictionary
Code:
# A dict that will contain all files
account_file_dict = {}
# A function given an account id, returns the file to write in (create new file if do not exist)
def get_account_file(account_id):
file = account_file_dict.get(account_id, None)
if file is None:
file = open(f'./result/{account_id}.csv', 'w')
account_file_dict[account_id] = file
file.__enter__()
return file
# Split the data
p_bar = tqdm.tqdm(total=N_ROW, desc='splitting csv file')
with open('./data.csv') as data_file:
next(data_file) # skip header
for row in data_file:
account_id = row.split(',')[1]
account_file = get_account_file(account_id)
account_file.write(row)
p_bar.update(1)
Issues:
I am not sure it is actually faster.
I have to open simultaneously 5M files (one per account). I get an error OSError: [Errno 24] Too many open files: './result/33725.csv'.
Solution 3 (works on small example not on large csv file)
Use awk command, solution from: split large csv text file based on column value
code:
after generating the file, run: awk -F, 'NR==1 {h=$0; next} {f="./result/"$2".csv"} !($2 in p) {p[$2]; print h > f} {print >> f}' ./data.csv
Issues:
I get the following error: input record number 28229, file ./data.csv source line number 1 (number 28229 is an example, it usually fails around 28k). I assume It is also because i am opening too many files
#VinceM :
While not quite 15 GB, I do have a 7.6 GB one with 3 columns :
-- 148 mn prime numbers, their base-2 log, and their hex
in0: 7.59GiB 0:00:09 [ 841MiB/s] [ 841MiB/s] [========>] 100%
148,156,631 lines 7773.641 MB ( 8151253694) /dev/stdin
|
f="$( grealpath -ePq ~/master_primelist_19d.txt )"
( time ( for __ in '12' '34' '56' '78' '9'; do
( gawk -v ___="${__}" -Mbe 'BEGIN {
___="^["(___%((_+=_^=FS=OFS="=")+_*_*_)^_)"]"
} ($_)~___ && ($NF = int(($_)^_))^!_' "${f}" & ) done |
gcat - ) ) | pvE9 > "${DT}/test_primes_squared_00000002.txt"
|
out9: 13.2GiB 0:02:06 [98.4MiB/s] [ 106MiB/s] [ <=> ]
( for __ in '12' '34' '56' '78' '9'; do; ( gawk -v ___="${__}" -Mbe "${f}" &)
0.36s user 3 out9: 13.2GiB 0:02:06 [ 106MiB/s] [ 106MiB/s]
Using only 5 instances of gawk with big-integer package gnu-GMP, each with a designated subset of leading digit(s) of the prime number,
—- it managed to calculate the full precision squaring of those primes in just 2 minutes 6 seconds, yielding an unsorted 13.2 GB output file
if it can square that quickly, then merely grouping by account_id should be a walk in the park
Have a look at https://docs.python.org/3/library/sqlite3.html
You could import the data, create required indexes and then run queries normally. No dependencies except for the python itself.
https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_csv.html
If you have to query raw data every time and you are limited by simple python only, then you can either write a code to read it manually and yield matched rows or use a helper like this:
from convtools.contrib.tables import Table
from convtools import conversion as c
iterable_of_matched_rows = (
Table.from_csv("tmp/in.csv", header=True)
.filter(c.col("account_id") == "1")
.into_iter_rows(dict)
)
However this won't be faster than reading 100M row csv file with csv.reader.
screenshot of the csv file
Hi(sorry if this is a dump question)..i have a data set as CSV file ...every row contains 44 column and every cell containes 44 float number separated by two spaces like this(look at the screenshot) ...i tried CSV readline/s + numpy and non of them worked
i want to take every row as a list with[1936] variable (44*44)
and then combine the whole data set into 2d array ...my_data[n_of_samples][1936]
so as stated by user ybl, this is not a CSV. It's not even close to being a CSV.
This means that you have to implement some processing to turn this into something useable. I put the screenshot through an OCR to extract the actual text values, but next time provide the input file. Screenshots of data are annoying to work with.
The processing you need to to is to find the start and end of the rows, using the [ and ] characters respectively. Then you split this data with the basic string.split() which doesn't care about the number of spaces.
Try the code below and see if that works for the input file.
rows = []
current_row = ""
with open("somefile.txt") as infile:
for line in infile.readlines():
cleaned = line.replace('"', '').replace("\n", " ")
if "]" in cleaned:
current_row = f"{current_row} {cleaned.split(']')[0]}"
rows.append(current_row.split())
current_row = ""
cleaned = cleaned.split(']')[1]
if "[" in cleaned:
cleaned = cleaned.split("[")[1]
current_row = f"{current_row} {cleaned}"
for row in rows:
print(len(row))
output
44
44
44
input file:
"[ 1.79619717e+04 1.09988207e+02 4.13270009e+01 1.72227906e+01
1.06178751e+01 5.20957856e+00 7.50891645e+00 4.57943370e+00
2.65572713e+00 2.96725867e-01 2.43040664e+00 1.32822091e+00
4.09853169e-01 1.18412873e+00 6.43398990e-01 1.23796528e+00
9.63975374e-02 2.95295579e-01 7.68998970e-01 4.98040980e-01
2.84036936e-01 1.76004564e-01 1.43527613e-01 1.64765236e-01
1.51171075e-01 1.02586637e-01 3.27835810e-02 1.21872869e-02
-7.59824907e-02 8.48217334e-02 7.29953754e-02 4.89750588e-02
5.89426950e-02 5.05485266e-02 2.34761263e-02 -2.41095452e-02
5.15952510e-02 1.39933210e-02 2.12354074e-02 3.40820680e-03
-2.57466949e-03 -1.06481222e-02 -8.35155410e-03 1.21653512e-12]","[-6.12189619e+02 1.03584744e+04 2.34417495e+02 7.01761526e+01
3.92495170e+01 1.81609738e+01 2.58114624e+01 1.52275550e+01
8.59676934e+00 9.45036161e-01 7.71943506e+00 4.17516432e+00
1.27920413e+00 3.68862368e+00 1.99582544e+00 3.82999035e+00
2.96068511e-01 9.06341796e-01 2.35621065e+00 1.52094079e+00
8.64565916e-01 5.34605108e-01 4.35456793e-01 4.99450615e-01
4.57778770e-01 3.10324997e-01 9.90860520e-02 3.68281889e-02
-2.29532895e-01 2.56108491e-01 2.20284123e-01 1.47727878e-01
1.77724506e-01 1.52350751e-01 7.07318164e-02 -7.26252404e-02
1.55364050e-01 4.21222079e-02 6.39113311e-02 1.02558665e-02
-7.74736016e-03 -3.20368093e-02 -2.51241082e-02 1.21653512e-12]","[-5.03959282e+02 -5.64452044e+02 7.90433958e+03 1.94146598e+02
1.06178751e+01 5.20957856e+00 7.50891645e+00 4.57943370e+00
2.65572713e+00 2.96725867e-01 2.43040664e+00 1.32822091e+00
4.09853169e-01 1.18412873e+00 6.43398990e-01 1.23796528e+00
9.63975374e-02 2.95295579e-01 7.68998970e-01 4.98040980e-01
2.84036936e-01 1.76004564e-01 1.43527613e-01 1.64765236e-01
1.51171075e-01 1.02586637e-01 3.27835810e-02 1.21872869e-02
-7.59824907e-02 8.48217334e-02 7.29953754e-02 4.89750588e-02
5.89426950e-02 5.05485266e-02 2.34761263e-02 -2.41095452e-02
5.15952510e-02 1.39933210e-02 2.12354074e-02 3.40820680e-03
-2.57466949e-03 -1.06481222e-02 -8.35155410e-03 1.21653512e-12]"
The option is this:
import numpy as np
import csv
c = np.array([n_of_samples])
with open('cocacola_sick.csv') as f:
p = csv.reader(f) # read file as csv
for s in p:
a = ','.join(s) # concatenate all lines into one line
a = a.replace("\n", "") # remove line breaks
b = np.array(np.mat(a))
my_data = np.vstack((c,b))
print(my_data)
I'm currently working on pulling a data set from .txt files. The data set has two types of spacing that are not uniform or consistent. For example one row will be:
10 0 1 10
and the next will be:
10 0 1 -10
This is giving me errors as using numpy.loadtxt(Data, delimeter=' ') will sometimes create 4 columns per row, and other times 3 columns per row if there is a negative integer.
I tried to take the raw .txt file and replace the ' -' with ' -' so the delimeter will pick it up but I get the error line 533, in open raise IOError("%s not found." % path)
Any help is greatly appreciated!
My Current Code:
Raw_02 = open('IEA-15-240-RWT_AeroDyn15_Polar_02.txt', 'r')
Raw_02 = Raw_02.read().replace(' -', ' -')
data_02 = np.loadtxt(Raw_02, delimiter=' ', skiprows=54, dtype=str, max_rows=200) #error here
data_02_a = np.array(data_02)
data_tab_02 = pd.DataFrame(data_02_a, columns=col_names2)
data_tab_02.to_excel('Raw_Data02.xlsx', sheet_name='02')
This error is due to the fact that np.loadtxt expects a file path, not a string.
The file can be replaced with io.StringIO
from io import StringIO
Raw_02 = open('IEA-15-240-RWT_AeroDyn15_Polar_02.txt', 'r')
Raw_02 = StringIO(Raw_02.read().replace(' -', ' -'))
data_02 = np.loadtxt(Raw_02, delimiter=' ', skiprows=54, dtype=str, max_rows=200) #error here
data_02_a = np.array(data_02)
data_tab_02 = pd.DataFrame(data_02_a, columns=col_names2)
data_tab_02.to_excel('Raw_Data02.xlsx', sheet_name='02')
But I think, that you should also pay attention to the fact that when replacing - with - the separator between other columns may not be correct.
import csv
import datetime
with open('soundTransit1_remote_rawMeasurements_15m.txt','r') as infile, open('soundTransit1.txt','w') as outfile:
inr = csv.reader(infile,delimiter='\t')
#ouw = csv.writer(outfile,delimiter=' ')
for row in inr:
d = datetime.datetime.strptime(row[0],'%Y-%m-%d %H:%M:%S')
s = 1
p = int(row[5])
nr = [format(s,'02')+format(d.year,'04')+format(d.month,'02')+format(d.day,'02')+format(d.hour,'02')+format(d.minute,'02')+format(int(p*0.2),'04')]
outfile.writelines(nr+'/n')
Using the above script, I have read in a .txt file and reformatted it as 'nr' so it looks like this:
['012015072314000000']
['012015072313450000']
['012015072313300000']
['012015072313150000']
['012015072313000000']
['012015072312450000']
['012015072312300000']
['012015072312150000']
..etc.
I need to now print it onto my new .txt file, but Python is not allowing me to print 'nr' with line breaks after each entry, I think because the data is in strings. I get this error:
TypeError: can only concatenate list (not "str") to list
Is there another way to do this?
You are trying to combine a list with a string, which cannot work. Simply don't create a list in nr.
import csv
import datetime
with open('soundTransit1_remote_rawMeasurements_15m.txt','r') as infile, open('soundTransit1.txt','w') as outfile:
inr = csv.reader(infile,delimiter='\t')
#ouw = csv.writer(outfile,delimiter=' ')
for row in inr:
d = datetime.datetime.strptime(row[0],'%Y-%m-%d %H:%M:%S')
s = 1
p = int(row[5])
nr = "{:02d}{:%Y%m%d%H%M}{:04d}\n".format(s,d,int(p*0.2))
outfile.write(nr)
There is no need to put your string into a list; just use outfile.write() here and build a string without a list:
nr = format(s,'02') + format(d.year,'04') + format(d.month, '02') + format(d.day, '02') + format(d.hour, '02') + format(d.minute, '02') + format(int(p*0.2), '04')
outfile.write(nr + '\n')
Rather than use 7 separate format() calls, use str.format():
nr = '{:02}{:%Y%m%d%H%M}{:04}\n'.format(s, d, int(p * 0.2))
outfile.write(nr)
Note that I formatted the datetime object with one formatting operation, and I included the newline into the string format.
You appear to have hard-coded the s value; you may as well put that into the format directly:
nr = '01{:%Y%m%d%H%M}{:04}\n'.format(d, int(p * 0.2))
outfile.write(nr)
Together, that updates your script to:
with open('soundTransit1_remote_rawMeasurements_15m.txt', 'r') as infile,\
open('soundTransit1.txt','w') as outfile:
inr = csv.reader(infile, delimiter='\t')
for row in inr:
d = datetime.datetime.strptime(row[0], '%Y-%m-%d %H:%M:%S')
p = int(int(row[5]) * 0.2)
nr = '01{:%Y%m%d%H%M}{:04}\n'.format(d, p)
outfile.write(nr)
Take into account that the csv module works better if you follow the guidelines about opening files; in Python 2 you need to open the file in binary mode ('rb'), in Python 3 you need to set the newline parameter to ''. That way the module can control newlines correctly and supports including newlines in column values.
I'm having trouble writing to text file. Here's my code snippet.
ram_array= map(str, ram_value)
cpu_array= map(str, cpu_value)
iperf_ba_array= map(str, iperf_ba)
iperf_tr_array= map(str, iperf_tr)
#with open(ram, 'w') as f:
#for s in ram_array:
#f.write(s + '\n')
#with open(cpu,'w') as f:
#for s in cpu_array:
#f.write(s + '\n')
with open(iperf_b,'w') as f:
for s in iperf_ba_array:
f.write(s+'\n')
f.close()
with open(iperf_t,'w') as f:
for s in iperf_tr_array:
f.write(s+'\n')
f.close()
The ram and cpu both work flawlessly, however when writing to a file for iperf_ba and iperf_tr they always come out look like this:
[45947383.0, 47097609.0, 46576113.0, 47041787.0, 47297394.0]
Instead of
1
2
3
They're both reading from global lists. The cpu and ram have values appended 1 by 1, but otherwise they look exactly the same pre processing.
Here's how they're made
filename= "iperfLog_2015_03_12_20:45:18_123_____tag_33120L06.csv"
write_location= self.tempLocation()
location=(str(write_location) + str(filename));
df = pd.read_csv(location, names=list('abcdefghi'))
transfer = df.h
transfer=transfer[~transfer.isnull()]#uses pandas to remove nan
transfer=transfer.tolist()
length= int(len(transfer))
extra= length-1
del transfer[extra]
bandwidth= df.i
bandwidth=bandwidth[~bandwidth.isnull()]
bandwidth=bandwidth.tolist()
del bandwidth[extra]
iperf_tran.append(transfer)
iperf_band.append(bandwidth)
[from comment]
you need to use .extend(list) if you want to add a list to a list - and don't worry: we're all spending hours debugging/chasing classy-stupid-me mistakes sometimes ;)