I have some data in my csv file but what I'm trying to do now is add in another column that has the row number in it. I've tried writing it normally but I need to be able to append to my file, so it needs to be able to pick up where it left off. I thought maybe the best thing to do was to overwrite the whole column again, but for that I think you need to read it again, which proved harder than I thought.
This depends - is the row number meaningful to the data in the row itself, or is this solely for your reporting so you can count how many lines appear as rows?
import csv
my_file = r'some.csv'
with open(my_file) as csv_file:
reader = csv.DictReader(csv_file)
row_index = 0
for row in reader:
print(str(row_index), row['col_1'], row['col_2'])
row_index += 1
This will read the CSV file and print out an incremented index per-row.
Otherwise this sounds like a script that checks for a column, 'row_count' for example, removes it, re-creates it, and populates all fields of 'row_count' with an incremented integer.
Picking up where it last left off: There will be no meaningful relationship between the index count and the rows of the file itself by manually adding these row numbers: if you want meaningful persistence with this new column (i.e. that a relational database would then use this column to link to another table with) it should be implemented sooner, such as in the schema of where these CSV files are being exported from.
Related
def deletedata(uniquecode):
with open('Stallingsbestand.csv', 'r+') as CSV:
writer = csv.writer(CSV, delimiter=';')
for row in CSV:
if uniquecode in row:
writer.writerow((uniquecode, ''))
in Stallingsbestand.csv consists of rows that look like this:
uniquecode;Date_of_last_opening_a_function
I want to be able to delete the date of last opening and just have the unique code there.
(appending False at the end of the row can work too but I don't know which is easier)
I thought that just overwriting the row would be the easiest but I can't get it to work. Is there anyone who knows how to make this work?
You want to rename the file to Stallingsbestand.old, and write out a new version of Stallingsbestand.csv. One way to do this is to copy (sometimes) modified rows from a csv.reader to a csv.writer within a loop, similar to your current code.
You might find it more convenient to create an in-memory dataframe with pandas.read_csv(), mutate one of its rows, and then persist it with pandas.to_csv().
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html
I have a csv reference data file of around 1m rows. I have a csv data file of 3m rows. I need to perform a reference data lookup for each of the 3m rows into the 1m row csv file.
For various reasons I am constrained to python and cvs. I have tried to have the 1m row table in a panda in memory but the whole thing is very slow.
Can someone recommend an alternative approach?
As I mentioned above, a good solution to this type of thing would be to dump the CSV into a sqlite db and the just query as needed :)
Here is one idea.
import csv
# Asks for search criteria from user
search_parts = input("Enter search criteria:\n").split(",")
# Opens csv data file
file = csv.reader(open("C:\\your_path_here\\test.csv"))
# Go over each row and print it if it contains user input.
for row in file:
if all([x in row for x in search_parts]):
print(row)
The Problem
I have a CSV file that contains a large number of items.
The first column can contain either an IP address or random garbage. The only other column I care about is the fourth one.
I have written the below snippet of code in an attempt to check if the first column is an IP address and, if so, write that and the contents of the fourth column to another CSV file side by side.
with open('results.csv','r') as csvresults:
filecontent = csv.reader(csvresults)
output = open('formatted_results.csv','w')
processedcontent = csv.writer(output)
for row in filecontent:
first = str(row[0])
fourth = str(row[3])
if re.match('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', first) != None:
processedcontent.writerow(["{},{}".format(first,fourth)])
else:
continue
output.close()
This works to an extent. However, when viewing in Excel, both items are placed in a single cell rather than two adjacent ones. If I open it in notepad I can see that each line is wrapped in quotation marks. If these are removed Excel will display the columns properly.
Example Input
1.2.3.4,rubbish1,rubbish2,reallyimportantdata
Desired Output
1.2.3.4 reallyimportantdata - two separate columns
Actual Output
"1.2.3.4,reallyimportantdata" - single column
The Question
Is there any way to fudge the format part to not write out with quotations? Alternatively, what would be the best way to achieve what I'm trying to do?
I've tried writing out to another file and stripping the lines but, despite not throwing any errors, the result was the same...
writerow() takes a list of elements and writes each of those into a column. Since you are feeding a list with only one element, it is being placed into one column.
Instead, feed writerow() a list:
processedcontent.writerow([first,fourth])
Have you considered using Pandas?
import pandas as pd
df = pd.read_csv("myFile.csv", header=0, low_memory=False, index_col=None)
fid = open("outputp.csv","w")
for index, row in df.iterrows():
aa=re.match(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$",row['IP'])
if aa:
tline = '{0},{1}'.format(row['IP'], row['fourth column'])
fid.write(tline)
output.close()
There may be an error or two and I got the regex from here.
This assumes the first row of the csv has titles which can be referenced. If it does not then you can use header = None and reference the columns with iloc
Come to think of it you could probably run the regex on the dataFrame, copy the first and fourth column to a new dataFrame and use the to_csv method in pandas.
I'm trying to use it to manipulate data in large txt-files.
I have a txt-file with more than 2000 columns, and about a third of these have a title which contains the word 'Net'. I want to extract only these columns and write them to a new txt file. Any suggestion on how I can do that?
I have searched around a bit but haven't been able to find something that helps me. Apologies if similar questions have been asked and solved before.
EDIT 1: Thank you all! At the moment of writing 3 users have suggested solutions and they all work really well. I honestly didn't think people would answer so I didn't check for a day or two, and was happily surprised by this. I'm very impressed.
EDIT 2: I've added a picture that shows what a part of the original txt-file can look like, in case it will help anyone in the future:
One way of doing this, without the installation of third-party modules like numpy/pandas, is as follows. Given an input file, called "input.csv" like this:
a,b,c_net,d,e_net
0,0,1,0,1
0,0,1,0,1
(remove the blank lines in between, they are just for formatting the
content in this post)
The following code does what you want.
import csv
input_filename = 'input.csv'
output_filename = 'output.csv'
# Instantiate a CSV reader, check if you have the appropriate delimiter
reader = csv.reader(open(input_filename), delimiter=',')
# Get the first row (assuming this row contains the header)
input_header = reader.next()
# Filter out the columns that you want to keep by storing the column
# index
columns_to_keep = []
for i, name in enumerate(input_header):
if 'net' in name:
columns_to_keep.append(i)
# Create a CSV writer to store the columns you want to keep
writer = csv.writer(open(output_filename, 'w'), delimiter=',')
# Construct the header of the output file
output_header = []
for column_index in columns_to_keep:
output_header.append(input_header[column_index])
# Write the header to the output file
writer.writerow(output_header)
# Iterate of the remainder of the input file, construct a row
# with columns you want to keep and write this row to the output file
for row in reader:
new_row = []
for column_index in columns_to_keep:
new_row.append(row[column_index])
writer.writerow(new_row)
Note that there is no error handling. There are at least two that should be handled. The first one is the check for the existence of the input file (hint: check the functionality provide by the os and os.path modules). The second one is to handle blank lines or lines with an inconsistent amount of columns.
This could be done for instance with Pandas,
import pandas as pd
df = pd.read_csv('path_to_file.txt', sep='\s+')
print(df.columns) # check that the columns are parsed correctly
selected_columns = [col for col in df.columns if "net" in col]
df_filtered = df[selected_columns]
df_filtered.to_csv('new_file.txt')
Of course, since we don't have the structure of your text file, you would have to adapt the arguments of read_csv to make this work in your case (see the the corresponding documentation).
This will load all the file in memory and then filter out the unnecessary columns. If your file is so large that it cannot be loaded in RAM at once, there is a way to load only specific columns with the usecols argument.
You can use pandas filter function to select few columns based on regex
data_filtered = data.filter(regex='net')
simple problem, but maybe tricky answer:
The problem is how to handle a huge .txt file with pytables.
I have a big .txt file, with MILLIONS of lines, short lines, for example:
line 1 23458739
line 2 47395736
...........
...........
The content of this .txt must be saved into a pytable, ok, it's easy. Nothing else to do with the info in the txt file, just copy into pytables, now we have a pytable with, for example, 10 columns and millions of rows.
The problem comes up when, with the content in the txt file, 10 columns x millions lines are directly generated in the paytable BUT, depending on the data on each line of the .txt file, new colums must be created on the pytable. So how to handle this efficiently??
Solution 1: first copy all the text file, line by line into pytable (millions), and then iterate over each row on pytable (millions again) and, depending on the values, generate the new columns needed for the pytable.
Solution 2: read line by line the .txt file, do whatever needed, calculate the new needed values, and then send all the info to a pyrtable.
Solution 3:.....any other efficient and faster solution???
I think that basic problem here is one of the conceptual model. PyTables' Tables only handle regular (or structured) data. However, the data that you have is irregular or unstructured in that the structure is determined as you read the data. Said another way, PyTables needs the column description to be known completely by the time that create_table() is called. There is no way around this.
Since in your problem statement any line may add a new column you have no choice but to do this in two full passes through the data: (1) read through the data and determine the columns and (2) write the data to the table. In pseudocode:
import tables as tb
cols = {}
# discover columns
d = open('data.txt')
for line in d:
for col in line:
if col not in cols:
cols['colname'] = col
# write table
d.seek(0)
f = tb.open_file(...)
t = f.create_table(..., description=cols)
for line in d:
row = line_to_row(line)
t.append(row)
d.close()
f.close()
Obviously, if you knew the table structure ahead of time you could skip the first loop and this would be much faster.