I'm trying to use it to manipulate data in large txt-files.
I have a txt-file with more than 2000 columns, and about a third of these have a title which contains the word 'Net'. I want to extract only these columns and write them to a new txt file. Any suggestion on how I can do that?
I have searched around a bit but haven't been able to find something that helps me. Apologies if similar questions have been asked and solved before.
EDIT 1: Thank you all! At the moment of writing 3 users have suggested solutions and they all work really well. I honestly didn't think people would answer so I didn't check for a day or two, and was happily surprised by this. I'm very impressed.
EDIT 2: I've added a picture that shows what a part of the original txt-file can look like, in case it will help anyone in the future:
One way of doing this, without the installation of third-party modules like numpy/pandas, is as follows. Given an input file, called "input.csv" like this:
a,b,c_net,d,e_net
0,0,1,0,1
0,0,1,0,1
(remove the blank lines in between, they are just for formatting the
content in this post)
The following code does what you want.
import csv
input_filename = 'input.csv'
output_filename = 'output.csv'
# Instantiate a CSV reader, check if you have the appropriate delimiter
reader = csv.reader(open(input_filename), delimiter=',')
# Get the first row (assuming this row contains the header)
input_header = reader.next()
# Filter out the columns that you want to keep by storing the column
# index
columns_to_keep = []
for i, name in enumerate(input_header):
if 'net' in name:
columns_to_keep.append(i)
# Create a CSV writer to store the columns you want to keep
writer = csv.writer(open(output_filename, 'w'), delimiter=',')
# Construct the header of the output file
output_header = []
for column_index in columns_to_keep:
output_header.append(input_header[column_index])
# Write the header to the output file
writer.writerow(output_header)
# Iterate of the remainder of the input file, construct a row
# with columns you want to keep and write this row to the output file
for row in reader:
new_row = []
for column_index in columns_to_keep:
new_row.append(row[column_index])
writer.writerow(new_row)
Note that there is no error handling. There are at least two that should be handled. The first one is the check for the existence of the input file (hint: check the functionality provide by the os and os.path modules). The second one is to handle blank lines or lines with an inconsistent amount of columns.
This could be done for instance with Pandas,
import pandas as pd
df = pd.read_csv('path_to_file.txt', sep='\s+')
print(df.columns) # check that the columns are parsed correctly
selected_columns = [col for col in df.columns if "net" in col]
df_filtered = df[selected_columns]
df_filtered.to_csv('new_file.txt')
Of course, since we don't have the structure of your text file, you would have to adapt the arguments of read_csv to make this work in your case (see the the corresponding documentation).
This will load all the file in memory and then filter out the unnecessary columns. If your file is so large that it cannot be loaded in RAM at once, there is a way to load only specific columns with the usecols argument.
You can use pandas filter function to select few columns based on regex
data_filtered = data.filter(regex='net')
Related
I am trying to get the dimensions (shape) of a data frame using pandas in python without reading the entire data frame first in memory given that the file is quite large.
To get the number of columns with minimal loading of the file into the memory, I can for example use the argument below.
import pandas as pd
pd = pd.read_csv("myData.csv", nrows=1)
print(pd.shape)
To get the row numbers I can use the argument usecols = [1] when reading the file but there must be a simpler way of doing this.
If there are other packages or scripts that can easily give me such metadata information, I would be happy as well. It is really metadata I am looking for such as column names, number of rows, number of columns etc but I don't want to read the entire file in!
You don't even need pandas for this. Use the built-in csv module to parse the file:
import csv
with open('myData.csv')as fp:
reader = csv.reader(fp)
headers = next(reader) # The header row is now consumed
ncol = len(headers)
nrow = sum(1 for _ in reader) # What remains are the data rows
I know this topic has been extensively treated, but I'm not able to get what I want, sorry about the probably newbie question. So the thing is I have a CSV like this:
Date,"Tmax","Tmin","Tmedia","Rachas","Vmax","LT","L1","L2","L3","L4"
23 nov 2018,"14.0 (15:30)","7.3 (23:59)","10.7","12 (14:50)","5 (14:50)","2.0","1.6","0.4","0.0","0.0"
I am getting a new CSV like that one each day, with multiple rows, but I'm interested only in the first row after the header. What I want to do is copying that first row each day to a new CSV iteratively, so at the end of the week, that CSV should have seven rows. Additionally, I'd like to check if that date is already in that daily file. The thing is that I'm not getting the new CSV right, here's my try:
import pandas as pd
df = pd.read_csv('file.csv', skiprows=4, header=None)
writer=df[df.index.isin([0])].to_csv('output.csv',header=None)
The problem with this code is that it overwrites the file output.csv each time. Then I considered changing it to:
writer=df[df.index.isin([0])]
pd.read_csv('output.csv').append(writer).to_csv('output.csv',header=None)
The problem now is that it does need the file to previously exist; and even so, the information is not correctly copied to the new file. I think it must be simpler than this, but I'm stuck. Thanks for your help.
If you only want the first row after the header, read the header and just use nrows=1. Then use open in append mode to write your one-row dataframe to the end of the csv file. The header=False argument deals nicely with excluding the header when writing.
df = pd.read_csv('file.csv', nrows=1)
with open('output.csv', 'a') as fout:
df.to_csv(fout, header=False)
I've omitted skiprows=4 because it's not clear how this relates to your input data.
def deletedata(uniquecode):
with open('Stallingsbestand.csv', 'r+') as CSV:
writer = csv.writer(CSV, delimiter=';')
for row in CSV:
if uniquecode in row:
writer.writerow((uniquecode, ''))
in Stallingsbestand.csv consists of rows that look like this:
uniquecode;Date_of_last_opening_a_function
I want to be able to delete the date of last opening and just have the unique code there.
(appending False at the end of the row can work too but I don't know which is easier)
I thought that just overwriting the row would be the easiest but I can't get it to work. Is there anyone who knows how to make this work?
You want to rename the file to Stallingsbestand.old, and write out a new version of Stallingsbestand.csv. One way to do this is to copy (sometimes) modified rows from a csv.reader to a csv.writer within a loop, similar to your current code.
You might find it more convenient to create an in-memory dataframe with pandas.read_csv(), mutate one of its rows, and then persist it with pandas.to_csv().
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html
The Problem
I have a CSV file that contains a large number of items.
The first column can contain either an IP address or random garbage. The only other column I care about is the fourth one.
I have written the below snippet of code in an attempt to check if the first column is an IP address and, if so, write that and the contents of the fourth column to another CSV file side by side.
with open('results.csv','r') as csvresults:
filecontent = csv.reader(csvresults)
output = open('formatted_results.csv','w')
processedcontent = csv.writer(output)
for row in filecontent:
first = str(row[0])
fourth = str(row[3])
if re.match('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', first) != None:
processedcontent.writerow(["{},{}".format(first,fourth)])
else:
continue
output.close()
This works to an extent. However, when viewing in Excel, both items are placed in a single cell rather than two adjacent ones. If I open it in notepad I can see that each line is wrapped in quotation marks. If these are removed Excel will display the columns properly.
Example Input
1.2.3.4,rubbish1,rubbish2,reallyimportantdata
Desired Output
1.2.3.4 reallyimportantdata - two separate columns
Actual Output
"1.2.3.4,reallyimportantdata" - single column
The Question
Is there any way to fudge the format part to not write out with quotations? Alternatively, what would be the best way to achieve what I'm trying to do?
I've tried writing out to another file and stripping the lines but, despite not throwing any errors, the result was the same...
writerow() takes a list of elements and writes each of those into a column. Since you are feeding a list with only one element, it is being placed into one column.
Instead, feed writerow() a list:
processedcontent.writerow([first,fourth])
Have you considered using Pandas?
import pandas as pd
df = pd.read_csv("myFile.csv", header=0, low_memory=False, index_col=None)
fid = open("outputp.csv","w")
for index, row in df.iterrows():
aa=re.match(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$",row['IP'])
if aa:
tline = '{0},{1}'.format(row['IP'], row['fourth column'])
fid.write(tline)
output.close()
There may be an error or two and I got the regex from here.
This assumes the first row of the csv has titles which can be referenced. If it does not then you can use header = None and reference the columns with iloc
Come to think of it you could probably run the regex on the dataFrame, copy the first and fourth column to a new dataFrame and use the to_csv method in pandas.
I have a two-column csv which I have uploaded via an HTML page to be operated on by a python cgi script. Looking at the file on the server side, it looks to be a long string i.e for a file called test.csv with the contents.
col1, col2
x,y
has become
('upfile', 'test.csv', 'col1,col2'\t\r\nx,y')
Col1 contains the data I want to operate on (i.e. x) and col 2 contains its identifier (y). Is there a better way of doing the uploading or do I need to manually extract the fields I want - this seems potentially very error-prone
thanks
If you're using the cgi module in python, you should be able to do something like:
form = cgi.FieldStorage()
thefile = form['upfile']
reader = csv.reader(thefile.file)
header = reader.next() # list of column names
for row in reader:
# row is a list of fields
process_row(row)
See, for example, cgi programming or the python cgi module docs.
Can't you use the csv module to parse this? It certantly better than rolling your own.
Something along the lines of
import csv
import cgi
form = cgi.FieldStorage()
thefile = form['upfile']
reader = csv.reader(thefile, delimiter=',')
for row in reader:
for field in row:
doThing()
EDIT: Correcting my answer from the ars answer posted below.
Looks like your file is becoming modified by the HTML upload. Is there anything stopping you from just ftp'ing in and dropping the csv file where you need it?
Once the CSV file is more proper, here is a quick function that will put it into a 2D array:
def genTableFrCsv(incsv):
table = []
fin = open(incsv, 'rb')
reader = csv.reader(fin)
for row in reader:
table.append(row)
fin.close()
return table
From here you can then operate on the whole list in memory rather than pulling bit by bit from the file as in Vitor's solution.
The easy solution is rows = [row.split('\t') for r in csv_string.split('\r\n')]. It's only error proned if you have users from different platforms submit data. They might submit comas or tabs and their line breaks could be \n, \r\n, \r, or ^M. The easiest solution is to use regular expressions. Book mark this page if you don't know regular expressions:
http://regexlib.com/CheatSheet.aspx
And here's the solution:
import re
csv_string = 'col1,col2'\t\r\nx,y' #obviously your csv opening code goes here
rows = re.findall(r'(.*?)[\t,](.*?)',csv_string)
rows = rows[1:] # remove header
Rows is now a list of tuples for all of the rows.