I have a data set which more more than 100mb in size and also many in number of files. These files have more than 20 columns and about more than 1 million rows.
The main problem with data is:
Headers are repeating -- Duplicate header rows
Duplicate rows in full i.e. data from all the columns in that particular row is duplicate.
Without bothering about the which column or how many columns .. only need to Keep the first occurrence and then remove the rest.
I did find too many examples but what I am looking for is the input and output both need to be same file. The only reason to seek help is, I want the same file to be edited.
sample Input: Here
https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=0
Appreciate the help thanks in advance..
If the number of duplicate headers is known and constant, skip those rows:
csv = pd.read_csv('https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=1', skiprows=4)
Alternatively, which comes w/ the bonus of removing all duplicates, based on all columns, do this:
csv = pd.read_csv('https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=1')
csv = csv.drop_duplicates()
Now you still have a header line in the data, just skip it:
csv = csv.iloc[1:]
You certainly can then overwrite the input file with pandas.DataFrame.to_csv
Related
I have 2 csv files that have information related to each other. Each row of one csv file corresponds to another row in the other file. In order to prepare the data, I needed to remove certain values from the first csv file which resulted in removing certain rows from that file. Now when I print those rows out they jump around. As an example a certain portion of the first csv file jumps from row number 20838 to 20842, 20843, etc. So what I want to do is compare the first csv file that had certain rows removed to the second csv file and remove the rows that are not currently in the first csv file from the second csv file and then reorder all the rows so that both csv files have rows listed from 0 to 20000. I am using Pandas and numpy.
This is the code I have used to remove the information from the first csv file:
data_csv1 = pd.read_csv("address1")
data_csv2 = pd.read_csv("address2")
data_csv1.drop(data.columns[[0]], axis = 1)
data_csv1 = data_csv1[(data_csv1 !=0).all(1)]
How would I go about doing this? I personally do not care if the data is removed or simply ignored, I just need both csv files to contain the same row numbers.
assuming that at start your two files had exact same index, you can pass index of first file to the second file after post processing:
data_csv2 = data_csv2.iloc[data_csv1.index]
I have a csv file that contains extra rows at the very end (last 9 rows) that are important, but do not fit the schema at all and need to be processed differently. They just contain number of clicks for different sites. I want to split these last few rows from the original csv and save it as a different file.
So far, I can get the most important rows out using pandas, skipping the footer. If the number of rows was consistent, then I could do the same for saving the footer using skiprows=0-2000 (for example), but these rows will change.
The code to save all the main rows is as follows:
reader = pd.read_csv(os.path.join(DATA_DIR, file), encoding='utf8', header=0, skipfooter=9, index_col=0)
trimmed_file_name = 'trimmed_{}'.format(file)
path = os.path.join(DATA_DIR)
full_path = path + "\ ".strip(' ') + trimmed_file_name
# had to use this odd way of creating a path because it kept trying to use \ as an escape char, just ignore
print(full_path)
reader.to_csv(full_path, mode='a')
So how do I just get out those last 9 rows without 'skiprows'? Any ideas? The footer is consistently the last 9 rows if that helps.
After reading in the first dataframe, we know how many regular rows there are. So just read the remaining by
footer = pd.read_csv(file, skiprows=len(reader))
It seems that you can look at columns in a file no problem, but there's no apparent way to look at rows. I know I can read the entire file (CSV or excel) into a crazy huge dataframe in order to select rows, but I'd rather be able to grab particular rows straight from the file and store those in a reasonably sized dataframe.
I do realize that I could just transpose/pivot the df before saving it to the aforementioned CVS/Excel file. This would be a problem for Excel because I'd run out of columns (the transposed rows) far too quickly. I'd rather use Excel than CSV.
My original, not transposed data file has 9000+ rows and 20ish cols. I'm using Excel 2003 which supports up to 256 columns.
EDIT: Figured out a solution that works for me. It's a lot simpler than I expected. I did end up using CSV instead of Excel (I found no serious difference in terms of my project) Here it is for whoever may have the same problem:
import pandas as pd
selectionList = (2, 43, 792, 4760) #rows to select
df = pd.read_csv(your_csv_file, index_col=0).T
selection = {}
for item in selectionList:
selection[item] = df[item]
selection = pd.DataFrame.from_dict(selection)
selection.T.to_csv(your_path)
I think you can use the skiprows and nrows arguments in pandas.read_csv to pick out individual rows to read in.
With skiprows, you can provide it a long list (0 indexed) of rows not to import , e.g. [0,5,6,10]. That might end up being a huge list though. If you provide it a single integer, it will skip that number of rows and start reading. Set nrows to whatever to pick up the number of rows you want at the point where you have it start.
If I've misunderstood the issue, let me know.
I'm trying to use it to manipulate data in large txt-files.
I have a txt-file with more than 2000 columns, and about a third of these have a title which contains the word 'Net'. I want to extract only these columns and write them to a new txt file. Any suggestion on how I can do that?
I have searched around a bit but haven't been able to find something that helps me. Apologies if similar questions have been asked and solved before.
EDIT 1: Thank you all! At the moment of writing 3 users have suggested solutions and they all work really well. I honestly didn't think people would answer so I didn't check for a day or two, and was happily surprised by this. I'm very impressed.
EDIT 2: I've added a picture that shows what a part of the original txt-file can look like, in case it will help anyone in the future:
One way of doing this, without the installation of third-party modules like numpy/pandas, is as follows. Given an input file, called "input.csv" like this:
a,b,c_net,d,e_net
0,0,1,0,1
0,0,1,0,1
(remove the blank lines in between, they are just for formatting the
content in this post)
The following code does what you want.
import csv
input_filename = 'input.csv'
output_filename = 'output.csv'
# Instantiate a CSV reader, check if you have the appropriate delimiter
reader = csv.reader(open(input_filename), delimiter=',')
# Get the first row (assuming this row contains the header)
input_header = reader.next()
# Filter out the columns that you want to keep by storing the column
# index
columns_to_keep = []
for i, name in enumerate(input_header):
if 'net' in name:
columns_to_keep.append(i)
# Create a CSV writer to store the columns you want to keep
writer = csv.writer(open(output_filename, 'w'), delimiter=',')
# Construct the header of the output file
output_header = []
for column_index in columns_to_keep:
output_header.append(input_header[column_index])
# Write the header to the output file
writer.writerow(output_header)
# Iterate of the remainder of the input file, construct a row
# with columns you want to keep and write this row to the output file
for row in reader:
new_row = []
for column_index in columns_to_keep:
new_row.append(row[column_index])
writer.writerow(new_row)
Note that there is no error handling. There are at least two that should be handled. The first one is the check for the existence of the input file (hint: check the functionality provide by the os and os.path modules). The second one is to handle blank lines or lines with an inconsistent amount of columns.
This could be done for instance with Pandas,
import pandas as pd
df = pd.read_csv('path_to_file.txt', sep='\s+')
print(df.columns) # check that the columns are parsed correctly
selected_columns = [col for col in df.columns if "net" in col]
df_filtered = df[selected_columns]
df_filtered.to_csv('new_file.txt')
Of course, since we don't have the structure of your text file, you would have to adapt the arguments of read_csv to make this work in your case (see the the corresponding documentation).
This will load all the file in memory and then filter out the unnecessary columns. If your file is so large that it cannot be loaded in RAM at once, there is a way to load only specific columns with the usecols argument.
You can use pandas filter function to select few columns based on regex
data_filtered = data.filter(regex='net')
simple problem, but maybe tricky answer:
The problem is how to handle a huge .txt file with pytables.
I have a big .txt file, with MILLIONS of lines, short lines, for example:
line 1 23458739
line 2 47395736
...........
...........
The content of this .txt must be saved into a pytable, ok, it's easy. Nothing else to do with the info in the txt file, just copy into pytables, now we have a pytable with, for example, 10 columns and millions of rows.
The problem comes up when, with the content in the txt file, 10 columns x millions lines are directly generated in the paytable BUT, depending on the data on each line of the .txt file, new colums must be created on the pytable. So how to handle this efficiently??
Solution 1: first copy all the text file, line by line into pytable (millions), and then iterate over each row on pytable (millions again) and, depending on the values, generate the new columns needed for the pytable.
Solution 2: read line by line the .txt file, do whatever needed, calculate the new needed values, and then send all the info to a pyrtable.
Solution 3:.....any other efficient and faster solution???
I think that basic problem here is one of the conceptual model. PyTables' Tables only handle regular (or structured) data. However, the data that you have is irregular or unstructured in that the structure is determined as you read the data. Said another way, PyTables needs the column description to be known completely by the time that create_table() is called. There is no way around this.
Since in your problem statement any line may add a new column you have no choice but to do this in two full passes through the data: (1) read through the data and determine the columns and (2) write the data to the table. In pseudocode:
import tables as tb
cols = {}
# discover columns
d = open('data.txt')
for line in d:
for col in line:
if col not in cols:
cols['colname'] = col
# write table
d.seek(0)
f = tb.open_file(...)
t = f.create_table(..., description=cols)
for line in d:
row = line_to_row(line)
t.append(row)
d.close()
f.close()
Obviously, if you knew the table structure ahead of time you could skip the first loop and this would be much faster.