I want to create a matrix from CSV file.
Here's what I've tried:
df = pd.read_csv('csv-path', usecols=[0,1], names=['A', 'B'])
pd.pivot_table(df,columns='A', values='B')
output : [9197337 rows x 2 columns].
I want to take fewer rows like I want to make a matrix of first 100 entries or 1000. How can I do that?
Since the csv module only deals in complete files, it would be easiest to extract the lines of interest before you use it. You could do this before running your program with the Unix head utility. Here's one way that should work in Python:
with open("csv-path") as inf, open("mod_csv_path", "w") as outf:
for i in range(1000):
outf.write(inf.readline())
Obviously you'd then read "mod_csv_path" rather than "csv-path' as the input file.
Pandas seems to be the right approach ? Can you provide a sample of your CSV file.
Also, with pandas, you can limit the size of your dataframe:
limited_df = df.head(num_elements)
Related
I have to split a dataframe containing 15000 rows into sections of 300 rows each.
split_df = np.array_split(data, np.arange(0, len(data),300))
I need to convert the split groups into a dataframe/dataframes and then to be converted to a csv.
Any ideas?
I have one idea.
split_df = np.array_split(data, np.arange(0, len(data),300))
for i in range(len(split_df)):
csv_writer = split_df[i].to_csv('data'+str(i)+'.csv')
You can use this code to output everything as a csv
I hope it fits what you want to do.
The only thing is that this code will take some time to finish running.
Please be aware of that.
I've got about 40 tsv files, with the size of any given tsv ranging from 250mb to 3GB. I'm looking to pull data from the tsvs where rows contain certain values.
My current approach is far from efficient:
nums_to_look = ['23462346', '35641264', ... , '35169331'] # being about 40k values I'm interested in
all_tsv_files = glob.glob(PATH_TO_FILES + '*.tsv')
all_dfs = []
for file in all_tsv_files:
df = pd.read_csv(file, sep='\t')
# Extract rows which match values in nums_to_look
df = df[df['col_of_interest'].isin(nums_to_look)].reset_index(drop=True)
all_dfs.append(df)
Surely there's a much more efficient way to do this without having to read in every single file fully, and go through the entire file?
Any thoughts / insights would be much appreciated.
Thanks!
I have an empty data frame with about 120 columns, I want to fill it using data I have in a file.
I'm iterating over a file that has about 1.8 million lines.
(The lines are unstructured, I can't load them to a dataframe directly)
For each line in the file I do the following:
Extract the data I need from the current line
Copy the last row in the data frame and append it to the end df = df.append(df.iloc[-1]). The copy is critical, most of the data in the previous row won't be changed.
Change several values in the last row according to the data I've extracted df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value
This is very slow, I assume the fault is in the append.
What is the correct approach to speed things up ? preallocate the dataframe ?
EDIT:
After reading the answers I did the following:
I preallocated the dataframe (saved like 10% of the time)
I replaced this : df = df.append(df.iloc[-1]) with this : df.iloc[i] = df.iloc[i-1] (i is the current iteration in the loop).(save like 10% of the time).
Did profiling, even though I removed the append the main issue is copying the previous line, meaning : df.iloc[i] = df.iloc[i-1] takes about 95% of the time.
You may need plenty of memory, whichever option you choose.
However, what you should certainly avoid is using pd.DataFrame.append within a loop. This is expensive versus list.append.
Instead, aggregate to a list of lists, then feed into a dataframe. Since you haven't provided an example, here's some pseudo-code:
# initialize empty list
L = []
for line in my_binary_file:
# extract components required from each line to a list of Python types
line_vars = [line['var1'], line['var2'], line['var3']]
# append to list of results
L.append(line_vars)
# create dataframe from list of lists
df = pd.DataFrame(L, columns=['var1', 'var2', 'var3'])
The Fastest way would be load to dataframe directly via pd.read_csv()
Try separating the logic to clean out unstructured to structured data and then use pd.read_csv to load the dataframe.
You can share the sample unstructured line and logic to take out the structured data, So that might share some insights on the same.
Where you use append you end up copying the dataframe which is inefficient. Try this whole thing again but avoiding this line:
df = df.append(df.iloc[-1])
You could do something like this to copy the last row to a new row (only do this if the last row contains information that you want in the new row):
df.iloc[...calculate the next available index...] = df.iloc[-1]
Then edit the last row accordingly as you have done
df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value
You could try some multiprocessing to speed things up
from multiprocessing.dummy import Pool as ThreadPool
def YourCleaningFunction(line):
for each line do the following
blablabla
return(your formated lines with ,) # or use the kind of function jpp just provided
pool = ThreadPool(8) # your number of cores
lines = open('your_big_csv.csv').read().split('\n') # your csv as a list of lines
df = pool.map(YourCleaningFunction, lines)
df = pandas.DataFrame(df)
pool.close()
pool.join()
I have a large text file, separated by semi-column. I am trying to retrieve the values of a column (e.g. the second column) and work on it iteratively using numpy. An example of data contained in a text file is given below:
10862;2;1;1;0;0;0;3571;0;
10868;2;1;1;1;0;0;3571;0;
10875;2;1;1;1;0;0;3571;0;
10883;2;1;1;1;0;0;3571;0;
...
11565;2;1;1;1;0;0;3571;0;
11572;2;1;1;1;0;0;3571;0;
11579;2;1;1;1;0;0;3571;0;
11598;2;1;1;1;0;0;3571;0;
11606;2;1;1;
Please note that the last line may not contain the same number of values as the previous ones.
I am trying to use pandas.read_csv to read this large file by chunks. For the purpose of the example, let's assume that the chunk size is 40.
I have tried so far 2 different approaches:
1)
Set nrows, and iteratively increase the skiprows so as to read the entire file by chunk.
nrows_set = 40
n_it = 0
while(1):
df = pd.read_csv(filename, nrows=nrows_set , sep=';',skiprows = n_it * nrows_set)
vect2 = df[1] # trying to access the values of the second column -- works
n_it = n_it+1
Issue when accessing the end of the file: Pandas generates an error when ones tries to read a number of rows bigger than the number of rows contained in the file.
For example, if the file contains 20 lines, and nrows is set as 40, the file cannot be read. My first approach hence generated a bug when I was trying to read the last 40 lines of my file, when less than 40 lines were remaining.
I do not know how to check for the end of file before trying to read from the file - and I do not want to load the entire file to obtain the total rows number since the file is large. Hence, I tried a second approach.
2)
Set chunksize. This works well, however I have an issue when I then try to acess the data in chunk:
reader = pd.read_csv(filename, chunksize=40, sep=';')
for chunk in reader :
print(chunk) # displays data -- the data looks correct
chunk[1] # trying to access the values of the second column -- generates an error
What is the data type of chunk, and how can I convert it so as this operation works?
Alternatively, how can I retrieve the number of lines contained in a file without loading the entire file in memory (solution 1))?
Thank you for your help!
Gaelle
chunk is a data frame.
so you can access it using indexers (accesors) like .ix / .loc / .iloc / .at / etc.:
chunk.ix[:, 'col_name']
chunk.iloc[:, 1] # second column
I have many text files (1,409) with 259,200 x 1 data points (each text file is a year and variable). I want to combine these into one text file column wise, i.e.
botTemp_1950 | botTemp_1951 | botTemp_1952 | ... etc
.... .... ....
.
.
.
I have already done this but the data is arranged into 1 column and is 4GB in size. Is there a way of doing this but column wise in windows, or will I need a scripting language like Python?
I then want to mask the data for only 200 or so rows for each column and take the average for each column so I have a time series for each variable.
If I can do both of these things then ideal, but mainly after the column-wise then I can apply the mask fairly easily in Excel, providing I can open the text file.
Thanks in advance.
In Python you can try the following:
os.chdir(current_dir_with_txt_files)
txtfiles = [open(i) for i in os.listdir('*.txt')]
out = open(out_name, 'w')
from itertools import izip
for l1,l2,l3 in izip(*txtfiles):
out.write(','.join(l1.strip(), l2.strip(), l3.strip() + '\n')
out.close()
This may take a while but it will not consume all your RAM memory.