Converting a split dataframe into csv - python

I have to split a dataframe containing 15000 rows into sections of 300 rows each.
split_df = np.array_split(data, np.arange(0, len(data),300))
I need to convert the split groups into a dataframe/dataframes and then to be converted to a csv.
Any ideas?

I have one idea.
split_df = np.array_split(data, np.arange(0, len(data),300))
for i in range(len(split_df)):
csv_writer = split_df[i].to_csv('data'+str(i)+'.csv')
You can use this code to output everything as a csv
I hope it fits what you want to do.
The only thing is that this code will take some time to finish running.
Please be aware of that.

Related

Best Ways to combine two Pandas Dataframes Python

I have two data frames (for now) that I want to combine:
head_site_df = pd.DataFrame(head_sitetupaggr_list, columns =['Die Loc', 'X Coord', 'Y Coord'])
regvalfile_df = pd.DataFrame(regvaltupaggr_list)
regvalfile_df contains more than 800 columns that are generated dynimcally. So I cannot list the columns out.
Why I say for now is because the situation is a little complicated. I need to produce an excel sheet that looks like this:
This data comes from a log file. After column K, you can have 800 or more columns with those headers with integer values. These are the columns contained in regvalfile_df. The other DF captures columns 'Die Loc' and 'X Coord', 'Y Coord'. I intend to modify my code to be able to eventually have this second DF also capture columns A to K. So in total, I will have two DFs. One which has A to K and another which has L and onwards.
My question is, in order to combine these two seperate DFs to get the excel sheet, what is the best method to use. Cocatenate(), Merge)(), Join(), or Append()? What is the most efficient method with speed and mem consumption. I'd be combining two pandas DFs. Unless it'd be more efficient to somehow have one DF capture everything. I don't see how it would work given that L onwards has "dynamic" amount headers that change for each file.
Sample of code so far. Please note the code works. I just took out a huge chunk out that contain datastructures I have used so far:
for odfslogp_obj in odfslogs_plist:
with zipfile.ZipFile(odfslogp_obj, mode='r') as z:
for name in z.namelist():
dfregval = pd.DataFrame()
with z.open(name) as etest_zip:
for head_site, loclist in zip(head_siteparam_tup_list, linesineed): #is there a way to turn this all into a function inside a list comprehension?
regvals_ext = [x for x in loclist if pattern.search(x)]
#print(regvals_ext)
regvaltups_list = [tuple(x.split(":")[0:2]) for x in regvals_ext]
regvaldict = dict(regvaltups_list)
regvaltupaggr_list.append(regvaldict)
head_siteloc_tup = (head_site[1], head_site[0].split(',')[0], head_site[0].split(',')[1])
# print(head_siteloc_tup)
head_sitetupaggr_list.append(head_siteloc_tup)
#print(head_sitetupaggr_list)
head_site_df = pd.DataFrame(head_sitetupaggr_list, columns =['Die Loc', 'X Coord', 'Y Coord'])
regvalfile_df = pd.DataFrame(regvaltupaggr_list)

How to efficiently 'query' multiple tsv files?

I've got about 40 tsv files, with the size of any given tsv ranging from 250mb to 3GB. I'm looking to pull data from the tsvs where rows contain certain values.
My current approach is far from efficient:
nums_to_look = ['23462346', '35641264', ... , '35169331'] # being about 40k values I'm interested in
all_tsv_files = glob.glob(PATH_TO_FILES + '*.tsv')
all_dfs = []
for file in all_tsv_files:
df = pd.read_csv(file, sep='\t')
# Extract rows which match values in nums_to_look
df = df[df['col_of_interest'].isin(nums_to_look)].reset_index(drop=True)
all_dfs.append(df)
Surely there's a much more efficient way to do this without having to read in every single file fully, and go through the entire file?
Any thoughts / insights would be much appreciated.
Thanks!

How to add rows to pandas dataframe with reasonable performance

I have an empty data frame with about 120 columns, I want to fill it using data I have in a file.
I'm iterating over a file that has about 1.8 million lines.
(The lines are unstructured, I can't load them to a dataframe directly)
For each line in the file I do the following:
Extract the data I need from the current line
Copy the last row in the data frame and append it to the end df = df.append(df.iloc[-1]). The copy is critical, most of the data in the previous row won't be changed.
Change several values in the last row according to the data I've extracted df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value
This is very slow, I assume the fault is in the append.
What is the correct approach to speed things up ? preallocate the dataframe ?
EDIT:
After reading the answers I did the following:
I preallocated the dataframe (saved like 10% of the time)
I replaced this : df = df.append(df.iloc[-1]) with this : df.iloc[i] = df.iloc[i-1] (i is the current iteration in the loop).(save like 10% of the time).
Did profiling, even though I removed the append the main issue is copying the previous line, meaning : df.iloc[i] = df.iloc[i-1] takes about 95% of the time.
You may need plenty of memory, whichever option you choose.
However, what you should certainly avoid is using pd.DataFrame.append within a loop. This is expensive versus list.append.
Instead, aggregate to a list of lists, then feed into a dataframe. Since you haven't provided an example, here's some pseudo-code:
# initialize empty list
L = []
for line in my_binary_file:
# extract components required from each line to a list of Python types
line_vars = [line['var1'], line['var2'], line['var3']]
# append to list of results
L.append(line_vars)
# create dataframe from list of lists
df = pd.DataFrame(L, columns=['var1', 'var2', 'var3'])
The Fastest way would be load to dataframe directly via pd.read_csv()
Try separating the logic to clean out unstructured to structured data and then use pd.read_csv to load the dataframe.
You can share the sample unstructured line and logic to take out the structured data, So that might share some insights on the same.
Where you use append you end up copying the dataframe which is inefficient. Try this whole thing again but avoiding this line:
df = df.append(df.iloc[-1])
You could do something like this to copy the last row to a new row (only do this if the last row contains information that you want in the new row):
df.iloc[...calculate the next available index...] = df.iloc[-1]
Then edit the last row accordingly as you have done
df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value
You could try some multiprocessing to speed things up
from multiprocessing.dummy import Pool as ThreadPool
def YourCleaningFunction(line):
for each line do the following
blablabla
return(your formated lines with ,) # or use the kind of function jpp just provided
pool = ThreadPool(8) # your number of cores
lines = open('your_big_csv.csv').read().split('\n') # your csv as a list of lines
df = pool.map(YourCleaningFunction, lines)
df = pandas.DataFrame(df)
pool.close()
pool.join()

i want to create matrix from csv file

I want to create a matrix from CSV file.
Here's what I've tried:
df = pd.read_csv('csv-path', usecols=[0,1], names=['A', 'B'])
pd.pivot_table(df,columns='A', values='B')
output : [9197337 rows x 2 columns].
I want to take fewer rows like I want to make a matrix of first 100 entries or 1000. How can I do that?
Since the csv module only deals in complete files, it would be easiest to extract the lines of interest before you use it. You could do this before running your program with the Unix head utility. Here's one way that should work in Python:
with open("csv-path") as inf, open("mod_csv_path", "w") as outf:
for i in range(1000):
outf.write(inf.readline())
Obviously you'd then read "mod_csv_path" rather than "csv-path' as the input file.
Pandas seems to be the right approach ? Can you provide a sample of your CSV file.
Also, with pandas, you can limit the size of your dataframe:
limited_df = df.head(num_elements)

Loop for creating csv out of dataframe column index

I want to create a loop which creates multiple csvs which have the same 9 columns in the beginning but differ iteratively in the last column.
[col1,col2,col3,col4,...,col9,col[i]]
I have a dataframe with a shape of (20000,209).
What I want is that I create a loop which does not takes too much computation power and resources but creates 200 csvs which differ in the last column. All columns exist in one dataframe. The columns which should be added are in columns i =[10:-1].
I thought of something like:
for col in df.columns[10:-1]:
dfi = df[:9]
dfi.concat(df[10])
dfi.dropna()
dfi.to_csv('dfi.csv'))
Maybe it is also possible to use
dfi.to_csv('dfi.csv', sequence = [:9,i])
The i should display the number of the added column. Any idea how to make this happen easily? :)
Thanks a lot!
I'm not sure I understand fully what you want but are you saying that each csv should just have 10 columns, all should have the first 9 and then one csv for each of the remaining 200 columns?
If so I would go for something as simple as:
base_cols = list(range(9))
for i in range(9, 209):
df.iloc[:, base_cols+[i]].to_csv('csv{}.csv'.format(i))
Which should work I think.

Categories

Resources