Stop overwriting when creating new df from looping through original df

Stop overwriting when creating new df from looping through original df - python

I have a large df, where the end column is a filename. I want to make a new CSV continuing the rows of all files who have an 'M' in the filename. I have managed to do the majority of this, but the end csv has only one row, containing the last file that has been found in the large csv. I want each row to be transferred to the csv on a new line.
I have tried df.append in many ways but haven't had any luck. I have seen some very different ways but it required changing all my code when it feels like only a minor adjustment is needed
path = '.../files/'
big_data = pd.read_csv('landmark_coordinates.csv', sep=',', skipinitialspace=True) #open big CSV as a DF
#put photos into a male array based on the M character that appears in the filename
male_files = [f for f in glob.glob(path + "**/*[M]*.??g", recursive=True)]
for each_male in male_files: #for all male files
male_data = big_data.loc[big_data['photo_name'] == each_male] # extract their row of data from the CSV and put in a new dataframe
# NEEDED: ON A NEW LINE! MUST APPEND. right now it just overwrites
male_data.to_csv('male_landmark_coordinates.csv', index=False, sep=',') #transport new df to csv format
Like I said, I need to make sure each file starts on a new row. Would really appreciate any help as it feels like I am so close!

Everytime you call the df.to_csv you are overwriting the csv.
male_data = pd.DataFrame()
for each_male in male_files: #for all male files
male_data.append(big_data.loc[big_data['photo_name'] == each_male])
male_data.to_csv('male_landmark_coordinates.csv', index=False, sep=',') #transport new df to csv format

Related

Python - Extract mutual column of multiple csv files into one DataFrame for statistics/plotting

I want to analyze 26 csv files from lab experiments, all in the same format.
However, I just want to extract column #5 of each csv file, and put all columns #5 into one DataFrame, which each column named after the initial csv file.
The final df should contain 26 columns and one row as header.
In simpler words: load multiple csv files, extract same column of each, put all extracted columns into a new DataFrame with column names = filename_n, filename_n+1, ...
I could find partly some lines of code to get the df, but I'm not able to adjust it for the final goal though...
path = '*' # use your path
files = glob.glob(path + "/*.csv")
get_df = lambda f: pd.read_csv(f, header=None)
dodf = {f: get_df(f) for f in files}
dodf
If someone has an idea, I'd highly appreciate it!
Thanks,
Niels

Adding rows with csv.writer.writerow / csv.writer.writerows is removing headers / columns

Been testing with some data management techniques with pandas and csv. What I'm trying to do is read a csv file, add some extra rows to it, and save again with identical format.
I've created a dataframe of shape (250, 20) with random values, dates as index and alphabets as columns then saved it as a csv file. Ultimately what I've tried is to append the same dataframe below an existing csv file.
def _writeBulk(savefile, data):
df = data.reset_index()
with open(savefile, 'w', newline='') as outfile:
writer = csv.writer(outfile)
writer.writerows(df.to_numpy().tolist())
outfile.close()
def _writeData(savefile, data):
df = data.reset_index()
with open(savefile, 'w', newline='') as outfile:
writer = csv.writer(outfile)
for row in range(df.shape[0]):
writer.writerow(df.iloc[row].tolist())
outfile.close()
When reading in the file again after edit, the result I'm expecting is a dataframe of shape (500,20). But it seems that the file does not have headers(columns) anymore, with shape (499, 200).
I've searched for solutions or explanations but skipping the header while writing rows is the closest I've ever got to the actual issue.
Skip the headers when editing a csv file using Python
Any explanations or solutions would be appreciated.

Firstly, if your .csv has the first column as date (also the index), when you read that file into a DataFrame you don't have to use .reset_index()
For example, you csv might look like this in Excel:
When you read that to a dataframe it becomes:
If you simply want to append a new DataFrame with the same number of columns Date, A,B,C... you can simply do so:
source_df = pd.read_csv('initial.csv')
# copy of the same dataframe
# it could be a different df as per your requirement
new_df = source_df.copy()
# appending the 2nd dataframe to the 1st
final_df = source_df.append(new_df)
# writing to .csv
final_df.to_csv('final.csv', index=False)
We're setting the index=False to avoid DateFrame default indexing (0,1,2,3...) to be written to the final .csv file
Your final df is going to look like:
And when viewed in Excel:
Having said all this, if you want your DataFrame to have date as the index, but when you write to .csv, it should have columns data, A,B,C,...
Do this:
source_df = pd.read_csv('test.csv')
# copy of the same dataframe
new_df = source_df.copy()
# appending the 2nd dataframe to the 1st
final_df = source_df.append(new_df)
final_df.set_index(['date'], inplace=True)
final_df.to_csv('final.csv')
And you'll have, df:
And when viewed in Excel:

How to efficiently remove junk above headers in an .xls file

I have a number of .xls datasheets which I am looking to clean and merge.
Each data sheet is generated by a larger system which cannot be changed.
The method that generates the data sets displays the selected parameters for the data set. (E.G 1) I am looking to automate the removal of these.
The number of rows that this takes up varies, so I am unable to blanket remove x rows from each sheet. Furthermore, the system that generates the report arbitrarily merges cells in the blank sections to the right of the information.
Currently I am attempting what feels like a very inelegant solution where I convert the file to a CSV, read it as a string and remove everything before the first column.
data_xls = pd.read_excel('InputFile.xls', index_col=None)
data_xls.to_csv('Change1.csv', encoding='utf-8')
with open("Change1.csv") as f:
s = f.read() + '\n'
a=(s[s.index("Col1"):])
df = pd.DataFrame([x.split(',') for x in a.split('\n')])
This works but it seems wildly inefficient:
Multiple format conversions
Reading every line in the file when the only rows being altered occur within first ~20
Dataframe ends up with column headers shifted over by one and must be re-aligned (Less concern)
With some of the files being around 20mb, merging a batch of 8 can take close to 10 minutes.

A little hacky, but an idea to speed up your process, by doing some operations directly on your dataframe. Considering you know your first column name to be Col1, you could try something like this:
df = pd.read_excel('InputFile.xls', index_col=None)
# Find the first occurrence of "Col1"
column_row = df.index[df.iloc[:, 0] == "Col1"][0]
# Use this row as header
df.columns = df.iloc[column_row]
# Remove the column name (currently an useless index number)
del df.columns.name
# Keep only the data after the (old) column row
df = df.iloc[column_row + 1:]
# And tidy it up by resetting the index
df.reset_index(drop=True, inplace=True)
This should work for any dynamic number of header rows in your Excel (xls & xlsx) files, as long as you know the title of the first column...

If you know the number of junk rows, you skip them using "skiprows",
data_xls = pd.read_excel('InputFile.xls', index_col=None, skiprows=2)

Python / glob glob - change datatype during import

I'm looping through all excel files in a folder and appending them to a dataframe. One column (column C) has an ID number. In some of the sheets, the ID is formatted as text and in others it's formatted as a number. What's the best way to change the data type during or after the import so that the datatype is consistent? I could always change them in each excel file before importing but there are 40+ sheets.
for f in glob.glob(path):
dftemp = pd.read_excel(f,sheetname=0,skiprows=13)
dftemp['file_name'] = os.path.basename(f)
df = df.append(dftemp,ignore_index=True)

Don't append to a dataframe in a loop, every append relocates the whole dataframe to a new location in memory, very slow. Do one single concat after reading all your dataframes:
dfs = []
for f in glob.glob(path):
df = pd.read_excel(f,sheetname=0,skiprows=13)
df['file_name'] = os.path.basename(f)
df['c'] = df['c'].astype(str)
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
It sounds like your ID, that's the c column, is a string, but sometimes lacks alphabets. Ideally this should be used as a string.

looping in lists of csv files names

I am having some issues with my below code. The purpose of the code is to take a list of lists that within each of the lists, carries a series of csv files. I want to loop through each of these lists (one at a time) and pull in only the csv files found in the respective list.
my current code is accumulating all the data instead of starting from scratch each time it loops thru. First loop, use all the csv files in 0th index, second loop, use all the csv files in the 1st index - but dont accumulate
path = "C:/DataFolder/"
allFiles = glob.glob(path + "/*.csv")
fileChunks = [['2003.csv','2004.csv','2005.csv'],['2006.csv','2007.csv','2008.csv']]
for i in range(len(fileChunks)):
"""move empty dataframe here"""
df = pd.DataFrame()
for file_ in fileChunks[i]:
df_temp = pd.read_csv(file_, index_col = None, names = names, parse_dates=True)
df = df.append(df_temp)
note: fileChunks is derived from a function, and it spits out a list of lists like the example above
any help to documentation or pointing out my error would be great - I want to learn from this. thank you.
EDIT
It seems that moving the empty dataframe to within the first for loop works.

This should unnest your files and read each separately using a list comprehension, and then join them all using concat. This is much more efficient than appending each read to a growing dataframe.
df = pd.concat([pd.read_csv(file_, index_col=None, names=names, parse_dates=True)
for chunk in fileChunks for file_ in chunk],
ignore_index=True)
>>> [file_ for chunk in fileChunks for file_ in chunk]
['2003.csv', '2004.csv', '2005.csv', '2006.csv', '2007.csv', '2008.csv']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Stop overwriting when creating new df from looping through original df - python

Related

Python - Extract mutual column of multiple csv files into one DataFrame for statistics/plotting

Adding rows with csv.writer.writerow / csv.writer.writerows is removing headers / columns

How to efficiently remove junk above headers in an .xls file

Python / glob glob - change datatype during import

looping in lists of csv files names

Categories

Resources