Just a quick question on a guidance before having my nerves broken ! :)
I have multiple CSV file that i would like to merge to one BIG new CSV file.
All the files have the same exact structure :
Muzicast V2;;;;;;;;
Zoom mÈdia sur Virgin Radio;;;;;;;;;
Sem. 16 : Du 15 avril 2016 au 21 avril 2016;;;;;;;;;
;;;;;;;;;
;;;;;;;;;
TOP 40;;;;;;;;;
Rg;Evo.;Rg-1;Artiste;Titre;Genre;Label;Audience;Nb.Diffs;Nb.Sem
1;+3;4;Twenty One Pilots;Stressed out;Pop/Rock International;WEA;5 982 000;56;18
2;+1;3;Coldplay;Hymn for the weekend;Pop/Rock International;WEA;5 933 000;55;13
3;-2;1;Imany;Don't be so shy (Filatov & Karas remix);Dance;THINK ZIK;5 354 000;55;7
4;-2;2;Lukas Graham;7 years;Pop/Rock International;POLYDOR;5 927 000;54;16
5; =;5;Justin Bieber;Love yourself;Pop/Rock International;MERCURY GROUP;5 481 000;49;21
All the cvs files have the same formatting.
I would like to :
- open each file one after the other / ignore the 10 first lines
- take all the infos with ";" as a separator
- insert variables at the beginning of each lines
- write on a new file with all the infos from each files.
I managed to open a file and made the changes I needed :
handle = open(file_dir+'/'+'virgin092016.csv','r')
results = []
for line in handle :
line = '12;2016;'+line
line = line.lower()
line = line.strip()
line = line.split(';')
line = line[0],line[1],line[5]
results.append(line)
df = pd.DataFrame(results)
print df
I managed to open multiple files and create a DataFrame =
file_dir = "VIRGIN"
main_df = pd.DataFrame()
for i, file_name in enumerate(os.listdir(file_dir)):
if i == 0 :
main_df = pd.read_csv(file_dir + "/" + file_name, sep=";")
main_df["file_name"] = file_name
else :
current_df = pd.read_csv(file_dir + "/" + file_name, sep=";")
current_df["file_name"] = file_name
current_df = current_df
main_df = pd.concat([main_df,current_df],ignore_index=True)
print main_df
But now I have an issue trying to do both of them at the same time.
I am missing a part and I think it is because I am not sure of the order I have to do my code.
Do i have to open open a file make the changes and then write directly to the MAIN.CSV (which will have the final infos of all files) and then do a DataFrame
OR should i open a file > do a data frame and after that make the changes I need.
I'm new to python... taking multiple online courses and reading books.... but I feel that I'm still not really "pythonic" in my way of thinking.
Help would be much appreciated.
Thanks
I'm assuming all your csv files are in "./data/", defined in main_dir, and that the sum of all your csv does not exceed your RAM memory. The trick is to use a temporary variable current_df and then to append it to a final dataframe final_df with pd.concat.
import os
import pandas as pd
main_dir = "./data/"
all_files = os.listdir(main_dir)
for i, file_name in enumerate(all_files):
current_df = pd.read_csv(main_dir+file_name,
sep=";",
skiprows=10)
#add here whatever information you need to your dataframe
#dump the results into a separate file with current_df.to_csv()
if i == 0:
final_df = current_df
else:
final_df = pd.concat([final_df, current_df], axis=0)
Related
I'm trying to merge my 119 csv files into one file through a python code. The only issue I'm facing is that even though I've applied the sort method it isnt working and my files are not ordered , which is causing the date column to be un-ordered. Below is the code, when I run this and open my new csv file "call.sms.merged" it appears that after my 1st csv file, data is inserted or merged from the 10th csv then 100th csv till 109 csv & then it starts to begin from csv 11. I'm attaching an image for better understanding.
file_path = "C:\\Users\\PROJECT\\Data Set\\SMS Data\\"
file_list = [file_path + f for f in os.listdir(file_path) if f.startswith('call. sms ')]
csv_list = []
for file in sorted(file_list):
csv_list.append(pd.read_csv(file).assign(File_Name = os.path.basename(file)))
csv_merged = pd.concat(csv_list, ignore_index=True)
csv_merged.to_csv(file_path + 'calls.sms.merged.csv', index=False)
UN-SORTED DATA
Incorrect order of csv
un-ordered
Python Code and Error :
Python Code Screenshot
Error Screenshot
You can extract the number of each call/file with pandas.Series.str.extract then use pandas.DataFrame.sort_values to make an ascending sort along this column/number.
Try this :
file_path = "C:\\Users\\PROJECT\\Data Set\\SMS Data\\"
file_list = [file_path + f for f in os.listdir(file_path) if f.startswith('call. sms ')]
csv_list = []
for file in file_list:
csv_list.append(pd.read_csv(file).assign(File_Name = os.path.basename(file)))
csv_merged = (
pd.concat(csv_list, ignore_index=True)
.assign(num_call= lambda x: x["File_Name"].str.extract("(\d{1,})", expand=False).astype(int))
.sort_values(by="num_call", ignore_index=True)
.drop(columns= "num_call")
)
csv_merged.to_csv(file_path + 'calls.sms.merged.csv', index=False)
I have many .csv files like this (with one column):
picture
Id like to merge them into one .csv file, so that each of the column will contain one of the csv files data. The headings should be like this (when converted to spreadsheet):
picture (the first number is the number of minutes extracted from the file name, the second is the first word in the file name behind "export_" in the name, and third is the whole name of the file).
Id like to work in Python.
Can you please someone help me with this? I am new in Python.
Thank you very much.
I tried to join only 2 files, but I have no idea how to do it with more files without writing all down manually. Also, i dont know, how to extract headings from the file names:
import pandas as pd
file_list = ['export_Control 37C 4h_Single Cells_Single Cells_Single Cells.csv', 'export_Control 37C 0 min_Single Cells_Single Cells_Single Cells.csv']
df = pd.DataFrame()
for file in file_list:
temp_df = pd.read_csv(file)
df = pd.concat([df, temp_df], axis=1)
print(df)
df.to_csv('output2.csv', index=False)
Assuming that your .csv files they all have a header and the same number of rows, you can use the code below to put all the .csv (single-columned) one besides the other in a single Excel worksheet.
import os
import pandas as pd
csv_path = r'path_to_the_folder_containing_the_csvs'
csv_files = [file for file in os.listdir(csv_path)]
list_of_dfs=[]
for file in csv_files :
temp=pd.read_csv(csv_path + '\\' + file, header=0, names=['Header'])
time_number = pd.DataFrame([[file.split('_')[1].split()[2]]], columns=['Header'])
file_title = pd.DataFrame([[file.split('_')[1].split()[0]]], columns=['Header'])
file_name = pd.DataFrame([[file]], columns=['Header'])
out = pd.concat([time_number, file_title, file_name, temp]).reset_index(drop=True)
list_of_dfs.append(out)
final= pd.concat(list_of_dfs, axis=1, ignore_index=True)
final.columns = ['Column' + str(col+1) for col in final.columns]
final.to_csv(csv_path + '\output.csv', index=False)
final
For example, considering three .csv files, running the code above yields to :
>>> Output (in Jupyter)
>>> Output (in Excel)
So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.
So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active
for frame in main_df_list:
for r in dataframe_to_rows(frame, index = True, header = True):
ws.append(r)
Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.
Edit after reading Comments
So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.
import os
import pandas as pd
from openpyxl import load_workbook
#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'
#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []
#open the file path to store the directory in files
files = os.listdir(in_path)
#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index
#searching only for files that end with .xlsx
for file in files:
if file.endswith('.xlsx'):
file_list.append(in_path + '/' + file)
#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
df = pd.read_excel(current_file)
#get columns headers and a range for total rows
columns = df.columns
total_rows = df.index
#adding to store where headers are stored in DF
row_list = []
column_list = []
header_list = []
for name in columns:
for number in total_rows:
cell = df.at[number, name]
if isinstance(cell, str) == False:
continue
elif cell == '':
continue
for place in search_size:
search_loop = search.at[place, 'Parameters']
#main compare, if str and matches search params, then do...
if insensitive_compare(search_loop, cell) == True:
if cell not in header_list:
header_list.append(df.at[number, name]) #store data headers
row_list.append(number) #store row number where it is in that data frame
column_list.append(name) #store column number where it is in that data frame
else:
continue
else:
continue
for thing in column_list:
df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)
#turns the dataframe into a set of booleans where its true if
#theres something there
na_finder = df.notna()
#create a new dataframe to write the output to
outdf = pd.DataFrame(columns = header_list)
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
#I turn the dataframe into booleans and read until False
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
#Store actual dataframe into my output dataframe, outdf
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like:
So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:
# what I want to do here is grab all the data in the same column as each
# header, then move to the next column
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
to_xlsx_df = pd.DataFrame()
for frame in main_df:
to_xlsx_df = pd.concat([to_xlsx_df, frame])
to_xlsx_df.to_excel(outfile_path)
The output to excel ended up looking something like this:
Hopefully this can help someone else out too.
I have 30 excel files and want to combine them all into only 1 excel file (using Python) and i want only 1 header on top of file ( not keep happening 30 times)
dont know how to write in python
please help.
thank you so much
This is my code snippet for merging 13 excel file in to 1 file.
fout=open("1300 restaurant data.csv", "a", encoding="utf8")
# now the rest:
for num in range(1,13):
f = open(str(num)+".csv",encoding="utf8")
for line in f:
fout.write(line)
f.close() # not really needed
fout.close()
Try using the below code to merge a list of files into a single file.
import glob
import pandas as pd
path = "C:/documents"
file_list = glob.glob(path + "/*.xlsx")
excel_list = []
for file in excel_list:
excel_list.append(pd.read_excel(file))
excel_merged = pd.DataFrame()
for excel_file in excel_list:
excel_merged = excel_merged.append(
excel_file, ignore_index=True)
excel_merged.to_excel('mergedFile.xlsx', index=False)
I’m extremely new to Python & trying to figure the below out:
I have multiple CSV files (monthly files) that I’m trying to combine into a yearly file. The monthly files all have headers, so I’m trying to keep the first header & remove the rest. I used the below script which accomplished this, however there are 10 blank rows between each month.
Does anyone know what I can add to this to remove the blank rows?
import shutil
import glob
#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort() # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
for i, fname in enumerate(allFiles):
with open(fname, 'rb') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
shutil.copyfileobj(infile, outfile)
print(fname + " has been imported.")
Thank you in advance!
assuming the dataset isn't bigger than you memory, I suggest reading each file in pandas, concatenating the dataframes and filtering from there. blank rows will probably show up as nan.
import pandas as pd
import glob
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort()
df = pd.Dataframe()
for i, fname in enumerate(allFiles):
#append data to existing dataframe
df = df.append(pd.read(fname), ignore_index = True)
#hopefully, this will drop blank rows
df = df.dropna(how = 'all')
#write to file
df.to_csv('someoutputfile.csv')