How do I add a blank line between merged files - python
I have several CSV files that I have managed to merge. However, I need to add a blank row between each files as they merge so I know a different file starts at that point. Tried everything. Please help.
import os
import glob
import pandas
def concatenate(indir="C:\\testing", outfile="C:\\done.csv"):
os.chdir(indir)
fileList=glob.glob("*.csv")
dfList=[]
colnames=["Creation Date","Author","Tweet","Language","Location","Country","Continent"]
for filename in fileList:
print(filename)
df=pandas.read_csv(filename, header=None)
ins=df.insert(len(df),'\n')
dfList.append(ins)
concatDf=pandas.concat(dfList,axis=0)
concatDf.columns=colnames
concatDf.to_csv(outfile,index=None)
Here is an example script. You can use the loc method with a non-existent key to enlarge the DataFrame and set the value of the new row.
The simplest solution seems to be to create a template DataFrame to use as a separator with the values set as desired. Then just insert it into the list of data frames to concatenate at appropriate positions.
Lastly, I removed the chdir, since glob can search in any path.
import glob
import pandas
def concatenate(input_dir, output_file_name):
file_list=glob.glob(input_dir + "/*.csv")
column_names=["Creation Date"
, "Author"
, "Tweet"
, "Language"
, "Location"
, "Country"
, "Continent"]
# Create a separator template
separator = pandas.DataFrame(columns=column_names)
separator.loc[0] = [""]*7
dataframes = []
for file_name in file_list:
print(file_name)
if len(dataframes):
# The list is not empty, so we need to add a separator
dataframes.append(separator)
dataframes.append(pandas.read_csv(file_name))
concatenated = pandas.concat(dataframes, axis=0)
concatenated.to_csv(output_file_name, index=None)
print(concatenated)
concatenate("input", ".out.csv")
An alternative, even shorter, way is to build the concatenated DataFrame iteratively, using the append method.
def concatenate(input_dir, output_file_name):
file_list=glob.glob(input_dir + "/*.csv")
column_names=["Creation Date"
, "Author"
, "Tweet"
, "Language"
, "Location"
, "Country"
, "Continent"]
concatenated = pandas.DataFrame(columns=column_names)
for file_name in file_list:
print(file_name)
if len(concatenated):
# The list is not empty, so we need to add a separator
concatenated.loc[len(concatenated)] = [""]*7
concatenated = concatenated.append(pandas.read_csv(file_name))
concatenated.to_csv(output_file_name, index=None)
print(concatenated)
I tested the script with 3 input CSV files:
input/1.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2015-12-17,foo,Hello,EN,London,UK,Europe
2015-12-18,bar,Bye,EN,Manchester,UK,Europe
2015-12-28,baz,Hallo,DE,Frankfurt,Germany,Europe
input/2.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2016-01-09,bar,Tweeeeet,EN,New York,USA,America
2016-01-09,cat,Miau,FI,Helsinki,Finland,Europe
input/3.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2018-12-12,who,Hello,EN,Delhi,India,Asia
When I ran it, the following output was written to console:
Console Output (using concat)
input\1.csv
input\2.csv
input\3.csv
Creation Date Author Tweet Language Location Country Continent
0 2015-12-17 foo Hello EN London UK Europe
1 2015-12-18 bar Bye EN Manchester UK Europe
2 2015-12-28 baz Hallo DE Frankfurt Germany Europe
0
0 2016-01-09 bar Tweeeeet EN New York USA America
1 2016-01-09 cat Miau FI Helsinki Finland Europe
0
0 2018-12-12 who Hello EN Delhi India Asia
The console output of the shorter variant is slightly different (note the indices in the first column), however this has no effect on the generated CSV file.
Console Output (using append)
input\1.csv
input\2.csv
input\3.csv
Creation Date Author Tweet Language Location Country Continent
0 2015-12-17 foo Hello EN London UK Europe
1 2015-12-18 bar Bye EN Manchester UK Europe
2 2015-12-28 baz Hallo DE Frankfurt Germany Europe
3
0 2016-01-09 bar Tweeeeet EN New York USA America
1 2016-01-09 cat Miau FI Helsinki Finland Europe
6
0 2018-12-12 who Hello EN Delhi India Asia
Finally, this is what the output CSV file it generated looks like:
out.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2015-12-17,foo,Hello,EN,London,UK,Europe
2015-12-18,bar,Bye,EN,Manchester,UK,Europe
2015-12-28,baz,Hallo,DE,Frankfurt,Germany,Europe
,,,,,,
2016-01-09,bar,Tweeeeet,EN,New York,USA,America
2016-01-09,cat,Miau,FI,Helsinki,Finland,Europe
,,,,,,
2018-12-12,who,Hello,EN,Delhi,India,Asia
Related
Openpyxl to create dataframe with sheet name and specific cell values?
What I need to do: Open Excel Spreadsheet in Python/Pandas Create df with [name, balance] Example: name balance Jones Ministry 45,408.83 Smith Ministry 38,596.20 Doe Ministry 28,596.20 What I have done so far... import pandas as pd import openpyxl as op from openpyxl import load_workbook from pathlib import Path Then... # Excel File src_file = src_file = Path.cwd() / 'lm_balance.xlsx' df = load_workbook(filename = src_file) I viewed all the sheet names by... df.sheetnames And created a dataframe with the 'name' column balance_df = pd.DataFrame(df.sheetnames) My spreadsheet looks like this... I now need to loop thru each sheet and add the 'ending fund balance' and corresponding 'value' The "Ending Fund Balance" is at different rows, but always the final row. The 'value' is always in column 'G' How do I go about doing this? I have read through examples in: Automate the Boring Stuff Openpyxl documentation PBPython.com examples Stack Overflow questions I appreciate your help! Working samples on github: Github: JohnMillstead: Balance_Study
To ge a cell value first set the data_only=True on load_workbook, otherwise you could end up getting the cell formula. To get last row of a worksheet you can use ws.max_row. Combine the previous command with the already created dataframe and apply for each worksheet name a function to get the last value from that worksheet at the G column (wb[x][f'G{wb[x].max_row}']). import pandas as pd from openpyxl import load_workbook src_file = 'test_balance.xlsx' wb = load_workbook(filename = src_file, data_only=True) df = pd.DataFrame(data=wb.sheetnames, columns=["name"]) df["balance"] = df.name.apply(lambda x: wb[x][f'G{wb[x].max_row}'].value) print(df) Output from df name balance 0 Jones Ministry 15100.08 1 Smith Ministry 45408.83 2 Stark Ministry 1561.75 3 Doe Ministry 7625.75 4 Bright Ministry 3078.30 5 Lincoln Ministry 6644.59 6 Martinez Ministry 11500.54 7 Patton Ministry 9782.65 8 Rich Ministry 8429.88 9 Seitz Ministry 2974.58 10 Bhiri Ministry 622.83 11 Pignatelli Ministry 34992.05 12 Cortez Ministry -283.48 13 Little Ministry 13755.80 14 Johnson Ministry -2035.31
Pandas/Python - Merging files on different columns based on incoming files
I have a python program which receive incoming files. Incoming files are files based on different countries. Sample files are below - File 1 (USA) - country state city population USA IL Chicago 2000000 USA TX Dallas 1000000 USA CO Denver 5000000 File 2 (Non USA) - country state city population UK London 2000000 UK Bristol 1000000 UK Glasgow 5000000 Then I have a mapping file which needs to be merged with incoming files. Mapping file look like this Country state Continent UK Europe Egypt Africa USA TX North America USA IL North America USA CO North America Now the requirement is that I need to join the incoming file with mapping file based on state column if its a USA file and join based on Country Column if its a Non USA file. For example - If its a USA file - result_file = pd.merge(input_file, mapping_file, on="state", how="left") If its a non USA file - result_file = pd.merge(input_file, mapping_file, on="country", how="left") How can I place a condition which can identify the incoming file and do the merging of file accordingly? Thanks in advance
In order to get a unified code for the both two cases, After reading the files, add another column for both DataFrame of fileX (df) and DataFrame of the mapping file (dfmap) with the name of (country_state) in which country and state are combined, then make this column is the linked relation. for example: import pandas as pd df = pd.read_csv('fileX.txt') # assumed for fileX dfmap = pd.read_csv('mapping_file.txt') # assumed for mapping file df.fillna('') # to replace Nan values with '' if 'state' in df.columns: df['country_state'] = df['country'] + df['state'] else: df['country_state'] = df['country'] dfmap['country_state'] = dfmap['country'] + dfmap['state'] result_file = pd.merge(df, dfmap, on="country_state", how="left") Then you can drop the columns you do not need Adding a modification in which adding state if not exist, and set relation based on country and state without adding the column 'country_sate' shown in the previous code: import pandas as pd df = pd.read_csv('file1.txt') dfmap = pd.read_csv('file_map.txt') df.fillna('') if 'state' not in df.columns: df['state']='' result_file = pd.merge(df, dfmap, on=["country", "state"], how="left")
First, empty the state column for non-US files. input_file.loc[input_file.country!='US', 'state'] = '' Then, merge on two columns: result_file = pd.merge(input_file, mapping_file, on=["country", "state"], how="left")
-How are you loading the files? Are there any pattern in the names of the files which you can work on? If they are in the same folder, you can recognize the file with import os list_of_files=os.listdir('my_directory/') or you could do a simple search in the Country column looking for USA, and then apply the merges according to the situation
How to efficiently concatenate columns from different csv files based on ids python
I have 4 csv files. Each file has different fields, e.g. name, id_number, etc. Each file is talking about the same thing, for which there is a unique id that each file has. So, I would like to concatenate the fields of each of the 4 files into a single DataFrame. For instance, one file contains first_name, another file contains last_name, then I want to merge those two, so that I can have first and last name for each object. Doing that is trivial, but I'd like to know the most efficient way, or if there is some built-in function that does it very efficiently. The files look something like this: file1: id name age pets b13 Marge 18 cat y47 Dan 13 dog h78 Mark 20 lizard file2: id last_name income city y47 Schmidt 1800 Dallas b13 Olson 1670 Paris h78 Diaz 2010 London file 3 and 4 are like that with different fields. The ids are not necessarily ordered. The goal again, is to have one DataFrame looking like this: id name age pets last_name income city b13 Marge 18 cat Olson 1670 Paris y47 Dan 13 dog Schmidt 1800 Dallas h78 Mark 20 lizard Diaz 2010 London What I've done is this: file1 = pd.read_csv('file1.csv') file2 = pd.read_csv('file2.csv') file3 = pd.read_csv('file3.csv') file4 = pd.read_csv('file4.csv') f1_group = file1.groupby(['id']) f2_group = file2.groupby(['id']) f3_group = file3.groupby(['id']) f4_group = file4.groupby(['id']) data = [] for id1, group1 in f1_group: for id2, group2 in f2_group: for id3, group3 in f3_group: for id4, group4 in f4_group: if id1 == id2 == id3 == id4: frames = [group1, group2, group3, group4] con = pd.concat(frames, axis=1) data.append(con) That works but is extremely inefficient. If I could eliminate the element that has been already considered from group1, group2, etc, that would help, but it would still be inefficient. Thanks in advance.
Hi maybe you can try this :) https://www.freecodecamp.org/news/how-to-combine-multiple-csv-files-with-8-lines-of-code-265183e0854/ import os import glob import pandas as pd #set working directory os.chdir("/mydir") #find all csv files in the folder #use glob pattern matching -> extension = 'csv' #save result in list -> all_filenames extension = 'csv' all_filenames = [i for i in glob.glob('*.{}'.format(extension))] #print(all_filenames) #combine all files in the list combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ]) #export to csv combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
How to parse string as a pandas dataframe
I'm trying to build a self-contained Jupyter notebook that parses a long address string into a pandas dataframe for demonstration purposes. Currently I'm having to highlight the entire string and use pd.read_clipboard: data = pd.read_clipboard(f, comment='#', header=None, names=['address']).values.reshape(-1, 2) matched_address = pd.DataFrame(data, columns=['addr_zagat', 'addr_fodor']) I'm wondering if there is an easier way to read the string in directly instead of relying on having something copied to the clipboard. Here are the first few lines of the string for reference: f = """################################################################################################### # # There are 112 matches between the tuples. The Zagat tuple is listed first, # and then its Fodors pair. # ################################################################################################### Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 90048 310-246-1501 Steakhouses Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 90048 310/246-1501 American ######################## Art's Deli 12224 Ventura Blvd. Studio City 91604 818-762-1221 Delis Art's Delicatessen 12224 Ventura Blvd. Studio City 91604 818/762-1221 American ######################## Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 90077 310-472-1211 Californian Hotel Bel-Air 701 Stone Canyon Rd. Bel Air 90077 310/472-1211 Californian ######################## Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 91423 818-788-3536 French Bistro Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 91423 818/788-3536 French ######################## h Bistro Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 91423 818/788-3536 French ########################""" Does anybody have any tips as to how to parse this string directly into a pandas dataframe? I realise there is another question that addresses this here: Create Pandas DataFrame from a string but the string is delimited by a semi colon and totally different to the format used in my example.
You should add an example of what your output should look like but generally, I would suggest something like this: import pandas as pd import numpy as np # read file, split into lines f = open("./your_file.txt", "r").read().split('\n') accumulator = [] # loop through lines for line in f: # define criteria for selecting lines if len(line) > 1 and line[0].isupper(): # define criteria for splitting the line # get name first_num_char = [c for c in line if c.isdigit()][0] name = line.split(first_num_char, 1)[0] line = line.replace(name, '') # get restaurant type rest_type = line.split()[-1] line = line.replace(rest_type, '') # get phone number number = line.split()[-1] line = line.replace(number, '') # remainder should be the address address = line accumulator.append([name, rest_type, number, address]) # turn accumulator into numpy array, pass with column index to DataFrame constructor df = pd.DataFrame(np.asarray(accumulator), columns=['name', 'restaurant_type', 'phone_number', 'address'])
R scripting Error { : missing value where TRUE/FALSE needed on Dataframe
I have a Data Frame which looks like this Name Surname Country Path John Snow UK /Home/drive/John BOB Anderson /Home/drive/BOB Tim David UK /Home/drive/Tim Wayne Green UK /Home/drive/Wayne I have written a script which first checks if country =="UK", if true, changes Path from "/Home/drive/" to "/Server/files/" using gsub in R. Script Pattern<-"/Home/drive/" Replacement<- "/Server/files/" for (i in 1:nrow(gs_catalog_Staging_123)) { if( gs_catalog_Staging_123$country[i] == "UK" && !is.na(gs_catalog_Staging_123$country[i])) { gs_catalog_Staging_123$Path<- gsub(Pattern , Replacement , gs_catalog_Staging_123$Path,ignore.case=T) } } The output i get : Name Surname Country Path John Snow UK /Server/files/John *BOB Anderson /Server/files/BOB* Tim David UK /Server/files/Tim Wayne Green UK /Server/files/Wayne The output I want Name Surname Country Path John Snow UK /Server/files/John BOB Anderson /Home/drive/BOB Tim David UK /Server/files/Tim Wayne Green UK /Server/files/Wayne As we can clearly see gsub fails to recognize missing values and appends that row as well.
Many R functions are vectorized, so we can avoid a loop here. # example data df <- data.frame( name = c("John", "Bob", "Tim", "Wayne"), surname = c("Snow", "Ander", "David", "Green"), country = c("UK", "", "UK", "UK"), path = paste0("/Home/drive/", c("John", "Bob", "Tim", "Wayne")), stringsAsFactors = FALSE ) # fix the path df$newpath <- ifelse(df$country=="UK" & !is.na(df$country), gsub("/Home/drive/", "/Server/files/", df$path), df$path) # view result df name surname country path newpath 1 John Snow UK /Home/drive/John /Server/files/John 2 Bob Ander /Home/drive/Bob /Home/drive/Bob 3 Tim David UK /Home/drive/Tim /Server/files/Tim 4 Wayne Green UK /Home/drive/Wayne /Server/files/Wayne In fact, this is the issue with your code. Each time through your loop, you check row i but then you do a full replacement of the whole column. A fix would be to add [i] at appropriate places of your final line of code: gs_catalog_Staging_123$Path[i] <- gsub(Pattern , Replacement , gs_catalog_Staging_123$Path[i] ,ignore.case=T)