How do I add a blank line between merged files - python

I have several CSV files that I have managed to merge. However, I need to add a blank row between each files as they merge so I know a different file starts at that point. Tried everything. Please help.
import os
import glob
import pandas
def concatenate(indir="C:\\testing", outfile="C:\\done.csv"):
os.chdir(indir)
fileList=glob.glob("*.csv")
dfList=[]
colnames=["Creation Date","Author","Tweet","Language","Location","Country","Continent"]
for filename in fileList:
print(filename)
df=pandas.read_csv(filename, header=None)
ins=df.insert(len(df),'\n')
dfList.append(ins)
concatDf=pandas.concat(dfList,axis=0)
concatDf.columns=colnames
concatDf.to_csv(outfile,index=None)

Here is an example script. You can use the loc method with a non-existent key to enlarge the DataFrame and set the value of the new row.
The simplest solution seems to be to create a template DataFrame to use as a separator with the values set as desired. Then just insert it into the list of data frames to concatenate at appropriate positions.
Lastly, I removed the chdir, since glob can search in any path.
import glob
import pandas
def concatenate(input_dir, output_file_name):
file_list=glob.glob(input_dir + "/*.csv")
column_names=["Creation Date"
, "Author"
, "Tweet"
, "Language"
, "Location"
, "Country"
, "Continent"]
# Create a separator template
separator = pandas.DataFrame(columns=column_names)
separator.loc[0] = [""]*7
dataframes = []
for file_name in file_list:
print(file_name)
if len(dataframes):
# The list is not empty, so we need to add a separator
dataframes.append(separator)
dataframes.append(pandas.read_csv(file_name))
concatenated = pandas.concat(dataframes, axis=0)
concatenated.to_csv(output_file_name, index=None)
print(concatenated)
concatenate("input", ".out.csv")
An alternative, even shorter, way is to build the concatenated DataFrame iteratively, using the append method.
def concatenate(input_dir, output_file_name):
file_list=glob.glob(input_dir + "/*.csv")
column_names=["Creation Date"
, "Author"
, "Tweet"
, "Language"
, "Location"
, "Country"
, "Continent"]
concatenated = pandas.DataFrame(columns=column_names)
for file_name in file_list:
print(file_name)
if len(concatenated):
# The list is not empty, so we need to add a separator
concatenated.loc[len(concatenated)] = [""]*7
concatenated = concatenated.append(pandas.read_csv(file_name))
concatenated.to_csv(output_file_name, index=None)
print(concatenated)
I tested the script with 3 input CSV files:
input/1.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2015-12-17,foo,Hello,EN,London,UK,Europe
2015-12-18,bar,Bye,EN,Manchester,UK,Europe
2015-12-28,baz,Hallo,DE,Frankfurt,Germany,Europe
input/2.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2016-01-09,bar,Tweeeeet,EN,New York,USA,America
2016-01-09,cat,Miau,FI,Helsinki,Finland,Europe
input/3.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2018-12-12,who,Hello,EN,Delhi,India,Asia
When I ran it, the following output was written to console:
Console Output (using concat)
input\1.csv
input\2.csv
input\3.csv
Creation Date Author Tweet Language Location Country Continent
0 2015-12-17 foo Hello EN London UK Europe
1 2015-12-18 bar Bye EN Manchester UK Europe
2 2015-12-28 baz Hallo DE Frankfurt Germany Europe
0
0 2016-01-09 bar Tweeeeet EN New York USA America
1 2016-01-09 cat Miau FI Helsinki Finland Europe
0
0 2018-12-12 who Hello EN Delhi India Asia
The console output of the shorter variant is slightly different (note the indices in the first column), however this has no effect on the generated CSV file.
Console Output (using append)
input\1.csv
input\2.csv
input\3.csv
Creation Date Author Tweet Language Location Country Continent
0 2015-12-17 foo Hello EN London UK Europe
1 2015-12-18 bar Bye EN Manchester UK Europe
2 2015-12-28 baz Hallo DE Frankfurt Germany Europe
3
0 2016-01-09 bar Tweeeeet EN New York USA America
1 2016-01-09 cat Miau FI Helsinki Finland Europe
6
0 2018-12-12 who Hello EN Delhi India Asia
Finally, this is what the output CSV file it generated looks like:
out.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2015-12-17,foo,Hello,EN,London,UK,Europe
2015-12-18,bar,Bye,EN,Manchester,UK,Europe
2015-12-28,baz,Hallo,DE,Frankfurt,Germany,Europe
,,,,,,
2016-01-09,bar,Tweeeeet,EN,New York,USA,America
2016-01-09,cat,Miau,FI,Helsinki,Finland,Europe
,,,,,,
2018-12-12,who,Hello,EN,Delhi,India,Asia

Related

Openpyxl to create dataframe with sheet name and specific cell values?

What I need to do:
Open Excel Spreadsheet in Python/Pandas
Create df with [name, balance]
Example:
name
balance
Jones Ministry
45,408.83
Smith Ministry
38,596.20
Doe Ministry
28,596.20
What I have done so far...
import pandas as pd
import openpyxl as op
from openpyxl import load_workbook
from pathlib import Path
Then...
# Excel File
src_file = src_file = Path.cwd() / 'lm_balance.xlsx'
df = load_workbook(filename = src_file)
I viewed all the sheet names by...
df.sheetnames
And created a dataframe with the 'name' column
balance_df = pd.DataFrame(df.sheetnames)
My spreadsheet looks like this...
I now need to loop thru each sheet and add the 'ending fund balance' and corresponding 'value'
The "Ending Fund Balance" is at different rows, but always the final row. The 'value' is always in column 'G'
How do I go about doing this?
I have read through examples in:
Automate the Boring Stuff
Openpyxl documentation
PBPython.com examples
Stack Overflow questions
I appreciate your help!
Working samples on github: Github: JohnMillstead: Balance_Study
To ge a cell value first set the data_only=True on load_workbook, otherwise you could end up getting the cell formula. To get last row of a worksheet you can use ws.max_row. Combine the previous command with the already created dataframe and apply for each worksheet name a function to get the last value from that worksheet at the G column (wb[x][f'G{wb[x].max_row}']).
import pandas as pd
from openpyxl import load_workbook
src_file = 'test_balance.xlsx'
wb = load_workbook(filename = src_file, data_only=True)
df = pd.DataFrame(data=wb.sheetnames, columns=["name"])
df["balance"] = df.name.apply(lambda x: wb[x][f'G{wb[x].max_row}'].value)
print(df)
Output from df
name balance
0 Jones Ministry 15100.08
1 Smith Ministry 45408.83
2 Stark Ministry 1561.75
3 Doe Ministry 7625.75
4 Bright Ministry 3078.30
5 Lincoln Ministry 6644.59
6 Martinez Ministry 11500.54
7 Patton Ministry 9782.65
8 Rich Ministry 8429.88
9 Seitz Ministry 2974.58
10 Bhiri Ministry 622.83
11 Pignatelli Ministry 34992.05
12 Cortez Ministry -283.48
13 Little Ministry 13755.80
14 Johnson Ministry -2035.31

Pandas/Python - Merging files on different columns based on incoming files

I have a python program which receive incoming files. Incoming files are files based on different countries. Sample files are below -
File 1 (USA) -
country state city population
USA IL Chicago 2000000
USA TX Dallas 1000000
USA CO Denver 5000000
File 2 (Non USA) -
country state city population
UK London 2000000
UK Bristol 1000000
UK Glasgow 5000000
Then I have a mapping file which needs to be merged with incoming files. Mapping file look like this
Country state Continent
UK Europe
Egypt Africa
USA TX North America
USA IL North America
USA CO North America
Now the requirement is that I need to join the incoming file with mapping file based on state column if its a USA file and join based on Country Column if its a Non USA file. For example -
If its a USA file -
result_file = pd.merge(input_file, mapping_file, on="state", how="left")
If its a non USA file -
result_file = pd.merge(input_file, mapping_file, on="country", how="left")
How can I place a condition which can identify the incoming file and do the merging of file accordingly?
Thanks in advance
In order to get a unified code for the both two cases, After reading the files, add another column for both DataFrame of fileX (df) and DataFrame of the mapping file (dfmap) with the name of (country_state) in which country and state are combined, then make this column is the linked relation.
for example:
import pandas as pd
df = pd.read_csv('fileX.txt') # assumed for fileX
dfmap = pd.read_csv('mapping_file.txt') # assumed for mapping file
df.fillna('') # to replace Nan values with ''
if 'state' in df.columns:
df['country_state'] = df['country'] + df['state']
else:
df['country_state'] = df['country']
dfmap['country_state'] = dfmap['country'] + dfmap['state']
result_file = pd.merge(df, dfmap, on="country_state", how="left")
Then you can drop the columns you do not need
Adding a modification in which adding state if not exist, and set relation based on country and state without adding the column 'country_sate' shown in the previous code:
import pandas as pd
df = pd.read_csv('file1.txt')
dfmap = pd.read_csv('file_map.txt')
df.fillna('')
if 'state' not in df.columns:
df['state']=''
result_file = pd.merge(df, dfmap, on=["country", "state"], how="left")
First, empty the state column for non-US files.
input_file.loc[input_file.country!='US', 'state'] = ''
Then, merge on two columns:
result_file = pd.merge(input_file, mapping_file, on=["country", "state"], how="left")
-How are you loading the files?
Are there any pattern in the names of the files which you can work on?
If they are in the same folder, you can recognize the file with
import os
list_of_files=os.listdir('my_directory/')
or you could do a simple search in the Country column looking for USA, and then apply the merges according to the situation

How to efficiently concatenate columns from different csv files based on ids python

I have 4 csv files. Each file has different fields, e.g. name, id_number, etc. Each file is talking about the same thing, for which there is a unique id that each file has. So, I would like to concatenate the fields of each of the 4 files into a single DataFrame. For instance, one file contains first_name, another file contains last_name, then I want to merge those two, so that I can have first and last name for each object.
Doing that is trivial, but I'd like to know the most efficient way, or if there is some built-in function that does it very efficiently.
The files look something like this:
file1:
id name age pets
b13 Marge 18 cat
y47 Dan 13 dog
h78 Mark 20 lizard
file2:
id last_name income city
y47 Schmidt 1800 Dallas
b13 Olson 1670 Paris
h78 Diaz 2010 London
file 3 and 4 are like that with different fields. The ids are not necessarily ordered. The goal again, is to have one DataFrame looking like this:
id name age pets last_name income city
b13 Marge 18 cat Olson 1670 Paris
y47 Dan 13 dog Schmidt 1800 Dallas
h78 Mark 20 lizard Diaz 2010 London
What I've done is this:
file1 = pd.read_csv('file1.csv')
file2 = pd.read_csv('file2.csv')
file3 = pd.read_csv('file3.csv')
file4 = pd.read_csv('file4.csv')
f1_group = file1.groupby(['id'])
f2_group = file2.groupby(['id'])
f3_group = file3.groupby(['id'])
f4_group = file4.groupby(['id'])
data = []
for id1, group1 in f1_group:
for id2, group2 in f2_group:
for id3, group3 in f3_group:
for id4, group4 in f4_group:
if id1 == id2 == id3 == id4:
frames = [group1, group2, group3, group4]
con = pd.concat(frames, axis=1)
data.append(con)
That works but is extremely inefficient. If I could eliminate the element that has been already considered from group1, group2, etc, that would help, but it would still be inefficient.
Thanks in advance.
Hi maybe you can try this :)
https://www.freecodecamp.org/news/how-to-combine-multiple-csv-files-with-8-lines-of-code-265183e0854/
import os
import glob
import pandas as pd
#set working directory
os.chdir("/mydir")
#find all csv files in the folder
#use glob pattern matching -> extension = 'csv'
#save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#print(all_filenames)
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')

How to parse string as a pandas dataframe

I'm trying to build a self-contained Jupyter notebook that parses a long address string into a pandas dataframe for demonstration purposes. Currently I'm having to highlight the entire string and use pd.read_clipboard:
data = pd.read_clipboard(f,
comment='#',
header=None,
names=['address']).values.reshape(-1, 2)
matched_address = pd.DataFrame(data, columns=['addr_zagat', 'addr_fodor'])
I'm wondering if there is an easier way to read the string in directly instead of relying on having something copied to the clipboard. Here are the first few lines of the string for reference:
f = """###################################################################################################
#
# There are 112 matches between the tuples. The Zagat tuple is listed first,
# and then its Fodors pair.
#
###################################################################################################
Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 90048 310-246-1501 Steakhouses
Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 90048 310/246-1501 American
########################
Art's Deli 12224 Ventura Blvd. Studio City 91604 818-762-1221 Delis
Art's Delicatessen 12224 Ventura Blvd. Studio City 91604 818/762-1221 American
########################
Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 90077 310-472-1211 Californian
Hotel Bel-Air 701 Stone Canyon Rd. Bel Air 90077 310/472-1211 Californian
########################
Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 91423 818-788-3536 French Bistro
Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 91423 818/788-3536 French
########################
h Bistro
Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 91423 818/788-3536 French
########################"""
Does anybody have any tips as to how to parse this string directly into a pandas dataframe?
I realise there is another question that addresses this here: Create Pandas DataFrame from a string but the string is delimited by a semi colon and totally different to the format used in my example.
You should add an example of what your output should look like but generally, I would suggest something like this:
import pandas as pd
import numpy as np
# read file, split into lines
f = open("./your_file.txt", "r").read().split('\n')
accumulator = []
# loop through lines
for line in f:
# define criteria for selecting lines
if len(line) > 1 and line[0].isupper():
# define criteria for splitting the line
# get name
first_num_char = [c for c in line if c.isdigit()][0]
name = line.split(first_num_char, 1)[0]
line = line.replace(name, '')
# get restaurant type
rest_type = line.split()[-1]
line = line.replace(rest_type, '')
# get phone number
number = line.split()[-1]
line = line.replace(number, '')
# remainder should be the address
address = line
accumulator.append([name, rest_type, number, address])
# turn accumulator into numpy array, pass with column index to DataFrame constructor
df = pd.DataFrame(np.asarray(accumulator), columns=['name', 'restaurant_type', 'phone_number', 'address'])

R scripting Error { : missing value where TRUE/FALSE needed on Dataframe

I have a Data Frame which looks like this
Name Surname Country Path
John Snow UK /Home/drive/John
BOB Anderson /Home/drive/BOB
Tim David UK /Home/drive/Tim
Wayne Green UK /Home/drive/Wayne
I have written a script which first checks if country =="UK", if true, changes Path from "/Home/drive/" to "/Server/files/" using gsub in R.
Script
Pattern<-"/Home/drive/"
Replacement<- "/Server/files/"
for (i in 1:nrow(gs_catalog_Staging_123))
{
if( gs_catalog_Staging_123$country[i] == "UK" && !is.na(gs_catalog_Staging_123$country[i]))
{
gs_catalog_Staging_123$Path<- gsub(Pattern , Replacement , gs_catalog_Staging_123$Path,ignore.case=T)
}
}
The output i get :
Name Surname Country Path
John Snow UK /Server/files/John
*BOB Anderson /Server/files/BOB*
Tim David UK /Server/files/Tim
Wayne Green UK /Server/files/Wayne
The output I want
Name Surname Country Path
John Snow UK /Server/files/John
BOB Anderson /Home/drive/BOB
Tim David UK /Server/files/Tim
Wayne Green UK /Server/files/Wayne
As we can clearly see gsub fails to recognize missing values and appends that row as well.
Many R functions are vectorized, so we can avoid a loop here.
# example data
df <- data.frame(
name = c("John", "Bob", "Tim", "Wayne"),
surname = c("Snow", "Ander", "David", "Green"),
country = c("UK", "", "UK", "UK"),
path = paste0("/Home/drive/", c("John", "Bob", "Tim", "Wayne")),
stringsAsFactors = FALSE
)
# fix the path
df$newpath <- ifelse(df$country=="UK" & !is.na(df$country),
gsub("/Home/drive/", "/Server/files/", df$path),
df$path)
# view result
df
name surname country path newpath
1 John Snow UK /Home/drive/John /Server/files/John
2 Bob Ander /Home/drive/Bob /Home/drive/Bob
3 Tim David UK /Home/drive/Tim /Server/files/Tim
4 Wayne Green UK /Home/drive/Wayne /Server/files/Wayne
In fact, this is the issue with your code. Each time through your loop, you check row i but then you do a full replacement of the whole column. A fix would be to add [i] at appropriate places of your final line of code:
gs_catalog_Staging_123$Path[i] <- gsub(Pattern , Replacement , gs_catalog_Staging_123$Path[i] ,ignore.case=T)

Categories

Resources