I am facing an issue where I have to load a huge CSV file, split the file into multiple files based on the unique values in the columns and outputting the files to a multiple Csv's with a predefined name pattern.
The example of the original CSV is as below.
date place type product value zone
09/10/16 NY Zo shirt 19 1
09/10/16 NY Mo jeans 18 2
09/10/16 CA Zo trouser 13 3
09/10/16 CA Co tie 17 4
09/10/16 WA Wo bat 11 1
09/10/16 FL Zo ball 12 2
09/10/16 NC Mo belt 13 3
09/10/16 WA Zo buckle 15 4
09/10/16 WA Co glass 16 1
09/10/16 FL Zo cup 19 2
I have to filer this massive pandas dataframe into multiple pandas dataframes based on place, type and zone and the output dataframes should be converted into multiple csv file with the naming convention place_type_product_zone.csv.
The code I have got till now is as below.
def list_of_dataframes(df, col_list):
df_list = [df]
name_list = []
for _, i in enumerate(col_list):
df_list, names = _split_dataframes(df_list, i)
file_name = zip(name_list, df)
_ = dict(zip(names, df))
for k, v in _:
v.to_csv("{0}.csv".format(k))
Print("CSV files created")
return df, file_name
def _split_dataframes(df_list, col):
names = []
dfs = []
for df in df_list:
for c in df[col].unique():
dfs.append(df.loc[df[col] == c])
names.append(c)
return dfs, names
list_of_dataframes(df,['place','type','zone']
It output csv files with the title 1.csv, 2.csv etc. How do I create a loop in the function to get the naming convention as NY_zo_shirt_1.csv, CA_Zo_trouser_3.csv etc. should I be creating a dictionary where it stores all the keys?
Thanks in advance.
Here it is -
# Part 1
places = df['place'].unique()
types = df['type'].unique()
products = df['product'].unique()
zones = df['zone'].unique()
# Part 2
import itertools
combs = list(itertools.product(*[places, types, products, zones]))
#Part 3
for comb in combs:
place, type_, prod, zone = comb
df_subset = df[(df['place']==place) & (df['type']==type_) & (df['product']==prod) & (df['zone']==zone)]
if df_subset.shape[0] > 0:
df_subset.to_csv('temp1/{}_{}_{}_{}.csv'.format(place, type_, prod, zone), index=False)
Output
Related
I have multiple CSV files which are formatted with multiple tables inside separated by line breaks.
Example:
Technology C_inv [MCHF/y] C_maint [MCHF/y]
NUCLEAR 70.308020 33.374568
HYDRO_DAM_EXISTING 0.000000 195.051200
HYDRO_DAM 67.717942 1.271600
HYDRO_RIVER_EXISTING 0.000000 204.820000
IND_BOILER_OIL 2.053610 0.532362
IND_BOILER_COAL 4.179935 1.081855
IND_BOILER_WASTE 11.010126 2.849652
DEC_HP_ELEC 554.174644 320.791276
DEC_THERMAL_HP_GAS 77.077291 33.717477
DEC_BOILER_GAS 105.586089 41.161335
DEC_BOILER_OIL 33.514266 25.948450
H2_FROM_GAS 145.185290 59.178082
PYROLYSIS 132.200818 112.392123
Storage technology C_inv [MCHF/y] C_maint [MCHF/y]
HYDRO_STORAGE 0.000000 0.000000
Resource C_op [MCHF/y]
ELECTRICITY 1174.452848
GASOLINE 702.000000
DIESEL 96.390000
OIL 267.787558
NG 1648.527242
WOOD 592.110000
COAL 84.504083
URANIUM 18.277626
WASTE 0.000000
All my CSV files have different subtable names but few enough that I could enter them manually to detect them if required.
Another issue is that many titles include spaces (eg "Storage Technology") which is read by pandas as 2 columns.
I initially tried to do it directly with pandas and splitting manually but the argument on_bad_lines='skip' which allows avoiding errors also skips useful lines:
Cost_bd = pd.read_csv(f"{Directory}/cost_breakdown.csv",on_bad_lines='skip',delim_whitespace=True).dropna(axis=1,how='all')
colnames=['Technnolgy', 'C_inv[MCHF/y]', 'C_maint[MCHF/y]']
Cost_bd.columns = colnames
I believe it might be better to scan the .txt file and split it but I'm unsure how to do this in the best way.
I have also tried to use the solution provided in this feed
import csv
from os.path import dirname # gets parent folder in a path
from os.path import join # concatenate paths
table_names = ["Technology", "Storage technology", "Resource"]
df = pd.read_csv(f"{Directory}/cost_breakdown.csv", header=None, names=range(3))
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}
but it doesn't work:
tables.keys()=
dict_keys(['Technology\tC_inv [MCHF/y]\tC_maint [MCHF/y]'])
EDIT : Final solution based on #Rabinzel:
import re
def make_df(group,dict_of_dfs):
header, data = re.split(r'\t',group[0]), list(map(str.split, group[1:]))
if len(header) != len(data[0]): # If missing columns list, take former
header = header + dict_of_dfs[list(dict_of_dfs.keys())[0]].columns.tolist()[1:]
dict_of_dfs[header[0]] = pd.DataFrame(data, columns=header)
return dict_of_dfs
def Read_csv_as_df(path, file_name):
with open(path+file_name) as f:
dict_of_dfs = {}
group = []
for line in f:
if line!='\n':
group.append(line.strip())
else:
print(dict_of_dfs)
dict_of_dfs = make_df(group,dict_of_dfs)
group = []
dict_of_dfs = make_df(group,dict_of_dfs)
return dict_of_dfs
I would do it the following way.
Iterate through each row, append each chunk seperated by a newline to a list and build dataframes from the lists. The problem with the column names with spaces, use re.split and split only if there are two or more spaces.
Save the different df's in a dictionary where the key is the first element of the header of each df.
import re
def make_df(group):
header, data = re.split(r'\s\s+',group[0]), list(map(str.split, group[1:]))
dict_of_dfs[header[0]] = pd.DataFrame(data, columns=header)
with open('your_csv_file.csv') as f:
dict_of_dfs = {}
group = []
for line in f:
if line!='\n':
group.append(line.strip())
else:
make_df(group)
group = []
make_df(group)
for key, value in dict_of_dfs.items():
print(f"{key=}\ndf:\n{value}\n---------------------")
Output:
key='Technology'
df:
Technology C_inv [MCHF/y] C_maint [MCHF/y]
0 NUCLEAR 70.308020 33.374568
1 HYDRO_DAM_EXISTING 0.000000 195.051200
2 HYDRO_DAM 67.717942 1.271600
3 HYDRO_RIVER_EXISTING 0.000000 204.820000
4 IND_BOILER_OIL 2.053610 0.532362
5 IND_BOILER_COAL 4.179935 1.081855
6 IND_BOILER_WASTE 11.010126 2.849652
7 DEC_HP_ELEC 554.174644 320.791276
8 DEC_THERMAL_HP_GAS 77.077291 33.717477
9 DEC_BOILER_GAS 105.586089 41.161335
10 DEC_BOILER_OIL 33.514266 25.948450
11 H2_FROM_GAS 145.185290 59.178082
12 PYROLYSIS 132.200818 112.392123
---------------------
key='Storage technology'
df:
Storage technology C_inv [MCHF/y] C_maint [MCHF/y]
0 HYDRO_STORAGE 0.000000 0.000000
---------------------
key='Resource'
df:
Resource C_op [MCHF/y]
0 ELECTRICITY 1174.452848
1 GASOLINE 702.000000
2 DIESEL 96.390000
3 OIL 267.787558
4 NG 1648.527242
5 WOOD 592.110000
6 COAL 84.504083
7 URANIUM 18.277626
8 WASTE 0.000000
---------------------
I have a .csv file that contains 3 types of records, each with different quantity of columns.
I know the structure of each record type and that the rows are always of type1 first, then type2 and type 3 at the end, but I don't know how many rows of each record type there are.
The first 4 characters of each row define the record type of that row.
CSV Example:
typ1,John,Smith,40,M,Single
typ1,Harry,Potter,22,M,Married
typ1,Eva,Adams,35,F,Single
typ2,2020,08,16,A
typ2,2020,09,02,A
typ3,Chevrolet,FC101TT,2017
typ3,Toyota,CE972SY,2004
How can I read It with Pandas? It doesn't matter if I have to read one record type each time.
Thanks!!
Here it is a pandas solution.
First we must read the csv file in a way that pandas keeps the entires lines in one cell each. We do that by simply using a wrong separator, such as the 'at' symbol '#'. It can be whatever we want, since we guarantee it won't ever appear in our data file.
wrong_sep = '#'
right_sep = ','
df = pd.read_csv('my_file.csv', sep=wrong_sep).iloc[:, 0]
The .iloc[:, 0] is used as a quick way to convert a DataFrame into a Series.
Then we use a loop to select the rows that belong to each data structure based on their starting characters. Now we use the "right separator" (probably a comma ',') to split the desired data into real DataFrames.
starters = ['typ1', 'typ2', 'typ3']
detected_dfs = dict()
for start in starters:
_df = df[df.str.startswith(start)].str.split(right_sep, expand=True)
detected_dfs[start] = _df
And here you go. If we print the resulting DataFrames, we get:
0 1 2 3 4 5
0 typ1 Harry Potter 22 M Married
1 typ1 Eva Adams 35 F Single
0 1 2 3 4
2 typ2 2020 08 16 A
3 typ2 2020 09 02 A
0 1 2 3
4 typ3 Chevrolet FC101TT 2017
5 typ3 Toyota CE972SY 2004
Let me know if it helped you!
Not Pandas:
from collections import defaultdict
filename2 = 'Types.txt'
with open(filename2) as dataLines:
nL = dataLines.read().splitlines()
defDList = defaultdict(list)
subs = ['typ1','typ2','typ3']
dataReadLines = [defDList[i].append(j) for i in subs for j in nL if i in j]
# dataReadLines = [i for i in nL]
print(defDList)
Output:
defaultdict(<class 'list'>, {'typ1': ['typ1,John,Smith,40,M,Single', 'typ1,Harry,Potter,22,M,Married', 'typ1,Eva,Adams,35,F,Single'], 'typ2': ['typ2,2020,08,16,A', 'typ2,2020,09,02,A'], 'typ3': ['typ3,Chevrolet,FC101TT,2017', 'typ3,Toyota,CE972SY,2004']})
You can make use of the skiprows parameter of pandas read_csv method to skip the rows you are not interested in for a particular record type. The following gives you a dictionary dfs of dataframes for each type. An advantage is that records of the same types don't necessarily have to be adjacent to each other in the csv file.
For larger files you might want to adjust the code such that the file is only read once instead of twice.
import pandas as pd
from collections import defaultdict
indices = defaultdict(list)
types = ['typ1', 'typ2', 'typ3']
filename = 'test.csv'
with open(filename) as csv:
for idx, line in enumerate(csv.readlines()):
for typ in types:
if line.startswith(typ):
indices[typ].append(idx)
dfs = {typ: pd.read_csv(filename, header=None,
skiprows=lambda x: x not in indices[typ])
for typ in types}
Read the file as a CSV file using the CSV reader. The reader fortunately does not care about line formats:
import csv
with open("yourfile.csv") as infile:
data = list(csv.reader(infile))
Collect the rows with the same first element and build a dataframe of them:
import pandas as pd
from itertools import groupby
dfs = [pd.DataFrame(v) for _,v in groupby(data, lambda x: x[0])]
You've got a list of three dataframes (or as many as necessary).
dfs[1]
# 0 1 2 3 4
#0 typ2 2020 08 16 A
#1 typ2 2020 09 02 A
I have 4 csv files. Each file has different fields, e.g. name, id_number, etc. Each file is talking about the same thing, for which there is a unique id that each file has. So, I would like to concatenate the fields of each of the 4 files into a single DataFrame. For instance, one file contains first_name, another file contains last_name, then I want to merge those two, so that I can have first and last name for each object.
Doing that is trivial, but I'd like to know the most efficient way, or if there is some built-in function that does it very efficiently.
The files look something like this:
file1:
id name age pets
b13 Marge 18 cat
y47 Dan 13 dog
h78 Mark 20 lizard
file2:
id last_name income city
y47 Schmidt 1800 Dallas
b13 Olson 1670 Paris
h78 Diaz 2010 London
file 3 and 4 are like that with different fields. The ids are not necessarily ordered. The goal again, is to have one DataFrame looking like this:
id name age pets last_name income city
b13 Marge 18 cat Olson 1670 Paris
y47 Dan 13 dog Schmidt 1800 Dallas
h78 Mark 20 lizard Diaz 2010 London
What I've done is this:
file1 = pd.read_csv('file1.csv')
file2 = pd.read_csv('file2.csv')
file3 = pd.read_csv('file3.csv')
file4 = pd.read_csv('file4.csv')
f1_group = file1.groupby(['id'])
f2_group = file2.groupby(['id'])
f3_group = file3.groupby(['id'])
f4_group = file4.groupby(['id'])
data = []
for id1, group1 in f1_group:
for id2, group2 in f2_group:
for id3, group3 in f3_group:
for id4, group4 in f4_group:
if id1 == id2 == id3 == id4:
frames = [group1, group2, group3, group4]
con = pd.concat(frames, axis=1)
data.append(con)
That works but is extremely inefficient. If I could eliminate the element that has been already considered from group1, group2, etc, that would help, but it would still be inefficient.
Thanks in advance.
Hi maybe you can try this :)
https://www.freecodecamp.org/news/how-to-combine-multiple-csv-files-with-8-lines-of-code-265183e0854/
import os
import glob
import pandas as pd
#set working directory
os.chdir("/mydir")
#find all csv files in the folder
#use glob pattern matching -> extension = 'csv'
#save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#print(all_filenames)
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
I have multiple CSV files (like 200) in a folder that I want to merge them into one unique dataframe. For example, each file has 3 columns, of which 2 are common in all the files (Country and Year), the third column is different in each file.
For example, one file has the following columns:
Country Year X
----------------------
Mexico 2015 10
Spain 2014 6
And other file can be like this:
Country Year A
--------------------
Mexico 2015 90
Spain 2014 67
USA 2020 8
I can read this files and merge them with the following code:
x = pd.read_csv("x.csv")
a = pd.read_csv("a.csv")
df = pd.merge(a, x, how="left", left_on=["country", "year"],
right_on=["country", "year"], indicator=False)
And this result in the output that I want, like this:
Country Year A X
-------------------------
Mexico 2015 90 10
Spain 2014 67 6
USA 2020 8
However, my problem is to do the previously process with each file, there are more than 200, I want to know if I can use a loop (or other method) in order to read the files and merge them into a unique dataframe.
Thank you very much, I hope I was clear enough.
Use glob like this:
import glob
print(glob.glob("/home/folder/*.csv"))
This gives all your files in a list : ['/home/folder/file1.csv', '/home/folder/file2.csv', .... ]
Now, you can just iterate over this list : from 1->end, keeping 0 as your base, and do pd.read_csv() and pd.merge() - it should be sorted!
Try this:
import os
import pandas as pd
# update this to path that contains your .csv's
path = '.'
# get files that end with csv in path
dir_list = [file for file in os.listdir(path) if file.endswith('.csv')]
# initiate empty list
df_list = []
# simple for loop with Try, Except that passes on iterations that throw errors when trying to 'read_csv' your files
for file in dir_list:
try:
# append to df_list and set your indices to match across your df's for later pd.concat to work
df_list.append(pd.read_csv(file).set_index(['Country', 'Year']))
except: # change this depending on whatever Errors pd.read_csv() throws
pass
concatted = pd.concat(df_list)
I have a list of tuples in this format:
[("25.00", u"A"), ("44.00", u"X"),("17.00", u"E"),("34.00", u"Y")]
I want to count the number of time we have each letter.
I already created a sorted list with all the letter and now I want to count them.
First of all I have a problem with the u before the second item of each tuple, I don't know how to delete it, I guess it's something about enconding.
Here is my code
# coding=utf-8
from collections import Counter
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('test.xlsx', sheet_name='Essais', skiprows=1)
groupes = []
students = []
group_of_each_letter = []
number_of_students_per_group = []
final_list = []
def print_a_list(list):
for items in list:
print(items)
for i in df.index:
groupes.append(df['GROUPE'][i])
students.append(df[u'ÉTUDIANT'][i])
groupes = groupes[1:]
students = students[1:]
group_of_each_letter = list(set(groupes))
group_of_each_letter = sorted(group_of_each_letter)
z = zip(students, groupes)
z = list(set(z))
final_list = list(zip(*z))
for j in group_of_each_letter:
number_of_students_per_group.append(final_list.count(j))
print_a_list(number_of_students_per_group)
Group of each letter is a list with the group letters without duplicate.
The problem is that I got the right number of value with the for loop at the end but the list is filled with '0'.
The screenshot below is a sample of the excel file. The column "ETUDIANT" means "Student number" but I cant edit the file, I have to deal with it. GROUPE means GROUP obviously. The goal is to count the number of student per group. I think I'm on the right way even if there is easier ways to do that.
Thanks in advance for your help even if I know that my question is a bit ambiguous
Building off of kerwei's answer:
Use groupby() and then nunique()
This will give you the number of unique Student IDs in each Group.
import pandas as pd
df = pd.read_excel('test.xlsx', sheet_name='Essais', skiprows=1)
# Drop the empty row, which is actually the subheader
df.drop(0, axis=0, inplace=True)
# Now we get a count of unique students by group
student_group = df.groupby('GROUPE')[u'ÉTUDIANT'].nunique()
I think a groupby.count() should be sufficient. It'll count the number of occurrences of your GROUPE letter in the dataframe.
import pandas as pd
df = pd.read_excel('test.xlsx', sheet_name='Essais', skiprows=1)
# Drop the empty row, which is actually the subheader
df.drop(0, axis=0, inplace=True)
# Now we get a count of students by group
sub_student_group = df.groupby(['GROUPE','ETUDIANT']).count().reset_index()
>>>sub_student_group
GROUPE ETUDIANT
0 B 29
1 L 88
2 N 65
3 O 27
4 O 29
5 O 34
6 O 35
7 O 54
8 O 65
9 O 88
10 O 99
11 O 114
12 O 122
13 O 143
14 O 147
15 U 122
student_group = sub_student_group.groupby('GROUPE').count()
>>>student_group
ETUDIANT
GROUPE
B 1
L 1
N 1
O 12
U 1