I have over two thousands csv files in a folder as follows:
University_2010_USA.csv, University_2011_USA.csv, Education_2012_USA.csv, Education_2012_Mexico.csv, Education_2012_Argentina.csv,
and
Results_2010_USA.csv, Results_2011_USA.csv, Results_2012_USA.csv, Results_2012_Mexico.csv, Results_2012_Argentina.csv,
I would like to match the first csv files in the list with the second ones based on "year" (2012, etc.) and "country" (Mexico, etc.) in the file name. Is there a way to do so quickly? Both the csv files have the same column names and I'm looking at the following code:
df0 = pd.read_csv('University_2010_USA.csv')
df1 = pd.read_csv('Results_2010_USA.csv')
new_df = pd.merge(df0, df1, on=['year','country','region','sociodemographics'])
So basically, I'd need help to write a for-loop that iterates over the datasets... Thanks!
Try this:
from pathlib import Path
university = []
results = []
for file in Path('/path/to/data/folder').glob('*.csv'):
# Determine the properties from the file's name
file_type, year, country = file.stem.split('_')
if file_type not in ['University', 'Result']:
continue
# Make the data frame, with 2 extra columns using properties
# we extracted from the file's name
tmp = pd.read_csv(file).assign(
year=int(year),
country=country
)
if file_type == 'University':
university.append(tmp)
else:
results.append(tmp)
df = pd.merge(
pd.concat(university),
pd.concat(results),
on=['year','country','region','sociodemographics']
)
Related
I have many .csv files like this (with one column):
picture
Id like to merge them into one .csv file, so that each of the column will contain one of the csv files data. The headings should be like this (when converted to spreadsheet):
picture (the first number is the number of minutes extracted from the file name, the second is the first word in the file name behind "export_" in the name, and third is the whole name of the file).
Id like to work in Python.
Can you please someone help me with this? I am new in Python.
Thank you very much.
I tried to join only 2 files, but I have no idea how to do it with more files without writing all down manually. Also, i dont know, how to extract headings from the file names:
import pandas as pd
file_list = ['export_Control 37C 4h_Single Cells_Single Cells_Single Cells.csv', 'export_Control 37C 0 min_Single Cells_Single Cells_Single Cells.csv']
df = pd.DataFrame()
for file in file_list:
temp_df = pd.read_csv(file)
df = pd.concat([df, temp_df], axis=1)
print(df)
df.to_csv('output2.csv', index=False)
Assuming that your .csv files they all have a header and the same number of rows, you can use the code below to put all the .csv (single-columned) one besides the other in a single Excel worksheet.
import os
import pandas as pd
csv_path = r'path_to_the_folder_containing_the_csvs'
csv_files = [file for file in os.listdir(csv_path)]
list_of_dfs=[]
for file in csv_files :
temp=pd.read_csv(csv_path + '\\' + file, header=0, names=['Header'])
time_number = pd.DataFrame([[file.split('_')[1].split()[2]]], columns=['Header'])
file_title = pd.DataFrame([[file.split('_')[1].split()[0]]], columns=['Header'])
file_name = pd.DataFrame([[file]], columns=['Header'])
out = pd.concat([time_number, file_title, file_name, temp]).reset_index(drop=True)
list_of_dfs.append(out)
final= pd.concat(list_of_dfs, axis=1, ignore_index=True)
final.columns = ['Column' + str(col+1) for col in final.columns]
final.to_csv(csv_path + '\output.csv', index=False)
final
For example, considering three .csv files, running the code above yields to :
>>> Output (in Jupyter)
>>> Output (in Excel)
For my thesis project I need to import and join 117 .json files into one dataframe. Manually it works but I can't figure out a loop. Another problem is that the features need the have filename in the data frame.
Basically I need to automate this process below:
df_aa = pd.read_json(r'path')
df_aa.columns = ['Time', 'Active_adresses']
#df_aa.head()
#df_aa.tail()
df_tf = pd.read_json(r'path')
df_tf.columns = ['Time', 'Total_fees']
#df_tf.head()
#df_tf.tail()
df_tf.merge(df_aa, on='Time', how='left')
Example picture
Anyone who can help me? I don't have much programming experience.
Find a common pattern by which you split/chunk all of the files for your loop to do the single action of loading. If there's no specific treatment required, use
from glob import glob
print(list(glob("*.json"))) # put to list if you want to print it
for item in glob("*.json"):
do_your_loading()
to get all of the files, then load them one by one (or via an aggregating func, if there's some in the library) and then join the dataframes into a single one, if necessary.
For the name in the dataframe simply add a new column after the initial chunk loading will contain on all of the rows the filename (from glob or otherwise).
For example:
dataframes = {}
for item in glob("*.json"):
df_aa = pd.read_json(r'path')
df_aa.columns = ['Time', 'Active_adresses']
#df_aa.head()
#df_aa.tail()
# put the filename to a new column
df_aa = df_aa.assign(filename=item)
df_tf = pd.read_json(r'path')
df_tf.columns = ['Time', 'Total_fees']
#df_tf.head()
#df_tf.tail()
df_tf.merge(df_aa, on='Time', how='left')
# put the filename to a new column
df_tf = df_tf.assign(filename=item)
dataframes[item] = {"aa": df_aa, "tf": df_tf}
dataframes["my-filename.json"]["aa"].head()
dataframes["my-filename.json"]["tf"].head()
Is there a way in python i can drop columns in multiple excel files? i.e. i have a folder with several xlsx files. each file has about 5 columns (date, value, latitude, longitude, region). I want to drop all columns except date and value in each excel file.
Let's say you have a folder with multiple excel files:
from pathlib import Path
folder = Path('excel_files')
xlsx_only_files = list(folder.rglob('*.xlsx'))
def process_files(xls_file):
#stem is a method in pathlib
#that gets just the filename without the parent or the suffix
filename = xls_file.stem
#sheet = None ensure the data is read in as a dictionary
#this sets the sheetname as the key
#usecols allows you to read in only the relevant columns
df = pd.read_excel(xls_file, usecols = ['date','value'] ,sheet_name = None)
df_cleaned = [data.assign(sheetname=sheetname,
filename = filename)
for sheetname, data in df.items()
]
return df_cleaned
combo = [process_files(xlsx) for xlsx in xlsx_only_files]
final = pd.concat(combo, ignore_index = True)
Let me know how it goes
stem
I suggest you should define columns you want to keep as a list and then select as a new dataframe.
# after open excel file as
df = pd.read_excel(...)
keep_cols = ['date', 'value']
df = df[keep_cols] # keep only selected columns it will return df as dataframe
df.to_excel(...)
How do I go about manipulating each file of a folder based on values pulled from a dictionary? Basically, say I have x files in a folder. I use pandas to reformat the dataframe, add a column which includes the date of the report, and save the new file under the same name and the date.
import pandas as pd
import pathlib2 as Path
import os
source = Path("Users/Yay/AlotofFiles/April")
items = os.listdir(source)
d_dates = {'0401' : '04/1/2019', '0402 : 4/2/2019', '0403 : 04/03/2019'}
for item in items:
for key, value in d_dates.items():
df = pd.read_excel(item, header=None)
df.set_columns = ['A', 'B','C']
df[df['A'].str.contains("Awesome")]
df['Date'] = value
file_basic = "retrofile"
short_date = key
xlsx = ".xlsx"
file_name = file_basic + short_date + xlsx
df.to_excel(file_name)
I want each file to be unique and categorized by the date. In this case, I would want to have three files, for example "retrofile0401.xlsx" that has a column that contains "04/01/2019" and only has data relevant to the original file.
The actual result is pretty much looping each individual item, creating three different files with those values, moves on to the next file, repeats and replace the first iteration and until I only am left with three files that are copies of the last file. The only thing that is different is that each file has a different date and are named differently. This is what I want but it's duplicating the data from the last file.
If I remove the second loop, it works the way I want it but there's no way of categorizing it based on the value I made in the dictionary.
Try the following. I'm only making input filenames explicit to make clear what's going on. You can continue to use yours from the source.
input_filenames = [
'retrofile0401_raw.xlsx',
'retrofile0402_raw.xlsx',
'retrofile0403_raw.xlsx',]
date_dict = {
'0401': '04/1/2019',
'0402': '4/2/2019',
'0403': '04/03/2019'}
for filename in input_filenames:
date_key = filename[9:13]
df = pd.read_excel(filename, header=None)
df[df['A'].str.contains("Awesome")]
df['Date'] = date_dict[date_key]
df.to_excel('retrofile{date_key}.xlsx'.format(date_key=date_key))
filename[9:13] takes characters #9-12 from the filename. Those are the ones that correspond to your date codes.
All of my files have the following titles and they stretch back for a few years. I want to be able to read each file and then add the date from the file name as a column.
Filetype as of 2015-04-01.csv
path = 'C:\\Users\\'
filelist = os.listdir(path) #All of my .csv files I am working with
file_count = len(filelist) #I thought I could do a for loop and use this as a the range
df = Series(filelist) #I just added this because I couldn't get the date from a list
date_name = df.str[15:-4] #This gives me the date
So what I have tried is:
for file in filelist:
df = pd.read_csv(file)
Now I want to take the date_name from the file name and add a column called date. Every file is exactly the same but I want to track changes over time and the only date is found just on the name of the file.
Then I will append it.
path = 'C:\\Users\\'
filelist = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list = []
for file in filelist:
df = pd.read_csv(file)
list_.append(df)
frame = pd.concat(list)
How can I add the date_name to the file/dataframe? 1) Read the file, 2) Add the date column based on the file name, 3) Read the next file, 4) Add the date column, 5) Append, 6) Repeat for all files in the path
Edit---
I think I got something to work - is this the best way? Can someone explain what the list = [] is doing and such is doing?
path = 'C:\\Users\\'
filelist = os.listdir(path)
list = []
frame = pd.DataFrame()
for file in filelist:
df2 = pd.read_csv(path+file)
date_name = file[15:-4]
df2['Date'] = date_name
list.append(df2)
frame = pd.concat(list)
This seems like a reasonable way to do it. The pd.concat takes a list of pandas objects and concatenates them. append adds each frame to the list as you loop through the files. I see two things to change though.
You don't need frame = pd.DataFrame(). It is not doing anything as you are appending dataframes to the list.
I'd change the name of the variable list to something else. Maybe frames as it is descriptive of the contents and doesn't already mean something.