When import several csvs and I save it within an array, the name for all files that were imported is q=[Dataframe, Dataframe,Dataframe,Dataframe,Dataframe,Dataframe,Dataframe,Dataframe], I would like to change the name using like base the name of the file.
files_array_Q = []
files_array_F = []
files_array_MRG =[]
for files in files_import_Q :
qs_matrix = pd.read_csv(files, delimiter=" ", header=None)
files_array_Q.append(qs_matrix)
for files in files_import_F :
in_fam = pd.read_csv(files, delimiter=" ", header=None)
files_array_F.append(in_fam)
for example read files with names file1.Q file2.Q file3.Q files4.Q
within the array files_array_Q = [files1, file2, files3, file4]
You can use the Pandas string split function. This solution assumes that the file names do not have extraneous periods.
q_list = pd.DataFrame(['file1.Q', 'file2.Q', 'file3.Q'], columns=['files'])
# split into file, type
q_split = pd.DataFrame(q.files.str.split('.',1).tolist())
# now get only the first column:
q_name_only = q_split[q_split.columns[0]]
You can combine these two steps into one line using Pandas iloc to choose the column by its integer location:
q_name_only = pd.DataFrame(q_list.files.str.split('.',1).tolist()).iloc[:,0]
Related
I have over two thousands csv files in a folder as follows:
University_2010_USA.csv, University_2011_USA.csv, Education_2012_USA.csv, Education_2012_Mexico.csv, Education_2012_Argentina.csv,
and
Results_2010_USA.csv, Results_2011_USA.csv, Results_2012_USA.csv, Results_2012_Mexico.csv, Results_2012_Argentina.csv,
I would like to match the first csv files in the list with the second ones based on "year" (2012, etc.) and "country" (Mexico, etc.) in the file name. Is there a way to do so quickly? Both the csv files have the same column names and I'm looking at the following code:
df0 = pd.read_csv('University_2010_USA.csv')
df1 = pd.read_csv('Results_2010_USA.csv')
new_df = pd.merge(df0, df1, on=['year','country','region','sociodemographics'])
So basically, I'd need help to write a for-loop that iterates over the datasets... Thanks!
Try this:
from pathlib import Path
university = []
results = []
for file in Path('/path/to/data/folder').glob('*.csv'):
# Determine the properties from the file's name
file_type, year, country = file.stem.split('_')
if file_type not in ['University', 'Result']:
continue
# Make the data frame, with 2 extra columns using properties
# we extracted from the file's name
tmp = pd.read_csv(file).assign(
year=int(year),
country=country
)
if file_type == 'University':
university.append(tmp)
else:
results.append(tmp)
df = pd.merge(
pd.concat(university),
pd.concat(results),
on=['year','country','region','sociodemographics']
)
I have multiple CSV files, I want to compare them. The file contents are the same except for some additional changes, and I want to list those additional changes.
For eg:
files =[1.csv, 2.csv,3.csv]
I want to compare 1.csv and 2.csv, get the difference and store somewhere, then compare 2.csv and 3.csv, store the diff somewhere.
for dirs in glob.glob(INPUT_PATH+"*"):
if (os.path.isdir(dirs)):
for files in glob.glob(dirs+'*/'+'/*.csv'):
## list all the csv files but how to read them to get difference.
you can use pandas to read csv as dataframe in a list then compare them from that list :
import pandas as pd
dfList = []
dfList.append(pd.read_csv('FilePath'))
dfList[0] contains the content of first csv file and so on
So, for comparing between first and 2nd csv you have to compare between dfList[0] and dfList[1]
The first fonction compare 2 files and the second fonction create a additional file with the difference between the 2 files.
import os
def compare(file_compared,file_master):
"""
A = [100,200,300]
B = [400,500,100]
compare(A,B) = [200,300]
"""
file_compared_list = []
file_master_list = []
with open(file_compared,'r') as fc:
for line in fc:
file_compared_list.append(line.strip())
with open(file_master,'r') as fm:
for line in fm:
file_master_list.append(line.strip())
return list(set(file_compared_list) - set(file_master_list))
def create_file(filename):
diff = compare("file1.csv","file2.csv")
with open(filename,'w') as f:
for element in diff:
f.write(element)
create_file("test.csv")
How do I go about manipulating each file of a folder based on values pulled from a dictionary? Basically, say I have x files in a folder. I use pandas to reformat the dataframe, add a column which includes the date of the report, and save the new file under the same name and the date.
import pandas as pd
import pathlib2 as Path
import os
source = Path("Users/Yay/AlotofFiles/April")
items = os.listdir(source)
d_dates = {'0401' : '04/1/2019', '0402 : 4/2/2019', '0403 : 04/03/2019'}
for item in items:
for key, value in d_dates.items():
df = pd.read_excel(item, header=None)
df.set_columns = ['A', 'B','C']
df[df['A'].str.contains("Awesome")]
df['Date'] = value
file_basic = "retrofile"
short_date = key
xlsx = ".xlsx"
file_name = file_basic + short_date + xlsx
df.to_excel(file_name)
I want each file to be unique and categorized by the date. In this case, I would want to have three files, for example "retrofile0401.xlsx" that has a column that contains "04/01/2019" and only has data relevant to the original file.
The actual result is pretty much looping each individual item, creating three different files with those values, moves on to the next file, repeats and replace the first iteration and until I only am left with three files that are copies of the last file. The only thing that is different is that each file has a different date and are named differently. This is what I want but it's duplicating the data from the last file.
If I remove the second loop, it works the way I want it but there's no way of categorizing it based on the value I made in the dictionary.
Try the following. I'm only making input filenames explicit to make clear what's going on. You can continue to use yours from the source.
input_filenames = [
'retrofile0401_raw.xlsx',
'retrofile0402_raw.xlsx',
'retrofile0403_raw.xlsx',]
date_dict = {
'0401': '04/1/2019',
'0402': '4/2/2019',
'0403': '04/03/2019'}
for filename in input_filenames:
date_key = filename[9:13]
df = pd.read_excel(filename, header=None)
df[df['A'].str.contains("Awesome")]
df['Date'] = date_dict[date_key]
df.to_excel('retrofile{date_key}.xlsx'.format(date_key=date_key))
filename[9:13] takes characters #9-12 from the filename. Those are the ones that correspond to your date codes.
I have many CSV files and I would like to rename each column of each file. A CSV file has for example a column named "wind" and I would like to transform it automatically to : wind_Dar. (Dar is the name of one file) so in other words I would like that each column of each file has the label "column name"_"currentFilename"
Here is my code :
path = ".../As-Pre-"
path_previsions = ["Dar.csv","Ope.csv","Wea.csv", "Wun.csv"]
path_observations = ".../As-Ob.csv"
def get_forecast(path, path_pre, path_ob):
list_data = []
for forecaster in path_pre:
dataframe = pd.read_csv(path + forecaster, sep=";").dropna(subset=["temperature"])
dataframe["time"] = dataframe["time"].apply(lambda x: str(x).split(":")[0])
dataframe = dataframe.groupby(['time']).mean()
dataframe = dataframe.rename(index=str, columns={"humidity": "humidity_Y", "precipitation": "precipitation_Y",
"temperature":"temperature_Y"})
list_data.append(dataframe)
I'm not sure where your code fails. But here is an easy way to rename the columns the way you want to using a list comprehension :
dataframe.columns = [x + forecaster.split('.')[0] for x in dataframe.columns]
I have 2/ more csv files in folder and I want to import those files to different data frame and want to rename data frame as per file name.
import os
import pandas as pd
redShiftKeyFolderPath = r'D:\Sunil_Work\temp8\Prod_IMFIFSS2017' # contains 2 csv files i.e Prod_IMFIFSS2017_Indicator.csv & Prod_IMFIFSS2017_Location.csv
def importRedshiftKeys(redShiftKeyFolderPath):
for file in os.listdir(redShiftKeyFolderPath):
if file.endswith('.csv'):
redShiftKey = pd.read_csv(os.path.join(redShiftKeyFolderPath, file), dtype = object) # importing
i = file.rfind('_'); file = file[i:]; rDfName = 'redShiftKey_' + file.replace('_', '').replace('.csv', '') # taking last part of the file name after _ & excluding .csv
print('Need to rename dataframe as: ', rDfName)
# Here i want to rename dataframe: "redShiftKey" with new name stored in "rDfName"
return
importRedshiftKeys(redShiftKeyFolderPath)
I expect 2 data frame i.e redShiftKey_Indicator & redShiftKey_Location
You can use a dictionary, where each key holds a dataframe as a value:
import os
import pandas as pd
redShiftKeyFolderPath = r'D:\Sunil_Work\temp8\Prod_IMFIFSS2017' # contains 2 csv files i.e Prod_IMFIFSS2017_Indicator.csv & Prod_IMFIFSS2017_Location.csv
def importRedshiftKeys(redShiftKeyFolderPath):
data = {}
for file in os.listdir(redShiftKeyFolderPath):
if file.endswith('.csv'):
csv_file_name = file
i = file.rfind('_'); file = file[i:]; rDfName = 'redShiftKey_' + file.replace('_', '').replace('.csv', '') # taking last part of the file name after _ & excluding .csv
data[rDfName] = redShiftKey = pd.read_csv(os.path.join(redShiftKeyFolderPath, csv_file_name), dtype = object) # importing
return data
importRedshiftKeys(redShiftKeyFolderPath)
For creating a variable dynamically, you need to check this discussion: generating variable names on fly in python
However, a dictionary strategy is better because you can handle a dynamic n number of csv files, plus your can iterate through them easily:
for Df_name, df in data.items():
# do any further processing in here
Use globals()[newName] = redShiftKey , this will work