I have a script to output a whole bunch of CSVs to folder c:\Scripts\CSV. This particular script is looping through all of the dataframes and counting the usage of the top 100 words in the data set. The top 100 words and their count are added to a list, the dataframes are concatenated, and then the csv should export. The print contains the correct information, but the script doesn't output any file.
#! python3
import pandas as pd
import os
path = r'Scripts\\CSV\\'
directory = os.path.join("c:\\",path)
appended_data = []
for root,dirs,files in os.walk(directory):
for file in files:
if file.endswith(".csv"):
thread = pd.read_csv(directory + file)
thread.columns = ['num', 'id', 'body', 'title', 'url']
s = pd.Series(''.join(thread['body']).lower().split()).value_counts()[:100]
appended_data.append(s)
thatdata = pd.concat(appended_data)
#print(appended_data)
thatdata.to_csv = (directory + 'somename.csv')
Try using pathlib instead:
from pathlib import PureWindowsPath
directory = PureWindowsPath('c:/Scripts/CSV/')
for csv_f in directory.glob('**/*.csv'):
# process inputs
target_path = directory / 'somename.csv'
thatdata.to_csv(target_path)
Related
I have an Excel file that needs to be refreshed automatically every week. It must be extended by other Excel files. The problem is that these files have different names each time.
So in my opinion i can not use code like:
import pandas as pd
NG = 'NG.xlsx'
df = pd.read_excel(NG)
because the filename is not always "NG" like in this case.
Do you have any ideas?
Best Greetz
You could read all the files in your folder by doing this, because it allows you to ignore name changes:
import sys
import csv
import glob
import pandas as pd
# get data file names
path = r"C:\.......\folder_with_excel"
filenames = glob.glob(path + "/*.xlsx")
DF = []
for df in dfs:
xl_file = pd.ExcelFile(filenames)
df=xl_file.parse('Sheet1')
DF.concat(df, ignore_index=True)
Alternatively:
import os
import pandas as pd
path = os.getcwd()
files = os.listdir(path) # list all the files in you directory
files_xls = [f for f in files if f[-3:] == 'xlsx'] # make a list of the xlsx
df = pd.DataFrame()
for f in files_xls:
info = pd.read_excel(f, '<sheet name>') # remove <sheet name if you don't need it
df = df.append(info)
I have a directory ../customer_data/* with 15 folders. Each folder is a unique customer.
Example: ../customer_data/customer_1
Within each customer folder there is a csv called surveys.csv.
GOAL: I want to iterate through all the folders in ../customer_data/* and find the surveys.csv for each unique customer and create a concatenated dataframe. I also want to add a column in the dataframe where it has the customer id which is the name of the folder.
import glob
import os
rootdir = '../customer_data/*'
dataframes = []
for subdir, dirs, files in os.walk(rootdir):
for file in files:
csvfiles = glob.glob(os.path.join(rootdir, 'surveys.csv'))
# loop through the files and read them in with pandas
# a list to hold all the individual pandas DataFrames
df = pd.read_csv(csvfiles)
df['customer_id'] = os.path.dirname
dataframes.append(df)
# concatenate them all together
result = pd.concat(dataframes, ignore_index=True)
result.head()
This code is not giving me all 15 files. Please help
You can use the pathlib module for this.
from pathlib import Path
import pandas as pd
dfs = []
for filepath in Path("customer_data").glob("customer_*/surveys.csv"):
this_df = pd.read_csv(filepath)
# Set the customer ID as the name of the parent directory.
this_df.loc[:, "customer_id"] = filepath.parent.name
dfs.append(this_df)
df = pd.concat(dfs)
Let's try pathlib with rglob which will recursively search your directory structure for all files that match a glob pattern. in this instance survey.
import pandas as pd
from pathlib import Path
root_dir = Path('/top_level_dir/')
files = {file.parent.parts[-1] : file for file in Path.rglob('*survey.csv')}
df = pd.concat([pd.read_csv(file).assign(customer=name) for name,file in files.items()])
Note you'll need Python 3.4+ for pathlib.
I have a folder with several excel files in the format xls and xlsx and I am trying to read them and concatenate them in one single Dataframe. The problem that I am facing is that python does not read the files in the folder in the correct order.
My folder contains the following files:
190.xls , 195.xls , 198.xls , 202.xlsx , 220.xlsx and so on
This is my code:
import pandas as pd
from pathlib import Path
my_path = 'my_Dataset/'
xls_files = pd.concat([pd.read_excel(f2) for f2 in Path(my_path).rglob('*.xls')], sort = False)
xlsx_files = pd.concat([pd.read_excel(f1) for f1 in Path(my_path).rglob('*.xlsx')],sort = False)
all_files = pd.concat([xls_files,xlsx_files],sort = False).reset_index(drop=True))
I get what I want but the FILES ARE NOT CONCATENATED IN ORDER AS THEY WERE IN THE FOLDER!!!!!
meaning that in the all_files Dataframe I first have data from 202.xlsx and then from 190.xls
How can I solve this problem?
Thank you in advance!
Try using
import pandas as pd
from pathlib import Path
my_path = 'my_Dataset/'
all_files = pd.concat([pd.read_excel(f) for f in sorted(list(Path(my_path).rglob('*.xls')) + list(Path(my_path).rglob('*.xlsx')), key=lambda x: int(x.stem))],sort = False).reset_index(drop=True)
Update this
all_files = pd.concat([xls_files,xlsx_files],sort = False).reset_index(drop=True))
to this
all_files = pd.concat([xlsx_files,xls_files],sort = False).reset_index(drop=True))
I want to perform a analysis on 24 excel doc's 12 containing one side of the story and the other 12 the other side. I managed to load them into python but when i try to get them in two seperate datafranes python combines them back to one.
This is for a windows server using Python3.7
import pandas as pd
import os
path = os.getcwd()
files = os.listdir(path)
files
files_car = [f for f in files if f.startswith("CAR")]
files_car
for f in files_car:
data1 = pd.read_excel(f)
car = car.append(data1)
path = os.getcwd()
files2 = os.listdir(path)
files2
files_ean = [f for f in files2 if f.startswith("ELEK")]
files_ean
ean = pd.DataFrame()
for x in files_ean:
data2 = pd.read_excel(f)
ean = ean.append(data2)
i expected that files_car would contain the 12 files tht start with "CAR"
and that files_ean the 12 files that start with "ELEK"
This should do what you want, based on your comment:
import pandas as pd
import os
path = os.getcwd()
files = os.listdir(path)
car_dfs = [pd.read_excel(f) for f in files if f.startswith("CAR")]
ean_dfs = [pd.read_excel(f) for f in files if f.startswith("ELEK")]
car_df = pd.concat(files_car)
ean_df = pd.concat(files_ean)
Couple of points:
you don't need to recreate files if you're going to run the same command (i.e. os.listdir(path)) unless path were to change
you should not append to a DataFrame in a loop like you're doing, since this creates a copy every time you call append. It's better to create a list of DataFrames and then concatenate that list in to one big DataFrame.
You could shorten this even more by just doing:
import pandas as pd
import os
path = os.getcwd()
files = os.listdir(path)
car_df = pd.concat([pd.read_excel(f) for f in files if f.startswith("CAR")])
ean_df = pd.concat([pd.read_excel(f) for f in files if f.startswith("ELEK")])
unless you have some need for the individual file DataFrames
I am trying to open all files from a folder, store them in a dataframe and append each csv file with another csv file called Append.csv and am trying to write all the files with their names to a different folder.
For example I have 5 csv files that are saved in a folder called CSV FILES FOLDER. These files are F1.csv, F2.csv, F3.csv, F4.csvand F5.csv. What I am trying to do is open each file using pandas and I do this in a for loop, Append.csv and now store it in a different folder called NEW CSV FILES FOLDER as :
F1_APPENDED.csv
F2_APPENDED.csv
F3_APPENDED.csv
F4_APPENDED.csv
In other words, the _APPENDED is added with each file and then the file with the new name having _APPENDED is saved.
I have already defined the path for this folder but cant save it. The code is as below :
import pandas as pd
import glob
import os.path
import pathlib
path =r'C:\Users\Ahmed Ismail Khalid\Desktop\CSV FILES FOLDER'
allFiles = glob.glob(path + "/*.csv")
path1 = r'C:\Users\Ahmed Ismail Khalid\Desktop\Different Folder\Bitcoin Prices Hourly Based.csv'
outpath = r'C:\Users\Ahmed Ismail Khalid\Desktop\NEW CSV FILES FOLDER'
for f in allFiles:
file = open(f, 'r')
df1 = pd.read_csv(path1)
df2 = pd.read_csv(f)
output = pd.merge(df1, df2, how="inner", on="created_at")
df3 = output.created_at.value_counts().rename_axis('created_at').reset_index(name='count')
df3 = df3.sort_values(by=['created_at'])
#print(df3,'\n\n')
df3.to_csv(outpath+f, encoding='utf-8',index=False)
#print(f,'\n\n')
How can I do this? I tried to look up the official documentation but couldn't understand anything
Any and all help would be appreciated
Thanks
Here, I added a line in the for loop where you can get just the file name. You can use that instead of the full path to the file when you write the file and indicate the output .csv filename.
import pandas as pd
import glob
import os.path
import pathlib
path =r'C:\Users\Ahmed Ismail Khalid\Desktop\CSV FILES FOLDER'
allFiles = glob.glob(path + "/*.csv")
path1 = r'C:/Users/Ahmed Ismail Khalid/Desktop/Different Folder/Bitcoin Prices Hourly Based.csv'
# You need to have a slash at the end so it knows it's a folder
outpath = r'C:/Users/Ahmed Ismail Khalid/Desktop/NEW CSV FILES FOLDER/'
for f in allFiles:
file = open(f, 'r')
_, fname = os.path.split(f)
fname, ext = os.path.splittext(fname)
df1 = pd.read_csv(path1)
df2 = pd.read_csv(f)
output = pd.merge(df1, df2, how="inner", on="created_at")
df3 = output.created_at.value_counts().rename_axis('created_at').reset_index(name='count')
df3 = df3.sort_values(by=['created_at'])
#print(df3,'\n\n')
df3.to_csv(outpath+fname+'_appended.csv', encoding='utf-8',index=False)
#print(f,'\n\n')