After checking Stack Overflow, I reviewed the path directory—and it works fine as shown in "Loading files".
However, when I try to concatenate I get the above ValueError as shown in "Concatenate files".
What can I learn from this?
Also, is it ideal to use the below line of code?
files_names=os.listdir()
Loading files:
import pandas as pd
import os
files_names=os.listdir()
def load_all_csv(files_names):
# Follow this function template: take a list of file names and return one dataframe
# YOUR CODE HERE
return files_names
print(files_names)
all_data=load_all_csv(files_names)
all_data
Concatenate files:
combined_data=[]
for filename in all_data:
if filename.endswith('csv'):
#print("All csv files: ", filename)
df=pd.read_csv(filename, index_col='Month')
combined_data.append(df)
all_data=pd.concat(combined_data, axis=1)
all_data
Related
I have few csv files in my Azure File share which I am accessing as text by following the code:
from azure.storage.file import FileService
storageAccount='...'
accountKey='...'
file_service = FileService(account_name=storageAccount, account_key=accountKey)
share_name = '...'
directory_name = '...'
file_name = 'Name.csv'
file = file_service.get_file_to_text(share_name, directory_name, file_name)
print(file.content)
The contents of the csv files are being displayed but I need to pass them as dataframe which I am not able to do. Can anyone please tell me how to read the file.content as pandas dataframe?
After reproducing from my end, I could able to read a csv file into dataframe from the contents of the file following the below code.
generator = file_service.list_directories_and_files('fileshare/')
for file_or_dir in generator:
print(file_or_dir.name)
file=file_service.get_file_to_text('fileshare','',file_or_dir.name)
df = pd.read_csv(StringIO(file.content), sep=',')
print(df)
RESULTS:
I found to have problem with conversion of .xlsx file to .csv using pandas library.
Here is the code:
import pandas as pd
# If pandas is not installed: pip install pandas
class Program:
def __init__(self):
# file = input("Insert file name (without extension): ")
file = "Daty"
self.namexlsx = "D:\\" + file + ".xlsx"
self.namecsv = "D:\\" + file + ".csv"
Program.export(self.namexlsx, self.namecsv)
def export(namexlsx, namecsv):
try:
read_file = pd.read_excel(namexlsx, sheet_name='Sheet1', index_col=0)
read_file.to_csv(namecsv, index=False, sep=',')
print("Conversion to .csv file has been successful.")
except FileNotFoundError:
print("File not found, check file name again.")
print("Conversion to .csv file has failed.")
Program()
After running the code the console shows the ValueError: File is not a recognized excel file error
File i have in that directory is "Daty.xlsx". Tried couple of thigns like looking up to documentation and other examples around internet but most had similar code.
Edit&Update
What i intend afterwards is use the created csv file for conversion to .db file. So in the end the line of import will go .xlsx -> .csv -> .db. The idea of such program came as a training, but i cant get past point described above.
You can use like this-
import pandas as pd
data_xls = pd.read_excel('excelfile.xlsx', 'Sheet1', index_col=None)
data_xls.to_csv('csvfile.csv', encoding='utf-8', index=False)
I checked the xlsx itself, and apparently for some reason it was corrupted with columns in initial file being merged into one column. After opening and correcting the cells in the file everything runs smoothly.
Thank you for your time and apologise for inconvenience.
How to create a file list of my files included in the same folder?
In this question, I have asked about how to put all my file names from the same folder in one numpy file.
import os
path_For_Numpy_Files = 'C:\\Users\\user\\My_Test_Traces\\1000_Traces_npy'
with open('C:\\Users\\user\\My_Test_Traces\\Traces.list_npy', 'w') as fp:
fp.write('\n'.join(os.listdir(path_For_Numpy_Files)))
I have 10000 numpy files in my folder, so the result is:
Tracenumber=01_Pltx1
Tracenumber=02_Pltx2
Tracenumber=03_Pltx3
Tracenumber=04_Pltx4
Tracenumber=05_Pltx5
Tracenumber=06_Pltx6
Tracenumber=07_Pltx7
Tracenumber=08_Pltx8
Tracenumber=09_Pltx9
Tracenumber=10_Pltx10
Tracenumber=1000_Pltx1000
Tracenumber=100_Pltx100
Tracenumber=101_Pltx101
The order is very important to analyse my result, how to keep thqt order when creating the list please? I mean that I need my results like this:
Tracenumber=01_Pltx1
Tracenumber=02_Pltx2
Tracenumber=03_Pltx3
Tracenumber=04_Pltx4
Tracenumber=05_Pltx5
Tracenumber=06_Pltx6
Tracenumber=07_Pltx7
Tracenumber=08_Pltx8
Tracenumber=09_Pltx9
Tracenumber=10_Pltx10
Tracenumber=11_Pltx11
Tracenumber=12_Pltx12
Tracenumber=13_Pltx13
I try to iterate it by using:
import os
path_For_Numpy_Files = 'C:\\Users\\user\\My_Test_Traces\\1000_Traces_npy'
with open('C:\\Users\\user\\My_Test_Traces\\Traces.list_npy', 'w') as fp:
list_files=os.listdir(path_For_Numpy_Files)
list_files_In_Order=sorted(list_files, key=lambda x:(int(re.sub('D:\tt','',x)),x))
fp.write('\n'.join(sorted(os.listdir(list_files_In_Order))))
It gives me this error:
invalid literal for int() with base 10: ' Tracenumber=01_Pltx1'
How to solve this problem please?
I edit the solution, It may work now:
You will sort your files based on time.
import os
path_For_Numpy_Files = 'C:\\Users\\user\\My_Test_Traces\\1000_Traces_npy'
path_List_File='C:\\Users\\user\\My_Test_Traces\\Traces.list_npy'
with open(path_List_File, 'w') as fp:
os.chdir(path_For_Numpy_Files)
list_files=os.listdir(os.getcwd())
fp.write('\n'.join(sorted((list_files),key=os.path.getmtime)))
I read all the files in one folder one by one into a pandas.DataFrame and then I check them for some conditions. There are a few thousand files, and I would love to make pandas raise an exception when a file is empty, so that my reader function would skip this file.
I have something like:
class StructureReader(FileList):
def __init__(self, dirname, filename):
self.dirname=dirname
self.filename=str(self.dirname+"/"+filename)
def read(self):
self.data = pd.read_csv(self.filename, header=None, sep = ",")
if len(self.data)==0:
raise ValueError
class Run(object):
def __init__(self, dirname):
self.dirname=dirname
self.file__list=FileList(dirname)
self.result=Result()
def run(self):
for k in self.file__list.file_list[:]:
self.b=StructureReader(self.dirname, k)
try:
self.b.read()
self.b.find_interesting_bonds(self.result)
self.b.find_same_direction_chain(self.result)
except ValueError:
pass
Regular file that I'm searching for some condition looks like:
"A/C/24","A/G/14","WW_cis",,
"B/C/24","A/G/15","WW_cis",,
"C/C/24","A/F/11","WW_cis",,
"d/C/24","A/G/12","WW_cis",,
But somehow I don't ever get ValueError raised, and my functions are searching empty files, which gives me a lot of "Empty DataFrame ..." lines in my results file. How can I skip empty files?
I'd first check if the file is empty, and if it isn't empty I'll try to use it with pandas. Following this link https://stackoverflow.com/a/15924160/5088142 you can find a nice way to check if a file is empty:
import os
def is_non_zero_file(fpath):
return os.path.isfile(fpath) and os.path.getsize(fpath) > 0
You should not use pandas, but directly the python libraries. The answer is there: python how to check file empty or not
You can get your work done with following code, just add your CSVs path to the path variable, and run. You should get an object raw_data which is a Pandas dataframe.
import os, pandas as pd, glob
import pandas.io.common
path = "/home/username/data_folder"
files_list = glob.glob(os.path.join(path, "*.csv"))
for i in range(0,len(files_list)):
try:
raw_data = pd.read_csv(files_list[i])
except pandas.errors.EmptyDataError:
print(files_list[i], " is empty and has been skipped.")
How about this
files = glob.glob('*.csv')
files = list(filter(lambda file: os.stat(file).st_size > 0, files))
data = pd.read_csv(files)
I would like to know how to read several json files from a single folder (without specifying the files names, just that they are json files).
Also, it is possible to turn them into a pandas DataFrame?
Can you give me a basic example?
One option is listing all files in a directory with os.listdir and then finding only those that end in '.json':
import os, json
import pandas as pd
path_to_json = 'somedir/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
print(json_files) # for me this prints ['foo.json']
Now you can use pandas DataFrame.from_dict to read in the json (a python dictionary at this point) to a pandas dataframe:
montreal_json = pd.DataFrame.from_dict(many_jsons[0])
print montreal_json['features'][0]['geometry']
Prints:
{u'type': u'Point', u'coordinates': [-73.6051013, 45.5115944]}
In this case I had appended some jsons to a list many_jsons. The first json in my list is actually a geojson with some geo data on Montreal. I'm familiar with the content already so I print out the 'geometry' which gives me the lon/lat of Montreal.
The following code sums up everything above:
import os, json
import pandas as pd
# this finds our json files
path_to_json = 'json/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
# here I define my pandas Dataframe with the columns I want to get from the json
jsons_data = pd.DataFrame(columns=['country', 'city', 'long/lat'])
# we need both the json and an index number so use enumerate()
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
# here you need to know the layout of your json and each json has to have
# the same structure (obviously not the structure I have here)
country = json_text['features'][0]['properties']['country']
city = json_text['features'][0]['properties']['name']
lonlat = json_text['features'][0]['geometry']['coordinates']
# here I push a list of data into a pandas DataFrame at row given by 'index'
jsons_data.loc[index] = [country, city, lonlat]
# now that we have the pertinent json data in our DataFrame let's look at it
print(jsons_data)
for me this prints:
country city long/lat
0 Canada Montreal city [-73.6051013, 45.5115944]
1 Canada Toronto [-79.3849008, 43.6529206]
It may be helpful to know that for this code I had two geojsons in a directory name 'json'. Each json had the following structure:
{"features":
[{"properties":
{"osm_key":"boundary","extent":
[-73.9729016,45.7047897,-73.4734865,45.4100756],
"name":"Montreal city","state":"Quebec","osm_id":1634158,
"osm_type":"R","osm_value":"administrative","country":"Canada"},
"type":"Feature","geometry":
{"type":"Point","coordinates":
[-73.6051013,45.5115944]}}],
"type":"FeatureCollection"}
Iterating a (flat) directory is easy with the glob module
from glob import glob
for f_name in glob('foo/*.json'):
...
As for reading JSON directly into pandas, see here.
Loads all files that end with * .json from a specific directory into a dict:
import os,json
path_to_json = '/lala/'
for file_name in [file for file in os.listdir(path_to_json) if file.endswith('.json')]:
with open(path_to_json + file_name) as json_file:
data = json.load(json_file)
print(data)
Try it yourself:
https://repl.it/#SmaMa/loadjsonfilesfromfolderintodict
To read the json files,
import os
import glob
contents = []
json_dir_name = '/path/to/json/dir'
json_pattern = os.path.join(json_dir_name, '*.json')
file_list = glob.glob(json_pattern)
for file in file_list:
contents.append(read(file))
If turning into a pandas dataframe, use the pandas API.
More generally, you can use a generator..
def data_generator(my_path_regex):
for filename in glob.glob(my_path_regex):
for json_line in open(filename, 'r'):
yield json.loads(json_line)
my_arr = [_json for _json in data_generator(my_path_regex)]
I am using glob with pandas. Checkout the below code
import pandas as pd
from glob import glob
df = pd.concat([pd.read_json(f_name, lines=True) for f_name in glob('foo/*.json')])
A simple and very easy-to-understand answer.
import os
import glob
import pandas as pd
path_to_json = r'\path\here'
# import all files from folder which ends with .json
json_files = glob.glob(os.path.join(path_to_json, '*.json'))
# convert all files to datafr`enter code here`ame
df = pd.concat((pd.read_json(f) for f in json_files))
print(df.head())
I feel a solution using pathlib is missing :)
from pathlib import Path
file_list = list(Path("/path/to/json/dir").glob("*.json"))
One more option is to read it as a PySpark Dataframe and then convert it to Pandas Dataframe (if really necessary, depending on the operation I'd suggest keeping as a PySpark DF). Spark natively handles using a directory with JSON files as the main path without the need of libraries for reading or iterating over each file:
# pip install pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_df = spark.read.json('/some_dir_with_json/*.json')
Next, in order to convert into a Pandas Dataframe, you can do:
df = spark_df.toPandas()