I have 30 csv files of wind speed data on my computer- each file represents data at a different location. I have written code to calculate the statistics I need to run for each site; however, I am currently pulling in each csv file individually to do so(see code below):
from google.colab import files
data_to_load = files.upload()
import io
df = pd.read_csv(io.BytesIO(data_to_load['Downtown.csv']))
Is there a way to pull in all 30 csv files at once so each file is run through my statistical analysis code block and spits out an array with the file name and the statistic calculated?
use a loop
https://intellipaat.com/community/17913/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe
import glob
import pandas as pd
# get data file names
local_path = r'/my_files'
filenames = glob.glob(local_path + "/*.csv")
dfs = [pd.read_csv(filename)) for filename in filenames]
# if needed concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)
Also you can try put data online: github or google drive and read from there
https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92
Related
I am trying to merge a couple of parquet files inside a folder to a dataframe along with their respective meta data. I have the code for converting all parquet to dataframe but I am not able to find a solution to get the meta data out of each parq files. Any help is appreciated. Thank you
Image of code
#code
from pathlib import Path
import pandas as pd
data_dir = Path(r'C:\python\Datas\parq\Merged')
full_df1 = pd.concat(
pd.read_parquet(parquet_file)
for parquet_file in data_dir.glob('*.parquet'))
full_df1.to_csv('csv_file_lat.csv')
I tried to merge the parque but I am clueless on how to merge meta data
Update :
data_dir = Path(r'C:\python\Datas\parq\Merged')
for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
df = pd.read_parquet(parquet_path)
pt = pq.read_table(parquet_path)
meta_own = json.loads(pt.schema.metadata[b'ais'])
dff=pd.json_normalize(meta_own)
fin = pd.concat([df,dff],axis = 1)
fin.to_csv('testda.csv',mode=write_mode,header=write_header)
I tried to implement a for loop using glob and iterate through the folder of parquet files and at the same time I am trying to extract the meta information of the same row but now I am stuck on how to add the parquet file details and the meta data where both are dataframes generated in the for loop as it iterates together.
I'm trying to fetch the name of the file I upload. I'm wrote a program which does a statistical test based on the data in the file, the program is currently set up in two steps:
1 - upload the file using the following methods:
from google.colab import files
import io
uploaded = files.upload()
This triggers a small "uploader" as a widget
I then upload the CSV file and my next set of code only needs to read the file name, here's the code
2 - read the data by specifying uploaded file name (let's say "filename" for ex.)
data = pd.read_csv(io.BytesIO(uploaded["filename.csv"]))
Every time I run this code, I need to manually update the name of the file, I'm trying to automate the part of fetching the filename so it can be read automatically.
Thanks
To upload the file:
from google.colab import files
import numpy as np
import pandas as pd
import io
uploaded = files.upload()
To read the file: (currently name of the file needs to be updated manually each time)
data = pd.read_csv(io.BytesIO(uploaded["filename.csv"]))
The following contains the name of your csv
list(uploaded.keys())[0]
so your line should look like
data = pd.read_csv(io.BytesIO(uploaded[list(uploaded.keys())[0]]))
I'm trying to read some excel files as pandas dataframes. The problem is they are quite large (about 2500 rows, columns up to 'CYK' label in the excel sheet, and there are 14 of them).
Every time that I run my program, it has to import again the files from excel. This causes the runtime to grow a lot, currently it's a bit more than 15 minutes and as of now the program doesn't even do anything significant except importing the files.
I would like to be able to import the files just once, then save the dataframe objects somewhere and make my program work only on those dataframes.
Any suggestions?
This is the code I developed until now:
import pandas as pd
import os
path = r'C:/Users/damia/Dropbox/Tesi/WIOD'
dirs = os.listdir(path)
complete_dirs = []
for f in dirs:
complete_dirs.append(path + r"/" + f)
data = []
for el in complete_dirs:
wiod = pd.read_excel(el, engine='pyxlsb')
data.append(wiod)
If anyone is interested, you can find the files I'm trying to read at this link:
http://www.wiod.org/database/wiots16
You could use the to_pickle and read_pickle methods provided by pandas to serialize the dataframes and store them in files.
docs
Example pickling:
data = []
pickle_paths = []
for el in complete_dirs:
wiod = pd.read_excel(el, engine='pyxlsb')
# here's where you store it
pickle_loc = 'your_unique_path_to_save_this_frame'
wiod.to_pickle(pickle_loc)
pickle_paths.append(pickle_loc)
data.append(wiod)
Depickling
data = []
for el in pickle_paths:
data.append(pd.read_pickle(el))
Another solution using to_pickle and read_pickle.
As an aside, you can read Excel files directly from URLs if you don't want to save to your drive first.
#read each file from the URL and save to disk
for year in range(2000, 2015):
pd.read_excel(f"http://www.wiod.org/protected3/data16/wiot_ROW/WIOT{year}_Nov16_ROW.xlsb").to_pickle(f"{year}.pkl")
#read pickle files from disk to a dataframe
data = list()
for year in range(2000, 2015):
data.append(pd.read_pickle(f"{year}.pkl"))
data = pd.concat(data)
I'm using google colab and I'm trying to import multiple csv files from google drive to the program.
I know how to import the datasets one by one but I'm not sure how to create a loop that reads in all of the csv files so that I can just have one line of code that imports all of the datasets for me.
You can create a dictionary with all dataframes like this:
from glob import glob
import pandas as pd
filepaths = glob('/content/drive/My Drive/location_of_the_files/*.csv')
dfs = {f'df{n}': pd.read_csv(i) for n, i in enumerate(filepaths)}
Individual dataframes can then be accessed with dfs['df0'], dfs['df1'], etc.
I have a 2 column CSV with download links in the first column and company symbols in the second column. For example:
http://data.com/data001.csv, BHP
http://data.com/data001.csv, TSA
I am trying to loop through the list so that Python opens each CSV via the download link and saves it separately as the company name. Therefore each file should be downloaded and saved as follows:
BHP.csv
TSA.csv
Below is the code I am using. It currently exports the entire CSV into a single row tabbed format, then loops back and does it again and again in an infinite loop.
import pandas as pd
data = pd.read_csv('download_links.csv', names=['download', 'symbol'])
file = pd.DataFrame()
cache = []
for d in data.download:
df = pd.read_csv(d,index_col=None, header=0)
cache.append(df)
file = pd.DataFrame(cache)
for s in data.symbol:
file.to_csv(s+'.csv')
print("done")
Up until I convert the list 'cache' into the DataFrame 'file' to export it, the data is formatted perfectly. It's only when it gets converted to a DataFrame when the trouble starts.
I'd love some help on this one as I've been stuck on it for a few hours.
import pandas as pd
data = pd.read_csv('download_links.csv')
links = data.download
file_names = data.symbol
for link, file_name in zip(links,file_names):
file = pd.read_csv(link).to_csv(file_name+'.csv', index=False)
Iterate over both fields in parallel:
for download, symbol in data.itertuples(index=False):
df = pd.read_csv(d,index_col=None, header=0)
df.to_csv('{}.csv'.format(symbol))