For my thesis project I need to import and join 117 .json files into one dataframe. Manually it works but I can't figure out a loop. Another problem is that the features need the have filename in the data frame.
Basically I need to automate this process below:
df_aa = pd.read_json(r'path')
df_aa.columns = ['Time', 'Active_adresses']
#df_aa.head()
#df_aa.tail()
df_tf = pd.read_json(r'path')
df_tf.columns = ['Time', 'Total_fees']
#df_tf.head()
#df_tf.tail()
df_tf.merge(df_aa, on='Time', how='left')
Example picture
Anyone who can help me? I don't have much programming experience.
Find a common pattern by which you split/chunk all of the files for your loop to do the single action of loading. If there's no specific treatment required, use
from glob import glob
print(list(glob("*.json"))) # put to list if you want to print it
for item in glob("*.json"):
do_your_loading()
to get all of the files, then load them one by one (or via an aggregating func, if there's some in the library) and then join the dataframes into a single one, if necessary.
For the name in the dataframe simply add a new column after the initial chunk loading will contain on all of the rows the filename (from glob or otherwise).
For example:
dataframes = {}
for item in glob("*.json"):
df_aa = pd.read_json(r'path')
df_aa.columns = ['Time', 'Active_adresses']
#df_aa.head()
#df_aa.tail()
# put the filename to a new column
df_aa = df_aa.assign(filename=item)
df_tf = pd.read_json(r'path')
df_tf.columns = ['Time', 'Total_fees']
#df_tf.head()
#df_tf.tail()
df_tf.merge(df_aa, on='Time', how='left')
# put the filename to a new column
df_tf = df_tf.assign(filename=item)
dataframes[item] = {"aa": df_aa, "tf": df_tf}
dataframes["my-filename.json"]["aa"].head()
dataframes["my-filename.json"]["tf"].head()
Related
I have over two thousands csv files in a folder as follows:
University_2010_USA.csv, University_2011_USA.csv, Education_2012_USA.csv, Education_2012_Mexico.csv, Education_2012_Argentina.csv,
and
Results_2010_USA.csv, Results_2011_USA.csv, Results_2012_USA.csv, Results_2012_Mexico.csv, Results_2012_Argentina.csv,
I would like to match the first csv files in the list with the second ones based on "year" (2012, etc.) and "country" (Mexico, etc.) in the file name. Is there a way to do so quickly? Both the csv files have the same column names and I'm looking at the following code:
df0 = pd.read_csv('University_2010_USA.csv')
df1 = pd.read_csv('Results_2010_USA.csv')
new_df = pd.merge(df0, df1, on=['year','country','region','sociodemographics'])
So basically, I'd need help to write a for-loop that iterates over the datasets... Thanks!
Try this:
from pathlib import Path
university = []
results = []
for file in Path('/path/to/data/folder').glob('*.csv'):
# Determine the properties from the file's name
file_type, year, country = file.stem.split('_')
if file_type not in ['University', 'Result']:
continue
# Make the data frame, with 2 extra columns using properties
# we extracted from the file's name
tmp = pd.read_csv(file).assign(
year=int(year),
country=country
)
if file_type == 'University':
university.append(tmp)
else:
results.append(tmp)
df = pd.merge(
pd.concat(university),
pd.concat(results),
on=['year','country','region','sociodemographics']
)
I have two csv files, and I want to combine these two csv files into one csv file. Assume that the two csv files are A.csv and B.csv, I have already known that there are some conflicts in them. For example, there are two columns, ID and name, in A.csv ID "12345" has name "Jack", in B.csv ID "12345" has name "Tom". So there are conflicts that the same ID has different name. Now I want to keep ID "12345", and I want to choose name from A.csv, and abandon name from B.csv. How could I do that?
Here is some code I have tried, but it can only combine two csv files but connot deal with the conflicts, or more precisely, it cannot choose definite value from A.csv :
import pandas as pd
import glob
def merge(csv_list, outputfile):
for input_file in csv_list:
f = open(input_file, 'r', encoding='utf-8')
data = pd.read_csv(f, error_bad_lines=False)
data.to_csv(outputfile, mode='a', index=False)
print('Combine Completed')
def distinct(file):
df = pd.read_csv(file, header=None)
datalist = df.drop_duplicates()
datalist.to_csv('result_new_month01.csv', index = False, header = False)
print('Distint Completed')
if __name__ = '__main__':
csv_list = glob.glob('*.csv')
output_csv_path = 'result.csv'
print(csv_list)
merge(csv_list)
distinct(output_csv_path)
P.S. English is not my native language. Please excuse my syntax error.
Will put down my comments here as an answer:
The problem with your merge function is, you're reading one file and writing it out the same result.csv in append mode without doing any check for duplicate names. In order to check for duplicates, they need to be in the same dataframe. From your code, you're combining multiple CSV files, not necessary just A.csv and B.csv. So when you say "I want to choose name from A.csv, and abandon name from B.csv" - it looks like you really mean "keep the first one".
Anyway, fix your merge() function to keep reading files into a list of dataframes - with A.csv being first. And then use #gold_cy's answer to concatenate the dataframes keeping only the first occurrence.
Or in your distinct() function, instead of datalist = df.drop_duplicates(), put datalist = df.drop_duplicates("ID", keep="first").reset_index(drop=True) - but this can be done after the read-loop instead writing out a CSV full of duplicates, first drop the dupes and then write out to csv.
So here's the change using the first method:
import pandas as pd
import glob
def merge_distinct(csv_list, outputfile):
all_frames = [] # list of dataframes
for input_file in csv_list:
# skip your file-open line and pass those params to pd.read_csv
data = pd.read_csv(input_file, encoding='utf-8', error_bad_lines=False)
all_frames.append(data) # append to list of dataframes
print('Combine Completed')
final = pd.concat(all_frames).drop_duplicates("ID", keep="first").reset_index(drop=True)
final.to_csv('result_new_month01.csv', index=False, header=False)
print('Distint Completed')
if __name__ = '__main__':
csv_list = sorted(glob.glob('*.csv')) # sort the list
output_csv_path = 'result.csv'
print(csv_list)
merge_distinct(csv_list, output_csv_path)
Notes:
Instead of doing f = open(...) just pass those arguments to pd.read_csv().
why are you writing out the final csv with header=False? That's useful to have.
glob.glob() doesn't guarantee sorting (depends on filesystem) so I've used sorted() above.
File-system sorting is different from just sorting in sorted which is essentially in ASCII/Unicode index order.
If you want to keep one DataFrame value over the other, then concatenate them and keep the first duplicate in the output. This means the preferred values should be in the first argument to the sequence you provide to concatenate as shown below.
df = pd.DataFrame({"ID": ["12345", "4567", "897"], "name": ["Jack", "Tom", "Frank"]})
df1 = pd.DataFrame({"ID": ["12345", "333", "897"], "name": ["Tom", "Sam", "Rob"]})
pd.concat([df, df1]).drop_duplicates("ID", keep="first").reset_index(drop=True)
ID name
0 12345 Jack
1 4567 Tom
2 897 Frank
3 333 Sam
For a current project, I am planning to crawl over all CSV files within a given folder, to filter the content of the files by a certain word and to then save the filtered data frame as a new file with an extension that includes the search keyword.
The script below is however yielding the message TypeError: list indices must be integers or slices, not str for line df2 = df[df['tag'] == "Sales"], hence indicating an issue with the data type.
I have already tried to solve things by adding a generic data type definition such as dtype='unicode', which did not solve things. Is there any smart tweak to make this work?
import pandas as pd
import csv
import glob
# Crawl over all CSV files within folder
df = glob.glob(r'/Users/name/SEC/Merged/*.csv')
# Filter by key word "Sales"
df2 = df[df['tag'] == "Sales"]
# Remove duplicates
df2 = df2.drop_duplicates(subset=None, keep='first', inplace=False)
# Save as new file that includes the name of the "input" file as well as the extension '-sales'.
df2.to_csv(basename+'-sales.csv')
# Sanity check print command
print(df2)
Loop over the paths and read them into a dataframe
import pandas as pd
import csv
import glob
# Crawl over all CSV files within folder
for csv_path in glob.glob(r'/Users/name/SEC/Merged/*.csv'):
df = pd.read_csv(csv_path)
# Filter by key word "Sales"
df2 = df[df['tag'] == "Sales"]
# Remove duplicates
df2 = df2.drop_duplicates(subset=None, keep='first', inplace=False)
# Save as new file that includes the name of the "input" file as well as the extension '-sales'.
df2.to_csv(csv_path+'-sales.csv')
# Sanity check print command
print(df2)
Excuse my question, I know this is trivial but for some reasons I am not getting it right. Reading dataframes one by one is highly inefficient especially if you have a lot of dataframes you would like to read from. Remember DRY - DO NOT REPEAT YOURSELF
So here is my approach:
files = ["company.csv", "house.csv", "taxfile.csv", "reliablity.csv", "creditloan.csv", "medicalfunds.csv"]
DataFrameName = ["company_df", "house_df", "taxfile_df", "reliablity_df", "creditloan_df", "medicalfunds_df"]
for file in files:
for df in DataFrameName:
df = pd.read_csv(file)
This only gives me df as one of the frames, I am not sure which of them but I guess the last one. How can I read through the csv files and store them with a dataframe names in the DataFrameName
My goal:
To have 6 dataframes loaded in the workspace spaced in the DataFrameName
For example company_df holds the data from "company.csv"
You could set up
DataFrameDic = {"company":[], "house":[], "taxfile":[], "reliablity":[], "creditloan":[], "medicalfunds":[]}
for key in DataFrameDic:
DataFrameDic[key] = pd.read_csv(key+'.csv')
This should return a dictionary containing of dataframes.
Something like this:
files = [
"company.csv",
"house.csv",
"taxfile.csv",
"reliablity.csv",
"creditloan.csv",
"medicalfunds.csv",
]
DataFrameName = [
"company_df",
"house_df",
"taxfile_df",
"reliablity_df",
"creditloan_df",
"medicalfunds_df",
]
dfs = {}
for name, file in zip(DataFrameName, files):
dfs[name] = pd.read_csv(file)
zip lets you iterate two lists at the same time, so you can get both the name and the filename.
You'll end up with a dict of DataFrames
using pathlib we can create a generator expression then create a dictionary with the file name as the name and the value as the dataframe.
with pathlib we can use the .glob module to grab all the csv's in a target path.
replace "\tmp\files" with the path to your files, if your using windows use a raw string or escape the slashes.
from pathlib import Path
trg_files = (f for f in Path("\tmp\files").glob("*.csv"))
dataframe_dict = {f"{file.stem}_df": pd.read_csv(file) for file in trg_files}
print(dataframe_dict.keys())
'company_df'
print(datarame_dict['company_df'])
Dictionary are the way, since you can name their content dynamically.
names = ["company", "house", "taxfile", "reliablity", "creditloan", "medicalfunds"]
dataframes = {}
for name in names:
dataframes[f"{name}_df"] = pd.read_csv(f"{name}.csv")
The fact that you have a nice naming convention allows us to append easily the _df or .csv part to the name when needed.
How do I go about manipulating each file of a folder based on values pulled from a dictionary? Basically, say I have x files in a folder. I use pandas to reformat the dataframe, add a column which includes the date of the report, and save the new file under the same name and the date.
import pandas as pd
import pathlib2 as Path
import os
source = Path("Users/Yay/AlotofFiles/April")
items = os.listdir(source)
d_dates = {'0401' : '04/1/2019', '0402 : 4/2/2019', '0403 : 04/03/2019'}
for item in items:
for key, value in d_dates.items():
df = pd.read_excel(item, header=None)
df.set_columns = ['A', 'B','C']
df[df['A'].str.contains("Awesome")]
df['Date'] = value
file_basic = "retrofile"
short_date = key
xlsx = ".xlsx"
file_name = file_basic + short_date + xlsx
df.to_excel(file_name)
I want each file to be unique and categorized by the date. In this case, I would want to have three files, for example "retrofile0401.xlsx" that has a column that contains "04/01/2019" and only has data relevant to the original file.
The actual result is pretty much looping each individual item, creating three different files with those values, moves on to the next file, repeats and replace the first iteration and until I only am left with three files that are copies of the last file. The only thing that is different is that each file has a different date and are named differently. This is what I want but it's duplicating the data from the last file.
If I remove the second loop, it works the way I want it but there's no way of categorizing it based on the value I made in the dictionary.
Try the following. I'm only making input filenames explicit to make clear what's going on. You can continue to use yours from the source.
input_filenames = [
'retrofile0401_raw.xlsx',
'retrofile0402_raw.xlsx',
'retrofile0403_raw.xlsx',]
date_dict = {
'0401': '04/1/2019',
'0402': '4/2/2019',
'0403': '04/03/2019'}
for filename in input_filenames:
date_key = filename[9:13]
df = pd.read_excel(filename, header=None)
df[df['A'].str.contains("Awesome")]
df['Date'] = date_dict[date_key]
df.to_excel('retrofile{date_key}.xlsx'.format(date_key=date_key))
filename[9:13] takes characters #9-12 from the filename. Those are the ones that correspond to your date codes.