Find if a column value exists in multiple dataframes - python

I have 4 excel files - 'a1.xlsx','a2.xlsx','a3.xlsx','a4.xlsx'
The format of the files are same
for eg a1.xlsx looks like:
id code name
1 100 abc
2 200 zxc
... ... ...
i have to read this files in pandas dataframe and check whether the same value of code column exists in multiple excel files or not.
something like this.
if code=100 exists in 'a1.xlsx','a3.xlsx' , and code=200 exists only in 'a1.xlsx'
final dataframe should look like:
code filename
100 a1.xlsx,a3.xlsx
200 a1.xlsx
... ....
and so on
I have all the files in a directory and tried to iterate them through loop
import pandas as pd
import os
x = next(os.walk('path/to/files/'))[2] #list all files in directory
os.chdir('path/to/files/')
for i in range (0,len(x)):
df = pd.read_excel(x[i])
How to proceed? any leads?

Use:
import glob
#get all filenames
files = glob.glob('path/to/files/*.xlsx')
#list comprehension with assign new column for filenames
dfs = [pd.read_excel(fp).assign(filename=os.path.basename(fp).split('.')[0]) for fp in files]
#one big df from list of dfs
df = pd.concat(dfs, ignore_index=True)
#join all same codes
df1 = df.groupby('code')['filename'].apply(', '.join).reset_index()

Related

How can I read multiple CSV files and merge them in single dataframe in PySpark

I have 4 CSV files with different columns. Some CSV have same column name as well. The details of the csv are:
capstone_customers.csv: [customer_id, customer_type, repeat_customer]
capstone_invoices.csv: [invoice_id,product_id, customer_id, days_until_shipped, product_line, total]
capstone_recent_customers.csv: [customer_id, customer_type]
capstone_recent_invoices.csv: [invoice_id,product_id, customer_id, days_until_shipped, product_line, total]
My code is:
df1 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_customers.csv")
df2 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_invoices.csv")
df3 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_customers.csv")
df4 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_invoices.csv")
from functools import reduce
def unite_dfs(df1, df2):
return df2.union(df1)
list_of_dfs = [df1, df2,df3,df4]
united_df = reduce(unite_dfs, list_of_dfs)
but I got the error:
Union can only be performed on tables with the same number of columns, but the first table has 6 columns and the second table has 3 columns;;\n'Union\n:- Relation[invoice_id#234,product_id#235,customer_id#236,days_until_shipped#237,product_line#238,total#239] csv\n+- Relation[customer_id#218,customer_type#219,repeat_customer#220] csv\n
How can I merge in a single data frame and remove same column names using PySpark?
To read multiple files in shark you can make list of all files you want and read them at once, you don't have to read them in order.
Here is an example of code you can use:
path = ['file.cvs','file.cvs']
df = spark.read.options(header=True).csv(path)
df.show()
you can provide list of files or path to files to read, instead of reading one by one. And don't forget about mergeSchema option:
files = [
"capstone_customers.csv",
"capstone_invoices.csv",
"capstone_recent_customers.csv",
"capstone_recent_invoices.csv"
]
df = spark.read.options(inferSchema='True',header='True',delimiter=',', mergeSchema='True').csv(files)
# or
df = spark.read.options(inferSchema='True',header='True',delimiter=',',mergeSchema='True').csv('/path/to/files/')

How do I bring the filename into the data frame with read_excel?

I have a directory of excel files that will continue to grow with weekly snapshots of the same data fields. Each file has a date stamp added to the file name (e.g. "_2021_09_30").
Here are my source files:
I have figured out how to read all of the excel files into a python data frame using the code below:
import os
import pandas as pd
cwd = os.path.abspath('NETWORK DRIVE DIRECTORY')
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel(cwd+"/"+file), ignore_index=True)
df.head()
Since these files are snapshots of the same data fields, I want to be able to track how the underlying data changes from week to week. So I would like to add/include a column that has the filename so I can incorporate the date stamp in downstream analysis.
Any thoughts? Thank you in advance.
Welcome to StackOverflow! I agree with the comments that it's not exactly clear what you're looking for, so maybe clearing that up will help us be more helpful.
For example, with the filename "A_FILENAME_2020-01-23", do you want to use the name "A_FILENAME", or "A_FILENAME_2020-01-23"? Or are you not sure, because you're trying to think through how to track this downstream?
If the latter approach, this is what you would do for adding a new column:
for file in files:
if file.endswith('.xlsx'):
tmp = pd.read_excel(cwd+"/"+file)
tmp['filename'] = file
df = df.Append(tmp, ignore_index=True)
This would allow you to search the table by the starting of the 'filename' column, and pull the discrete data of each snapshot of the file side by side. Unfortuantely, this is a LOT of data.
If you ONLY want to store differences, you'd be able to use the .drop_duplicates function to try to drop based off a unique value that you use to decide whether there's a new, modified, or deleted row: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
But, if you don't have a unique identifier for rows, that makes this quite a tough engineering problem. Very. Do you have a unique identifier you can use as your diffing strategy?
Extra code to split up the Files into separate columns for easier filtering later on (no harm in adding these, more columns the better I think):
from datetime import datetime
tmp['filename_stripped']=file[-11:]
tmp['filename_date']=datetime.strptime(file[:10], "%Y_%m_%d")
You could add additional column on the dataframe.
Modifying from your code
temp = pd.read_excel(cwd+"/"+file), ignore_index=True
temp['date'] = file[-11:]
df = df.append(temp)
You can use glob to easily combine xlsx or csv files into one dataframe. You just have to copy-paste your files' absolute path to where it says "/xlsx_path". You can also change read_excel to read_csv if you have csv files.
import pandas as pd
import glob
all_files = glob.glob(r'/xlsx_path' + "/*.xlsx")
file_list = [pd.read_excel(f) for f in all_files]
all_df = pd.concat(file_list, axis=0, ignore_index=True)
Alternatively you can use the one-liner below:
all_df = pd.concat(map(pd.read_excel, glob.glob('/xlsx_path/*.xlsx')))
Not sure what you really want, but related to tracking changes, let's say you have 2 excel files, you can track changes doing the following :
df1 = pd.read_excel("file-1.xlsx")
df1
values
0 aa
1 bb
2 cc
3 dd
4 ee
df2 = pd.read_excel("file-2.xlsx")
df2
values
0 aa
1 bb
2 cc
3 ddd
4 e
..and generate a new dataframe having rows that have changed between your 2 files :
df = pd.concat([df1, df2])
df = df.reset_index(drop=True)
new_df = df.groupby(list(df.columns))
diff = [x[0] for x in new_df.groups.values() if len(x) == 1]
df.reindex(diff)
Output :
values
0 dd
1 ddd
2 e
3 ee

Loop pandas directory

I have many csv files in a directory with two column each
miRNA read_counts
miR1 10
miR1 5
miR2 2
miR2 3
miR3 100
I would like to sum read_counts if the miRNA id is the same.
Result:
miRNA read_counts
miR1 15
miR2 5
miR3 100
To do that I wrote a little script. However I don't know how to loop it through all my csv files so I don't have to copy paste file names and output each time. Any help will be very appreciated. Thanks for the help!
import pandas as pd
df = pd.read_csv("modified_LC1a_miRNA_expressed.csv")
df_new = df.groupby('miRNA')['read_count'].sum()
print(df_new)
df_new.to_csv('sum_LC1a_miRNA_expressed.csv')
Try looking into glob module.
from glob import glob
import os
path = "./your/path"
files = glob(os.path.join(path, "*.csv"))
dataframes = []
for file in files:
df = pd.read_csv(file)
# rest you would want to append these to dataframes
dataframes.append(df)
Then, use pd.concat to join the dataframes and perform the groupby operation.
EDIT 1:
Based on the request mentioned in the comment:
results = {}
for file in files:
df = pd.read_csv(file)
# perform operation
df_new = df.groupby('miRNA')['read_count'].sum()
results[file] = df_new
Not trying to steal the answer. I would have put this in a comment under #Asif Ali's answer if I had enough rep.
Assuming all input .csv files follow the format:
"modified_{rest_of_the_file_name}.csv"
And you want the outputs to be:
"sum_{same_rest_of_the_file_name}.csv"
import os
import glob
path = "./your/path"
files = glob.glob(os.path.join(path, "*.csv"))
for file in files:
df = pd.read_csv(file)
df_new = df.groupby('miRNA')['read_count'].sum()
print(df_new)
df_new.to_csv(file.split('modified')[:-1] + \
'sum' + \
'_'.join(file.split('modified')[-1:]))

Iterating over files in a directory and remove rows from them based on other file

I am looking for a way to iterate over 30 files in directory and remove rows from them based on IDs in other file. The files contain two columns - ID and a value, without column names. The other file contains just a column with the IDs ("id") that should be removed("ids_toberemoved"). After the 30 files are cleaned I want to export them to other folder.
This is what I have so far:
import pandas as pd
import os
ids_toberemoved = pd.read_csv('F:\\ids.csv')
myPath = "F:\\Other"
filesList= []
for path, subdirs, files in os.walk(myPath):
for name in files:
filesList.append(os.path.join(name))
dataframes = []
for filename in filesList:
dataframes.append(pd.read_csv(filename))
for df in dataframes:
df_cleaned = df.merge(ids_toberemoved, left_index=True, right_on=['id'],
how='left', indicator=True)
df_cleaned[df_cleaned._merge != 'both']
I am missing something in the step where I iterate over the data frames and join them with 'ids_toberemoved', in order to delete the rows with the matching IDs. Also, I can't figure out how to store every single file, after the cleaning, to other folder.
Any help appreciated!
Try the following approach:
from pathlib import Path
myPath = Path("F:\\Other")
ids_toberemoved = pd.read_csv('F:\\ids.csv', squeeze=True)
res = pd.concatenate([pd.read_csv(f, header=None, names=["ID","val"])
.query("ID not in #ids_toberemoved")
for f in myPath.glob("*.csv")],
ignore_index=True)
UPDATE: in order to clean the files and to export them separately as "filename_clean.csv":
_ = [pd.read_csv(f, header=None, names=["ID","val"])
.query("ID not in #ids_toberemoved")
.to_csv(f.with_name(f"{f.stem}_clean{f.suffix}"), index=False)
for f in myPath.glob("*.csv")]

Adding a column to dataframe while reading csv files [pandas]

I'm reading multiple csv files and combining them into a single dataframe like below:
pd.concat([pd.read_csv(f, encoding='latin-1') for f in glob.glob('*.csv')],
ignore_index=False, sort=False)
Problem:
I want to add a column that doesn't exist in any csv (to the dataframe) based on the csv file name for every csv file that is getting concatenated to the dataframe. Any help will be appreciated.
glob.glob returns normal string so you can just add a column to every individual dataframe in a loop.
Assuming you have files df1.csv and df2.csv in your directory:
import glob
import pandas as pd
files = glob.glob('df*csv')
dfs = []
for file in files:
df = pd.read_csv(file)
df['filename'] = file
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
df
a b filename
0 1 2 df1.csv
1 3 4 df1.csv
2 5 6 df2.csv
3 7 8 df2.csv
I have multiple csv files in my local directory. Each filename contains some numbers. Some of those numbers identify years for which the file is. I need to add a column year to each file that I'm concatenating and while I do I want to get the year information from the filename and insert it into that column. I'm using regex to extract the year and concatenate it like 20 + 11 = 2011. Then, I'm setting the column's data type to int32.
pd.concat(
[
pd.read_csv(f)
.assign(year = '20' + re.search('[a-z]+(?P<year>[0-9]{2})', f).group('year'))
.astype({'year' : 'int32'})
for f in glob.glob('stateoutflow*[0-9].csv')
],
ignore_index = True
)

Categories

Resources