How to Merge Two CSV Files in Python Based on Common Column - python

I am attempting to merge CSV files together.
Now I have 28 country need test, data in File-1.csv
And Every country has maybe 10 test Scenario,data in File-2.csv
I need the merge file, then I can use python to create N*M(280) unit cases.
File-1.csv
Case,Country,URL,Message1,Message2
01,UK,www.acuvue.co.uk, registersuccess,Fail
02,LU,www.acuvue.lu, denaissance,Vousdevez
03,DE,www.acuvue.de,,,
File-2.csv
Case,Country,Scenario,Mail,Name,Password
UK,InvalidMail,TEST,Susan_UK,Password1#
UK,InvalidPass,susan#test.com,Susan_UK,TEST
LU,InvalidMail,TEST,Susan_LU,Password1#
DE,InvalidMail,TEST,Susan_DE,Password1#
I want Python merge those two CSV file as below:
Case,Country,URL,Message1,Message2,Scenario,Mail,Name,Password
010,UK,www.acuvue.co.uk,registersuccess,Fail,InvalidMail,TEST,Susan_UK,Password1#
011,UK,www.acuvue.co.uk,registersuccess,Fail,InvalidPass,susan#test.com,Susan_UK,TEST
020,LU,www.acuvue.lu,denaissance,Vousdevez,InvalidMail,TEST,Susan_LU,Password1#
030,DE,www.acuvue.de,,,InvalidMail,TEST,Susan_DE,Password1#
How could I do this in Python?

Try reading both CSV into two seperate list and use zip_longest to merge two lists and store merged list into single list.

Related

Dataframe instance management in Python

I recently worked on a project parsing CSV files with cable modem MAC addresses (CMMAC) data that made it useful to incorporate dataframes through the Pandas module. One of the problems I encountered related to the overall approach to and structure of the dataframes themselves. Specifically I was concerned with having to increment the number of instances of dataframes to perform specific actions on the data. I did not feel that having to invoke "df1", "df2", "df3", etc was an efficient approach to writing in Python.
Below is a segment of the code where I had to instantiate the dataframes for different actions. The sample files (file1.csv and file2.csv) are identical and posted below as well.
file1.csv and file2.csv
cmmac,match
AABBCCDDEEFF,true
001122334455,false
001122334455,false
Python script:
import os
import glob
from functools import partial
import pandas as pd
#read and concatenate all CSV files in working directory
df1 = pd.concat(map(partial(pd.read_csv, header=0), glob.glob(os.path.join('', "*.csv"))))
#sort by column labeled "cmmac"
df2 = df1.sort_values(by='cmmac')
#delete any duplicate records
df3 = df2.drop_duplicates()
#convert MAC address format to colon notation (e.g. 001122334455 to 00:11:22:33:44:55)
df3['cmmac'] = df3['cmmac'].apply(lambda x: ':'.join(x[i:i+2] for i in range(0, len(x), 2)))
There were additional actions that were performed on the data in the CSV files and by the end I had thirteen dataframes (df13). With more complex projects I would have been in a death spiral of dataframes using this method.
The question I have is: how should dataframes be managed in order to avoid using this many instances? If it was necessary to drop a column or rearrange the columns does each one of those actions require invoking a new dataframe? In "df1" I am able to combine two distinct actions, which include reading in all CSV files and concatenating them. I was unable to add additional actions but even so that line would eventually become difficult to read. Which approach have you adopted when working with dataframes that incorporated many smaller tasks? Thanks.

compare 2 csv file and find out missing, inserted data and modified data using pandas

Using Pandas, I want to compare 2 csv files. Both files have same data but in 2nd file, some rows will be deleted, some will be inserted and some will be modified. I want to compare both the files and find out the deleted, inserted and modified row.
A simple way to do this is to use:
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]

Is there a way to import several .txt files each becoming a separate dataframe using pandas?

I have to work with 50+ .txt files each containing 2 columns and 631 rows where I have to do different operations to each (sometimes with each other) before doing data analysis. I was hoping there was a way to import each text file under a different dataframe in pandas instead of doing it individually. The code I've been using individually has been
df = pd.read_table(file_name, skiprows=1, index_col=0)
print(B)
I use index_col=0 because the first row is the x-value. I use skiprows=1 because I have to drop the title which is the first row (and file name in folder) of each .txt file. I was thinking maybe I could use glob package and importing all as a single data frame from the folder and then splitting it into different dataframes while keeping the first column as the name of each variable? Is there a feasible way to import all of these files at once under different dataframes from a folder and storing them under the first column name? All .txt files would be data frames of 2 col x 631 rows not including the first title row. All values in the columns are integers.
Thank you
Yes. If you store your file in a list named filelist (maybe using glob) you can use the following commands to read all files and store them on a dict.
dfdict = {f: pd.read_table(f,...) for f in filelist}
Then you can use each data frame with dfdict["filename.txt"].

Pulling data from two excel files with openpyxl in python

I am trying to pull data from two excel files using openpyxl, one file includes two columns, employee names and hours worked, the other,two columns, employee names and hourly wage. Ultimately, I'd like the files compared by name, have wage * hours worked, and then dumped into a third sheet by name and wages payable, but at this point, I'm struggling to get the items from two rows in the first sheet into excel to be able to manipulate them.
I thought I'd create two lists from the columns, the combine them into a dictionary, but I don't think that will get me where I need to be.
Any suggestions on how to get this data into python to manipulate it would be fantastic!
import openpyxl
wb = openpyxl.load_workbook("Test_book.xlsx")
sheet=wb.get_sheet_by_name('Hours')
employee_names=[]
employee_hours=[]
for row in sheet['A']:
employee_names.append(row.value)
for row in sheet['B']:
employee_hours.append(row.value)
my_dict=dict(zip(employee_names,employee_hours))
print(my_dict)
A list comprehension may do it. and using zip to iterate over
my_dict = {name:hours for name, hours in zip(sheet['A'], sheet['b'])}
what zip is doing is iterating through parallel lists.

Pandas: Reading CSV files with different delimiters - merge error

I have 4 separate CSV files that I wish to read into Pandas. I want to merge these CSV files into one dataframe.
The problem is that the columns within the CSV files contain the following: , ; | and spaces. Therefore I have to use different delimiters when reading the different CSV files and do some transformations to get them in the correct format.
Each CSV file contains an 'ID' column. When I merge my dataframes, it is not done correctly and I get 'NaN' in the column which has been merged.
Do you have to use the same delimiter in order for the dataframes to merge properly?
In short : no, you do not need similar delimiters within your files to merge pandas Dataframes - in fact, once data has been imported (which requires setting the right delimiter for each of your files), the data is placed in memory and does not keep track of the initial delimiter (you can see this by writing down your imported dataframes to csv using the .to_csv method : the delimiter will always be , by default).
Now, in order to understand what is going wrong with your merge, please post more details about your data and the code your are using to perform the operation.

Categories

Resources