Merging multiple CSV's with different columns - python
Lets say I have a CSV which is generated yearly by my business. Each year my business decides there is a new type of data we want to collect. So Year2002.csv looks like this:
Age,Gender,Address
A,B,C
Then year2003.csv adds a new column
Age,Gender,Address,Location,
A,B,C,D
By the time we get to year 2021, my CSV now has 7 columns and looks like this:
Age,Gender,Address,Location,Height,Weight,Race
A,B,C,D,E,F,G,H
My business wants to create a single CSV which contains all of the data recorded. Where data is not available, (for example, Address data is not recorded in the 2002 CSV) there can be a 0 or a NAAN or a empty cell.
What is the best method available to merge the CSV's into a single CSV? It may be worthwhile saying, that I have 15,000 CSV files which need to be merged. ranging from 2002-2021. 2002 the CSV starts off with three columns, but by 2020, the csv has 10 columns. I want to create one 'master' spreadsheet which contains all of the data.
Just a little extra context... I am doing this because I will then be using Python to replace the empty values using the new data. E.g. calculate an average and replace CSV empty values with that average.
Hope this makes sense. I am just looking for some direction on how to best approach this. I have been playing around with excel, power bi and python but I can not figure out the best way to do this.
With pandas you can use pandas.read_csv() to create Dataframe, which you can merge using pandas.concat().
import pandas as pd
data1 = pd.read_csv(csv1)
data2 = pd.read_csv(csv2)
data = pd.concat(data1, data2)
You should take a look at python csv module.
A good place to start: https://www.geeksforgeeks.org/working-csv-files-python/
It is simple and useful for reading CSVs and creating new ones.
Related
How to get rid of rows with pandas in a CSV where the value of cells in a specific column is under 100 Billion?
I'm trying to filter through a CSV and make a new CSV which is the exact same except for it gets rid of any rows that have a value of greater than 100 billion in the 'marketcap' column. The code I've written so just spits out the same CSV as the original out over again and doesn't cut out any lines from the old CSV to the new CSV. Code: db = pd.read_csv('SF1_original.csv') db = db[db['marketcap']<= 100000000000] db.to_csv('new_SF1_original.csv') Example of old CSV (It's long don't look through whole thing, just to give you an idea): ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000 Example New CSV (Exact same when this line should have been cut): ,ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital 0,A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000.0,7107000000.0,,4982000000.0,2125000000.0,,10.219,-30000000.0,1368000000.0,1368000000.0,1160000000.0,131000000.0,2.41,0.584,665000000.0,111000000.0,554000000.0,665000000.0,281000000.0,96000000.0,0.0,0.0,0.0,202000000.0,298000000.0,0.133,298000000.0,202000000.0,202000000.0,0.3,0.3,0.3,4486000000.0,,4486000000.0,50960600000.0,,,354000000.0,0.8059999999999999,1.0,1086000000.0,0.484,0.0,0.0,4337000000.0,,1567000000.0,42000000.0,42000000.0,0.0,2621000000.0,2067000000.0,554000000.0,51663600000.0,1368000000.0,-160000000.0,2068000000.0,111000000.0,0.0,1192000000.0,-208000000.0,-42000000.0,384000000.0,0.0,131000000.0,131000000.0,131000000.0,0.0,0.0,0.057999999999999996,915000000.0,171000000.0,635000000.0,0.0,11.517000000000001,,,1408000000.0,0.0,114.3,,,1445000000.0,131000000.0,2246000000.0,2246000000.0,290000000.0,,,,,0.0,625000000.0,1.0,452000000.0,439000000.0,440000000.0,5.1160000000000005,7107000000.0,0.0,71000000.0,113000000.0,16.189,2915000000.0 I've seen two questions somewhat related to this on StackOverflow, but they haven't helped me much. This one uses CSV library instead of pandas (which is an option for me). This one is more helpful since it uses pandas but still hasn't been interacted with and isn't exactly the same as my use case.
You can get the indexes of the rows with "marketcap" over 100 billion rows like so: df.loc[df["marketcap"] > 100000000000]["marketcap"].index All that's left to do is drop them from the DataFrame: df.drop(df.loc[df["marketcap"] > 100000000000]["marketcap"].index, inplace=True) Reading from CSV and writing to the CSV is already correctly taken care of in your code.
Filtering data on multiple csv files - python - pandas
This is my first question on stackoverflow. I have just started to learn python 2 months ago. I had a look on this site and others, but I can't find a solution to my problem. I'm trying to speed up an annoying data filtering task I have to do everytime for my job. I want to use pandas library to read multiple .csv files (12 to be precise) and assign for each one a variable (df_1, df_2,...,df_12) that correspond to a new filtered dataframe. Each .csv file contain the raw data of a tensile test from one of the company Instron machine we have in the Lab. Example: first .csv file with raw data, first 9 rows I will use the filtered dataframe to do some other analysis with Minitab software. This is what I managed to do so far: import pandas as pd dataset_1 = pd.read_csv('Specimen_RawData_1.csv') df_1 = pd.DataFrame({'X': dataset_1.iloc[1:,-1].values, 'y': dataset_1.iloc[1:, 2].values, }) df_1 = df_1.loc[df_1['X'].isin(['1.0', '2.0', '3.0', '4.0', '5.0'])] The code will take the last column and assign to X, and take the third column and assign to y. X will be then filtered asking to keep only the value equal to 1,2,3,4,5. This works for the first .csv file. I can copy and paste it 12 times, but I thought that using a list or a dictionary may help instead. I understand I can't create variables in a loop. I have failed so far because the dictionary I have created takes the variables as strings so I can't use them for data analysis. Any idea, please? Thank You
Change column names in pandas dataframe
I'm completely new to coding and learning on the go and need some advice. I have a dataset imported into jupyter notebook from an excel.csv file. The column headers are all dates in the format "1/22/20" (22nd January 2020) and I want them to read as "Day1", "Day2", "Day3" etc. I have changed them manually to read as I want them but the csv file updates with a new column every day, which means that when I read it into my notebook to produce the graphs I want I first have to update the code in my notebook and add the extra "Dayxxx". This isn't a major problem but I have now 92 days in the csv file/dataset and it's getting boring. I wondered if there is a way to automatically add the "Dayxxx" by reading the file and with a for or while loop to change the column headers. Any advice gratefully recieved, thanks. Steptho.
I understand that those are your only columns and they are already ordered from first to last day? You can obtain the number of days by getting the length of the list of column names returned by df.columns. From there you can create a new list with the column names you desire. import pandas as pd df = pd.read_csv("your_csv") no_columns = len(df.columns) new_column_names = [] for day in range(no_columns): new_column_names.append("Day "+str(day+1)) df.columns = new_column_names
Need help isolating specific parts of a csv into a pandas dataframe
I am trying to open a specific part of a csv file into a pandas dataframe. Here is what the start of the csv file looks like: AMZN,"[{'cik': '0001018724', 'file': '\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>amzn-20161231x10k.htm\n<DESCRIPTION>FORM 10-K\n<TEXT>\n<!DOCTYPE html PUBLIC ""-//W3C//DTD HTML 4.01 Transitional//EN"" ""http://www.w3.org/TR/html4/loose.dtd"">\n<html>\n\t<head>\n\t\t<!-- Document created using Wdesk 1 -->\n\t\t<!-- Copyright 2017 Workiva -->\n\t\t<title>Document</title>\n\t</head>\n\t<body style=""font-family:Times New Roman;font-size:10pt;"">\n<div><a name=""s00CB6310C0A752A3B7DA1FF03AF57C4C""></a></div><div style=""line-height:120%;text-align:center;font-size:10pt;""><div style=""padding-left:0px;text-indent:0px;line-height:normal;padding-top:10px;""><table cellpadding=""0"" cellspacing=""0"" style=""font-family:Times New Roman;font-size:10pt;margin-left:auto;margin-right:auto;width:100%;border-collapse:collapse;text-align:left;""><tr><td colspan=""1""></td></tr><tr><td style=""width:100%;""></td></tr><tr><td style=""vertical-align:bottom;border-bottom:1px solid #000000;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;border-top:1px solid #000000;""><div style=""overflow:hidden;height:5px;font-size:10pt;""><font style=""font-family:inherit;font-size:10pt;""> </font></div></td></tr></table></div></div><div style=""line-height:120%;text-align:center;font-size:16pt;""><font style=""font-family:inherit;font-size:16pt;font-weight:bold;"">UNITED STATES</font></div><div style=""line-height:120%;text-align:center;font-size:16pt;""><font style=""font-family:inherit;font-size:16pt;font-weight:bold;"">SECURITIES AND EXCHANGE COMMISSION As you can notice, it starts with a company ticker, followed by a list of dictionaries in quotation marks. Each dictionary has 4 keys (you can see 2 in the snippet above, 'cik' and 'file') and there's about 100 dictionaries in the list for each ticker (each dictionary containing the same keys but with data from different time periods) and 17 tickers overall. So to summarise, it is like this: ticker, "[{},{},{}...]" followed by the next ticker What I would like to do is to open this csv into a pandas df with 2 columns, the first having the tickers and the second containing the information contained in a specific key in the list of dictionaries. This key is called 'file', you can see it in the csv snippet above. For each ticker it will appear about 100 times as there are about 100 dictionaries per ticker and each dictionary has the 'file' key followed by a long string of html text. I have tried to open it with the usual read_csv as such ten_k_data_csv = pd.read_csv(filepath_or_buffer='output_final_cleaned_for_manipulation.csv', header=None) But it just opens into a 17x2 df, with 1 row for each ticker and the first column containing the ticker symbol and the second column containing the entire list of dictionaries as a string. I cannot utilise any of the df manipulations that I have learnt to isolate the information I want as the 'file' key appears many times per row. The dataframe I want should instead be about 1700 rows, as I want each individual dictionary to be its own row. Alternatively, if there are ways to manipulate the csv file directly through code to get it to isolate the 'file' keys and tickers and delete everything else, I don't actually need to load it into a pandas dataframe. I just usually load data in pandas dataframes as it, most of the time, makes the data easier to manipulate. But in this case, I have no idea where to start. Thanks in advance for any help!
how to write an empty column in a csv based on other columns in the same csv file
I don't know whether this is a very simple qustion, but I would like to do a condition statement based on two other columns. I have two columns like: the age and the SES and the another empty column which should be based on these two columns. For example when one person is 65 years old and its corresponding socio-economic status is high, then in the third column(empty column=vitality class) a value of 1 is for example given. I have got an idea about what I want to achieve, however I have no idea how to implement that in python itself. I know I should use a for loop and I know how to write conditons, however due to the fact that I want to take two columns into consideration for determining what will be written in the empty column, I have no idea how to write that in a function and furthermore how to write back into the same csv (in the respective empty column) []
Use the pandas module to import the csv as a DataFrame object. Then you can do logical statements to fill empty columns: import pandas as pd df = pd.read_csv('path_to_file.csv') df.loc[(df['age']==65) & (df['SES']=='high'), 'vitality_class'] = 1 df.to_csv('path_to_new_file.csv', index=False)