Need help isolating specific parts of a csv into a pandas dataframe - python

I am trying to open a specific part of a csv file into a pandas dataframe. Here is what the start of the csv file looks like:
AMZN,"[{'cik': '0001018724', 'file': '\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>amzn-20161231x10k.htm\n<DESCRIPTION>FORM 10-K\n<TEXT>\n<!DOCTYPE html PUBLIC ""-//W3C//DTD HTML 4.01 Transitional//EN"" ""http://www.w3.org/TR/html4/loose.dtd"">\n<html>\n\t<head>\n\t\t<!-- Document created using Wdesk 1 -->\n\t\t<!-- Copyright 2017 Workiva -->\n\t\t<title>Document</title>\n\t</head>\n\t<body style=""font-family:Times New Roman;font-size:10pt;"">\n<div><a name=""s00CB6310C0A752A3B7DA1FF03AF57C4C""></a></div><div style=""line-height:120%;text-align:center;font-size:10pt;""><div style=""padding-left:0px;text-indent:0px;line-height:normal;padding-top:10px;""><table cellpadding=""0"" cellspacing=""0"" style=""font-family:Times New Roman;font-size:10pt;margin-left:auto;margin-right:auto;width:100%;border-collapse:collapse;text-align:left;""><tr><td colspan=""1""></td></tr><tr><td style=""width:100%;""></td></tr><tr><td style=""vertical-align:bottom;border-bottom:1px solid #000000;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;border-top:1px solid #000000;""><div style=""overflow:hidden;height:5px;font-size:10pt;""><font style=""font-family:inherit;font-size:10pt;""> </font></div></td></tr></table></div></div><div style=""line-height:120%;text-align:center;font-size:16pt;""><font style=""font-family:inherit;font-size:16pt;font-weight:bold;"">UNITED STATES</font></div><div style=""line-height:120%;text-align:center;font-size:16pt;""><font style=""font-family:inherit;font-size:16pt;font-weight:bold;"">SECURITIES AND EXCHANGE COMMISSION
As you can notice, it starts with a company ticker, followed by a list of dictionaries in quotation marks. Each dictionary has 4 keys (you can see 2 in the snippet above, 'cik' and 'file') and there's about 100 dictionaries in the list for each ticker (each dictionary containing the same keys but with data from different time periods) and 17 tickers overall. So to summarise, it is like this:
ticker, "[{},{},{}...]"
followed by the next ticker
What I would like to do is to open this csv into a pandas df with 2 columns, the first having the tickers and the second containing the information contained in a specific key in the list of dictionaries. This key is called 'file', you can see it in the csv snippet above. For each ticker it will appear about 100 times as there are about 100 dictionaries per ticker and each dictionary has the 'file' key followed by a long string of html text.
I have tried to open it with the usual read_csv as such
ten_k_data_csv = pd.read_csv(filepath_or_buffer='output_final_cleaned_for_manipulation.csv', header=None)
But it just opens into a 17x2 df, with 1 row for each ticker and the first column containing the ticker symbol and the second column containing the entire list of dictionaries as a string. I cannot utilise any of the df manipulations that I have learnt to isolate the information I want as the 'file' key appears many times per row. The dataframe I want should instead be about 1700 rows, as I want each individual dictionary to be its own row.
Alternatively, if there are ways to manipulate the csv file directly through code to get it to isolate the 'file' keys and tickers and delete everything else, I don't actually need to load it into a pandas dataframe. I just usually load data in pandas dataframes as it, most of the time, makes the data easier to manipulate. But in this case, I have no idea where to start. Thanks in advance for any help!

Related

Matching header row from pandas dataframe with given dictionary for a row and add corresponding value to dataframe

I have dataset of different car models and variants for each variant of a car , I have a dictionary of its specifications (these dictionaries don't have same keys) one car may have different specification headers and values than another.
So, I have created a master list of headers that all my 900 cars specification headers will eventually fall under these 172 headers say (body type, rating, no. of airbags etc.)
I want to create a dataset/ excel file out of it. I am using Jupyter notebook.
I am trying to make a pandas dataframe so that i can convert it to csv file. I am facing the problem that, I have a pandas dataframe with first row as my master headers . Now for each variant dictionary i want to check that master header row and whereever the key matches from dictionary add corresponding value to that column in particular row.
What i have tried till now ?
final_df.loc[len(final_df)] = m
final_df = final_df.fillna("NA")
print(final_df)
here m is dictionary of key headers and corresponding values of each variant. Note that a for loop is running for all variants which after each turn is returning me a dictionary m for every variant and as soon as a get m for each variant i want to add it as a row in my dataframe.
I am facing the problem that my data values are not coming under proper headers because each m has different keys and master header has different

How to compile list of rows from excel and make it as a list array in Python

def check_duplication(excelfile, col_Date, col_Name):
list_rows[]
Uphere is the a bit of the code.
How do I make lists in Python from the excel file? I want to compile every rows that contains the value of Date and Name in the sheet from excel and make it as a list. The reason I want to make a list because later I want to compare between the rows within the list to check if there is a duplicate within the list of rows.
Dataframe Method
To compare excel content, you do not need to make a list. But if you want to make a list, one starting point may be making a dataframe, which you can inspect in python. To make a dataframe, use:
import pandas as pd
doc_path = r"the_path_of_excel_file"
sheets= pd.read_excel(doc_path, sheet_name= None, engine= "openpyxl", header= None)
This code lines read the excel document's all sheets without headers. You may change the parameters.
(For more information: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html)
Assume Sheet1 is the sheet we have our data in:
d_frame = sheets[0]
list_rows = [df.iloc[i,:] for i in range(len(d_frame.shape[0]))]
I assume you want to use all columns. You may find the list with the code.

Merging multiple CSV's with different columns

Lets say I have a CSV which is generated yearly by my business. Each year my business decides there is a new type of data we want to collect. So Year2002.csv looks like this:
Age,Gender,Address
A,B,C
Then year2003.csv adds a new column
Age,Gender,Address,Location,
A,B,C,D
By the time we get to year 2021, my CSV now has 7 columns and looks like this:
Age,Gender,Address,Location,Height,Weight,Race
A,B,C,D,E,F,G,H
My business wants to create a single CSV which contains all of the data recorded. Where data is not available, (for example, Address data is not recorded in the 2002 CSV) there can be a 0 or a NAAN or a empty cell.
What is the best method available to merge the CSV's into a single CSV? It may be worthwhile saying, that I have 15,000 CSV files which need to be merged. ranging from 2002-2021. 2002 the CSV starts off with three columns, but by 2020, the csv has 10 columns. I want to create one 'master' spreadsheet which contains all of the data.
Just a little extra context... I am doing this because I will then be using Python to replace the empty values using the new data. E.g. calculate an average and replace CSV empty values with that average.
Hope this makes sense. I am just looking for some direction on how to best approach this. I have been playing around with excel, power bi and python but I can not figure out the best way to do this.
With pandas you can use pandas.read_csv() to create Dataframe, which you can merge using pandas.concat().
import pandas as pd
data1 = pd.read_csv(csv1)
data2 = pd.read_csv(csv2)
data = pd.concat(data1, data2)
You should take a look at python csv module.
A good place to start: https://www.geeksforgeeks.org/working-csv-files-python/
It is simple and useful for reading CSVs and creating new ones.

Change column names in pandas dataframe

I'm completely new to coding and learning on the go and need some advice.
I have a dataset imported into jupyter notebook from an excel.csv file. The column headers are all dates in the format "1/22/20" (22nd January 2020) and I want them to read as "Day1", "Day2", "Day3" etc. I have changed them manually to read as I want them but the csv file updates with a new column every day, which means that when I read it into my notebook to produce the graphs I want I first have to update the code in my notebook and add the extra "Dayxxx". This isn't a major problem but I have now 92 days in the csv file/dataset and it's getting boring. I wondered if there is a way to automatically add the "Dayxxx" by reading the file and with a for or while loop to change the column headers.
Any advice gratefully recieved, thanks.
Steptho.
I understand that those are your only columns and they are already ordered from first to last day?
You can obtain the number of days by getting the length of the list of column names returned by df.columns. From there you can create a new list with the column names you desire.
import pandas as pd
df = pd.read_csv("your_csv")
no_columns = len(df.columns)
new_column_names = []
for day in range(no_columns):
new_column_names.append("Day "+str(day+1))
df.columns = new_column_names

Import table to DataFrame and set group of column as list

I have a table (Tab delimited .txt file) in the following form:
each row is an entry;
first row are headers
the first 5 columns are simple numeric parameters
all column after the 7th column are supposed to be a list of values
My problem is how can I import and create a data frame where the last column contain a list of values?
-----Problem 1 ----
The header (first row) is "shorter", containing simply the name of some columns. All the values after the 7th do not have a header (because it is suppose to be a list). If I import the file as is, this appear to confuse the import functions
If, for example, I import as follow
df = pd.read_table( path , sep="\t")
the DataFrame created has only as many columns as the elements in the first row. Moreover, the data value assigned are mismatched.
---- Problem 2 -----
What is really confusing to me is that if I open the .txt in Excel and save it as Tab-delimited (without changing anything), I can then import it without problems, with headers too: columns with no header are simply given an “Unnamed XYZ” tag.
Why would saving in Excel change it? Using Note++ I can see only one difference: the original .txt is in "Unix (LF)" form, while the one saved in Excel is "Windows (CR LF)". Both are UTF-8, so I do not understand how this would be an issue?!?
Nevertheless, from here I could manipulate the data and try to gather all columns I wish and make them into a list. However, I hope that there is a more elegant and faster way to do it.
Here is a screen-shot of the .txt file
Thank you,

Categories

Resources