Cleaning a messy Dataset - python

I got an originally txt file converted to csv.
I have the column names but there is practically one row in the
unprocessed dataset.
How do I clean the dataset using pandas,numpy exc.methods so that each string/int between every comma will be placed in seperated column with the proper column name?
Thanks,
Ido
cols = ['AIRLINE_ID','AIRLINE_NAME','ALIAS','IATA','ICAO','CALLSIGN','COUNTRY','ACTIVE'
]
Airlines_raw_dataset
I looked for videos regarding this topic on youtube but I didn't encounter specific info for this highly dirty dataset.

Pandas Has a built in method for reading csv files. It may be used in this fashion
df = pd.read_csv('filename.csv')
You can read more about this method here -> Official Docs

Related

reshape data using python?

being new to python I am looking for some help reshaping this data, already know how to do so in excel but want a python specific solution.
I want it to be in this format.
entire dataset is 70k rows with different vc_firm_names, any help would be great.
If you care about performance, then I suggest you take a look at other methods (such as using numpy, or sorting the table):
https://stackoverflow.com/a/42550516/17323241
https://stackoverflow.com/a/66018377/17323241
https://stackoverflow.com/a/22221675/17323241 (look at second comment)
Otherwise, you can do:
# load data from csv file
df = pd.read_csv("example.csv")
# aggregate
df.groupby("vc_first_name")["investment_industry"].apply(list)
Assuming the original file is "original.csv", and you want to save it as "new.csv" I would do:
pd.read_csv("original.csv").groupby(by=["vc_firm_name"],as_index=False).aggregate(lambda x: ','.join(x)).to_csv("new.csv", index=False)

Creating an object fro csv data

I want to create an object(maybe list?? still waiting for suggestions) that contains data read from a .csv file
The data looks like this:
[‘;;;;;;;;;;;;;;Number;;Semester;;Grade;;;’]
[';;;;;;;;;;;;;;1;;I;;A;;;']
[‘;;;;;;;;;;;;;;2;;I;;C;;;']
[';;;;;;;;;;;;;;3;;II;;A;;;']
I'm thinking this could be solved using regex.
The idea is that the first row determines the order in which 'number, semester, grade' appears and the next rows are what we need to store.
How about using pandas and the read_csv?
import pandas
pandas.read_csv('file.csv', sep=',',header=1)
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Is there any possibility to cut a long vector of output in to specific pieces and save them in different cells in excel?

I just started to use Python.
Actually, I'm setting up a new methodology to read patent data. With textrazor this patent data should be analyzed. I'm interested in getting the topics and save them in a term-document-matrix. It's already possible for me to save the output topics, but only in one big cell with a very long vector. How can I split this long vector, to save the topics in different cells in an Excel file?
If you have any ideas regarding this problem, I would be thankful for your answer. Also, feel free to recommend or help me with my code.
data = open('Patentdaten1.csv')
content= data.read()
table=[]
row = content.split('\n')
for i in range(len(row)):
column= row[i].split(';')
table.append(column)
patent1= table[1][1]
import textrazor
textrazor.api_key ="b033067632dba8a710c57f088115ad4eeff22142629bb1c07c780a10"
client = textrazor.TextRazor(extractors= ["entities", "categories", "topics"])
client.set_classifiers(['textrazor_newscodes'])
response = client.analyze(content)
topics= response.topics()
import pandas as pd
df = pd.DataFrame({'topic' : [topics]})
df.to_csv('test.csv')
It's a bit difficult to see exactly what is the problem without an example input and/or output, but saving data to excel via pandas removes any need for intermediate processing:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html
For instance:
import pandas
data = pandas.DataFrame.from_dict({"pantents": ["p0", "p1"], "authors": ["a0", "a1"]})
data.to_excel("D:\\test.xlsx")
Output:

Write Python Dataframe into a Word/Excel Document with Specific Formatting

Hi I'm relatively new to python and was hoping if any of you guys can provide advice on templating matters.
I've managed to parse an excel file, made a dataframe out of the data (using xl.parse, .loc, str.contains, str.split, sort_index etc. methods) and output it into another excel file like so:
Excel doc with dataframe
I'm stuck at formatting - adding borders, bolding certain rows of strings (not necessarily in the same position between 2 different output files), highlighting certain cells with color, etc.
I have a template which I have to follow, like so(word doc): Format to replicate (word doc)
I'm considering two ways about this:
1) Replicate the formatting from scratch through python (either as an excel or word doc)
2) Write the raw data from the output excel file to the word doc with the template
It'd be great if someone can enlighten me on which way is more efficient, and what libraries, methods/functions I can look into to get the job done.
Thank you!
I recommend using xlsxwriter. You can add borders with code like this:
import xlsxwriter
# left
begcol = 2 # skip first col
endcol = ws.UsedRange.Columns.Count
begrow = 2 # skip first row
endrow = ws.UsedRange.Rows.Count
ws.Range(ws.Cells(begrow, begcol),
ws.Cells(endrow, endcol)).Borders(7).LineStyle = 1 # continuous
ws.Range(ws.Cells(begrow, begcol),
ws.Cells(endrow, endcol)).Borders(7).Weight = 2 # thin
You can bold a row this way:
# bold last row
ws.Range(ws.Cells(endrow, begcol),
ws.Cells(endrow, endcol)).Font.Bold = True
You can set the background color of a cell like this:
format = workbook.add_format()
format.set_pattern(1) # This is optional when using a solid fill.
format.set_bg_color('green')
worksheet.write('A1', 'Ray', format)
For writing to Word Documents you can use docx with an example of how to do that here: http://pbpython.com/python-word-template.html
There are a few good ways to do this. I typically do one of the following two approaches:
1) XLSX writer: This package has support for changing formatting of Excel files. So my workflow would be to export to Excel using Pandas in Python then after the data is in the Excel file I'd manipulate the formatting with XLSX. Pandas and XLSX Writer play well together as you can see from this demo.
2) For some workflows I found the amount/type of formatting I wanted to do in Excel was just not reasonable to do with XLSX Writer. In those cases the best bet is to put your data in something that's NOT Excel then link Excel to it. One easy approach is dumping the data to a CSV then linking your well formated Excel file to the CSV. You could also push data into a database with Pandas and then have the Excel file pull data from the DB.

Methods of reading in and creating a list/array with excel information from excel

Imagine I am given two columns: a,b,c,d,e,f and e,f,g,h,i,j (commas indicating a new row in the column)
How could I read in such information from excel and put it in an two separate arrays? I would like to manipulate this info and read it off later. as part of an output.
You have a few choices here. If your data is rectangular, starts from A1, etc. you should just use pandas.read_excel:
import pandas as pd
df = pd.read_excel("/path/to/excel/file", sheetname = "My Sheet Name")
print(df["column1"].values)
print(df["column2"].values)
If your data is a little messier, and the read_excel options aren't enough to get your data, then I think your only choice will be to use something a little lower level like the fantastic xlrd module (read the quickstart on the README)

Categories

Resources