Select before aggregate with Pandas, python - python

I'm pretty new to Pandas in python and i need help to see if I'm doing it right. Basically i have an excel file containing data and I'm using pandas to play with the data. The question mentions that i can select four ports before aggregating in terms of Year and Ports. So i tried something like this.
filterport = portTraffic.Port.isin(['Adelaide','Brisbane','Sydney','Melbourne'])
new = portTraffic[filterport]
year_port = new.groupby(['Year','Port'])
I obtain an output (if i print the head) only showing me the data of the four ports i filtered but I wonder if I'm doing it correctly?
note, portTraffic is the excel file

Related

Merging multiple CSV's with different columns

Lets say I have a CSV which is generated yearly by my business. Each year my business decides there is a new type of data we want to collect. So Year2002.csv looks like this:
Age,Gender,Address
A,B,C
Then year2003.csv adds a new column
Age,Gender,Address,Location,
A,B,C,D
By the time we get to year 2021, my CSV now has 7 columns and looks like this:
Age,Gender,Address,Location,Height,Weight,Race
A,B,C,D,E,F,G,H
My business wants to create a single CSV which contains all of the data recorded. Where data is not available, (for example, Address data is not recorded in the 2002 CSV) there can be a 0 or a NAAN or a empty cell.
What is the best method available to merge the CSV's into a single CSV? It may be worthwhile saying, that I have 15,000 CSV files which need to be merged. ranging from 2002-2021. 2002 the CSV starts off with three columns, but by 2020, the csv has 10 columns. I want to create one 'master' spreadsheet which contains all of the data.
Just a little extra context... I am doing this because I will then be using Python to replace the empty values using the new data. E.g. calculate an average and replace CSV empty values with that average.
Hope this makes sense. I am just looking for some direction on how to best approach this. I have been playing around with excel, power bi and python but I can not figure out the best way to do this.
With pandas you can use pandas.read_csv() to create Dataframe, which you can merge using pandas.concat().
import pandas as pd
data1 = pd.read_csv(csv1)
data2 = pd.read_csv(csv2)
data = pd.concat(data1, data2)
You should take a look at python csv module.
A good place to start: https://www.geeksforgeeks.org/working-csv-files-python/
It is simple and useful for reading CSVs and creating new ones.

Filtering data on multiple csv files - python - pandas

This is my first question on stackoverflow. I have just started to learn python 2 months ago.
I had a look on this site and others, but I can't find a solution to my problem.
I'm trying to speed up an annoying data filtering task I have to do everytime for my job.
I want to use pandas library to read multiple .csv files (12 to be precise) and assign for each one a variable (df_1, df_2,...,df_12) that correspond to a new filtered dataframe.
Each .csv file contain the raw data of a tensile test from one of the company Instron machine we have in the Lab.
Example:
first .csv file with raw data, first 9 rows
I will use the filtered dataframe to do some other analysis with Minitab software.
This is what I managed to do so far:
import pandas as pd
dataset_1 = pd.read_csv('Specimen_RawData_1.csv')
df_1 = pd.DataFrame({'X': dataset_1.iloc[1:,-1].values,
'y': dataset_1.iloc[1:, 2].values,
})
df_1 = df_1.loc[df_1['X'].isin(['1.0', '2.0', '3.0', '4.0', '5.0'])]
The code will take the last column and assign to X, and take the third column and assign to y.
X will be then filtered asking to keep only the value equal to 1,2,3,4,5.
This works for the first .csv file. I can copy and paste it 12 times, but I thought that using a list or a dictionary may help instead.
I understand I can't create variables in a loop.
I have failed so far because the dictionary I have created takes the variables as strings so I can't use them for data analysis.
Any idea, please?
Thank You

Python Searching Excel sheet for a row based on keyword and returning the row

Disclosure: I am not an expert in this, nor do I have much practice with this. I have however, spent several hours attempting to figure this out on my own.
I have an excel sheet with thousands of object serial numbers and addresses where they are located. I am attempting to write a script that will search columns 'A' and 'B' for either the serial number or the address, and returns the whole row with the additional information on the particular object. I am attempting to use python to write this script as I am trying to integrate this with a preexisting script I have to access other sources. Using Pandas, I have been able to load the .xls sheet and return the values of all of the spreadsheet, but I cannot figure out how to search it in such a way as to only return the row pertaining to what I am talking about.
excel sheet example
Here is the code that I have:
import pandas as pd
data = pd.read_excel('my/path/to.xls')
print(data.head())
I can work with the print function to print various parts of the sheet, however whenever I try to add a search function to it, I get lost and my online research is less than helpful. A.) Is there a better python module to be using for this task? Or B.) How do I implement a search function to return a row of data as a variable to be displayed and/or used for other aspects of the program?
Something like this should work, pandas works with rows, columns and indices. We can take advantage of all three to get your utility working.
import pandas as pd
serial_number = input("What is your serial number: ")
address = input("What is your address: ")
# read in dataframe
data = pd.read_excel('my/path/to.xls')
# filter dataframe with a str.contains method.
filter_df = data.loc[(data['A'].str.contains(serial_number))
| data['B'].str.contains(address)]
print(filter_df)
if you have a list of items e.g
serial_nums = [0,5,9,11]
you can use isin which filters your dataframe based on a list.
data.loc[data['A'].isin(serial_nums)]
hopefully this gets you started.

Exporting Pandas DataFrame cells directly to excel/csv (python)

I have a Pandas DataFrame that has sports records in it. All of them look like this: "1-2-0", "17-12-1", etc., for wins, losses and ties. When I export this the records come up in different date formats within Excel. Some will come up as "12-May", others as "9/5/2001", and others will come up as I want them to.
The DataFrame that I want to export is named 'x' and this is the command I'm currently using. I tried it without the date_format part and it gave the same response in Excel.
x.to_csv(r'C:\Users\B\Desktop\nba.csv', date_format = '%s')
Also tried using to_excel and I kept getting errors while trying to export. Any ideas? I was thinking I am doing the date_format part wrong, but don't know to transfer the string of text directly instead of it getting automatically switched to a string.
Thanks!
I don't think its a python issue, but Excel auto detecting dates in your data.
But, see below to convert your scores to strings.
Try this,
import pandas as pd
df = pd.DataFrame({"lakers" : ["10-0-1"],"celtics" : ["11-1-3"]})
print(df.head())
here is the dataframe with made up data.
lakers celtics
0 10-0-1 11-1-3
Convert to dataframe to string
df = df.astype(str)
and save the csv:
df.to_csv('nba.csv')
Opening in LibreOffice gives me to columns with scores (made up)
You might have a use Excel issue going on here. Inline with my comment below, you can change any column in Excel to lots of different formats. In this case I believe Excel is auto detecting date formatting, incorrectly. Select your columns of data, right click, select format and change to anything else, like 'General'.

Is there a way to loop through rows of data in excel with python until an empty cell is reached? [duplicate]

This question already has answers here:
How to find the last row in a column using openpyxl normal workbook?
(4 answers)
Closed 3 years ago.
I am working with a large excel chart. For each row of data I need to perform several tasks. Is there a way to construct a loop in python to run through each line until an empty cell is found?
For example:
Project1 Data Data Data
Project2 Data Data Data
Project3 Data Data Data
Project4 Data Data Data
In this scenario, I would want to run through the chart until after Project4. But different documents will have various sized charts so it will need to run until it hits an empty cell, not limited by a specific cell.
I am thinking a Do until (as you can tell I don't know python very well) type loop would be useful. I also know there is a way to attempt empty cells via openpyxl which I am using for this project.:
if sheet.cell(0, 0).value == xlrd.empty_cell.value:
# Do something
Currently, I would try to figure out a way to do something similar to this, unless someone suggests a better alternative:
For i=10 to 1000 in range:
#setting an upper limit of 1000 rows
if sheet.cell(0,i) <> xlrd.empty_cell.value:
variable = sheet.cell(2,i).value
#other body stuff
Else:
break
I know this code is rather undeveloped, I just wanted to ask before going in the wrong direction. I also am unsure how to assign i to run through the rows.
If what you need is to read the excel in python, I'd recommend taking a look at pandas read_excel.
Hope this helps!

Categories

Resources