Print specific columns from an excel file imported to python - python

I have a table as below:
How can I print all the sources that have an 'X' for a particular column?. For example, if I want to specify "Make", the output should be:
Delivery
Reputation
Profitability
PS: The idea is to import the excel file in python and do this operation.

use pandas
import pandas as pd
filename = "yourexcelfile"
dataframe = pd.read_excel(filename)
frame = dataframe.loc[dataframe["make"] == "X"]
print(frame["source"])

Related

How to preserve complicated excel header formats when manipulating data using Pandas Python?

I am parsing a large excel data file to another one, however the headers are very abnormal. I tried to use "read_excel skiprows" and that did not work. I also tried to include the header in
df = pd.read_excel(user_input, header= [1:3], sheet_name = 'PN Projection'), but then I get this error "ValueError: cannot join with no overlapping index names." To get around this I tried to name the columns by location and that did not work either.
When I run the code as shows below everything works fine, but past cell "U" I get the header titles to be "unnamed1, 2, ..." I understand this is because pandas is considering the first row to be the header(which are empty), but how do I fix this? Is there a way to preserve the headers without manually typing in the format for each cell? Any and all help is appreciated, thank you!
small section of the excel file header
the code I am trying to run
#!/usr/bin/env python
import sys
import os
import pandas as pd
#load source excel file
user_input = input("Enter the path of your source excel file (omit 'C:'): ")
#reads the source excel file
df = pd.read_excel(user_input, sheet_name = 'PN Projection')
#Filtering dataframe
#Filters out rows with 'EOL' in column 'item status' and 'xcvr' in 'description'
df = df[~(df['Item Status'] == 'EOL')]
df = df[~(df['Description'].str.contains("XCVR", na=False))]
#Filters in rows with "XC" or "spartan" in 'description' column
df = df[(df['Description'].str.contains("XC", na=False) | df['Description'].str.contains("Spartan", na=False))]
print(df)
#Saving to a new spreadsheet called Filtered Data
df.to_excel('filtered_data.xlsx', sheet_name='filtered_data')
If you do not need the top 2 rows, then:
df = pd.read_excel(user_input, sheet_name = 'PN Projection',error_bad_lines=False, skiprows=range(0,2)
This has worked for me when handling several strangely formatted files. Let me know if this isn't what your looking for, or if their are additional issues.

How to read a file name and append the name to a new column in a csv file using python pandas?

As the question says the requirement, I am using ubuntu os. I need to get the file name which I have in a folder called Sample_csv_files each represents a file in the same format except the id, e.g.
agent_op_023jlafa45459a390-.csv
agent_op_3rjfigr837yw749jh-.csv
agent_op_f78jlajk7h6559a39-.csv
Here I need to get those IDs and add them in a new_column. If I take for example the agent_op_023jlafa45459a390-.csv file, then I should populate the new_column with the id alone, e.g.
x | y | new_column
abc|xyz| 023jlafa45459a390
for the entire CSV file. Similarly I need to do this for the rest of the files.
Hope can understand the description above.
Anyone can help me to solve it out.
df1 = pd.read_csv('/home/user/Downloads/Sample_csv_files/agent_op_023jlafa45459a390-.csv')
df1['filename'] = "agent_op_023jlafa45459a390-.csv"
df1['filename'] = df1['filename'].map(lambda x: x.lstrip('agent-output').rstrip('-.csv'))
df2 = []
df3 = df1['filename'].append(df2)
print(df1.head(10))
df1.to_csv("/home/user/Downloads/sample_work.csv", index=False)
You can use glob.glob() to give you a list of all of the CSV files and then just extract the ID from each filename and add a new column. The file can then be updated as follows:
from glob import glob
import pandas as pd
import os.path
for filename in glob('my/source/folder/agent_op*.csv'):
id = os.path.basename(filename).lstrip('agent_op_').rstrip('-.csv')
df = pd.read_csv(filename)
df['run_id'] = id
df.to_csv(filename, index=False)

Python, how to add a new column in excel

I am having below file(file1.xlsx) as input. In total i am having 32 columns in this file and almost 2500 rows. Just for example i am mentioning 5 columns in screen print
I want to edit same file with python and want output as (file1.xlsx)
it should be noted i am adding one column named as short and data is a kind of substring upto first decimal of data present in name(A) column of same excel.
Request you to please help
Regards
Kawaljeet
Here is what you need...
import pandas as pd
file_name = "file1.xlsx"
df = pd.read_excel(file_name) #Read Excel file as a DataFrame
df['short'] = df['Name'].str.split(".")[0]
df.to_excel("file1.xlsx")
hello guys i solved the problem with below code:
import pandas as pd
import os
def add_column():
file_name = "cmdb_inuse.xlsx"
os.chmod(file_name, 0o777)
df = pd.read_excel(file_name,) #Read Excel file as a DataFrame
df['short'] = [x.split(".")[0] for x in df['Name']]
df.to_excel("cmdb_inuse.xlsx", index=False)

how to import the excel tabs and give the name of tabs in new column accordingly in python?

enter image description here I have file named Example.xls in which i have data in tab sales and purchase.
We have data in both tab from Column A to E.
When i import these data through pandas module, i want that result like Column A to F where column F should display the name sheet name. How to display the name of sheet name in pandas module?
I am using code
all= pd.read_excel(Example.xlsx',sheet_name=['Sales','Purchas'])
enter image description here
and then
df= pd.concat(All[frame]for fram in All.keys())
and then after i want to put the name of tabs in my data frame "All" in the last column which is F respectively
I think this is the simplest way.
import pandas as pd
path = r'path_of_your_file'
workbook = pd.read_excel(path, sheet_name = None)
df= pd.DataFrame()
for sheet_name, sheet in workbook.items():
sheet['sheet'] = sheet_name
df = df.append(sheet)
# Reset your index or you'll have duplicates
df = df.reset_index(drop=True)
The below code will solve your problem:
import os
from glob import glob
import pandas as pd
f_mask = r'path\*.xlsx' ## The folder path where your Example.xlsx is stored
df = \
pd.concat([df.assign(file=os.path.splitext(os.path.basename(f))[0],
sheet=sheet)
for f in glob(f_mask)
for sheet, df in pd.read_excel(f, sheet_name=None).items()],
ignore_index=True)
The code works in following way:
Check the base folder and take all the .xlsx files in it
Read the files one by one
Make two additional columns, one for file name other for sheet name
This solution will work if you want to do the exercise for more than 1 .xlsx file

Read specific columns with pandas or other python module

I have a csv file from this webpage.
I want to read some of the columns in the downloaded file (the csv version can be downloaded in the upper right corner).
Let's say I want 2 columns:
59 which in the header is star_name
60 which in the header is ra.
However, for some reason the authors of the webpage sometimes decide to move the columns around.
In the end I want something like this, keeping in mind that values can be missing.
data = #read data in a clever way
names = data['star_name']
ras = data['ra']
This will prevent my program to malfunction when the columns are changed again in the future, if they keep the name correct.
Until now I have tried various ways using the csv module and resently the pandas module. Both without any luck.
EDIT (added two lines + the header of my datafile. Sorry, but it's extremely long.)
# name, mass, mass_error_min, mass_error_max, radius, radius_error_min, radius_error_max, orbital_period, orbital_period_err_min, orbital_period_err_max, semi_major_axis, semi_major_axis_error_min, semi_major_axis_error_max, eccentricity, eccentricity_error_min, eccentricity_error_max, angular_distance, inclination, inclination_error_min, inclination_error_max, tzero_tr, tzero_tr_error_min, tzero_tr_error_max, tzero_tr_sec, tzero_tr_sec_error_min, tzero_tr_sec_error_max, lambda_angle, lambda_angle_error_min, lambda_angle_error_max, impact_parameter, impact_parameter_error_min, impact_parameter_error_max, tzero_vr, tzero_vr_error_min, tzero_vr_error_max, K, K_error_min, K_error_max, temp_calculated, temp_measured, hot_point_lon, albedo, albedo_error_min, albedo_error_max, log_g, publication_status, discovered, updated, omega, omega_error_min, omega_error_max, tperi, tperi_error_min, tperi_error_max, detection_type, mass_detection_type, radius_detection_type, alternate_names, molecules, star_name, ra, dec, mag_v, mag_i, mag_j, mag_h, mag_k, star_distance, star_metallicity, star_mass, star_radius, star_sp_type, star_age, star_teff, star_detected_disc, star_magnetic_field
11 Com b,19.4,1.5,1.5,,,,326.03,0.32,0.32,1.29,0.05,0.05,0.231,0.005,0.005,0.011664,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2008,2011-12-23,94.8,1.5,1.5,2452899.6,1.6,1.6,Radial Velocity,,,,,11 Com,185.1791667,17.7927778,4.74,,,,,110.6,-0.35,2.7,19.0,G8 III,,4742.0,,
11 UMi b,10.5,2.47,2.47,,,,516.22,3.25,3.25,1.54,0.07,0.07,0.08,0.03,0.03,0.012887,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2009,2009-08-13,117.63,21.06,21.06,2452861.05,2.06,2.06,Radial Velocity,,,,,11 UMi,229.275,71.8238889,5.02,,,,,119.5,0.04,1.8,24.08,K4III,1.56,4340.0,,
An easy way to do this is using the pandas library like this.
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print df.keys()
# See content in 'star_name'
print df.star_name
The problem here was the skipinitialspace which remove the spaces in the header. So ' star_name' becomes 'star_name'
According to the latest pandas documentation you can read a csv file selecting only the columns which you want to read.
import pandas as pd
df = pd.read_csv('some_data.csv', usecols = ['col1','col2'], low_memory = True)
Here we use usecols which reads only selected columns in a dataframe.
We are using low_memory so that we Internally process the file in chunks.
Above answers are for python2. So for python 3 users I am giving this answer. You can use the code below:
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print(df.keys())
# See content in 'star_name'
print(df.star_name)
Got a solution to above problem in a different way where in although i would read entire csv file, but would tweek the display part to show only the content which is desired.
import pandas as pd
df = pd.read_csv('data.csv', skipinitialspace=True)
print df[['star_name', 'ra']]
This one could help in some of the scenario's in learning basics and filtering data on the basis of columns in dataframe.
I think you need to try this method.
import pandas as pd
data_df = pd.read_csv('data.csv')
print(data_df['star_name'])
print(data_df['ra'])

Categories

Resources