Column index number by cell value in python - python

I am new to python and I am having a hard time with this issue and i need your help.
Q1 Q2 Q3 Q4 Q5
25 9 57 23 7
61 41 29 5 57
54 34 58 10 7
13 13 63 26 45
31 71 40 40 40
24 38 63 63 47
31 50 43 2 61
68 33 13 9 63
28 1 30 39 71
I have an excel report with the data above. I'd like to write a code that looks through all columns in the 1st row and output the index number of the column with S in the column value (i.e., 3). I want to use the index number to extract data for that column. I do not want to use row and cell reference as the excel file gets updated regularly, thus d column will always move.
def find_idx():
wb = xlrd.open_workbook(filename='data.xlsx') # open report
report_sheet1 = wb.sheet_by_name('Sheet 1')
for j in range(report_sheet1.ncols):
j=report_sheet1.cell_value(0, j)
if 'YTD' in j:
break
return j.index('Q4')
find_idx()
the i get "substring not found" erro
What i want is to return the column index number (i.e, 3), so that i can call it easily in another code. How can i fix this?

Hass!
As far as I understood, you want to get the index of a column of an excel file whose name contains a given substring such as Y. Is that right?
If so, here's a working snippet that does not requires pandas:
import xlrd
def find_idx(excel_filename, sheet_name, col_name_lookup):
"""
Returns the column index of the first column that
its name contains the string col_name_lookup. If
the col_name_lookup is not found, it returns -1.
"""
wb = xlrd.open_workbook(filename=excel_filename)
report_sheet1 = wb.sheet_by_name(sheet_name)
for col_ix in range(report_sheet1.ncols):
col_name = report_sheet1.cell_value(0, col_ix)
if col_name_lookup in col_name:
return col_ix
return -1
if __name__ == "__main__":
excel_filename = "./data.xlsx"
sheet_name = "Sheet 1"
col_name_lookup = "S"
print(find_idx(excel_filename, sheet_name, col_name_lookup))
I tried to give more semantic names to your variables (I transformed your variable j into two other variables: col_ix (actual column index of the loop) and also the variable col_name which really stands for the column name.
This code assumes that the first line of your excel file contains the column names, and if your desired substring to be looked in each of these names is not found, it returns -1.

Related

How to display truncated data points in a dataframe and csv files [duplicate]

I work with Series and DataFrames on the terminal a lot. The default __repr__ for a Series returns a reduced sample, with some head and tail values, but the rest missing.
Is there a builtin way to pretty-print the entire Series / DataFrame? Ideally, it would support proper alignment, perhaps borders between columns, and maybe even color-coding for the different columns.
You can also use the option_context, with one or more options:
with pd.option_context('display.max_rows', None, 'display.max_columns', None): # more options can be specified also
print(df)
This will automatically return the options to their previous values.
If you are working on jupyter-notebook, using display(df) instead of print(df) will use jupyter rich display logic (like so).
No need to hack settings. There is a simple way:
print(df.to_string())
Sure, if this comes up a lot, make a function like this one. You can even configure it to load every time you start IPython: https://ipython.org/ipython-doc/1/config/overview.html
def print_full(x):
pd.set_option('display.max_rows', len(x))
print(x)
pd.reset_option('display.max_rows')
As for coloring, getting too elaborate with colors sounds counterproductive to me, but I agree something like bootstrap's .table-striped would be nice. You could always create an issue to suggest this feature.
After importing pandas, as an alternative to using the context manager, set such options for displaying entire dataframes:
pd.set_option('display.max_columns', None) # or 1000
pd.set_option('display.max_rows', None) # or 1000
pd.set_option('display.max_colwidth', None) # or 199
For full list of useful options, see:
pd.describe_option('display')
Use the tabulate package:
pip install tabulate
And consider the following example usage:
import pandas as pd
from io import StringIO
from tabulate import tabulate
c = """Chromosome Start End
chr1 3 6
chr1 5 7
chr1 8 9"""
df = pd.read_table(StringIO(c), sep="\s+", header=0)
print(tabulate(df, headers='keys', tablefmt='psql'))
+----+--------------+---------+-------+
| | Chromosome | Start | End |
|----+--------------+---------+-------|
| 0 | chr1 | 3 | 6 |
| 1 | chr1 | 5 | 7 |
| 2 | chr1 | 8 | 9 |
+----+--------------+---------+-------+
Using pd.options.display
This answer is a variation of the prior answer by lucidyan. It makes the code more readable by avoiding the use of set_option.
After importing pandas, as an alternative to using the context manager, set such options for displaying large dataframes:
def set_pandas_display_options() -> None:
"""Set pandas display options."""
# Ref: https://stackoverflow.com/a/52432757/
display = pd.options.display
display.max_columns = 1000
display.max_rows = 1000
display.max_colwidth = 199
display.width = 1000
# display.precision = 2 # set as needed
set_pandas_display_options()
After this, you can use either display(df) or just df if using a notebook, otherwise print(df).
Using to_string
Pandas 0.25.3 does have DataFrame.to_string and Series.to_string methods which accept formatting options.
Using to_markdown
If what you need is markdown output, Pandas 1.0.0 has DataFrame.to_markdown and Series.to_markdown methods.
Using to_html
If what you need is HTML output, Pandas 0.25.3 does have a DataFrame.to_html method but not a Series.to_html. Note that a Series can be converted to a DataFrame.
If you are using Ipython Notebook (Jupyter). You can use HTML
from IPython.core.display import HTML
display(HTML(df.to_html()))
Try this
pd.set_option('display.height',1000)
pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',1000)
datascroller was created in part to solve this problem.
pip install datascroller
It loads the dataframe into a terminal view you can "scroll" with your mouse or arrow keys, kind of like an Excel workbook at the terminal that supports querying, highlighting, etc.
import pandas as pd
from datascroller import scroll
# Call `scroll` with a Pandas DataFrame as the sole argument:
my_df = pd.read_csv('<path to your csv>')
scroll(my_df)
Disclosure: I am one of the authors of datascroller
Scripts
Nobody has proposed this simple plain-text solution:
from pprint import pprint
pprint(s.to_dict())
which produces results like the following:
{'% Diabetes': 0.06365372374283895,
'% Obesity': 0.06365372374283895,
'% Bachelors': 0.0,
'% Poverty': 0.09548058561425843,
'% Driving Deaths': 1.1775938892425206,
'% Excessive Drinking': 0.06365372374283895}
Jupyter Notebooks
Additionally, when using Jupyter notebooks, this is a great solution.
Note: pd.Series() has no .to_html() so it must be converted to pd.DataFrame()
from IPython.display import display, HTML
display(HTML(s.to_frame().to_html()))
which produces results like the following:
You can set expand_frame_repr to False:
display.expand_frame_repr : boolean
Whether to print out the full DataFrame repr for wide DataFrames
across multiple lines, max_columns is still respected, but the output
will wrap-around across multiple “pages” if its width exceeds
display.width.
[default: True]
pd.set_option('expand_frame_repr', False)
For more details read How to Pretty-Print Pandas DataFrames and Series
this link can help ypu
hi my friend just run this
pd.set_option("display.max_rows", None, "display.max_columns", None)
print(df)
just do this
Output
Column
0 row 0
1 row 1
2 row 2
3 row 3
4 row 4
5 row 5
6 row 6
7 row 7
8 row 8
9 row 9
10 row 10
11 row 11
12 row 12
13 row 13
14 row 14
15 row 15
16 row 16
17 row 17
18 row 18
19 row 19
20 row 20
21 row 21
22 row 22
23 row 23
24 row 24
25 row 25
26 row 26
27 row 27
28 row 28
29 row 29
30 row 30
31 row 31
32 row 32
33 row 33
34 row 34
35 row 35
36 row 36
37 row 37
38 row 38
39 row 39
40 row 40
41 row 41
42 row 42
43 row 43
44 row 44
45 row 45
46 row 46
47 row 47
48 row 48
49 row 49
50 row 50
51 row 51
52 row 52
53 row 53
54 row 54
55 row 55
56 row 56
57 row 57
58 row 58
59 row 59
60 row 60
61 row 61
62 row 62
63 row 63
64 row 64
65 row 65
66 row 66
67 row 67
68 row 68
69 row 69
You can achieve this using below method. just pass the total no. of columns present in the DataFrame as arg to
'display.max_columns'
For eg :
df= DataFrame(..)
with pd.option_context('display.max_rows', None, 'display.max_columns', df.shape[1]):
print(df)
Try using display() function. This would automatically use Horizontal and vertical scroll bars and with this you can display different datasets easily instead of using print().
display(dataframe)
display() supports proper alignment also.
However if you want to make the dataset more beautiful you can check pd.option_context(). It has lot of options to clearly show the dataframe.
Note - I am using Jupyter Notebooks.

How to obtain the first 4 rows for every 20 rows from a CSV file

I've Read the CVS file using pandas and have managed to print the 1st, 2nd, 3rd and 4th row for every 20 rows using .iloc.
Prem_results = pd.read_csv("../data sets analysis/prem/result.csv")
Prem_results.iloc[:320:20,:]
Prem_results.iloc[1:320:20,:]
Prem_results.iloc[2:320:20,:]
Prem_results.iloc[3:320:20,:]
Is there a way using iloc to print the 1st 4 rows of every 20 lines together rather then seperately like I do now? Apologies if this is worded badly fairly new to both python and using pandas.
Using groupby.head:
Prem_results.groupby(np.arange(len(Prem_results)) // 20).head(4)
You can concat slices together like this:
pd.concat([df[i::20] for i in range(4)]).sort_index()
MCVE:
df = pd.DataFrame({'col1':np.arange(1000)})
pd.concat([df[i::20] for i in range(4)]).sort_index().head(20)
Output:
col1
0 0
1 1
2 2
3 3
20 20
21 21
22 22
23 23
40 40
41 41
42 42
43 43
60 60
61 61
62 62
63 63
80 80
81 81
82 82
83 83
Start at 0 get every 20 rows
Start at 1 get every 20 rows
Start at 2 get every 20 rows
And, start at 3 get every 20 rows.
You can also do this while reading the csv itself.
df = pd.DataFrame()
for chunk in pd.read_csv(file_name, chunksize = 20):
df = pd.concat((df, chunk.head(4)))
More resources:
You can read more about the usage of chunksize in Pandas official documentation here.
I also have a post about its usage here.

Select middle value for nth rows in Python

I am creating new dataframe which should contain an only middle value (not Median!!) for every nth rows, however my code doesn't work!
I've tried several approaches through pandas or simple Python but I always fail.
value date index
14 40 1983-07-15 14
15 86 1983-07-16 15
16 12 1983-07-17 16
17 78 1983-07-18 17
18 69 1983-07-19 18
19 78 1983-07-20 19
20 45 1983-07-21 20
21 47 1983-07-22 21
22 48 1983-07-23 22
23 ..... ......... ..
RSDF5 = RSDF4.groupby(pd.Grouper(freq='15D', key='DATE')).[int(len(RSDF5)//2)].reset_index()
I know that the code is wrong and I am completely out of ideas!
SyntaxError: invalid syntax
A solution based on indexes.
df is your original dataframe, N is the number of rows you want to group (assumed to be ad odd number, so there is a unique middle row).
df2 = df.groupby(np.arange(len(df))//N).apply(lambda x : x.iloc[len(x)//2])
Be aware that if the total number or rows is not divisible by N, the last group is shorter (you still get its middle value, though).
If N is an even number, you get the central row closer to the end of the group: for example, if N=6, you get the 4th row of each group of 6 rows.

how to get column number by cell value in python using openpyxl

I am completely new to openpyxl and python and I am having a hard time with this issue and i need your help.
JAN FEB MAR MAR YTD 2019 YTD
25 9 57 23 7
61 41 29 5 57
54 34 58 10 7
13 13 63 26 45
31 71 40 40 40
24 38 63 63 47
31 50 43 2 61
68 33 13 9 63
28 1 30 39 71
I have an excel report with the data above. I'd like to search cells for those that contain a specific string (i.e., YTD) and get the column number for YTD column. I want to use the column number to extract data for that column. I do not want to use row and cell reference as the excel file gets updated regularly, thus d column will always move.
def t_PM(ff_sheet1,start_row):
wb = openpyxl.load_workbook(filename='report') # open report
report_sheet1 = wb.get_sheet_by_name('sheet 1')
col = -1
for j, keyword in enumerate(report_sheet1.values(0)):
if keyword=='YTD':
col = j
break
ff_sheet1.cell(row=insert_col + start_row, column= header['YTD_OT'], value=report_sheet1.cell(row=i + 7, column=col).value)
But then, I get an " 'generator' object is not callable" error. How can i fix this?
Your problem is that report_sheet1.values is a generator so you can't call it with (0). I'm assuming by your code that you don't want to rely that the "YTD" will appear in the first row so you iterate all cells. Do this by:
def find_YTD():
wb = openpyxl.load_workbook(filename='report') # open report
report_sheet1 = wb.get_sheet_by_name('sheet 1')
for col in report_sheet1.iter_cols(values_only=True):
for value in col:
if isinstance(value, str) and 'YTD' in value:
return col
If you are assuming this data will be in the first row, simply do:
for cell in report_sheet1[1]:
if isinstance(value, str) and 'YTD' in cell.value:
return cell.column
openpyxl uses '1-based' line indexing
Read the docs - access many cells

Fetching records and then putting in a new column

I am working with Pandas data frame for one of my projects.
I have a column name Count having integer values in that column.
I have 720 values for each hour i.e 24 * 30 days.
I want to run a loop which can get initially first 24 values from the data frame and put in a new column and then take the next 24 and put in the new column and then so on.
for example:
input:
34
45
76
87
98
34
output:
34 87
45 98
76 34
here it is a row of 6 and I am taking first 3 values and putting them in the first column and the next 3 in the 2nd one.
Can someone please help with writing a code/program for the same. It would be of great help.
Thanks!
You can also try numpy's reshape method performed on pd.Series.values.
s = pd.Series(np.arange(720))
df = pd.DataFrame(s.values.reshape((30,24)).T)
Or split (specify how many arrays you want to split into),
df = pd.DataFrame({"day" + str(i): v for i, v in enumerate(np.split(s.values, 30))})

Categories

Resources