When you print a pandas DataFrame, which calls DataFrame.to_string, it normally inserts a minimum of 2 spaces between the columns. For example, this code
import pandas as pd
df = pd.DataFrame( {
"c1" : ("a", "bb", "ccc", "dddd", "eeeeee"),
"c2" : (11, 22, 33, 44, 55),
"a3235235235": [1, 2, 3, 4, 5]
} )
print(df)
outputs
c1 c2 a3235235235
0 a 11 1
1 bb 22 2
2 ccc 33 3
3 dddd 44 4
4 eeeeee 55 5
which has a minimum of 2 spaces between each column.
I am copying DataFarames printed on the console and pasting it into documents, and I have received feedback that it is hard to read: people would like more spaces between the columns.
Is there a standard way to do that?
I see no option in either DataFrame.to_string or pandas.set_option.
I have done a web search, and not found an answer. This question asks how to remove those 2 spaces, while this question asks why sometimes only 1 space is between columns instead of 2 (I also have seen this bug, hope someone answers that question).
My hack solution is to define a function that converts a DataFrame's columns to type str, and then prepends each element with a string of the specified number of spaces.
This code (added to the code above)
def prependSpacesToColumns(df: pd.DataFrame, n: int = 3):
spaces = ' ' * n
# ensure every column name has the leading spaces:
if isinstance(df.columns, pd.MultiIndex):
for i in range(df.columns.nlevels):
levelNew = [spaces + str(s) for s in df.columns.levels[i]]
df.columns.set_levels(levelNew, level = i, inplace = True)
else:
df.columns = spaces + df.columns
# ensure every element has the leading spaces:
df = df.astype(str)
df = spaces + df
return df
dfSp = prependSpacesToColumns(df, 3)
print(dfSp)
outputs
c1 c2 a3235235235
0 a 11 1
1 bb 22 2
2 ccc 33 3
3 dddd 44 4
4 eeeeee 55 5
which is the desired effect.
But I think that pandas surely must have some builtin simple standard way to do this. Did I miss how?
Also, the solution needs to handle a DataFrame whose columns are a MultiIndex. To continue the code example, consider this modification:
idx = (("Outer", "Inner1"), ("Outer", "Inner2"), ("Outer", "a3235235235"))
df.columns = pd.MultiIndex.from_tuples(idx)
You can accomplish this through formatters; it takes a bit of code to create the dictionary {'col_name': format_string}. Find the max character length in each column or the length of the column header, whichever is greater, add some padding, and then pass a formatting string.
Use partial from functools as the formatters expect a one parameter function, yet we need to specify a different width for each column.
Sample Data
import pandas as pd
df = pd.DataFrame({"c1": ("a", "bb", "ccc", "dddd", 'eeeeee'),
"c2": (1, 22, 33, 44, 55),
"a3235235235": [1,2,3,4,5]})
Code
from functools import partial
# Formatting string
def get_fmt_str(x, fill):
return '{message: >{fill}}'.format(message=x, fill=fill)
# Max character length per column
s = df.astype(str).agg(lambda x: x.str.len()).max()
pad = 6 # How many spaces between
fmts = {}
for idx, c_len in s.iteritems():
# Deal with MultIndex tuples or simple string labels.
if isinstance(idx, tuple):
lab_len = max([len(str(x)) for x in idx])
else:
lab_len = len(str(idx))
fill = max(lab_len, c_len) + pad - 1
fmts[idx] = partial(get_fmt_str, fill=fill)
print(df.to_string(formatters=fmts))
c1 c2 a3235235235
0 a 11 1
1 bb 22 2
2 ccc 33 3
3 dddd 44 4
4 eeeeee 55 5
# MultiIndex Output
Outer
Inner1 Inner2 a3235235235
0 a 11 1
1 bb 22 2
2 ccc 33 3
3 dddd 44 4
4 eeeeee 55 5
Related
I have column in a pandas df that has this format "1_A01_1_1_NA I want to extract the text that is between the underscores e.g. "A01" "1" "1" and "NA" , I tried to use left right and mid but the problem is that at some point the column value changes into something like this 11_B40_11_8_NA.
Pd the df has 7510 rows.
Use str.split:
df = pd.DataFrame({'Col1': ['1_A01_1_1_NA', '11_B40_11_8_NA']})
out = df['Col1'].str.split('_', expand=True)
Output:
>>> out
0 1 2 3 4
0 1 A01 1 1 NA
1 11 B40 11 8 NA
The function you are looking for is Pandas.series.str.split().
You should be able to take your nasty column as a series and use the str.split("_", expand = True) method. You can see the "expand" keyword is exactly what you need to make new columns out of the results (splitting on the "_" character, not any specific index).
So, something like this:
First we need to create a little bit of nonsense like yours.
(Please forgive my messy and meandering code, I'm still new)
import pandas as pd
from random import choice
import string
# Creating Nonsense Data Frame
def make_nonsense_codes():
"""
Returns a string of nonsense like '11_B40_11_8_NA'
"""
nonsense = "_".join(
[
"".join(choice(string.digits) for i in range(2)),
"".join(
[choice(string.ascii_uppercase),
"".join([choice(string.digits) for i in range(2)])
]
),
"".join(choice(string.digits) for i in range(2)),
choice(string.digits),
"NA"
]
)
return nonsense
my_nonsense_df = pd.DataFrame(
{"Nonsense" : [make_nonsense_codes() for i in range(5)]}
)
print(my_nonsense_df)
# Nonsense
# 0 25_S91_13_1_NA
# 1 80_O54_58_4_NA
# 2 01_N98_68_3_NA
# 3 88_B37_14_9_NA
# 4 62_N65_73_7_NA
Now we can select our "Nonsense" column, and use str.split().
# Wrangling the nonsense column with series.str.split()
wrangled_nonsense_df = my_nonsense_df["Nonsense"].str.split("_", expand = True)
print(wrangled_nonsense_df)
# 0 1 2 3 4
# 0 25 S91 13 1 NA
# 1 80 O54 58 4 NA
# 2 01 N98 68 3 NA
# 3 88 B37 14 9 NA
# 4 62 N65 73 7 NA
I Have the following table:
and for each cell, I'd like to obtain the n° of values different from 0.
As an example for the first 2 rows:
denovoLocus10 9 C 0 1 0
denovoLocus12 7 G 3 3 4
After creating a simple test data frame, as the data itself is in a screenshot rather than something copyable:
df = pd.DataFrame({'A': ['0/0/0/0', '0/245/42/0']})
Just extract all integers as strings using a regular expression, replace all strings '0' with np.nan. Then count, within each original-index-level group (note that count excludes NaN automatically):
>>> df['A_count'] = df['A'].str.extractall(r'(\d+)').replace('0', np.nan) \
... .groupby(level=0).count()
>>> df
A A_count
0 0/0/0/0 0
1 0/245/42/0 2
If you want it to do it with multiple columns, filter your columns and loop over them with a for loop. (This also could be done with an apply over those columns.) Eg:
for c in df.filter(regex=r'RA\d{2}_R1_2'):
df[c + '_count'] = df[c].str.extractall(r'(\d+)').replace('0', np.nan) \
.groupby(level=0).count()
Here is how I would do it in R.
#load package
library(tidyverse)
#here is the data you gave us
test_data <- tibble(Tag = paste0("denovoLocus", c(10, 12, 14, 16, 17)),
Locus = c(9,7,37,5,4),
ref = c("C", "G", "C", "T", "C"),
RA02_R1_2 = c("0/0/0/0", "22/0/262/1", "0/0/0/0", "0/0/0/0", "0/7/0/0"),
RA03_R1_2 = c("0/223/0/0", "22/0/989/15", "0/5/0/0", "0/0/0/0", "0/42/0/0"),
RA06_R1_2 = c("0/0/0/0", "25/3/791/3", "0/4/0/0", "0/0/0/8", "0/31/0/3"))
#split and count the elements that do not equal zero and them collapse them
test_data%>%
mutate(across(RA02_R1_2:RA06_R1_2, ~map_dbl(., ~str_split(.x, pattern = "/") %>%
map_dbl(., ~sum(.x != "0") )))) %>%
unite(col = "final", everything(), sep = " ")
#> # A tibble: 5 x 1
#> final
#> <chr>
#> 1 denovoLocus10 9 C 0 1 0
#> 2 denovoLocus12 7 G 3 3 4
#> 3 denovoLocus14 37 C 0 1 1
#> 4 denovoLocus16 5 T 0 0 1
#> 5 denovoLocus17 4 C 1 1 2
First using across I summarize the columns with a bunch of "/". I first split the elements by "/" using str_split, then I count the elements that do not equal zero (sum(.x != "0")). It is a little complicated because splitting produces a list, so you need to map over the list to pull the values out. Lastly, we use unite to collapse all the columns into the string format that you wanted.
I have a data frame that looks like this:
Numbers Names
0 A
1 A
2 B
3 B
4 C
5 C
6 C
8 D
10 D
And my numbers(integers) need to be sequential IF the value in the column "Names" is the same for both numbers: so for example, between 6 and 8 the numbers are not sequential but that is fine since the column "Names" changes from C to D. However, between 8 and 10 this is a problem since both rows have the same value "Names" but are not sequential.
I would like to do a code that returns the numbers missing that need to be added according to the logic explained above.
import itertools as it
import pandas as pd
df = pd.read_excel("booki.xlsx")
c1 = df['Numbers'].copy()
c2 = df['Names'].copy()
for i in it.chain(range(1,len(c2)-1), range(1,len(c1)-1)):
b = c2[i]
c = c2[i+1]
x = c1[i]
n = c1[i+1]
if c == b and n - x > 1:
print(x+1)
It prints the numbers that are missing but two times, so for the data frame in the example it would print:
9
9
but I would like to print only:
9
Perhaps it's some failure in the logic?
Thank you
you can use groupby('Names') and then shift to get the differences between following elements within each group, then pick only the ones that don't have -1 as a differnce, and print their following number.
try this:
import pandas as pd
import numpy as np
from io import StringIO
df = pd.read_csv(StringIO("""
Numbers Names
0 A
1 A
2 B
3 B
4 C
5 C
6 C
8 D
10 D"""), sep="\s+")
differences = df.groupby('Names', as_index=False).apply(lambda g: g['Numbers'] - g['Numbers'].shift(-1)).fillna(-1).reset_index()
missing_numbers = (df[differences != -1]['Numbers'].dropna()+1).tolist()
print(missing_numbers)
Output:
[9.0]
I'm not sure itertools is needed here. Here is one solution only using pandas methods.
Group the data according to Names column using groupby
Select the min and max from Numbers columns
Define an integer range from min to max
merge this value with the sub dataframe
Filter according missing values using isna
Return the filtered df
Optional : reindex the columns for prettier output with reset_index
Here the code:
df = pd.DataFrame({"Numbers": [0, 1, 2, 3, 4, 5, 6, 8, 10, 15],
"Names": ["A", "A", "B", "B", "C", "C", "C", "D", "D", "D"]})
def select_missing(df):
# Select min and max values
min_ = df.Numbers.min()
max_ = df.Numbers.max()
# Create integer range
serie = pd.DataFrame({"Numbers": [i for i in range(min_, max_ + 1)]})
# Merge with df
m = serie.merge(df, on=['Numbers'], how='left')
# Return rows not matching the equality
return m[m.isna().any(axis=1)]
# Group the data per Names and apply "select_missing" function
out = df.groupby("Names").apply(select_missing)
print(out)
# Numbers Names
# Names
# D 1 9 NaN
# 3 11 NaN
# 4 12 NaN
# 5 13 NaN
# 6 14 NaN
out = out[["Numbers"]].reset_index(level=0)
print(out)
# Names Numbers
# 1 D 9
# 3 D 11
# 4 D 12
# 5 D 13
# 6 D 14
I have a dataframe where the row indices and column headings should determine the content of each cell. I'm working with a much larger version of the following df:
df = pd.DataFrame(index = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'],
columns = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
Specifically, I want to apply the custom function edit_distance() or equivalent (see here for function code) which calculates a difference score between two strings. The two inputs are the row and column names. The following works but is extremely slow:
for seq in df.index:
for seq2 in df.columns:
df.loc[seq, seq2] = edit_distance(seq, seq2)
This produces the result I want:
ae azde afgle arlde afghijklbcmde
afghijklde 8 7 5 6 3
afghijklmde 9 8 6 7 2
ade 1 1 3 2 10
afghilmde 7 6 4 5 4
amde 2 1 3 2 9
What is a better way to do this, perhaps using applymap() ?. Everything I've tried with applymap() or apply or df.iterrows() has returned errors of the kind AttributeError: "'float' object has no attribute 'index'" . Thanks.
Turns out there's an even better way to do this. onepan's dictionary comprehension answer above is good but returns the df index and columns in random order. Using a nested .apply() accomplishes the same thing at about the same speed and doesn't change the row/column order. The key is to not get hung up on naming the df's rows and columns first and filling in the values second. Instead, do it the other way around, initially treating the future index and columns as standalone pandas Series.
series_rows = pd.Series(['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'])
series_cols = pd.Series(['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
df = pd.DataFrame(series_rows.apply(lambda x: series_cols.apply(lambda y: edit_distance(x, y))))
df.index = series_rows
df.columns = series_cols
you could use comprehensions, which speeds it up ~4.5x on my pc
first = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde']
second = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde']
pd.DataFrame.from_dict({f:{s:edit_distance(f, s) for s in second} for f in first}, orient='index')
# output
# ae azde afgle arlde afghijklbcmde
# ade 1 2 2 2 2
# afghijklde 1 3 4 4 9
# afghijklmde 1 3 4 4 10
# afghilmde 1 3 4 4 8
# amde 1 3 3 3 3
# this matches to edit_distance('ae', 'afghijklde') == 8, e.g.
note I used this code for edit_distance (first response in your link):
def edit_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
I was hoping you would be able to help me solve a small problem.
I am using a small device that prints out two properties that I save to a file. The device rasters in X and Y direction to form a grid. I am interested in plotting the relative intensity of these two properties as a function of the X and Y dimensions. I record the data in 4 columns that are comma separated (X, Y, property 1, property 2).
The grid is examined in lines, so for each Y value, it will move from X1 to X2 which are separated several millimeters apart. Then it will move to the next line and over again.
I am able to process the data in python with pandas/numpy but it doesn't work too well when there are any missing rows (which unfortunately does happen).
I have attached a sample of the output (and annotated the problems):
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
44,12,50,5
45,12,100,6
46,12,1500,7
47,12,2500,8
Sometimes, however a line or a few will be missing making it not possible to process and plot. Currently I have not been able to automatically fix it and have to do it manually. The bad output looks like this:
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
45,12,100,5 << missing 44,12...
46,12,1500,6
47,12,2500,7
I know the number of lines I expect since I know my range of X and Y.
What would be the best way to deal with this? Currently I manually enter the missing X and Y values and populate property 1 and 2 with values of 0. This can be time consuming and I would like to automate it. I have two questions.
Question 1: How can I automatically fill in my missing data with the corresponding values of X and Y and two zeros? This could be obtained from a pre-generated array of X and Y values that correspond to the experimental range.
Question 2: Is there a better way to split the file into separate arrays for plotting (rather than using the 'New' line?) For instance, by having a 'if' function that will output each line between X(start) and X(end) to a separate array? I've tried doing that but with no success.
I've attached my current (crude) code:
df = pd.read_csv('FileName.csv', delimiter = ',', skiprows=0)
rows = [-1] + np.where(df['X']=='New')[0].tolist() + [len(df.index)]
dff = {}
for i, r in enumerate(rows[:-1]):
dff[i] = df[r+1: rows[i+1]]
maxY = len(dff)
data = []
data2 = []
for yaxes in range(0, maxY):
data2.append(dff[yaxes].ix[:,2])
<data2 is then used for plotting using matplotlib>
To answer my Question 1, I was thinking about using the 'reindex' and 'reset_index' functions, however haven't managed to make them work.
I would appreciate any suggestions.
Is this meet what you want?
Q1: fill X using reindex, and others using fillna
Q2: Passing separated StringIO to read_csv is easier (change if you use Python 3)
# read file and split the input
f = open('temp.csv', 'r')
chunks = f.read().split('New')
# read csv as separated dataframes, using first column as index
dfs = [pd.read_csv(StringIO(unicode(chunk)), header=None, index_col=0) for chunk in chunks]
def pad(df):
# reindex, you should know the range of x
df = df.reindex(np.arange(44, 48))
# pad y from forward / backward, assuming y should have the single value
df[1] = df[1].fillna(method='bfill')
df[1] = df[1].fillna(method='ffill')
# padding others
df = df.fillna(0)
# revert index to values
return df.reset_index(drop=False)
dfs = [pad(df) for df in dfs]
dfs[0]
# 0 1 2 3
# 0 44 11 500 1
# 1 45 11 120 2
# 2 46 11 320 3
# 3 47 11 700 4
# dfs[1]
# 0 1 2 3
# 0 44 12 0 0
# 1 45 12 100 5
# 2 46 12 1500 6
# 3 47 12 2500 7
First Question
I've included print statements inside function to explain how this function works
In [89]:
def replace_missing(df , Ids ):
# check what are the mssing values
missing = np.setdiff1d(Ids , df[0])
if len(missing) > 0 :
missing_df = pd.DataFrame(data = np.zeros( (len(missing) , 4 )))
#print('---missing df---')
#print(missing_df)
missing_df[0] = missing
#print('---missing df---')
#print(missing_df)
missing_df[1].replace(0 , df[1].iloc[0] , inplace = True)
#print('---missing df---')
#print(missing_df)
df = pd.concat([df , missing_df])
#print('---final df---')
#print(df)
return df
In [91]:
Ids = np.arange(44,48)
final_df = df1.groupby(df1[1] , as_index = False).apply(replace_missing , Ids).reset_index(drop = True)
final_df
Out[91]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4
45 12 100 5
46 12 1500 6
47 12 2500 7
44 12 0 0
Second question
In [92]:
group = final_df.groupby(final_df[1])
In [99]:
separate = [group.get_group(key) for key in group.groups.keys()]
separate[0]
Out[104]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4