I have a data frame that looks like this:
Numbers Names
0 A
1 A
2 B
3 B
4 C
5 C
6 C
8 D
10 D
And my numbers(integers) need to be sequential IF the value in the column "Names" is the same for both numbers: so for example, between 6 and 8 the numbers are not sequential but that is fine since the column "Names" changes from C to D. However, between 8 and 10 this is a problem since both rows have the same value "Names" but are not sequential.
I would like to do a code that returns the numbers missing that need to be added according to the logic explained above.
import itertools as it
import pandas as pd
df = pd.read_excel("booki.xlsx")
c1 = df['Numbers'].copy()
c2 = df['Names'].copy()
for i in it.chain(range(1,len(c2)-1), range(1,len(c1)-1)):
b = c2[i]
c = c2[i+1]
x = c1[i]
n = c1[i+1]
if c == b and n - x > 1:
print(x+1)
It prints the numbers that are missing but two times, so for the data frame in the example it would print:
9
9
but I would like to print only:
9
Perhaps it's some failure in the logic?
Thank you
you can use groupby('Names') and then shift to get the differences between following elements within each group, then pick only the ones that don't have -1 as a differnce, and print their following number.
try this:
import pandas as pd
import numpy as np
from io import StringIO
df = pd.read_csv(StringIO("""
Numbers Names
0 A
1 A
2 B
3 B
4 C
5 C
6 C
8 D
10 D"""), sep="\s+")
differences = df.groupby('Names', as_index=False).apply(lambda g: g['Numbers'] - g['Numbers'].shift(-1)).fillna(-1).reset_index()
missing_numbers = (df[differences != -1]['Numbers'].dropna()+1).tolist()
print(missing_numbers)
Output:
[9.0]
I'm not sure itertools is needed here. Here is one solution only using pandas methods.
Group the data according to Names column using groupby
Select the min and max from Numbers columns
Define an integer range from min to max
merge this value with the sub dataframe
Filter according missing values using isna
Return the filtered df
Optional : reindex the columns for prettier output with reset_index
Here the code:
df = pd.DataFrame({"Numbers": [0, 1, 2, 3, 4, 5, 6, 8, 10, 15],
"Names": ["A", "A", "B", "B", "C", "C", "C", "D", "D", "D"]})
def select_missing(df):
# Select min and max values
min_ = df.Numbers.min()
max_ = df.Numbers.max()
# Create integer range
serie = pd.DataFrame({"Numbers": [i for i in range(min_, max_ + 1)]})
# Merge with df
m = serie.merge(df, on=['Numbers'], how='left')
# Return rows not matching the equality
return m[m.isna().any(axis=1)]
# Group the data per Names and apply "select_missing" function
out = df.groupby("Names").apply(select_missing)
print(out)
# Numbers Names
# Names
# D 1 9 NaN
# 3 11 NaN
# 4 12 NaN
# 5 13 NaN
# 6 14 NaN
out = out[["Numbers"]].reset_index(level=0)
print(out)
# Names Numbers
# 1 D 9
# 3 D 11
# 4 D 12
# 5 D 13
# 6 D 14
Related
Below, I have a dictionary called 'date_dict'. I want to create a DataFrame that takes each key of this dictionary, and have it appear in n rows of the DataFrame, n being the value. For example, the date '20220107' would appear in 75910 rows. Would this be possible?
{'20220107': 75910,
'20220311': 145012,
'20220318': 214286,
'20220325': 283253,
'20220401': 351874,
'20220408': 419064,
'20220415': 486172,
'20220422': 553377,
'20220429': 620635,
'20220506': 684662,
'20220513': 748368,
'20220114': 823454,
'20220520': 886719,
'20220527': 949469,
'20220121': 1023598,
'20220128': 1096144,
'20220204': 1167590,
'20220211': 1238648,
'20220218': 1310080,
'20220225': 1380681,
'20220304': 1450031}
Maybe this could help.
import pandas as pd
myDict = {'20220107': 3, '20220311': 4, '20220318': 5 }
wrkList = []
for k, v in myDict.items():
for i in range(v):
rowList = []
rowList.append(k)
wrkList.append(rowList)
df = pd.DataFrame(wrkList)
print(df)
'''
R e s u l t
0
0 20220107
1 20220107
2 20220107
3 20220311
4 20220311
5 20220311
6 20220311
7 20220318
8 20220318
9 20220318
10 20220318
11 20220318
'''
I have a data frame that contains a single column Positive Dispatch,
index Positive Dispatch
0 a,c
1 b
2 a,b
Each keyword has its own value:
a,b,c = 12,22,11
I want to create a new column that contains the max of each row, for example in the first row there are a and c and between them a has the biggest value, which is 12 and so on:
Positive Dispatch Max
a,c 12
b 22
a,b 22
My attempt:
import pandas as pd
dic1 = {
'a': [12,0,22],
'b': [0,13,22],
'c': [12,0,0], # there can be N number of columns here for example
} # 'd': [11,22,333]
a,b,c = 12,22,11 # d will have its own value, for example d = 33
df = pd.DataFrame(dic1)
df['Positive Dispatch'] = df.gt(0).dot(df.columns + ',').str[:-1] #Creating the positive dispatch column
print(df['Positive Dispatch'].max(axis=1))
But this gives the error:
ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'>
IIUC:
create a dict then calculate max according to the key and value of the dictionary by using split()+max()+map():
d={'a':a,'b':b,'c':c}
df['Max']=df['Positive Dispatch'].str.split(',').map(lambda x:max([d.get(y) for y in x]))
#for more columns use applymap() in place of map() and logic remains same
OR
If you have more columns like 'Dispatch' then use:
d={'a':a,'b':b,'c':c,'d':d}
df[['Max','Min']]=df[['Positive Dispatch','Negative Dispatch']].applymap(lambda x:max([d.get(y) for y in x.split(',')]))
Sample Dataframe used:
dic1 = {
'a': [12,0,22],
'b': [0,13,22],
'c': [12,0,0], # there can be N number of columns here for example
'd': [11,22,333]}
a,b,c,d = 12,22,11,33 # d will have its own value, for example d = 33
df = pd.DataFrame(dic1)
df['Positive Dispatch'] = df.gt(0).dot(df.columns + ',').str[:-1]
df['Negative Dispatch']=[['a,d'],['c,b,a'],['d,c']]
df['Negative Dispatch']=df['Negative Dispatch'].str.join(',')
output:
a b c Positive Dispatch Max
0 12 0 12 a,c 12
1 0 13 0 b 22
2 22 22 0 a,b 22
I Have the following table:
and for each cell, I'd like to obtain the n° of values different from 0.
As an example for the first 2 rows:
denovoLocus10 9 C 0 1 0
denovoLocus12 7 G 3 3 4
After creating a simple test data frame, as the data itself is in a screenshot rather than something copyable:
df = pd.DataFrame({'A': ['0/0/0/0', '0/245/42/0']})
Just extract all integers as strings using a regular expression, replace all strings '0' with np.nan. Then count, within each original-index-level group (note that count excludes NaN automatically):
>>> df['A_count'] = df['A'].str.extractall(r'(\d+)').replace('0', np.nan) \
... .groupby(level=0).count()
>>> df
A A_count
0 0/0/0/0 0
1 0/245/42/0 2
If you want it to do it with multiple columns, filter your columns and loop over them with a for loop. (This also could be done with an apply over those columns.) Eg:
for c in df.filter(regex=r'RA\d{2}_R1_2'):
df[c + '_count'] = df[c].str.extractall(r'(\d+)').replace('0', np.nan) \
.groupby(level=0).count()
Here is how I would do it in R.
#load package
library(tidyverse)
#here is the data you gave us
test_data <- tibble(Tag = paste0("denovoLocus", c(10, 12, 14, 16, 17)),
Locus = c(9,7,37,5,4),
ref = c("C", "G", "C", "T", "C"),
RA02_R1_2 = c("0/0/0/0", "22/0/262/1", "0/0/0/0", "0/0/0/0", "0/7/0/0"),
RA03_R1_2 = c("0/223/0/0", "22/0/989/15", "0/5/0/0", "0/0/0/0", "0/42/0/0"),
RA06_R1_2 = c("0/0/0/0", "25/3/791/3", "0/4/0/0", "0/0/0/8", "0/31/0/3"))
#split and count the elements that do not equal zero and them collapse them
test_data%>%
mutate(across(RA02_R1_2:RA06_R1_2, ~map_dbl(., ~str_split(.x, pattern = "/") %>%
map_dbl(., ~sum(.x != "0") )))) %>%
unite(col = "final", everything(), sep = " ")
#> # A tibble: 5 x 1
#> final
#> <chr>
#> 1 denovoLocus10 9 C 0 1 0
#> 2 denovoLocus12 7 G 3 3 4
#> 3 denovoLocus14 37 C 0 1 1
#> 4 denovoLocus16 5 T 0 0 1
#> 5 denovoLocus17 4 C 1 1 2
First using across I summarize the columns with a bunch of "/". I first split the elements by "/" using str_split, then I count the elements that do not equal zero (sum(.x != "0")). It is a little complicated because splitting produces a list, so you need to map over the list to pull the values out. Lastly, we use unite to collapse all the columns into the string format that you wanted.
I have scraped a webpage table, and the table items are in a sequential 1D list, with repeated headers. I want to reconstitute the table into a DataFrame.
I have an algorithm to do this, but I'd like to know if there is a more pythonic/efficient way to achieve this? NB. I don't necessarily know how many columns there are in my table. Here's an example:
input = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
output = {}
it = iter(input)
val = next(it)
while val:
if val in output:
output[val].append(next(it))
else:
output[val] = [next(it)]
val = next(it,None)
df = pd.DataFrame(output)
print(df)
with the result:
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
If your data is always "well behaved", then something like this should suffice:
import pandas as pd
data = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
result = {}
for k,v in zip(data[::2], data[1::2]):
result.setdefault(k, []).append(v)
df = pd.DataFrame(output)
You can also use numpy reshape:
import numpy as np
cols = sorted(set(l[::2]))
df = pd.DataFrame(np.reshape(l, (int(len(l)/len(cols)/2), len(cols)*2)).T[1::2].T, columns=cols)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Explaination:
# get columns
cols = sorted(set(l[::2]))
# reshape list into list of lists
shape = (int(len(l)/len(cols)/2), len(cols)*2)
np.reshape(l, shape)
# get only the values of the data
.T[1::2].T
# this transposes the data and slices every second step
I have a dataframe with 15 columns named 0,1,2,...,14. I would like to write a method that would take in this data, and a vector of length 15. I would like it to return dataframe conditionally selected based on this vector that I have passed. E.g. the data passed is data_ and the vector passed is v_
I would like to produce that:
data[(data[0] == v_[0]) & (data[1] == v_[1]) & ... & (data[14] == v_[14])]
However I would like the method to be flexible, e.g. I could pass in dataframe of 100 columns named 0, ..., 99 and a vector of length 99. My problem is that I do not know how to cleverly create [(data[0] == v_[0]) & (data[1] == v_[1]) & ... & (data[14] == v_[14])] programatically to account for "&" sign. Equally well I would be satisfied if someone gave me a method that could merge multiple NxM matrices filled with True and False on "and" or "or" into single MxN matrix.
Thank You very much!
You can try this:
def custom_filter(data, v):
if len(data.columns) == len(v):
# If data has the same number of columns
# as v has elements
mask = (data == v).all(axis=1)
else:
# If they have a different length, we'll need to subset
# the data first, then create our mask
# This attempts to susbet the dataframe by assuming columns
# 0 .. len(v) - 1 exist as columns, and will throw an error
# otherwise
colnames = list(range(len(v)))
mask = (data[colnames] == v).all(axis=1)
return data.loc[mask, :]
df = pd.DataFrame({
"F": list("hiadsfin"),
0: list("aaaabbbb"),
1: list("cccdddee"),
2: list("ffgghhij")
})
v = ["a", "c", "f"]
df
F 0 1 2 H
0 h a c f 1
1 i a c f 2
2 a a c g 3
3 d a d g 4
4 s b d h 5
5 f b d h 6
6 i b e i 7
7 n b e j 8
custom_filter(df, v)
F 0 1 2 H
0 h a c f 1
1 i a c f 2
Note that with this function, if the number of columns exactly matches the length of your vector v, then you do not need to ensure the columns are labelled as 0, 1, 2, ..., len(v)-1. However if you have more columns than elements of v, you need to ensure that a subset of those columns are labelled as 0, 1, 2, ..., len(v)-1. If v` is longer than there are columns in your dataframe, this will throw an error.
This might work:
data[(data==v_.transpose())].dropna(axis=1)