Suppose I have a pandas dataframe with following columns:
A , B , C , D , E , F , G , H
I want to select all the columns with a specific interval, say n.For example if n=2 and I start from A, I would select :
A,B,E,F (select the first two,drop the next two and so on)
If I start from the end, I would select:
H,G,D,C
I can even start from any random column in between.What would be an efficient way of doing so?
Compress and cycle i.e
from itertools import compress,cycle
ndf = pd.DataFrame(pd.np.random.randn(2,6), columns = ['A','B','C','D','E','F'])
ndf[list(compress(ndf.columns,cycle([True]*2 + [False]*2)))]
A B E F
0 0.833114 -0.616667 -0.908963 -0.486292
1 1.285927 -0.335325 0.562466 1.218459
I would write your logic into a function and then iterate through the columns using that function.
For example, to go by n2, use the list slicing syntax [start:stop:step] to iterate through the dataframe.columns:
df = pd. DataFrame() # insert creation code here
cols = [c for c in df.columns[::2] ]
df.loc[:,cols] # result
Edit this is wrong per the comment.
To skip columns, maybe check the mod of the column position.
[val for num, val in enumerate(df.columns)
if num % 4 in [0,1]]
I'm doing this on my phone so sorry if not so nicely formatted
Related
I have 2 dataframes:
DF A:
and DF B:
I need to check every row in the DFA['item'] if it contains some of the values in the DFB['original'] and if it does, then add new column in DFA['my'] that would correspond to the value in DFB['my'].
So here is the result I need:
I tought of converting the DFB['original'] into list and then use regex, but this way I wont get the matching result from column 'my'.
Ok, maybe not the best solution, but it seems to be working.
I did cartesian join and then check the records which contains the data needed
dfa['join'] = 1
dfb['join'] = 1
dfFull = dfa.merge(dfb, on='join').drop('join' , axis=1)
dfFull['match'] = dfFull.apply(lambda x: x.original in x.item, axis = 1)
dfFull[dfFull['match']]
I'm with a challenge in python/pandas script.
My data is a gene expression table, which is organized as follow:
Basically, Index 0 contain the both conditions studied, while Index 1 has the information about the gene identified between the samples.
Then, I would like to produce a table with index 0 and 1 close together, as follow:
I've tried a lot of things, such as generate a list of index 0 to join in index 1...
Save me, guys, please!
Thank you
Assuming your first row of column names are in row 0, and your second column names are in row 1 try this:
df.columns = [f'{c1}.{c2}'.strip('.') for c1,c2 in zip(df.loc[0], df.loc[1])]
df.loc[2:]
Should look like this
According to OP's comment, I change the add_suffix function.
construct the dataframe
s1 = "Gene name,Description,Foldchange,Anova,Sample 1,Sample 2,Sample 3,Sample 4,Sample 5,Sample 6".split(",")
s2 = "HK1,Hexokinase,Infinity,0.05,1213,1353,14356,0,0,0".split(",")
df = pd.DataFrame(s2).T
df.columns = s1
define a function, (change the funcition according to different situations)
def add_suffix(x):
try:
flag = int(x[-1])
except:
return x
if flag <= 4:
return x + '.Conditon1'
else:
return x + '.Condition2'
and then assign the columns
cols = df.columns.to_series().apply(add_suffix)
df.columns = cols
I have the following data frame of the form:
1 2 3 4 5 6 7 8
A C C T G A T C
C A G T T A D N
Y F V H Q A F D
I need to randomly select a column k times where k is the number of columns in the given sample. My program creates a list of empty lists of size k and then randomly selects a column from the dataframe to be appended to the list. Each list must be unique and cannot have duplicates.
From the above example dataframe, an expected output should be something like:
[[2][4][6][1][7][3][5][8]]
However I am obtaining results like:
[[1][1][3][6][7][8][8][2]]
What is the most pythonic way to go about doing this? Here is my sorry attempt:
k = len(df.columns)
k_clusters = [[] for i in range(k)]
for i in range(len(k_clusters)):
for j in range(i + 1, len(k_clusters)):
k_clusters[i].append((df.sample(1, axis=1)))
if k_clusters[i] == k_clusters[j]:
k_clusters[j].pop(0)
k_clusters[j].append(df.sample(1, axis=1)
Aside from the shuffling step, your question is very similar to How to change the order of DataFrame columns?. Shuffling can be done in any number of ways in Python:
cols = np.array(df.columns)
np.random.shuffle(cols)
Or using the standard library:
cols = list(df.columns)
random.shuffle(cols)
You do not want to do cols = df.columns.values, because that will give you write access to the underlying column name data. You will then end up shuffling the column names in-place, messing up your dataframe.
Rearranging your columns is then easy:
df = df[cols]
You can use numpy.random.shuffle to just shuffle the column indexes. Because from your question, this is what I assume you want to do.
An example:
import numpy as np
to_shuffle = np.array(df.columns)
np.random.shuffle(to_shuffle)
print(to_shuffle)
I have a specific series of datasets which come in the following general form:
import pandas as pd
import random
df = pd.DataFrame({'n': random.sample(xrange(1000), 3), 't0':['a', 'b', 'c'], 't1':['d','e','f'], 't2':['g','h','i'], 't3':['i','j', 'k']})
The number of tn columns (t0, t1, t2 ... tn) varies depending on the dataset, but is always <30.
My aim is to merge the content of the tn columns for each row so that I achieve this result (note that for readability I need to keep the whitespace between elements):
df['result'] = df.t0 +' '+df.t1+' '+df.t2+' '+ df.t3
So far so good. This code may be simple but it becomes clumsy and inflexible as soon as I receive another dataset, where the number of tn columns goes up. This is where my question comes in:
Is there any other syntax to merge the content across multiple columns? Something agnostic to the number columns, akin to:
df['result'] = ' '.join(df.ix[:,1:])
Basically, I want to achieve the same as the OP in the link below, but with whitespace between the strings:
Concatenate row-wise across specific columns of dataframe
The key to operate in columns (Series) of strings en mass is the Series.str accessor.
I can think of two .str methods to do what you want.
str.cat()
The first is str.cat. You have to start from a series, but you can pass a list of series (unfortunately you can't pass a dataframe) to concatenate with an optional separator. Using your example:
column_names = df.columns[1:] # skipping the first, numeric, column
series_list = [df[c] for c in column_names[1:]]
# concatenate:
df['result'] = series_list[0].str.cat(series_list[1:], sep=' ')
Or, in one line:
df['result'] = df[df.columns[1]].str.cat([df[c] for c in df.columns[2:]], sep=' ')
str.join()
The second is the .str.join() method, which works like the standard Python method string.join(), but for which you need to have a column (Series) of iterables, for example, a column of tuples, which we can get by applying tuples row-wise to a sub-dataframe of the columns you're interested in:
tuple_series = df[column_names].apply(tuple, axis=1)
df['result'] = tuple_series.str.join(' ')
Or, in one line:
df['result'] = df[df.columns[1:]].apply(tuple, axis=1).str.join(' ')
BTW, don't try the above with list instead of tuple. As of pandas-0.20.1, if the function passed into the Dataframe.apply() method returns a list and the returned list has the same number entries as the columns of the original (sub)dataframe, Dataframe.apply() returns a Dataframe instead of a Series.
Other than using apply to concatenate the strings, you can also use agg to do so.
df[df.columns[1:]].agg(' '.join, axis=1)
Out[118]:
0 a d g i
1 b e h j
2 c f i k
dtype: object
Here is a slightly alternative solution:
In [57]: df['result'] = df.filter(regex=r'^t').apply(lambda x: x.add(' ')).sum(axis=1).str.strip()
In [58]: df
Out[58]:
n t0 t1 t2 t3 result
0 92 a d g i a d g i
1 916 b e h j b e h j
2 363 c f i k c f i k
I have a file where the separator(delimiter) is ';' . I read that file into a pandas dataframe df. Now, I want to select some rows from df using a criteria from column c in df. The format of data in column c is as follows:
[0]science|time|boot
[1]history|abc|red
and so on...
I have another list of words L, which has values such as
[history, geography,....]
Now, if I split the text in column c on '|', then I want to select those rows from df, where the first word does not belong to L.
Therefore, in this example, I will select df[0] but will not chose df[1], since history is present in L and science is not.
I know, I can write a for loop and iter over each object in the dataframe but I was wondering if I could do something in a more compact and efficient way.
For example, we can do:
df.loc[df['column_name'].isin(some_values)]
I have this:
df = pd.read_csv(path, sep=';', header=None, error_bad_lines=False, warn_bad_lines=False)
dat=df.ix[:,c].str.split('|')
But, I do not know how to index 'dat'. 'dat' is a Pandas Series, as follows:
0 [science, time, boot]
1 [history, abc, red]
....
I tried indexing dat as follows:
dat.iloc[:][0]
But, it gives the entire series instead of just the first element.
Any help would be appreciated.
Thank You in advance
Here is an approach:
Data
df = pd.DataFrame({'c':['history|science','science|chemistry','geography|science','biology|IT'],'col2':range(4)})
Out[433]:
c col2
0 history|science 0
1 science|chemistry 1
2 geography|science 2
3 biology|IT 3
lst = ['geography', 'biology','IT']
Resolution
You can use list comprehension:
df.loc[pd.Series([not x.split('|')[0] in lst for x in df.c.tolist()])]
Out[444]:
c col2
0 history|science 0
1 science|chemistry 1