How can I add a column to a pandas dataframe with values 'A', 'B', 'C', 'A', 'B' etc? i.e. ABC repeating down the rows. Also I need to vary the letter that is assigned to the first row (i.e. it could start ABCAB..., BCABC... or CABCA...).
I can get as far as:
df.index % 3
which gets me the index as 0,1,2 etc, but I cannot see how to get that into a column with A, B, C.
Many thanks,
Julian
If I've understood your question correctly, you can create a list of the letters as follows, and then add that to your dataframe:
from itertools import cycle
from random import randint
letter_generator = cycle('ABC')
offset = randint(0, 2)
dataframe_length = 10 # or just use len(your_dataframe) to avoid hardcoding it
column = [next(letter_generator) for _ in range(dataframe_length + offset)]
column = column[offset:]
What I will do
df['col']=(df.index%3).map({0:'A',1:'B',2:'C'})
Related
I have a dataframe:
import pandas as pd
df = pd.DataFrame({'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']})
that I would like to slice into two new dataframes such that the first contains every nth value, while the second contains the remaining values not in the first.
For example, in the case of n=3, the second dataframe would keep two values from the original dataframe, skip one, keep two, skip one, etc. This slice is illustrated in the following image where the original dataframe values are blue, and these are split into a green set and a red set:
I have achieved this successfully using a combination of iloc and isin:
df1 = df.iloc[::3]
df2 = df[~df.val.isin(df1.val)]
but what I would like to know is:
Is this the most Pythonic way to achieve this? It seems inefficient and not particularly elegant to take what I want out of a dataframe then get the rest of what I want by checking what is not in the new dataframe that is in the original. Instead, is there an iloc expression, like that which was used to generate df1, which could do the second part of the slicing procedure and replace the isin line? Even better, is there a single expression that could execute the the entire two-step slice in one step?
Use modulo 3 with compare for not equal first values (same like sliced rows):
#for default RangeIndex
df2 = df[df.index % 3 != 0]
#for any Index
df2 = df[np.arange(len(df)) % 3 != 0]
print (df2)
val
1 b
2 c
4 e
5 f
7 h
Suppose I have a pandas dataframe with following columns:
A , B , C , D , E , F , G , H
I want to select all the columns with a specific interval, say n.For example if n=2 and I start from A, I would select :
A,B,E,F (select the first two,drop the next two and so on)
If I start from the end, I would select:
H,G,D,C
I can even start from any random column in between.What would be an efficient way of doing so?
Compress and cycle i.e
from itertools import compress,cycle
ndf = pd.DataFrame(pd.np.random.randn(2,6), columns = ['A','B','C','D','E','F'])
ndf[list(compress(ndf.columns,cycle([True]*2 + [False]*2)))]
A B E F
0 0.833114 -0.616667 -0.908963 -0.486292
1 1.285927 -0.335325 0.562466 1.218459
I would write your logic into a function and then iterate through the columns using that function.
For example, to go by n2, use the list slicing syntax [start:stop:step] to iterate through the dataframe.columns:
df = pd. DataFrame() # insert creation code here
cols = [c for c in df.columns[::2] ]
df.loc[:,cols] # result
Edit this is wrong per the comment.
To skip columns, maybe check the mod of the column position.
[val for num, val in enumerate(df.columns)
if num % 4 in [0,1]]
I'm doing this on my phone so sorry if not so nicely formatted
I have a dataframe df with multiple time series variables. Say variable 'A', 'B', 'C' etc.
There have date as the index. How can I create 3,6 and 12 month lagged version in a loop? I guess I could manually type for each variable like below, but was hoping if there is an efficient way to do it. Thanks.
df['A_3'] = df['A'].shift(3)
df['A_6'] = df['A'].shift(6)
df['A_12'] = df['A'].shift(12)
df['B_3'] = df['B'].shift(3)
df['B_6'] = df['B'].shift(6)
df['B_12'] = df['B'].shift(12)
Try this:
lag = [3,6,12]
for col in df.columns:
for l in lag:
df.loc[:,col+"_"+str(l)] = df[col].shift(l)
You can also use itertools product i.e
from itertools import product
for col,lag in product(df.columns,lags):
df[col+'_'+str(lag)] = df[col].shift(lag)
I am currently in the process of automating SQL script using the csv file and pandas module. where condition is based on the values present on my csv file.
Sample csv file wll be as below.
First Last
X A
Y B
Z C
I want a new dataframe which should look like this(with new column added).
First Last condition
X A First='X' and Last='A'
Y B First='Y' and Last='B'
Z C First='Z' and Last='C'
so i can use the third column in my sql where condition.
Note:
I can achieve this by below method but i cannot use it because my column names are not static, i mean i will be using this on multiple csv/df's which will have different column names, also number columns might be more than 2.
df['condition'] = 'First=\'' + df['First'] +'\' And ' + 'Last=\'' + df['Last'] +'\''
If I resolve the 'condition' column then my final SQL will should look like this:
Select First, Last from mydb.customers
where
(First='X' and Last='A') or
(First='Y' and Last='B') or
(First='Z' and Last='C')
Thanks
You can use apply with row (axis=1) to execute function with every row - and this functions gets all informations about data in row - column names and values
import pandas as pd
df = pd.DataFrame({
'First': ['X', 'Y', 'Z'],
'Second': ['1', '2', '3'],
'Last': ['A', 'B', 'C'],
})
print(df)
def concatenate(row):
parts = []
for name, value in row.items():
parts.append("{}='{}'".format(name, value))
return ' and '.join(parts)
df['condition'] = df.apply(concatenate, axis=1)
print(df['condition'])
Data:
(because I used dictionary which doesn't have to keep order so I get Second as last element ;) )
First Last Second
0 X A 1
1 Y B 2
2 Z C 3
Result:
0 First='X' and Last='A' and Second='1'
1 First='Y' and Last='B' and Second='2'
2 First='Z' and Last='C' and Second='3'
Name: condition, dtype: object
You can create a function that accomplishes what you are attempting. This takes any string series (such as yours) and creates the pattern you want with the series name.
Avoiding explicitly naming the columns is the hard part.
from functools import reduce #for python 3, it is native in 2
def series_to_str(s):
n = s.name
return n+"='" + s +"'"
df['condition'] = reduce(lambda x, y: x+' and '+y,
map(series_namer, (df[col] for col in df)))
I want to specify a label index, then slice int X rows from a dataframe. And I do not necessarily know my end label. My labels are usually timestamps, but that should not matter. I am having trouble achieving this, mixing labels and integer numbers of rows wanted.
so if:
df= pd.DataFrame(np.random.rand(8,3), columns = list('abc'), index = list('lmnopqrs'))
How do I get the result given by this code:
df.loc['q':'o':-1]
BUT, if I only know the 'q' index? So I want to something that returns logic like this:
df.loc['q':"3 rows only":-1]
So normally I would never know which int index the 'q' is, but I would know its name, and I do not know where in the dataframe it is. Thanks.
I am not sure if there are better ways to do this, But You can use df.index to access the indexes in the DataFrame, and also df.index.tolist() to access the index as list.
So in your case, df.index.tolist() would give -
In [13]: df.index.tolist()
Out[13]: ['l', 'm', 'n', 'o', 'p', 'q', 'r', 's']
Then, you can find the index of q in that list, using list.index() method and then get the element that is 2 indexes before q . Example -
In [19]: df.index[df.index.tolist().index('q')-2]
Out[19]: 'o'
You can use this to index your DataFrame , Example -
In [20]: df.loc['q':df.index[df.index.tolist().index('q')-2]:-1]
Out[20]:
a b c
q 0.791467 0.703116 0.268405
p 0.643924 0.434607 0.918549
o 0.630881 0.209446 0.351309
You can do this with the .ix attribute:
index = df.index.searchsorted('q') # or just a number if you already have it.
offset = 3
df.ix[index : index - offset : -1]