i have two list with the same length over 11 rows. I would like df[0] to find a match in any position in df2[0] and df[1] to find a match in any position df2[1] and etc.... Instead of me typing one by one is there a easier method?
df = [[[1, 5,7,9,12,13,17],
[2,17,18,23,32,34,45],
[3,5,11,33,34,36,45]],
[[6,21,22,50,56,58,72],
[7,5,12,13,55,56,74],
[8,23,24,32,56,58,64]]]
df2 =[[[100,5,12,15,27,32,54],
[120,10,17,18,19,43,55],
[99,21,32,33,34,36,54]],
[[41,16,32,45,66,67,76],
[56,10,11,43,54,55,56],
[77,12,16,18,19,21,23]]]
i would like my output to be like this.
output = [[[[5,12,],[17]],
[[17,18],[32,34,36]]],
[[[55,56],[32]],[[56]]]
Related
I have 1st dataFrame with column 'X' as :
X
A468593-3
A697269-2
A561044-2
A239882 04
2nd dataFrame with column 'Y' as :
Y
000A561044
000A872220
I would like to match the part of substrings from both columns with minimum no. of characters(example 7 chars only alphanumeric to be considered for matching. all special chars to be excluded).
so, my output DataFrame should be like this
X
A561044-2
Any possible solution would highly appreciate.
Thanks in advance.
IIUC and assuming that the first three values of Y start with 0, you can slice Y by [3:] to remove the first three zero values. Then, you can join these values by |. Finally, you can create your mask using contains that checks whether a series contains a specified value (in your case you would have something like 'A|B' and check whether a value contains 'A' or 'B'). Then, this mask can be used to query your other data frame.
Code:
import pandas as pd
df1 = pd.DataFrame({"X": ["A468593-3", "A697269-2", "A561044-2", "A239882 04"]})
df2 = pd.DataFrame({"Y": ["000A561044", "000A872220"]})
mask = df1["X"].str.contains(f'({"|".join(df2["Y"].str[3:])})')
df1.loc[mask]
Output:
X
2 A561044-2
If you have values in Y that do not start with three zeros, you can use this function to reduce your columns and remove all first n zeros.
def remove_first_numerics(s):
counter = 0
while s[counter].isnumeric():
counter +=1
return s[counter:]
df_test = pd.DataFrame({"A": ["01Abd3Dc", "3Adv3Dc", "d31oVgZ", "10dZb1B", "CCcDx10"]})
df_test["A"].apply(lambda s: remove_first_numerics(s))
Output:
0 Abd3Dc
1 Adv3Dc
2 d31oVgZ
3 dZb1B
4 CCcDx10
Name: A, dtype: object
I have a Panadas dataframe that looks like this:
df = pd.DataFrame({
'a':[['Often'],['Not Often','Not Often','Often'],['Somewhat Often','Never']],
'b':[['j0003'],['j0002','j0005', 'j0006'],['j0009','j0010']],
'c':[['jump'],['skip', 'throw', 'stab'],['walk','sleep']]
})
I want to merge the columns of this dataframe such that we have a single column that has each row with a list of tuples. The length of each row's list varies.
Merged_Column
0 [('Often','j0003','jump')]
1 [('Not Often','j0002','skip'),('Not Often', 'j0005','throw'),('Often','j0006','stab')]
2 [('Somwhat Often','j0009','walk'),('Never','j0010','sleep')]
I've tried the following code, with the same data sourcing from lists:
lst1 = [['Often'],['Not Often','Not Often','Often'],['Somewhat Often','Never']]
lst2 = [['j0003'],['j0002','j0005', 'j0006'],['j0009','j0010']]
lst3 = [['jump'],['skip', 'throw', 'stab'],['walk','sleep']]
merged = []
while x<len(lst1):
for i in range(len(lst1[x])):
merged.append((lst1[x][i], lst2[x][i], lst3[x][i]))
x+=1
which results in the following structure (when we call merged):
[('Often','j0003','jump'), ('Not Often', 'j0002','skip'),('Not Often','j0005','throw'),
('Often','j0006','stab'), ('Somewhat Often','j0009','walk'),('Never','j0010','sleep')]
Thing is, I need an extra level of structure in here, so that instead of getting a list of length 6, I get a list of length 3.
[[('Often','j0003','jump')],[('Not Often','j0002','skip'),('Not Often', 'j0005','throw'),
('Often','j0006','stab')],[('Somwhat Often','j0009','walk'),('Never','j0010','sleep')]]
I figure if I can get a data structure looking like this I can pretty easily do pd.DataFrame() and change my list of lists of tuples into a dataframe/series. But I'm having a lot of trouble getting there. Any tip/suggestions/pointers would be very much appreciated.
This can be done very easily with explode. Just explode all the columns, then convert each row into a tuple, then re-combine the tuples into lists:
merged_df = df.explode(df.columns.tolist()).apply(tuple, axis=1).groupby(level=0).agg(list).to_frame('Merged_Column')
Output:
>>> merged_df
Merged_Column
0 [(Often, j0003, jump)]
1 [(Not Often, j0002, skip), (Not Often, j0005, throw), (Often, j0006, stab)]
2 [(Somewhat Often, j0009, walk), (Never, j0010, sleep)]
I have following dataset:
this dataset print correlation of two columns at left
if you look at the row number 3 and 42, you will find they are same. only column position is different. that does not affect correlation. I want to remove column 42. But this dataset has many these row of similar values. I need a general algorithm to remove these similar value and have only unique.
As the correlation_value seems to be the same, the operation should be commutative, so whatever the value, you just have to focus on two first columns. Sort the tuple and remove duplicates
# You can probably replace 'sorted' by 'set'
key = df[['source_column', 'destination_column']] \
.apply(lambda x: tuple(sorted(x)), axis='columns')
out = df.loc[~key.duplicated()]
>>> out
source_column destination_column correlation_Value
0 A B 1
2 C E 2
3 D F 4
You could try a self join. Without a code example, it's hard to answer, but something like this maybe:
df.merge(df, left_on="source_column", right_on="destination_column")
You can follow that up with a call to drop_duplicates.
I have a pandas dataframe with an array variable that's currently made up of a two part string as in the example below. The first part is a datetime and the second part is a price. Records in the dataframe have different length price_trend arrays.
Id Name Color price_trend
1 apple red '1420848000:1.25', '1440201600:1.35', '1443830400:1.52' 60
2 lemon yellow '1403740800:0.32','1422057600:0.25'
I'd like to split each of the strings in the array into two parts around the colon (:), however when I run the code below, all the values in price_trend are replaced with nan
df['price_trend'] = df['price_trend'].str.split(':')
I would like to keep the array in this dataframe, and not a create a new one.
df['price_trend'].apply(lambda x:[i.split(':') for i in x])
0 [['1420848000, 1.25'], [ '1440201600, 1.35'], [ '1443830400, 1.52']]
1 [['1403740800, 0.32'], ['1422057600, 0.25']]
I assume below code should work for you
>>> df={}
>>> df['p']=['1420848000:1.25', '1440201600:1.35', '1443830400:1.52']
>>> df['p']=[ x.split(':') for x in df['p']]
>>> df
{'p': [['1420848000', '1.25'], ['1440201600', '1.35'], ['1443830400', '1.52']]}
I have a pandas Series with a MultiIndex, and I want to get the integer row numbers that belong to one level of the MultiIndex.
For example, if I have sample data s
s = pandas.Series([10, 23, 2, 19],
index=pandas.MultiIndex.from_product([['a', 'b'], ['c', 'd']]))
which looks like this:
a c 10
d 23
b c 2
d 19
I want to get the row numbers that correspond to the level b. So here, I'd get [2, 3] as the output, because the last two rows are under b. Also, I really only need the first row that belongs under b.
I wanted to get the numbers so that I can compare across Series. Say I have five Series objects with a b level. These are time-series data, and b corresponds to a condition that was present during some of the observations (and c is a sub-condition, etc). I want to see which Series had the conditions present at the same time.
Edit: To clarify, I don't need to compare the values themselves, just the indices. For example, in R if I had this dataframe:
d = data.frame(col_1 = c('a','a','b','b'), col_2 = c('c','d','c','d'), col_3 = runif(4))
Then the command which(d$col_1 == 'b') would produce the results I want.
If the index that you want to index by is the outermost one you can use loc
df.loc['b']
To get the first row I find the head method the easiest
df.loc['b'].head(1)
The idiomatic way to do the second part of your question is as follows. Say your series are named series1, series2 and series3.
big_series = pd.concat([series1, series2, series3])
big_series.loc['b']