I have the following dataframe where I am trying to create new column C such that it is based on cumlative value of Column 'A' and 'B' as dictionary. And also, if '0' in Column 'B', the entry of that key is deleted from 'C'
df = DataFrame({'A' : [1,2,3,2,3,2],
'B':['Hi','Hello','HiWorld','HelloWorld','0','0']})
for indx,row in df.iterrows():
df['C'].append(dict(zip([row['A'],row['B']])))
I am looking for the following output in column C:
A B C
0 1 Hi {1:Hi}
1 2 Hello {1:Hi,2:Hello}
2 3 HiWorld {1:Hi,2:Hello,3:HiWorld}
3 2 HelloWorld {1:Hi,2:HelloWorld,3:HiWorld}
4 3 0 {1:Hi,2:HelloWorld}
5 2 0 {1:Hi}
I have tried potential solutions using cumsum, concat & series.shift(1) but had a block. Now I came across using dict & zip which seems clean solution but doesn't work for me. Any suggestions.
Try this:
d = dict()
column = list()
for _, a, b in df.itertuples():
if b != '0':
d[a] = b
else:
d.pop(a, None)
column.append(d.copy())
df['C'] = column
Related
I have a dataframe df which has 4 columns 'A','B','C','D'
I have to search for a substring in each column and return the complete dataframe in the search order for example if I get the substring in column B row 3,4,5 then my final df would be having
3 rows. For this I am using df[df['A'].str.contains('string_to _search') and it's working fine but one of the column consist each element in the column as list of strings like in column B
A B C D
0 asdfg [asdfgh, cvb] asdfg nbcjsh
1 fghjk [ertyu] fghhjk yrewf
2 xcvb [qwerr, hjklk, bnm] cvbvb gjfsjgf
3 ertyu [qwert] ertyhhu ertkkk
so df[df['A'].str.contains('string_to _search') is not working for column B pls suggest how can I search in this column and maintain the order of complete dataframe.
There are lists in column B, so need in statement:
df1 = df[df['B'].apply(lambda x: 'cvb' in x)]
print (df1)
A B C D
0 asdfg [asdfgh, cvb] asdfg nbcjsh
If want use str.contains then is possible use str.join first, so is possible search also substrings:
df1 = df[df['B'].str.join(' ').str.contains('er')]
print (df1)
A B C D
1 fghjk [ertyu] fghhjk yrewf
2 xcvb [qwerr, hjklk, bnm] cvbvb gjfsjgf
3 ertyu [qwert] ertyhhu ertkkk
If want search in all columns:
df2 = (df[df.assign(B = df['B'].str.join(' '))
.apply(' '.join, axis=1)
.str.contains('g')]
)
print (df2)
A B C D
0 asdfg [asdfgh, cvb] asdfg nbcjsh
1 fghjk [ertyu] fghhjk yrewf
2 xcvb [qwerr, hjklk, bnm] cvbvb gjfsjgf
I have a dataframe A with column 'col_1' and values of column is A and B and and I am trying to map the values of A and B present in Dictionary
DataFrame A:
enter image description here
and have dictionary
enter image description here
and I want the output like this
Dataframe :
col_1 Values
A 1
A 2
A 3
B 1
B 2
Any help will be highly appreciated
thanks
I tried to frame your problem properly:
df = pd.DataFrame({"col_1":["A","A","A","B","B"]})
Printing df gives us your dataframe shown in the image above:
print(df)
col_1
0 A
1 A
2 A
3 B
4 B
Here is your dictionary:
dict1 = {"A":[1,2,3], "B":[1,2]}
I created an empty list to hold the elements then stack up the list with your request, and finally created a new column called values and write the list into the column
values1 = []
for key,value_list in dict1.items():
for item in value_list:
value_item = key+" "+ str(item)
values1.append(value_item)
df["values"] = values1
printing df results into:
df
col_1 values
0 A A 1
1 A A 2
2 A A 3
3 B B 1
4 B B 2
I would like to convert a dictionary of key-value pairs to an excel file with column names that match the values to the corresponding columns.
For example :
I have an excel file with column names as:
a,b,c,d,e,f,g and h.
I have a dictionary like:
{1:['c','d'],2:['a','h'],3:['a','b','b','f']}.
I need the output to be:
a b c d e f g h
1 1 1
2 1 1
3 1 2 1
the 1,2,3 are the keys from the dictionary.
The rest of the columns could be either 0 or null.
I have tried splitting the dictionary and am getting
1 = ['c','d']
2 = ['a','h']
3 = ['a','b','b','f']
but, I don't know how to pass this to match with the excel file.
Your problem can be solved with pandas and collections (there may exist a more efficient solution):
import pandas as pd
from collections import Counter
d = {...} # Your dictionary
series = pd.Series(d) # Convert the dict into a Series
counts = series.apply(Counter) # Count items row-wise
counts = counts.apply(pd.Series) # Convert the counters to Series
table = counts.fillna(0).astype(int) # Fill the gaps and make the counts integer
print(table)
# a b c d f h
1 0 0 1 1 0 0
2 1 0 0 0 0 1
3 1 2 0 0 1 0
It is not clear what type of output you expect, so I leave it to you to convert the DataFrame to the output of your choice.
A simple solution only based on standard lists and dictionaries. It generates a 2D list, which is then easy to convert into a CSV file than can be loaded by Excel.
d = {1:['c','d'],2:['a','h'],3:['a','b','b','f']}
cols = dict((c,n) for n,c in enumerate('abcdefgh'))
rows = dict((k,n) for n,k in enumerate('123'))
table = [[0 for col in cols] for row in rows]
for row, values in d.items():
for col in values:
table[rows[row]][cols[col]] += 1
print(table)
# output:
# [[0,0,1,1,0,0,0,0], [1,0,0,0,0,0,0,1], [1,2,0,0,0,1,0,0]]
i have a data frame with a column of strings and integers.
On one of the columns containing strings I want to search all the items of that column for a specific substring let say "abc" and delete the row if the substring exists. How do I do that? It sounds easy but somehow I struggle with this.
The substring is always the last three characters.
I tried the following:
df1 = df.drop(df[df.Hostname[-4:]== "abc"])
which gives me
UserWarning: Boolean Series key will be reindexed to match DataFrame
index
so I tried to modify the values in that column and filter out all values that do not have "abc" at the end:
red = [c for c in df.Hostname[-4:] if c != 'abc']
which gives me
KeyError('%s not in index' % objarr[mask])
What do I do wrong?
Thanks for your help!
Use boolean indexing, add indexing with str if need check last 4 (3) chars of column Hostname and change condition from == to !=:
df1 = df[df.Hostname.str[-4:] != "abc"]
Or maybe:
df1 = df[df.Hostname.str[-3:] != "abc"]
Sample:
df = pd.DataFrame({'Hostname':['k abc','abc','dd'],
'b':[1,2,3],
'c':[4,5,6]})
print (df)
Hostname b c
0 k abc 1 4
1 abc 2 5
2 dd 3 6
df1 = df[df.Hostname.str[-3:] != "abc"]
print (df1)
Hostname b c
2 dd 3 6
Also works str.endswith if need check last chars:
df1 = df[~df.Hostname.str.endswith("abc")]
print (df1)
Hostname b c
2 dd 3 6
EDIT:
If need check in last 4 chars if abc and then remove rows first extract values and then use str.contains:
df1 = df[~df.Hostname.str[-4:].str.contains('abc')]
print (df1)
Hostname b c
2 dd 3 6
EDIT1:
For default index add reset_index - python counts form 0, so values of index are 0,1,2,...:
df1 = df[df.Hostname.str[-3:] != "abc"].reset_index(drop=True)
Imagine I have a dataframe, df, that has 2 columns, a USER_ID and a PRODUCT they bought.
df
USER_ID | PRODUCT
1 a
1 b
1 c
2 d
2 a
2 k
I want to convert this DataFrame into a new one, df2, where each row is one user and the products are aggregated into a list of string.
df2
USER_ID | PRODUCT
1 [a,b,c]
2 [d,a,k]
Finally, I’d like to be able to find the intersection between two user’s PRODUCT lists.
I’m able to create the second dataframe, but the method I’m using results in a list that is somehow not iterable.
Specifically i do:
df2 = df1.groupby(‘USER_ID)[‘PRODUCT’].agg(lambda x: x.tolist())
which gives me a series that I convert back to a dataframe.
df2 = df2.to_frame()
which gives me the df2 I’m looking for but each list of products has length = 1, thus I can’t compare 1 to another to find the intersection of products. For instance, when I execute:
s1 = df2.PRODUCT[df2.USER_ID == 1]
s2 = df2.PRODUCT[df2.USER_ID == 2]
common_elements = list(set(s1).intersection(set(s2)))
common_elements
The result is an empty list rather than [a]. What am I doing wrong?
Try this:
df3 = pd.crosstab(df2.PRODUCT,df2.USER_ID, margins= True)
print df3[df3['All']>1]
# USER_ID 1 2 All
# PRODUCT
# a 1 1 2
# All 3 3 6
My solution very simliar to #Nikil so using his.
df2 = df.groupby('USER_ID')['PRODUCT'].apply(list)
df2 = df2.reset_index()
print df2
# USER_ID PRODUCT
# 0 1 [a, b, c]
# 1 2 [d, a, k]
More Info on Crosstab, Its a dataframe.
pd.crosstab(df2.PRODUCT,df2.USER_ID, margins= True)
# USER_ID 1 2 All
# PRODUCT
# a 1 1 2
# b 1 0 1
# c 1 0 1
# d 0 1 1
# k 0 1 1
# All 3 3 6
You can perform groupby followed by finding intersection between the two lists as shown:
>>>df2 = df.groupby('USER_ID')['PRODUCT'].apply(list).reset_index()
>>>df2
USER_ID PRODUCT
0 1 [a, b, c]
1 2 [d, a, k]
>>>list(set(df2['PRODUCT'].loc[0]).intersection(df2['PRODUCT'].loc[1]))
['a']
Or in a more shorter way:
df2 = df.groupby('USER_ID')['PRODUCT'].apply(list)
>>>list(set(df2.loc[1]).intersection(df2.loc[2]))
['a']
is that what you want?
In [7]: pd.Series(np.intersect1d(df.loc[df.USER_ID == 1, 'PRODUCT'], df.loc[df.USER_ID == 2, 'PRODUCT']))
Out[7]:
0 a
dtype: object
or using index.intersection():
In [18]: (df.set_index('PRODUCT').query('USER_ID == 1').index
....: .intersection(df.set_index('PRODUCT').query('USER_ID == 2').index)
....: .to_series()
....: )
Out[18]:
PRODUCT
a a
Name: PRODUCT, dtype: object
PS i wouldn't convert your df into df2 as you most probably will have a lot of difficulties with this data model (i mean having lists in columns)
This will give you a generalizable solution to find the intersection of any two users product list without the sloppy second dataframe
from collections import defaultdict
user1 = 1
user2 = 2
products = defaultdict(set)
for record in df.to_dict('records'):
products[record['USER_ID']].add(record['PRODUCT'])
common_elements = products[user1].intersection(products[user2])]
print(common_elements)
And then if you wanted all intersections with all pairs of users
from itertools import combinations
common_elements = {(x,y): products[x].intersection(products[y]) for x,y in combinations(products.keys(),2)}