Not able to assign values to a column. Bag_of_words - python

I am trying to assign values to a column in my pandas df, however I am getting a blank column, here's the code:
df['Bag_of_words'] = ''
columns = ['genre', 'director', 'actors', 'key_words']
for index, row in df.iterrows():
words = ''
for col in columns:
words += ' '.join(row[col]) + ' '
row['Bag_of_words'] =words
The output is an empty column, can someone please help me understand what is happening here, as I am not getting any errors.

from the iterrows documentation:
You should never modify something you are iterating over.
This is not guaranteed to work in all cases. Depending on the
data types, the iterator returns a copy and not a view, and writing
to it will have no effect.
So you do row[...] = ... and it turns out row is a copy and that's not affecting the original rows.
iterrows is frowned upon anyway, so you can instead
join each words list per row to become strings
aggregate those strings with " ".join row-wise
add space to them
df["Bag_of_words"] = (df[columns].apply(lambda col: col.str.join(" "))
.agg(" ".join, axis="columns")
.add(" "))

Instead of:
row['Bag_of_words'] =words
Use:
df.at[index,'Bag_of_words'] = words

Related

How to Compare two columns of CSV simultaneously in Python?

I have a csv file with a huge dataset containing two columns. I want to compare the data of these two columns such that, if a duplicated pair is present then it gets deleted. For example, if my data file looks something like this:
Column A Column B
DIP-1N DIP-1N
DIP-2N DIP-3N
DIP-3N DIP-2N
DIP-4N DIP-5N
Then the first entry gets deleted because I don't want two "DIP-1Ns". Also, the order of occurrence of pair is not an issue as far as the entry is unique. For example, here, DIP-2N & DIP-3N and DIP-3N & DIP-3N are paired. But both the entries mean the same thing. So I want to keep one entry and delete the rest.
I have written the following code, but I don't know how to compare simultaneously the entry of both the columns.
import csv
import pandas as pd
file = pd.read_csv("/home/staph.csv")
for i in range(len(file['Column A'])):
for j in range(len(file['Column B'])):
list1 = []
list2 = []
list1.append(file[file['Column A'].str.contains('DIP-'+str(i)+'N')])
list2.append(file[file['Column B'].str.contains('DIP-'+str(i)+'N')])
for ele1,ele2 in list1,list2:
if(list1[ele1]==list2[ele2]):
print("Duplicate")
else:
print("The 1st element is :", ele1)
print("The 2nd element is :", ele2)
Seems like something is wrong, as there is no output. The program just ends without any output or error. Any help would be much appreciated in terms of whether my code is wrong or if I can optimize the process in a better way. Thanks :)
It might not be the best way to get what you need but, it works.
df['temp'] = df['Column A'] + " " + df['Column B']
df['temp'] = df['temp'].str.split(" ")
df['temp'] = df['temp'].apply(lambda list_: " ".join(sorted(list_)))
df.drop_duplicates(subset=['temp'], inplace=True)
df = df[df['Column A'] != df['Column B']]
df.drop('temp', axis=1, inplace=True)
Output:
index
Column A
Column B
1
DIP-2N
DIP-3N
3
DIP-4N
DIP-5N
With some tweaking you could use the pandas method:
# get indices of duplicate-free (except first occurence) combined sets of col A and B
keep_ind = pd.Series(df[["Column A", "Column B"]].values.tolist()).apply(set).drop_duplicates(keep="first").index
# use these indices to filter the DataFrame
df = df.loc[keep_ind]

Most pythonic/stylish/efficient way to create a dataframe from 2-dimensional list of string with varied length

I guess professional data analysts know an answer to this, but I'm no analyst.
And I just barely know Pandas. So I am at a loss.
There are two lists. Their contents are unpredictable (parsed from web counters, web analytics, web statistics, etc).
list1 = ['WordA', 'WordB', ..., 'WordXYZ']
...and...
list2 = [['WordA1', 'WordA2'], ['WordB1'], ['WordC1', 'WordC2', ,'WordC96'], ..., ['WordXYZ1', 'WordXYZ2']]
Length of two lists are always equal (they`re the results of work of parser I already wrote)
What I need is to create a dataframe which will have two rows for each item in list1, each containing the word in first column, and then put corresponding words from list2 into first row of those two (starting from second column, first column to bealready filled from list1).
So I imagine the following steps:
Create a dataframe filled with empty strings ('') with number of columns equal to len(max(list2, key=len)) and number of rows equal to twice length of list1 (aaaand I don't know how, this is actually my very second time using Pandas at all!);
Somehow fill first column of resulting dataframe with contents of list1, filling two rows for each item in list1;
Somehow put contents of list2 into every even row of the dataframe, starting with second column;
Save into .xls file (yes, that's the final goal), enjoy job done.
Now first thing, I already spend half a day trying to find an answer "how to create pandas dataframe filled with empty strings with given number of rows and columns", and found a lot of different articles, which contradict each other.
And second, there's got to be a way to do all this more pythonic, more efficient and more stylish way!
Aaaand, maybe there`s a way to create an excel file without using pandas at all, which I just don't know about (hopefully, yet)
Can anyone help, please?
UPD: (to answer a question) the results should look like:
WordA WordA1 WordA2
WordA
WordB WordB1
WordB
WordC WordC1 WordC2 (...) WordC96
WordC
(...)x2
WordXYZ WordXYZ1 WordXYZ2
WordXYZ
If you just want to write the lists to an Excel file, you don't need pandas. You can use for instance openpyxl:
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
for *word, words in zip(list1, list2):
ws.append(word + words)
ws.append(word)
wb.save('output.xlsx')
If you really want to use pandas:
import pandas as pd
df = pd.DataFrame([[None] + x if isinstance(x, list) else [x] for pair in zip(list2, list1) for x in pair])
df[0] = df[0].bfill()
df.to_excel('output.xlsx', index=False, header=False)
The following should give you (almost) what you want:
import pandas as pd
from itertools import chain
list1 = ['WordA', 'WordB']
list2 = [['WordA1', 'WordA2'], ['WordB1']]
# Flatten list 2
list2 = list(chain(*list2))
# Create DataFrames
list1 = pd.DataFrame(data=list1, columns=["word1"])
list2 = pd.DataFrame(data=list2, columns=["word2"])
# Prefix for list2
list2["prefix"] = list2["word2"].str.extract("([^0-9]+)")
list1 = list1.merge(list2, left_on="word1", right_on="prefix", how="inner")
# Concatenated words
list1 = list1.groupby("word1")["word2"].agg(lambda x: " ".join(x)).reset_index()
list1["word2"] = list1["word1"].str.cat(list1["word2"], sep=" ")
list1 = pd.melt(list1).sort_values(by="value")

How do I remove square brackets for all the values in a column?

I have a column named keywords in my pandas dataset. The values of the column are like this :
[jdhdhsn, cultuere, jdhdy]
I want my output to be
jdhdhsn, cultuere, jdhdy
Try this
keywords = [jdhdhsn, cultuere, jdhdy]
if(isinstance(keyword, list)):
output = ','.join(keywords)
else:
output = keywords[1:-1]
The column of your dataframe seems to be a list
Lists are formatted with brackets and each elements of that list's repr()
Pandas has built in functions for dealing with strings
df['column_name'].str let's you use each element in the column and apply a str function on them. Just like ', '.join(['foo', 'bar', 'baz'])
Thus df['column_name_str'] = df['column_name'].str.join(', ') will produce a new column with the formatting you're after.
You can also use the .apply to perform arbitrary lambda functions on a column, such as:
df['column_name'].apply(lambda row: ', '.join(row))
But since pandas has the .str built in this isn't needed for this example.
Try this
data = ["[jdhdhsn, cultuere, jdhdy]"]
df = pd.DataFrame(data, columns = ["keywords"])
new_df = df['keywords'].str[1:-1]
print(df)
print(new_df)

df.iloc Not Assigning Values in a For-loop? (pandas)

I'm fairly new to pandas, I have a dataset containing about 250,000 rows, stored in a JSON. One of my columns contains a long, possibly unique string in each cell which I have to filter some before the data is usable. For some reason, each value is being accessed and filtered correctly (meaning the correct value is stored in my processing variable at the end), but when it comes to assignment with df.iloc[x]['notes'], the values are not correctly reassigned into the dataframe. I've read about issues with chained indexing and assignment in pandas, but I thought that this would be circumvented by using .iloc, and it just isn't working for me right now.
Here's an example:
Assume this is my dataframe and some filtering code:
import pandas as pd
#Listing the things I want to filter out
greeting = ['Hello,', 'Hi']
goodbye = ['Thank you', 'Goodbye']
df = pd.DataFrame({'ID':[123, 456, 789], 'Group':['A', 'B', 'C'],\
'notes':['Hello, this is John', 'Thank you for your help',\
'This is a message.']})
#Doing the actual filtering
for x in range(0, len(df['notes'])):
note = df.iloc[x]['notes']
for y in greeting:
if y in note:
note = note.replace(y, '')
for z in goodbye:
if z in note:
note = note.replace(z, '')
#The variable note is correctly filtered here,\
but then it doesn't assign and leaves the df unchanged\
at the previous index, so error is probably beyond this point
df.iloc[x]['notes'] = note
df.to_json('final_data.json', orient = 'records')
Another thing I've used in place of .iloc is df.at[x, 'notes'] = note, but this seems to have the same problem.
So in the final version, instead of getting something like:
[{'ID':1, 'Group': "A", 'notes':' this is John'}..etc.]
I get:
[{'ID':1, 'Group': "A", 'notes':'Hello, this is John'}..etc.]
(which is completely unchanged)
What is happening here? Is there some unpredictable assignment going on that I can somehow fix?
Why not:
df['notes'] = df['notes'].str.replace('|'.join(greeting + goodbye), '')
And now:
df.to_json('final_data.json', orient = 'records')
Will give you a good desired json file.
As:
[{"Group":"A","ID":123,"notes":" this is John"},{"Group":"B","ID":456,"notes":" for your help"},{"Group":"C","ID":789,"notes":"This is a message."}]
Use the code below.
Var idx is the index of the dataframe df, you can pass idx to .loc() to index. Var row is a series, which contain data in a single row.
for idx, row in df.iterrows():
note = row['notes']
for y in greeting:
if y in note:
note = note.replace(y, '')
for z in goodbye:
if z in note:
note = note.replace(z, '')
df.loc[idx, 'notes'] = note

Correct way of concatenating two columns in pandas

I have a data frame df with two columns "values" and "values1". I want to concatenate both these columns and create a new column "values2". Values are as follows:
values values1
[u'12f4',u'ff45'] [u'12f4']
[u'sd45',u'45ty']
[u'12f34',u'ff2345'] []
If you noice the 2nd cell in column "values" is empty. And the last cell in columns "values1" is []. I want to concatenate as below -
values values1 values2
[u'12f4',u'ff45'] [u'12f4'] [u'12f4',u'ff45',u'12f4']
[u'sd45',u'45ty'] [u'sd45',u'45ty']
[u'12f34',u'ff2345'] [] [u'12f34',u'ff2345']
Code I am using is -
df["values2"] = data["values"] + ', ' + data["values1"]
This creates extra commas or brackets. What would be an ideal code for this?
Since you are concatenating text, I don't think it is possible to take advantage of numpy's ufuncs (I can be wrong).
So, assuming that, I would just use a list comprehension.
df["values2"] = [", ".join([str(data.loc[x, "values"]), str(data.loc[x, "values1"])]) for x in df.index]
#piRSquared is right (as usual). If values and values1 are lists, then...
df = pd.DataFrame({'values': [[u'12f4',u'ff45'], [], [u'12f34',u'ff2345']],
'values1': [[u'12f4'], [u'sd45',u'45ty'], []]},
columns=['values', 'values1'])
You can sum them like this...
>>> df[['values', 'values1']].sum(axis=1)
0 [12f4, ff45, 12f4]
1 [sd45, 45ty]
2 [12f34, ff2345]
Since the code you are using is data["values"] + ', ' + data["values1"], and it's creating extra commas or brackets, it sounds like your data is not lists but are instead strings.
df1 = pd.DataFrame({'values': ["[u'12f4',u'ff45']", "''", "[u'12f34',u'ff2345']"],
'values1': ["[u'12f4']", "[u'sd45',u'45ty']", '[]']})
A million different ways to do this. If you don't need the 'u' in front of the strings, the easiest way to do this might be this:
import ast
df1[['values', 'values1']].applymap(ast.literal_eval).applymap(lambda x: x if x else []).sum(axis=1)

Categories

Resources