Remove list of words from a string Series - python

I have a list of words to remove:
words_list_to_remove = ['abc', 'def', 'ghi', 'jkl']
I want to remove these words from the string Series (df):
My_strings
first
abc
second
third
def
forth
ghi
jkl
My goal new_df:
My_new_strings
first
second
third
forth
I want to keep each element as a string and also keep the index of each element. I tried to convert both of them to set but did not work for me.
Any help would appreciate it!

You can use .isin() and pass your words_list_to_remove to it:
import pandas as pd
# Define Pandas Series that holds your data
df = pd.Series(["first","abc","second","third","def","forth","ghi","jkl"])
print("before dropping:\n", df)
# Define list of strings to drop
words_list_to_remove = ['abc', 'def', 'ghi', 'jkl']
# Only keep rows that are not in list
df = df[~df.isin(words_list_to_remove)]
print("\nafter dropping:\n", df)
As you can see in the output, the index is preserved as well:
before dropping:
0 first
1 abc
2 second
3 third
4 def
5 forth
6 ghi
7 jkl
dtype: object
after dropping:
0 first
2 second
3 third
5 forth
dtype: object
Note: you would usually name a DataFrame as df, it would be better to rename your Series something else to avoid confusion.

Related

How to split one string element in a list into two in Python?

Sorry in advance if this is a silly question.
I need to use pandas to sort some data, but what I have been given is formatted strangely, and I get an error message "2 columns passed, passed data had 1 columns"
['Fred Green,20/11/2020\n', 'Jack Wilson,01/05/2021\n',] etc.
How can I go about splitting the elements into two at the , point, so I can get my columns to work properly?
I'd use list-comprehension to split each string and then pass it to pd.DataFrame:
lst = ['Fred Green,20/11/2020\n', 'Jack Wilson,01/05/2021\n',]
df = pd.DataFrame([item.strip().split(',') for item in lst], columns=['name', 'date'])
Output:
>>> df
name date
0 Fred Green 20/11/2020
1 Jack Wilson 01/05/2021

Python Panda Dataframe Count Specific Values from List

Say I have a list:
mylist = ['a','b','c']
and a Pandas dataframe (df) that has a column named "rating". How can I get the count for number of occurrence of a rating while iterating my list? For example, here is what I need:
for item in myList
# Do a bunch of stuff in here that takes a long time
# want to do print statement below to show progress
# print df['rating'].value_counts().a <- I can do this,
# but want to use variable 'item'
# print df['rating'].value_counts().item <- Or something like this
I know I can get counts for all distinct values of 'rating', but that is not what I am after.
If you must do it this way, you can use .loc to filter the df prior to getting the size of the resulting df.
mylist = ['a','b','c']
df = pd.DataFrame({'rating':['a','a','b','c','c','c','d','e','f']})
for item in mylist:
print(item, df.loc[df['rating']==item].size)
Output
a 2
b 1
c 3
Instead of thinking about this problem as one of going "from the list to the Dataframe" it might be easiest to flip it around:
mylist = ['a','b','c']
df = pd.DataFrame({'rating':['a','a','b','c','c','c','d','e','f']})
ValueCounts = df['rating'].value_counts()
ValueCounts[ValueCounts.index.isin(mylist)]
Output:
c 3
a 2
b 1
Name: rating, dtype: int64
You don't even need a for loop, just do:
df['rating'].value_counts()[mylist]
Or to make it a dictionary:
df['rating'].value_counts()[['a', 'b', 'c']].to_dict()

Is there any way to remove column and rows numbers from DataFrame.from_dict?

So, I have a problem with my dataframe from dictionary - python actually "names" my rows and columns with numbers.
Here's my code:
a = dict()
dfList = [x for x in df['Marka'].tolist() if str(x) != 'nan']
dfSet = set(dfList)
dfList123 = list(dfSet)
for i in range(len(dfList123)):
number = dfList.count(dfList123[i])
a[dfList123[i]]=number
sorted_by_value = sorted(a.items(), key=lambda kv: kv[1], reverse=True)
dataframe=pd.DataFrame.from_dict(sorted_by_value)
print(dataframe)
I've tried to rename columns like this:
dataframe=pd.DataFrame.from_dict(sorted_by_value, orient='index', columns=['A', 'B', 'C']), but it gives me a error:
AttributeError: 'list' object has no attribute 'values'
Is there any way to fix it?
Edit:
Here's the first part of my data frame:
0 1
0 VW 1383
1 AUDI 1053
2 VOLVO 789
3 BMW 749
4 OPEL 621
5 MERCEDES BENZ 593
...
The 1st rows and columns are exactly what I need to remove/rename
index and columns are properties of your dataframe
As long as len(df.index) > 0 and len(df.columns) > 0, i.e. your dataframe has nonzero rows and nonzero columns, you cannot get rid of the labels from your pd.DataFrame object. Whether the dataframe is constructed from a dictionary, or otherwise, is irrelevant.
What you can do is remove them from a representation of your dataframe, with output either as a Python str object or a CSV file. Here's a minimal example:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
print(df)
# 0 1 2
# 0 1 2 3
# 1 4 5 6
# output to string without index or headers
print(df.to_string(index=False, header=False))
# 1 2 3
# 4 5 6
# output to csv without index or headers
df.to_csv('file.csv', index=False, header=False)
By sorting the dict_items object (a.items()), you have created a list.
You can check this with type(sorted_by_value). Then, when you try to use the pd.DataFrame.from_dict() method, it fails because it is expecting a dictionary, which has 'values', but instead receives a list.
Probably the smallest fix you can make to the code is to replace the line:
dataframe=pd.DataFrame.from_dict(sorted_by_value)
with:
dataframe = pd.DataFrame(dict(sorted_by_value), index=[0]).
(The index=[0] argument is required here because pd.DataFrame expects a dictionary to be in the form {'key1': [list1, of, values], 'key2': [list2, of, values]} but instead sorted_by_value is converted to the form {'key1': value1, 'key2': value2}.)
Another option is to use pd.DataFrame(sorted_by_value) to generate a dataframe directly from the sorted items, although you may need to tweak sorted_by_value or the result to get the desired dataframe format.
Alternatively, look at collections.OrderedDict (the documentation for which is here) to avoid sorting to a list and then converting back to a dictionary.
Edit
Regarding naming of columns and the index, without seeing the data/desired result it's difficult to give specific advice. The options above will allow remove the error and allow you to create a dataframe, the columns of which can then be renamed using dataframe.columns = [list, of, column, headings]. For the index, look at pd.DataFrame.set_index(drop=True) (docs) and pd.DataFrame.reset_index() (docs).

Python, Pandas: Using isin() like functionality but do not ignore duplicates in input list

I am trying to filter an input dataframe (df_in) against a list of indices. The indices list contains duplicates and I want my output df_out to contain all occurrences of a particular index. As expected, isin() gives me only a single entry for every index.
How do I try and not ignore duplicates and get output similar to df_out_desired?
import pandas as pd
import numpy as np
df_in = pd.DataFrame(index=np.arange(4), data={'A':[1,2,3,4],'B':[10,20,30,40]})
indices_needed_list = pd.Series([1,2,3,3,3])
# In the output df, I do not particularly care about the 'index' from the df_in
df_out = df_in[df_in.index.isin(indices_needed_list)].reset_index()
# With isin, as expected, I only get a single entry for each occurence of index in indices_needed_list
# What I am trying to get is an output df that has many rows and occurences of df_in index as in the indices_needed_list
temp = df_out[df_out['index'] == 3]
# This is what I would like to try and get
df_out_desired = pd.concat([df_out, df_out[df_out['index']==3], df_out[df_out['index']==3]])
Thanks!
Check reindex
df_out_desired = df_in.reindex(indices_needed_list)
df_out_desired
Out[177]:
A B
1 2 20
2 3 30
3 4 40
3 4 40
3 4 40

Efficiently adding a new column to a Pandas DataFrame with values processed from an existing column?

I have a string column foo in my DataFrame. I need to create a new column bar, whose values are derived from corresponding foo values by a sequence of string-processing operations - a bunch of str.split()s and str.join()s in this particular case.
What is the most efficient way to do this?
Take a look at the vectorized string methods of pandas dataframes.
http://pandas.pydata.org/pandas-docs/dev/text.html#text-string-methods
# You can call whatever vectorized string methods on the RHS
df['bar'] = df['foo']
eg.
df = pd.DataFrame(['a c', 'b d'], columns=['foo'])
df['bar'] = df['foo'].str.split(' ').str.join('-')
print(df)
yields
foo bar
0 a c a-c
1 b d b-d
Pandas can do this for you. A simple example might look like:
foo = ["this", "is an", "example!"]
df = pd.DataFrame({'foo':foo})
df['upper_bar'] = df.foo.str.upper()
df['lower_bar'] = df.foo.str.lower()
df['split_bar'] = df.foo.str.split('_')
print(df)
which will give you
foo upper_bar lower_bar split_bar
0 this THIS this [this]
1 is an IS AN is an [is an]
2 example! EXAMPLE! example! [example!]
See the link above from Alex

Categories

Resources