creating a dataframe from a variable length text string - python

I am new to numpy and pandas. I am trying to add the words and their indexes to a dataframe. The text string can be of variable length.
text=word_tokenize('this string can be of variable length')
df2 = pd.DataFrame({'index':np.array([]),'word':np.array([])})
for i in text:
for i, row in df2.iterrows():
word_val = text[i]
index_val = text.index(i)
df2.set_value(i,'word',word_val)
df2.set_value(i,'index',index_val)
print df2

To create a DataFrame from each word of your string(can be of any length), you can directly use
df2 = pd.DataFrame(text, columns=['word'])
your nltk "word_tokenize" providing you a list of words which can be used to provide column data and by default pandas take care of index.

Just pass the list directly into the DataFrame method:
pd.DataFrame(['i', 'am', 'a', 'fellow'], columns=['word'])
word
0 i
1 am
2 a
3 fellow
I'm not sure you want to name a column 'index' and in this case the values will be the same as the index of the DataFrame itself. Also its not a good practice to name a column 'index' as you wont be able to access it with the df.column_name syntax and your code could be confusing to other people.

Related

how to generate column in pandas data frame using other columns and string formatting

I am trying to generate a third column in pandas dataframe using two other columns in dataframe. The requirement is very particular to the scenario for which I need to generate the third column data.
The requirement is stated as:
let the dataframe name be df, first column be 'first_name'. second column be 'last_name'.
I need to generate third column in such a manner so that it uses string formatting to generate a particular string and pass it to a function and whatever the function returns should be used as value to third column.
Problem 1
base_string = "my name is {first} {last}"
df['summary'] = base_string.format(first=df['first_name'], last=df['last_name'])
Problem 2
df['summary'] = some_func(base_string.format(first=df['first_name'], last=df['last_name']))
My ultimate goal is to solve problem 2 but for that problem 1 is pre-requisite and as of now I'm unable to solve that. I have tried converting my dataframe values to string but it is not working the way I expected.
You can do apply:
df.apply(lambda r: base_string.format(first=r['first_name'], last=r['last_name']) ),
axis=1)
Or list comprehension:
df['summary'] = [base_string.format(first=x,last=y)
for x,y in zip(df['first_name'], df['last_name'])
And then, for general function some_func:
df['summary'] = [some_func(base_string.format(first=x,last=y) )
for x,y in zip(df['first_name'], df['last_name'])
You could use pandas.DataFrame.apply with axis=1 so your code will look like this:
def mapping_function(row):
#make your calculation
return value
df['summary'] = df.apply(mapping_function, axis=1)

Check if one or more elements of a list are present in Pandas column

This question is an extension of the following question
Check if pandas column contains all elements from a list
In the question, for deriving output, all the members of a list are checked for a list in a Pandas column. My need is to check one or more elements of the list, i.e even if only one element of the list matches in the pandas column, I want to consider that in the output
The sample data would be
frame = pd.DataFrame({'a' : ['a,b,c', 'a,c,f', 'b,d,f','a,z,c']})
letters = ['a','c','m']
I want to get all the rows from the df where one or more elements of the letters list are found
You can change issubset to isdisjoint for found not common values, so then is added ~ for inverse mask:
letters = ['a','c','m']
letters_s = set(letters)
df = frame[~frame.a.str.split(',').map(letters_s.isdisjoint)]
print(df)
a
0 a,b,c
1 a,c,f
3 a,z,c
Or first solution is possible modify by np.any for test at least one match:
contains = [frame['a'].str.contains(i) for i in letters]
df = frame[np.any(contains, axis=0)]

Search for index value in a dataframe

I am looking for setting different conditions depending on the index value.
I have the following indices values:
country
Uk
Us
Es
In
It
Ge
Ho
where country is an index in my dataframe.
I would need to do the following
if index value is equal to 'Uk' then do something;
if index value is equal to 'Us' then do something else;
and so on.
I have tried as follows
if df.index.isin(['Us']) or df.isin(['Uk']):
stop_words = stopwords.words('english')
if df.index.isin(['Es']):
stop_words = stopwords.words('spanish')
but it is the wrong approach. I am not familiar with indices in pandas dataframe as I have always used column.
Help and suggestions are appreciated.
You can select the index of your dataframe using .loc()
df.loc['US', 'stop_words'] = 'english'
df.loc['UK', 'stop_words'] = 'english'
df.loc['ES', 'stop_words'] = 'spanish'
this example will create a new column stop_words with english or spanish depending on the index.

I'm using Pandas in Python and wanted to know how to split a value in a column and search that value in the column

Normally when splitting a value which is a string, one would simply do:
string = 'aabbcc'
small = string[0:2]
And that's simply it. I thought it would be the same thing for a dataframe by doing:
df = df['Column'][Range][Length of Desired value]
df = df['Date'][0:4][2:4]
Note: Every string in the column have the same length and are all integers written as a string data type
If I use the code above the program just throws the Range and takes [2:4] as the range which is weird.
When doing this individually it works:
df2 = df['Column'][index][2:4]
So right now I had to make a loop that goes one by one and append it to a new Dataframe.
To do the operation element wise, you can use apply (see link):
df['Column'][0:4].apply(lambda x : x[2:4])
When you did df2 = df['Column'][0:4][2:4], you are doing the same as df2 = df['Column'][2:4].
You're getting the indexes 0 to 4 of df and then getting the indexes 2 to 4 of that one.

Scalable solution for str.contains with list of strings in pandas

I am parsing a pandas dataframe df1 containing string object rows. I have a reference list of keywords and need to delete every row in df1 containing any word from the reference list.
Currently, I do it like this:
reference_list: ["words", "to", "remove"]
df1 = df1[~df1[0].str.contains(r"words")]
df1 = df1[~df1[0].str.contains(r"to")]
df1 = df1[~df1[0].str.contains(r"remove")]
Which is not not scalable to thousands of words. However, when I do:
df1 = df1[~df1[0].str.contains(reference_word for reference_word in reference_list)]
I yield the error first argument must be string or compiled pattern.
Following this solution, I tried:
reference_list: "words|to|remove"
df1 = df1[~df1[0].str.contains(reference_list)]
Which doesn't raise an exception but doesn't parse all words eather.
How to effectively use str.contains with a list of words?
For a scalable solution, do the following -
join the contents of words by the regex OR pipe |
pass this to str.contains
use the result to filter df1
To index the 0th column, don't use df1[0] (as this might be considered ambiguous). It would be better to use loc or iloc (see below).
words = ["words", "to", "remove"]
mask = df1.iloc[:, 0].str.contains(r'\b(?:{})\b'.format('|'.join(words)))
df1 = df1[~mask]
Note: This will also work if words is a Series.
Alternatively, if your 0th column is a column of words only (not sentences), then you can use df.isin, which should be faster -
df1 = df1[~df1.iloc[:, 0].isin(words)]

Categories

Resources