Im trying to add " in the beginning and end of the column in a df.
Eg - Initial dataframe:
A B
Hello What
I Is
AM MY
Output:
A B
Hello "What
I Is
AM MY"
You could use iat to index on the specific strings, and format them with:
df.iat[0,1] = f'"{df.iat[0,1]}'
df.iat[-1,1] = f'{df.iat[-1,1]}"'
print(df)
A B
0 Hello "What
1 I Is
2 AM MY"
Related
I have a list of "states" from which I have to iterate:
states = ['antioquia', 'boyaca', 'cordoba', 'choco']
I have to iterate one column in a pandas df to replace or cut the string where the state text is found, so I try:
df_copy['joined'].apply([(lambda x: x.replace(x,x[:-len(j)]) if x.endswith(j) and len(j) != 0 else x) for j in states])
And the result is:
Result wanted:
joined column is the input and the desired output is p_joined column
If it's possible also to find the state not only in the end of the string but check if the string contains it and replace it
Thanks in advance for your help.
This will do what your question asks:
df_copy['p_joined'] = df_copy.joined.str.replace('(' + '|'.join(states) + ')$', '')
Output:
joined p_joined
0 caldasantioquia caldas
1 santafeantioquia santafe
2 medelinantioquiamedelinantioquia medelinantioquiamedelin
3 yarumalantioquia yarumal
4 medelinantioquiamedelinantioquia medelinantioquiamedelin
I would like to keep in the column only the two first words of a cell in a dataframe.
For instance:
df = pd.DataFrame(["I'm learning Python", "I don't have money"])
I would like that the results in the column have the following output:
"I'm learning" ; "I don't"
After that, if possible I would like to add '*' between each word. So would be like:
"*I'm* *learning*" ; "*I* *don't*"
Thanks for all the help!
You can use a regex with str.replace:
df[0].str.replace(r'(\S+)\s(\S+).*', r'*\1* *\2*', regex=True)
output:
0 *I'm* *learning*
1 *I* *don't*
Name: 0, dtype: object
As a new column:
df['new'] = df[0].str.replace(r'(\S+)\s(\S+).*', r'*\1* *\2*', regex=True)
output:
0 new
0 I'm learning Python *I'm* *learning*
1 I don't have money *I* *don't*
I have the following string:
"hello, I'm going to eat to the fullest today hhhhhhhhhhhhhhhhhhhhh"
I have collected many tweets like that and assigned them to a dataframe. How can I clean those rows in dataframe by removing "hhhhhhhhhhhhhhhhhh" and only let the rest of the string in that row?
I'm also using countVectorizer later, so there was a lot of vocabularies that contained 'hhhhhhhhhhhhhhhhhhhhhhh'
Using Regex.
Ex:
import pandas as pd
df = pd.DataFrame({"Col": ["hello, I'm going to eat to the fullest today hhhhhhhhhhhhhhhhhhhhh", "Hello World"]})
#df["Col"] = df["Col"].str.replace(r"\b(.)\1+\b", "")
df["Col"] = df["Col"].str.replace(r"\s+(.)\1+\b", "").str.strip()
print(df)
Output:
Col
0 hello, I'm going to eat to the fullest today
1 Hello World
You may try this:
df["Col"] = df["Col"].str.replace(u"h{4,}", "")
Where you may set the number of characters to match in my case 4.
Col
0 hello, I'm today hh hhhh hhhhhhhhhhhhhhh
1 Hello World
Col
0 hello, I'm today hh
1 Hello World
I used unicode matching, since you mentioned you are in tweets.
I have a dataframe as follows:
Name Rating
0 ABC Good
1 XYZ Good #
2 GEH Good
3 ABH *
4 FEW Normal
Here I want to replace in the Rating element if it contain # it should replace by Can be improve , if it contain * then Very Poor. I have tried with following but it replace whole string. But I want to replace only the special char if it present.But it solves for another case if only special char is present.
import pandas as pd
df = pd.DataFrame() # Load with data
df['Rating'] = df['Rating'].str.replace('.*#+.*', 'Can be improve')
is returning
Name Rating
0 ABC Good
1 XYZ Can be improve
2 GEH Good
3 ABH Very Poor
4 FEW Normal
Can anybody help me out with this?
import pandas as pd
df = pd.DataFrame({"Rating": ["Good", "Good #", "*"]})
df["Rating"] = df["Rating"].str.replace("#", "Can be improve")
df["Rating"] = df["Rating"].str.replace("*", "Very Poor")
print(df)
Output:
0 Good
1 Good Can be improve
2 Very Poor
You replace the whole string because .* matches any character zero or more times.
If your special values are always at the end of the string you might use:
.str.replace(r'#$', "Can be improve")
.str.replace(r'\*$', "Very Poor")
I have a text file:
it can change each time and the number of lines can be changed, and contains the following for each line:
string (can contain one word, two or even more) ^ string of one word
EX:
level country ^ layla
hello sandra ^ organization
hello people ^ layla
hello samar ^ organization
I want to create dataframe using pandas such that:
item0 ( country, people)
item1 (sandra , samar)
Because for example each time there layla, we are returning the most right name that belongs to it and added it as the second column just shown above which is in this case ( country, people), and we called layla as item0 and as the index of the dataframe. I can't seem to arrange this and I don't know how to do the logic for returning the duplicated of whatever after the "^" and returning the list of its belonged most right name. My trial so far which doesn't really do it is:
def text_file(file):
list=[]
file_of_text = "text.txt"
with open(file_of_context) as f:
for l in f:
l_dict = l.split(" ")
list.append(l_dict)
return(list)
def items(file_of_text):
list_of_items= text_file(file_of_text)
for a in list_of_items:
for b in a:
if a[-1]==
def main():
file_of_text = "text.txt"
if __name__ == "__main__":
main()
Starting with pandas read_csv() Specifying '^' as your delimiter and using arbitrary column names
df = pd.read_csv('data.csv', delimiter='\^', names=['A', 'B'])
print (df)
A B
0 level country layla
1 hello sandra organization
2 hello people layla
3 hello samar organization
then we split to get the values we want. That expand arg is new in pandas 16 I believe
df['A'] = df['A'].str.split(' ', expand=True)[1]
print(df)
A B
0 country layla
1 sandra organization
2 people layla
3 samar organization
then we group column B and apply the tuple function. Note: We're reseting the index so we can use it later
g = df.groupby('B')['A'].apply(tuple).reset_index()
print(g)
B A
0 layla (country, people)
1 organization (sandra, samar)
Creating a new column with the string 'item' and the index
g['item'] = 'item' + g.index.astype(str)
print (g[['item','A']])
item A
0 item0 (country, people)
1 item1 (sandra, samar)
Let's assume that your file is called file_of_text.txt and contains the following:
level country ^ layla
hello sandra ^ organization
hello people ^ layla
hello samar ^ organization
You can get your data from a file to a dataframe similar to your desired output with the following lines of code:
import re
import pandas as pd
def main(myfile):
# Open the file and read the lines
text = open(myfile,'r').readlines()
# Split the lines into lists
text = list(map(lambda x: re.split(r"\s[\^\s]*",x.strip()), text))
# Put it in a DataFrame
data = pd.DataFrame(text, columns = ['A','B','C'])
# Create an output DataFrame with rows "item0" and "item1"
final_data = pd.DataFrame(['item0','item1'],columns=['D'])
# Create your desired column
final_data['E'] = data.groupby('C')['B'].apply(lambda x: tuple(x.values)).values
print(final_data)
if __name__ == "__main__":
myfile = "file_of_text.txt"
main(myfile)
The idea is to read the lines from the text file and then split each line using the split method from the re module. The result is then passed to the DataFrame method to generate a dataframe called data, which is used to create the desired dataframe final_data. The result should look like the following:
# data
A B C
0 level country layla
1 hello sandra organization
2 hello people layla
3 hello samar organization
# final_data
D E
0 item0 (country, people)
1 item1 (sandra, samar)
Please take a look at the script and ask further questions, if you have any.
I hope this helps.