I have a .tsv file dataset, and I transformed it into a DataFrame using Pandas.
Imagine that my_tsv_file was something like:
A Apple
B Orange
C Pear
To build the DataFrame I used:
df = pandas.read_csv(my_tsv_file, sep='\t')
Now, the first row of my_tsv_file was originally a row part of the data, but it has been transformed to the "key row" in the new DataFrame. So now the Dataframe is something like:
A Apple
0 B Orange
1 C Pear
As "A" and "Apple" were keys, when they actually are not. I would like to add the correct "key row", in order to obtain something like:
ID Fruit
0 A Apple
1 B Orange
2 C Pear
How can I achieve this?
I can't modify the original .tsv file.
Please remind that I am at the very beginning with Python and Pandas.
have you tried
df = pandas.read_csv(my_tsv_file, sep='\t', names=['ID', 'Fruit'])
Related
I want to subset a DataFrame by two columns in different dataframes if the values in the columns are the same. Here is an example of df1 and df2:
df1
A
0 apple
1 pear
2 orange
3 apple
df2
B
0 apple
1 orange
2 orange
3 pear
I would like the output to be a subsetted df1 based upon the df2 column:
A
0 apple
2 orange
I tried
df1 = df1[df1.A == df2.B] but get the following error:
ValueError: Can only compare identically-labeled Series objects
I do not want to rename the column in either.
What is the best way to do this? Thanks
If need compare index values with both columns create Multiindex and use Index.isin:
df = df1[df1.set_index('A', append=True).index.isin(df2.set_index('B', append=True).index)]
print (df)
A
0 apple
2 orange
I have a list of strings, let's say:
fruit_list = ["apple", "banana", "coconut"]
And I have some Pandas Dataframe, such like:
import pandas as pd
data = [['Apple farm', 10], ['Banana field', 15], ['Coconut beach', 14], ['corn field', 10]]
df = pd.DataFrame(data, columns = ['fruit_source', 'value'])
And I want to populate a new column based on a text search of the existing column 'fruit_source'. What I want populated is whatever element is matched to the specific column within the df. One way of writing it is:
df["fruit"] = NaN
for index, row in df.iterrows():
for fruit in fruit_list:
if fruit in row['fruit_source']:
df.loc[index,'fruit'] = fruit
else:
df.loc[index,'fruit'] = "fruit not found"
In which the dataframe is populated with a new column of what fruit the fruit source collected.
When expanding this out to a larger dataframe, though, this iteration can pose to be an issue based on performance. Reason being, as more rows are introduced, the iteration explodes due to iterating through the list as well.
Is there more of an efficient method that can be done?
You can let Pandas do the work like so:
# Prime series with the "fruit not found" value
df['fruit'] = "fruit not found"
for fruit in fruit_list:
# Generate boolean series of rows matching the fruit
mask = df['fruit_source'].str.contains(fruit, case=False)
# Replace those rows in-place with the name of the fruit
df['fruit'].mask(mask, fruit, inplace=True)
print(df) will then say
fruit_source value fruit
0 Apple farm 10 apple
1 Banana field 15 banana
2 Coconut beach 14 coconut
3 corn field 10 fruit not found
Use str.extract with a regex pattern to avoid a loop:
import re
pattern = fr"({'|'.join(fruit_list)})"
df['fruit'] = df['fruit_source'].str.extract(pattern, flags=re.IGNORECASE) \
.fillna('fruit not found')
Output:
>>> df
fruit_source value fruit
0 Apple farm 10 Apple
1 Banana field 15 Banana
2 Coconut beach 14 Coconut
3 corn field 10 fruit not found
>>> pattern
'(apple|banana|coconut)'
I am cleaning a dataset and I need to remove formatting errors in column A if the value in column B matches a specific string.
A B
foo//, cherry
bar//, orange
bar//, cherry
bar apple
So in this situation if column B is 'cherry' I want to replace "//," with "," in column A. The final result would look like this.
A B
foo, cherry
bar//, orange
bar, cherry
bar apple
Any advice is much appreciated
You can simply write a function that takes in a row as series, checks the cherry condition, fixes the string with str.replace and returns the row. The you can use df.apply over axis=1.
def fix(s):
if s['B']=='cherry':
s['A']=s['A'].replace('//,',',')
return s
df.apply(fix, axis=1)
A B
0 foo, cherry
1 bar//, orange
2 bar, cherry
3 bar apple
I would first check which rows contain cherry in the B column:
rows = df['B'].str.contains('cherry')
and then replace "//" with "" in these rows but A column.
I am trying to drop rows in pandas based on whether or not it contains "/" in the cells in column "Price". I have referred to the question: Drop rows in pandas if they contains "???".
As such, I have tried both codes:
df = df[~df["Price"].str.contains('/')]
and
df = df[~df["Price"].str.contains('/',regex=False)]
However, both codes give the error:
AttributeError: Can only use .str accessor with string values!
For reference, the first few rows of my dataframe is as follows:
Fruit Price
0 Apple 3
1 Apple 2/3
2 Banana 2
3 Orange 6/7
May I know what went wrong and how can I fix this problem? Thank you very much!
Try this:
df = df[~df['Price'].astype(str).str.contains('/')]
print(df)
Fruit Price
0 Apple 3
2 Banana 2
You need to convert the price column to string first and then apply this operation. I believe that price column doesn't have datatype string
df['Price'] = df['Price'].astype(str)
and then try
df = df[~df["Price"].str.contains('/',regex=False)]
After using transpose on a dataframe there is always an extra row as a remainder from the initial dataframe's index for example:
import pandas as pd
df = pd.DataFrame({'fruit':['apple','banana'],'number':[3,5]})
df
fruit number
0 apple 3
1 banana 5
df.transpose()
0 1
fruit apple banana
number 3 5
Even when i have no index:
df.reset_index(drop = True, inplace = True)
df
fruit number
0 apple 3
1 banana 5
df.transpose()
0 1
fruit apple banana
number 3 5
The problem is that when I save the dataframe to a csv file by:
df.to_csv(f)
this extra row stays at the top and I have to remove it manually every time.
Also this doesn't work:
df.to_csv(f, index = None)
because the old index is no longer considered an index (just another row...).
It also happened when I transposed the other way around and I got an extra column which i could not remove.
Any tips?
I had the same problem, I solved it by reseting index before doing the transpose. I mean df.set_index('fruit').transpose():
import pandas as pd
df = pd.DataFrame({'fruit':['apple','banana'],'number':[3,5]})
df
fruit number
0 apple 3
1 banana 5
And df.set_index('fruit').transpose() gives:
fruit apple banana
number 3 5
Instead of removing the extra index, why don't try setting the new index that you want and then use slicing ?
step 1: Set the new index you want:
df.columns = df.iloc[0]
step 2: Create a new dataframe removing extra row.
df_new = df[1:]