Changing column in DataFrame - python

I am looking to change part of the string in a column of a data frame. I, however, can not get it to update in the data frame. This is my code.
import pandas as pd
#File path
csv = '/home/test.csv'
#Read csv to pandas
df = pd.read_csv(nuclei_annotations_csv, header=None, names=['A', 'B', 'C', 'D', 'E', 'F'])
#Select Data to update
paths = df['A']
#Loop over data
for x in paths:
#Select data to updte
old = x[:36]
#Update value
new = '/Datasets/RetinaNetData'
#Replace
new_path = x.replace(old, new)
#Save values to DataFrame
paths.update(new_path)
#Print updated DataFrame
print(df)
The inputs and output I would like are:
Input:
/Annotations/test_folder/10_m03293_ORG.png
/Annotations/test_folder/10_m03293_ORG.png
/Annotations/test_folder/10_m03293_ORG.png
/Annotations/test_folder/10_m03293_ORG.png
OutPut:
/Datasets/RetinaNetData/10_m03293_ORG.png
/Datasets/RetinaNetData/10_m03293_ORG.png
/Datasets/RetinaNetData/10_m03293_ORG.png
/Datasets/RetinaNetData/10_m03293_ORG.png

Assuming that all of the rows are strings and all of them have at least 36 characters, you can use .str to get the part of the cells after the 36th character. Then you can just use the + operator to combine the new beginning with the remainder of each cell's contents:
df.A = '/Datasets/RetinaNetData' + df.A.str[36:]
As a general tip, methods like this that operate across the whole dataframe at once are going to be more efficient than looping over each row individually.

Related

Can't access DataFrame elements after reading from CSV

I'm creating a matrix and converting it into DataFrame after creation. Since I'm working with lots of data and it takes a while for creation I wanted to store the matrix into a CSV so I can just read it once is created. Here what I'm doing:
transitions = create_matrix(alpha, N)
# convert the matrix to a DataFrame
df = pd.DataFrame(transitions, columns=list(tags), index=list(tags))
df.to_csv(r'D:\U\k\Desktop\prg\F_transition_' + language + '.csv')
df_r = pd.read_csv('transition_en.csv')
The fact is that after reading from CSV I got the error:
in get_loc raise KeyError(key). KeyError: 'O'
It seems this is thrown by those lines of code:
if i == 0:
tran_pr = df_r.loc['O', tag]
else:
tran_pr = df_r.loc[st[-1], tag]
I imagine that once the data is stored in a CSV, the reading of the file is not equivalent to the DataFrame I had before. How can I convert these lines of code to login like I did before?
I tried to set index=False when creating the csv and also skip_blank_lines=True when reading. Nothing changes
df_r is like:
can you try:
import pandas as pd
df = pd.DataFrame([[1, 2], [2, 3]], columns = ['A', 'B'], index = ['C', 'D'])
print(df['A']['C'])
while using loc you need provide index first and then give column
df_r.loc[tag, 'O']
will work.
Don't use index = false, while importing, which will not include index in the dataframe

Remove data in csv after _ characters per cell python

I have a csv file that has some info inside of it. For my use case, I only need the first four characters in every cell.
So, using python, I need a solution that will allow me ideally to remove all characters in each cell after four characters, and optionally remove all spaces. If I could be pointed in the correct direction that'd be great!
one
two
three
OneOneOne
TwoTwoTwo
ThreeThreeThree
My Ideal output should look like
one
two
three
OneO
TwoT
Thre
Seems like your data contains some numeric values not of string type. In that case, you can convert the data to string first, then remove all spaces, and finally take the first 4 characters in each converted strings, as follows:
df = pd.read_csv("mycsv.csv") # read csv if not already read
df = df.apply(lambda x: x.astype(str).str.replace(' ', '').str[0:4])
df.to_csv("mycsv.csv") # save to csv
If you don't need to remove spaces, you can use:
df = pd.read_csv("mycsv.csv") # read csv if not already read
df = df.apply(lambda x: x.astype(str).str[0:4])
df.to_csv("mycsv.csv") # save to csv
Result:
print(df)
one two three
0 OneO TwoT Thre
Edit
If you want to apply to only specify columns, you can use:
For example, only apply to columns one and two:
df = pd.read_csv("mycsv.csv") # read csv if not already read
df[['one', 'two']] = df[['one', 'two']].apply(lambda x: x.astype(str).str.replace(' ', '').str[0:4])
df.to_csv("mycsv.csv") # save to csv
Adapting the answer by #SeaBean to show how to apply to just selected columns,
df = pd.read_csv("mycsv.csv") # read csv if not already read
cols = ['col_1', 'col_2'] # cols to apply
for col in cols:
df[col] = df[col].apply(lambda x: x.astype(str).str[0:4])
df.to_csv("mycsv.csv") # save to csv
There may be a better way, but I think this could get you started:
import pandas as pd
df = pd.read_csv("myfile.csv")
# remove spaces and keep first four letters
df = df.applymap(lambda x: x.replace(' ', '')[:4])
Update to account for non-string columns. This only changes string columns. If you want to truncate numbers also, others answers have covered that.
import pandas as pd
file = "myfile.csv"
df = pd.read_csv(file)
# select only columns of type str
cols = (df.applymap(type) == str).all(0)
# first 4 letters of each cell
first_four_no_space = lambda x: x.replace(' ', '')[:4]
df.loc[:, cols] = df.loc[:, cols].applymap(first_four_no_space)
# Warning! This will overwrite your existing file.
# I would rename the output, but it sounds like you want to
# overwrite. Uncomment if you want to overwrite your existing
# file.
# df.to_csv(file, index=False)

Pandas, groupby() function that separates out by "," but retains datatype?

Is there a way to use the .groupby() function to consolidate repeating rows in a data frame, separate out non-similar elements by a ',', and have the resulting .groupby() data frame retain the original datatype of the non-similar elements / convert non-similar items to an object?
As I understand, a column in pandas can hold multiple datatypes, so I feel like this should be possible.
I can use the .agg() function to separate out non-similar elements by a ',', but it doesn't work with non-string elements. I'd like to separate out the datatypes for error checking later when looking for rows with bad entries after the .groupby().
#Libraries
import pandas as pd
import numpy as np
#Example dataframe
col = ['Acol', 'Bcol', 'Ccol', 'Dcol']
df = pd.DataFrame(columns = col)
df['Acol'] = [1,1,2,3]
df['Bcol'] = ['a', 'b', 'c', 'd']
df['Ccol'] = [1,2,3,4]
df['Dcol'] = [1,'a',2,['a', 'b']]
#Code
outdf = df.groupby(by='Acol').agg(lambda x: ','.join(x)).reset_index()
outdf

Python csv Merge several column to one cell

How can I merge the several columns into one cell?
How can I convert a CSV file , that includes 1 by X cells where 1 is the row count and X is the column count unkown by the user,into a new CSV file including one single cell to combine all data from the original CSV file?
Right now://One row, four columns #####In fact, it will be variable number of columns as the data is extracted from a log file#############
1 A B C D
My purposes: //One row, one colum
1 A
B
C
D
The index of the row may not be always be 1 as I have many similar rows like that.
please refer to the original file and the expected new file at the following URL for details
https://drive.google.com/drive/folders/1cX0o86nbLAV5Foj5avCK6E2oAvExzt5t
161.csv is the file at the present
161-final.csv is what I want...
The count of rows will not be changed. However, the count of columns is a variable as the data is extracted from a log file. In the end, I only need each row has only one column.
I am just a fresh man with pandas. Will it be a way that I calculate the count of columns and then merge it into one cell?
Very appreciate your help.
code:
import pandas as pd
import numpy as np
df = pd.DataFrame([['a', 'a', 'a', 'a'], ['b', 'b', 'b', 'b']])
print(df)
df1 = np.empty(df.shape[0], dtype=object)
df1[:] = df.values.tolist()
print(df1)
output:
0 1 2 3
0 a a a a
1 b b b b
[list(['a', 'a', 'a', 'a']) list(['b', 'b', 'b', 'b'])]
Not sure this is what you want, but you can manage to put the content of one row in a single column with pd.groupby()
import pandas as pd
df = pd.DataFrame(np.arange(4).reshape(1, 4))
df.columns = ['A', 'B', 'C', 'D']
df = df.groupby(df.index).apply(lambda x: x.values.ravel())
df = df.to_frame('unique_col') # If needed
Output :
unique_col
0 [0, 1, 2, 3]
Not sure this is possible to have the output not in a list as you mentioned it in your example.
I used a "stupid" way to do that. It is the "pd.read_table(file)" do the magic.
def CsvMerge(file1, file2,output):
##It is for preparation of the getFormated()
##Merge the "Format.csv" and the "text.csv.csv
df1=pd.read_csv(file1)
###This is for merging the columns to one column
###It can also avoid an csv error
df2=pd.read_table(file2)
df3=pd.concat([df1,df2],axis=1)
print(df3)
with open(output,'w+',encoding="utf-8") as f:
df3.to_csv(f,index=False)
f.close()

Count non-empty cells in pandas dataframe rows and add counts as a column

Using Python, I want to count the number of cells in a row that has data in it, in a pandas data frame and record the count in the leftmost cell of the row.
To count the number of cells missing data in each row, you probably want to do something like this:
df.apply(lambda x: x.isnull().sum(), axis='columns')
Replace df with the label of your data frame.
You can create a new column and write the count to it using something like:
df['MISSING'] = df.apply(lambda x: x.isnull().sum(), axis='columns')
The column will be created at the end (rightmost) of your data frame.
You can move your columns around like this:
df = df[['Count', 'M', 'A', 'B', 'C']]
Update
I'm wondering if your missing cells are actually empty strings as opposed to NaN values. Can you confirm? I copied your screenshot into an Excel workbook. My full code is below:
df = pd.read_excel('count.xlsx', na_values=['', ' '])
df.head() # You should see NaN for empty cells
df['M']=df.apply(lambda x: x.isnull().sum(), axis='columns')
df.head() # Column M should report the values: first row: 0, second row: 1, third row: 2
df = df[['Count', 'M', 'A', 'B', 'C']]
df.head() # Column order should be Count, M, A, B, C
Notice the na_values parameter in the pd.read_excel method.

Categories

Resources