Python csv Merge several column to one cell

Python csv Merge several column to one cell - python

How can I merge the several columns into one cell?
How can I convert a CSV file , that includes 1 by X cells where 1 is the row count and X is the column count unkown by the user,into a new CSV file including one single cell to combine all data from the original CSV file?
Right now://One row, four columns #####In fact, it will be variable number of columns as the data is extracted from a log file#############
1 A B C D
My purposes: //One row, one colum
1 A
B
C
D
The index of the row may not be always be 1 as I have many similar rows like that.
please refer to the original file and the expected new file at the following URL for details
https://drive.google.com/drive/folders/1cX0o86nbLAV5Foj5avCK6E2oAvExzt5t
161.csv is the file at the present
161-final.csv is what I want...
The count of rows will not be changed. However, the count of columns is a variable as the data is extracted from a log file. In the end, I only need each row has only one column.
I am just a fresh man with pandas. Will it be a way that I calculate the count of columns and then merge it into one cell?
Very appreciate your help.

code:
import pandas as pd
import numpy as np
df = pd.DataFrame([['a', 'a', 'a', 'a'], ['b', 'b', 'b', 'b']])
print(df)
df1 = np.empty(df.shape[0], dtype=object)
df1[:] = df.values.tolist()
print(df1)
output:
0 1 2 3
0 a a a a
1 b b b b
[list(['a', 'a', 'a', 'a']) list(['b', 'b', 'b', 'b'])]

Not sure this is what you want, but you can manage to put the content of one row in a single column with pd.groupby()
import pandas as pd
df = pd.DataFrame(np.arange(4).reshape(1, 4))
df.columns = ['A', 'B', 'C', 'D']
df = df.groupby(df.index).apply(lambda x: x.values.ravel())
df = df.to_frame('unique_col') # If needed
Output :
unique_col
0 [0, 1, 2, 3]
Not sure this is possible to have the output not in a list as you mentioned it in your example.

I used a "stupid" way to do that. It is the "pd.read_table(file)" do the magic.
def CsvMerge(file1, file2,output):
##It is for preparation of the getFormated()
##Merge the "Format.csv" and the "text.csv.csv
df1=pd.read_csv(file1)
###This is for merging the columns to one column
###It can also avoid an csv error
df2=pd.read_table(file2)
df3=pd.concat([df1,df2],axis=1)
print(df3)
with open(output,'w+',encoding="utf-8") as f:
df3.to_csv(f,index=False)
f.close()

Related

Is there any way to keep on adding extra column that is coming in a file to "other" column in pyspark?

I have a scenario where my file schema might change.For example: consider if i am getting 3 columns "A","B","C" now ; next time I might get a case when 2 new columns are added in the file "A","B","C","D","E".
In that case, I want to add it to the "other" column (of JSON type) in the dataframe something like
A B C others
------------
1 2 3 {"D":4,"E":5}
Also there can be a case when a column might be missing, for example, I might not be getting column A and will be getting "B", "C" only.
How to handle this in pyspark ?

You can use something like following -
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
df = spark.read.csv("demo.csv", header=True)
df_1 = df.select([c for c in df.columns if c not in {'A', 'B', 'C'}]) # this will give your unknown columns
df.select("A","B", "C", to_json(struct(df_1.columns)).alias("other")).show(10,False)
Note: demo.csv has these columns ['A', 'B', 'C', 'D', 'E'] - this list can vary, and out of which ['A', 'B', 'C'] are your fixed/known columns.

Remove duplicates from rows in mydataset

Im having a CSV file which contain 436 columns and 14k rows.
The format of the data inside the cells is string.
For the example it looks like this:
A,A,A,B,B,C,C,,,,,
D,F,D,F,D,F,H,,,,,
My goal is to get every row with its unique values only. Like that:
A,B,C,,,,,,,,
D,F,H,,,,,,,,
The file is on csv/txt file. I can use Jupyter notebook( with Python3 or any other code you guys will provide). But this is my enviorment of work. Any help would be amazing!
I also uploaded the csv as a Dataframe to the notebook. What you guys suggest?

First you have to read your csv file into a numpy array. Then for each row, I'd do something like:
import numpy as np
s='A,A,A,B,B,C,C'
f=s.split(',')
np.unique(np.array(f))
which prints array(['A', 'B', 'C'], dtype='|S1').

If you have the csv loaded as a dataframe df:
0 1 2 3 4 5 6
0 A A A B B C C
1 D F D F D F H
Iterate over rows and find the unique values per each row:
unique_vals = []
for _, row in df.iterrows():
unique_vals.append(row.unique().tolist())
unique_vals
[['A', 'B', 'C'], ['D', 'F', 'H']]
You haven't mentioned the return data type so I've returned a list.
Edit: If the data set is too large, consider using the chunk_size option in read_csv.

Changing column in DataFrame

I am looking to change part of the string in a column of a data frame. I, however, can not get it to update in the data frame. This is my code.
import pandas as pd
#File path
csv = '/home/test.csv'
#Read csv to pandas
df = pd.read_csv(nuclei_annotations_csv, header=None, names=['A', 'B', 'C', 'D', 'E', 'F'])
#Select Data to update
paths = df['A']
#Loop over data
for x in paths:
#Select data to updte
old = x[:36]
#Update value
new = '/Datasets/RetinaNetData'
#Replace
new_path = x.replace(old, new)
#Save values to DataFrame
paths.update(new_path)
#Print updated DataFrame
print(df)
The inputs and output I would like are:
Input:
/Annotations/test_folder/10_m03293_ORG.png
/Annotations/test_folder/10_m03293_ORG.png
/Annotations/test_folder/10_m03293_ORG.png
/Annotations/test_folder/10_m03293_ORG.png
OutPut:
/Datasets/RetinaNetData/10_m03293_ORG.png
/Datasets/RetinaNetData/10_m03293_ORG.png
/Datasets/RetinaNetData/10_m03293_ORG.png
/Datasets/RetinaNetData/10_m03293_ORG.png

Assuming that all of the rows are strings and all of them have at least 36 characters, you can use .str to get the part of the cells after the 36th character. Then you can just use the + operator to combine the new beginning with the remainder of each cell's contents:
df.A = '/Datasets/RetinaNetData' + df.A.str[36:]
As a general tip, methods like this that operate across the whole dataframe at once are going to be more efficient than looping over each row individually.

Count non-empty cells in pandas dataframe rows and add counts as a column

Using Python, I want to count the number of cells in a row that has data in it, in a pandas data frame and record the count in the leftmost cell of the row.

To count the number of cells missing data in each row, you probably want to do something like this:
df.apply(lambda x: x.isnull().sum(), axis='columns')
Replace df with the label of your data frame.
You can create a new column and write the count to it using something like:
df['MISSING'] = df.apply(lambda x: x.isnull().sum(), axis='columns')
The column will be created at the end (rightmost) of your data frame.
You can move your columns around like this:
df = df[['Count', 'M', 'A', 'B', 'C']]
Update
I'm wondering if your missing cells are actually empty strings as opposed to NaN values. Can you confirm? I copied your screenshot into an Excel workbook. My full code is below:
df = pd.read_excel('count.xlsx', na_values=['', ' '])
df.head() # You should see NaN for empty cells
df['M']=df.apply(lambda x: x.isnull().sum(), axis='columns')
df.head() # Column M should report the values: first row: 0, second row: 1, third row: 2
df = df[['Count', 'M', 'A', 'B', 'C']]
df.head() # Column order should be Count, M, A, B, C
Notice the na_values parameter in the pd.read_excel method.

Python: pandas Data Frame and meaning of {Series}0 in debugger

I am using pandas in Python 2.7 and read a csv file like this:
import pandas as pd
df = pd.read_csv("test_file.csv")
df has a column titled rating, and a column titled 'review', I do some manipulations on df for example:
df3 = df[df['rating'] != 3]
Now if I look in a debugger at df['review'] and df3['review'] I see this information:
df['review'] = {Series}0
df3['review'] = {Series}1
Also if I want to see the first element of df['review'] I use:
df['review'][0]
which is fine, but if I do the same for df3, I get this error:
df3['review'][0]
{KeyError}0L
However, it looks like I can do this:
df3['review'][1]
Can someone please explain the difference?

Indexing with an integer on a Series doesn't work like a list. In particular, df['review'][0] doesn't get the first element of the "review" column, it gets the element with index 0:
In [4]: s = pd.Series(['a', 'b', 'c', 'd'], index=[1, 0, 2, 3])
In [5]: s
Out[5]:
1 a
0 b
2 c
3 d
dtype: object
In [6]: s[0]
Out[6]: 'b'
Presumably, in generating df3 you dropped the row with index 0. If you actually want to get the first element regardless of the index, use iloc:
In [7]: s.iloc[0]
Out[7]: 'a'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python csv Merge several column to one cell - python

code: import pandas as pd import numpy as np df = pd.DataFrame([['a', 'a', 'a', 'a'], ['b', 'b', 'b', 'b']]) print(df) df1 = np.empty(df.shape[0], dtype=object) df1[:] = df.values.tolist() print(df1) output: 0 1 2 3 0 a a a a 1 b b b b [list(['a', 'a', 'a', 'a']) list(['b', 'b', 'b', 'b'])]

Related

Is there any way to keep on adding extra column that is coming in a file to "other" column in pyspark?

Remove duplicates from rows in mydataset

Changing column in DataFrame

Count non-empty cells in pandas dataframe rows and add counts as a column

Python: pandas Data Frame and meaning of {Series}0 in debugger

Categories

Resources