How to add an element to every row in a data frame? - python

I have three types of columns in my dataframe: numeric, string and datetime.
I need to add the element | to the end of every value as a separator
I have tried:
df['column'] = (df['column']+ '|')
but it does not work for the datetime columns and I have to add .astype(str) to the numeric columns which may result in formatting issues later.
Any other suggestions?

you can use DataFrame.to_csv() with sep="|", if you want to create a csv.
further documentation :
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

Not too sure why you would want to do this but if you want to make a CSV file with | as the delimiter, you can set that in the df.to_csv('out.csv', sep='|') method. I think a cleaner way of doing this would be to use a lambda function:
df['column'] = df['column'].apply(lambda x: f"{x}|")
You will always have to add .astype(str) though...

This may help you in this case:
df['column'] = df['column'].astype(str) + "|"

Related

Converting all pandas column: row to key:value pair json

I am trying to add a new column at the end of my pandas dataframe that will contain the values of previous cells in key:value pair. I have tried the following:
import json
df["json_formatted"] = df.apply
(
lambda row: json.dumps(row.to_dict(), ensure_ascii=False), axis=1
)
It creates the the column json_formatted successfully with all required data, but the problem is it also adds the json_formatted as another extra key. I don't want that. I want the json data to contain only the information from the original df columns. How can I do that?
Note: I made ensure_ascii=False because the column names are in Japanese characters.
Create a new variable holding the created column and add it afterwards:
json_formatted = df.apply(lambda row: json.dumps(row.to_dict(), ensure_ascii=False), axis=1)
df['json_formatted'] = json_formatted
This behaviour shouldn't happen, but might be caused by your having run this function more than once. (You added the column, and then ran df.apply on the same dataframe).
You can avoid this by making your columns explicit: df[['col1', 'col2']].apply()
Apply is an expensive operation is Pandas, and if performance matters it is better to avoid it. An alternative way to do this is
df["json_formatted"] = [json.dumps(s, ensure_ascii=False) for s in df.T.to_dict().values()]

How do I strip data from a row in Pandas?

I have a Pandas dataframe and I need to strip out components.schema.Person.properties and just call it id.
column
data_type
data_description
components.schemas.Person.properties.id
string
Unique Mongo Id generated for the person.
Like this?
df['column'] = df['column'].apply(lambda x: x.split('.')[-1])
or more compact solution by #Chris Adams:
df['column'].str.split('.').str[-1]

How to delete everything after first space in Python?

I have a column in a data frame with dates in the format of “1/4/2021 0:00”. And I would like to get rid of everything after the first space, including the first space so that way it becomes “1/4/2021”.
How can I do that in Python? Also, does the column already have to be a specific data type in order to complete this task?
If you are using pandas you can try the following, assuming the entire column is following a similar datetime format.
Your dataframe is called df, and your column of dates is date.
df['date'] = df['date'].dt.date
or
df['date'] = pd.to_datetime(df['date'].dt.date)
or
df['date'] = df['date'].dt.normalize()
Depending on what you want the format of your date column to be.
Try this:
df['date'] = df['date'].apply(lambda x: x.split(' ')[0] if isinstance(x, str) else x)
Note that this code only works if your column in data frame has type string.
In order to check the data type, run: df.dtypes.

How to remove commas from ALL the column in pandas at once

I have a data frame where all the columns are supposed to be numbers. While reading it, some of them were read with commas. I know a single column can be fixed by
df['x']=df['x'].str.replace(',','')
However, this works only for series objects and not for entire data frame. Is there an elegant way to apply it to entire data frame since every single entry in the data frame should be a number.
P.S: To ensure I can str.replace, I have first converted the data frame to str by using
df.astype('str')
So I understand, I will have to convert them all to numeric once the comma is removed.
Numeric columns have no ,, so converting to strings is not necessary, only use DataFrame.replace with regex=True for substrings replacement:
df = df.replace(',','', regex=True)
Or:
df.replace(',','', regex=True, inplace=True)
And last convert strings columns to numeric, thank you #anki_91:
c = df.select_dtypes(object).columns
df[c] = df[c].apply(pd.to_numeric,errors='coerce')
Well, you can simplely do:
df = df.apply(lambda x: x.str.replace(',', ''))
Hope it helps!
In case you want to manipulate just one column:
df.column_name = df.column_name.apply(lambda x : x.replace(',',''))

Cant drop columns with pandas if index_col = 0 is used while reading csv's [duplicate]

I have the following code which imports a CSV file. There are 3 columns and I want to set the first two of them to variables. When I set the second column to the variable "efficiency" the index column is also tacked on. How can I get rid of the index column?
df = pd.DataFrame.from_csv('Efficiency_Data.csv', header=0, parse_dates=False)
energy = df.index
efficiency = df.Efficiency
print efficiency
I tried using
del df['index']
after I set
energy = df.index
which I found in another post but that results in "KeyError: 'index' "
When writing to and reading from a CSV file include the argument index=False and index_col=False, respectively. Follows an example:
To write:
df.to_csv(filename, index=False)
and to read from the csv
df.read_csv(filename, index_col=False)
This should prevent the issue so you don't need to fix it later.
df.reset_index(drop=True, inplace=True)
DataFrames and Series always have an index. Although it displays alongside the column(s), it is not a column, which is why del df['index'] did not work.
If you want to replace the index with simple sequential numbers, use df.reset_index().
To get a sense for why the index is there and how it is used, see e.g. 10 minutes to Pandas.
You can set one of the columns as an index in case it is an "id" for example.
In this case the index column will be replaced by one of the columns you have chosen.
df.set_index('id', inplace=True)
If your problem is same as mine where you just want to reset the column headers from 0 to column size. Do
df = pd.DataFrame(df.values);
EDIT:
Not a good idea if you have heterogenous data types. Better just use
df.columns = range(len(df.columns))
you can specify which column is an index in your csv file by using index_col parameter of from_csv function
if this doesn't solve you problem please provide example of your data
One thing that i do is df=df.reset_index()
then df=df.drop(['index'],axis=1)
To remove or not to create the default index column, you can set the index_col to False and keep the header as Zero. Here is an example of how you can do it.
recording = pd.read_excel("file.xls",
sheet_name= "sheet1",
header= 0,
index_col= False)
The header = 0 will make your attributes to headers and you can use it later for calling the column.
It works for me this way:
Df = data.set_index("name of the column header to start as index column" )

Categories

Resources