Python, Pandas, remove n-letters of string without removing index/title - python

this is probably a easy fix that just eludes me right now.
I have a excel file with the following content.
from it I want to filter out the "Num-" from the rest. For simplicitys sake I use .str here.
df_test = pd.read_excel(r'C:\...\test.xlsx')
df_test = df_test.filter(like='Order_Number', axis=1)
df_test = df_test['Order_Number'].str[4:]
df_test.head()
The output comes out without the title Order_Number though and I am not sure why. How can I preserve it without adding it manually back?

It appears you are assigning the new values of 'Order_Number' column to entire dataframe, instead of assigning them to the actual column. Try:
df_test['Order_Number'] = df_test['Order_Number'].str[4:]

Related

How to use .translate() correctly for the removal of non alphabetical either numerical characters?

I want to remove symbols (Most of them but not all) from my data column 'Review'.
A little background on my code:
from pandas.core.frame import DataFrame
# convert to lower case
data['Review'] = data['Review'].str.lower()
# remove trailing white spaces
data['Review'] = data['Review'].str.strip()
This is what I did based on what I read on the internet (I'm still on the beginner-level of NLP, so don't be surprised to find more than one mistake, I just want to know what are they):
import string
sep = '|'
punctuation_chars = '"#$%&\()*+,-./:;<=>?#[\\]^_`{}~'
mapping_table = str.maketrans(dict.fromkeys(punctuation_chars, ''))
= sep.join(df[df(data['Review']).tolist()]).translate(mapping_table).split(sep)
However, I get the following error:
AttributeError: 'DataFrame' object has no attribute 'tolist'
How could I solve it? I want to use .translate() because I read it's more efficient than other methods.
The AttributeError is caused because DataFrame.tolist() doesn't exist. It looks like the code assumes that df(data['Review']) is a Series, but it is actually a DataFrame.
df = DataFrame(data['Review'])
translated_reviews = sep.join(df[0].tolist()).translate(mapping_table).split(sep)
It's unclear whether data is a DataFrame. If it is, just use it in the join() without calling tolist() or instantiating a new DataFrame.
translated_reviews = sep.join(data['Review']).translate(mapping_table).split(sep)
Your problem was where you were trying to create a dataframe object from a column of your data dataframe and then convert that to list df[df(data['Review']).tolist()] (that part). You can either use
df.values.tolist() which would convert the whole dataframe, df, to a list or if you just want to convert a column use data['Review'].tolist()
So in your situation the final line of your code would be switched to
data['Review'] = sep.join(data['Review'].tolist()).translate(mapping_table).split(sep)

Code not permanently deleting rows with specific string in df

Trying to permanently delete all rows that contain a given string. Tried this code, it runs but if you df.head() afterwards it doesn't show that it dropped.
df[df["column"].str.contains('text')==False]
Try assigning it to the df. Like:
df = df[df["column"].str.contains('text')==False]

KeyError using Index

Apologies for the fairly basic question.
Basically I have a large dataframe where I'm pulling out the top dates for the sum of certain values. Looks like this:
hv_toploss = hv.groupby(['END_VALID_DT']).sum()
hv_toploss=hv_toploss.sort_values('TOTALPL',ascending=False).iloc[:10]
hv_toploss['END_VALID_DT'] = pd.to_datetime(hv_toploss['END_VALID_DT'])
Now, END_VALID_DT becomes the index of hv_toploss, and I get a KeyError when running line 3. If I try to reindex, I get a multi-index error, and since these are the values I need, I can't just drop the index.
I will be calling these values in a line like:
PnlByDay = PnlByDay.loc[hv_toploss['END_VALID_DT']]
Any help here would be great. I'm still a novice using Python.
You can use the index directly instead of creating another column containing the index.
the_dates = hv_toploss.sort_values('TOTALPL',ascending=False).iloc[:10].index
PnlByDay.loc[PnlByDay.index.isin(the_dates)]
I don't know the structure of PnlByDay, so you may have to modify that part.
Ok I got around this by just copying the index values into a new column and using that.
hv_toploss = hv.groupby(['END_VALID_DT']).sum()
hv_toploss['Scenario_Dates'] = hv_toploss.index
hv_toploss=hv_toploss.sort_values('TOTALPL',ascending=False).iloc[:10]
However any input on how to do this properly please advise.

How can I create index for python pandas dataframe?

I am importing several csv files into python using Jupyter notebook and pandas and some are created without a proper index column. Instead, the first column, which is data that I need to manipulate is used. How can I create a regular index column as first column? This seems like a trivial matter, but I can't find any useful help anywhere.
What my dataframe looks like
What my dataframe should look like
Could you please try this:
df.reset_index(inplace = True, drop = True)
Let me know if this works.
When you are reading in the csv, use pandas.read_csv(index_col= #, * args). If they don't have a proper index column, set index_col=False.
To change indices of an existing DataFrame df, try the methods df = df.reset_index() or df=df.set_index(#).
When you imported your csv, did you use the index_col argument? It should default to None, according to the documentation. If you don't use the argument, you should be fine.
Either way, you can force it not to use a column by using index_col=False. From the docs:
Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.
Python 3.8.5
pandas==1.2.4
pd.read_csv('file.csv', header=None)
I found the solution in the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Save Pandas dataframe with numeric column as text in Excel

I am trying to export a Pandas dataframe to Excel where all columns are of text format. By default, the pandas.to_excel() function lets Excel decide the data type. Exporting a column with [1,2,'w'] results in the cells containing 1 and 2 to be numeric, and the cell containing 'w' to be text. I'd like all rows in the column to be text (i.e. ['1','2','w']).
I was able to solve the problem by assigning the column I need to be text using the .astype(str). However, if the data is large, I am concerned that I will run into performance issues. If I understand correctly, df[col] = df[col].astype(str) makes a copy of the data, which is not efficient.
import pandas as pd
df = pd.DataFrame({'a':[1,2,'w'], 'b':['x','y','z']})
df['a'] = df['a'].astype(str)
df.to_excel(r'c:\tmp\test.xlsx')
Is there a more efficient way to do this?
I searched SO several times and didn't see anything on this. Forgive me if this has been answered before. This is my first post, and I'm really happy to participate in this cool forum.
Edit: Thanks to the comments I've received, I see that Converting a series of ints to strings - Why is apply much faster than astype? gives me other options to astype(str). This is really useful. I also wanted to know if astype(str) was inefficient because it made a copy of the data, which I now see that it does not.
I don't think that you'll not have performance issues with that approach since data is not copied but replaced. You may also convert the whole dataframe into string type using
df = df.astype(str)

Categories

Resources