I have a df which looks like:
BBG.LON.123.S_CAR_ADJ_DPS 343.94325
BBG.LON.436.S_CAR_ADJ_DPS 236.51530
I am trying to rename the row names (removing the '_CAR_ADJ_DPS' element of each row name and rename the column 'id' so my resulting df looks like:
id
BBG.LON.123.S 343.94325
BBG.LON.436.S 236.51530
I have tried using the following line without success:
pd.DataFrame(pd.Series(np.unique([row.split('_')[0] for row in df.rows]), name='id'))
What can I try next?
I think you can use str.split with rename_axis (new in pandas 0.18.0):
print (df)
a
BBG.LON.123.S_CAR_ADJ_DPS 343.94325
BBG.LON.436.S_CAR_ADJ_DPS 236.51530
df.index = df.index.str.split('_').str[0]
df = df.rename_axis('id')
#if use pandas bellow 0.18.0
#df.index.name = 'id'
print (df)
a
id
BBG.LON.123.S 343.94325
BBG.LON.436.S 236.51530
You may also be interested in str.extract to pull out the entries as columns:
In [11]: df[0].str.extract('(?P<A>.*)\.(?P<B>.*)\.(?P<C>\d+)\.(?P<D>.)_.*', expand=True)
Out[11]:
A B C D
0 BBG LON 123 S
1 BBG LON 436 S
Related
df = pd.DataFrame(data=JLSData)
DataFrame = df.transpose()
DataFrame.head()
Employees = DataFrame['Employees']
That is I want to be able to index the data frame by date and by the column names
Without a copy/pastable example and desired output, this is what I can give you:
df = pd.DataFrame(data=JLSData)
df = df.transpose()
df.columns = df.iloc[0, :]
df = df.iloc[1:, :]
df_new = df[1:]
df_new.columns = df.iloc[0]
To set date column as index
df_new.rename(index={0:'Date'}, inplace=True)
df_new = df_new.set_index('Date')
Are you trying to rename the first column that contains dates (where column name says "unnamed:0" to DATE ? If so, then do this.
df.rename(columns={ df.columns[1]: "whatever" })
where whatever is the name you want to give.
If you can provide a copy of the data in text format, and provide more clarity, we can try and provide a better answer.
I have two DataFrames:
df = pd.DataFrame({'ID': ['bumgm001', 'lestj001',
'tanam001', 'hellj001', 'chacj001']})
df1 = pd.DataFrame({'playerID': ['bumgama01', 'lestejo01',
'tanakama01', 'hellije01', 'chacijh01'],
'retroID': ['bumgm001', 'lestj001', 'tanam001', 'hellj001', 'chacj001']})
OR
df df1
ID playerID retroID
'bumgm001' 'bumgama01' 'bumgm001'
'lestj001' 'lestejo01' 'lestj001'
'tanam001' 'tanakama01' 'tanam001'
'hellj001' 'hellije01' 'hellj001'
'chacj001' 'chacijh01' 'chacj001'
Now, my actual DataFrames are a little more complicated than this, but I simplified it here so it's clearer what I'm trying to do.
I would like to take all of the ID's in df and replace them with the corresponding playerID's in df1.
My final df should look like this:
df
**ID**
'bumgama01'
'lestejo01'
'tanakama01'
'hellije01'
'chacijh01'
I have tried to do it using the following method:
for row in df.itertuples(): #row[1] == the retroID column
playerID = df1.loc[df1['retroID']==row[1], 'playerID']]
df.loc[df['ID']==row[1], 'ID'].replace(to_replace=
df.loc[df['ID']==row[1], 'ID'], value=playerID)
The code seems to run just fine. But my retroID's in df have been changed to NaN rather than the proper playerIDs.
This strikes me as a datatype problem, but I'm not familiar enough with Pandas to diagnose any further.
EDIT:
Unfortunately, I made my example too simplistic. I edited to better represent the issue I'm having. I'm trying to look up the item from one DataFrame in a second DataFrame, then I want to replace the item from the first Dataframe with an item from the corresponding row of the second Dataframe. The columns DO NOT have the same name.
You can use the second dataframe as a dictionary for replacement:
to_replace = df1.set_index('retroID')['playerID'].to_dict()
df['retroID'].replace(to_replace, inplace=True)
According to your example, this is what you want:
df['ID'] = df1['playerID']
If data is not in order (row 1 from df is not the same as row 1 from df1) then use
df['ID']=df1.set_index('retroID').reindex(df['ID'])['playerID'].values
Credit to Wen for second approach
Output
ID
0 bumgama01
1 lestejo01
2 tanakama01
3 hellije01
4 chacijh01
Let me know if it's correct
OK, I've figured out a solution. As it turns out, my problem was a type problem. I updated my code from:
for row in df.itertuples(): #row[1] == the retroID column
playerID = df1.loc[df1['retroID']==row[1], 'playerID']]
df.loc[df['ID']==row[1], 'ID'].replace(to_replace=
df.loc[df['ID']==row[1], 'ID'], value=playerID)
to:
for row in df.itertuples(): #row[1] == the retroID column
playerID = df1.loc[df1['retroID']==row[1], 'playerID']].values[0]
df.loc[df['ID']==row[1], 'ID'].replace(to_replace=
df.loc[df['ID']==row[1], 'ID'], value=playerID)
This works because "playerID" is now a scalar object(thanks to .values[0]) rather than some other datatype which is not compatible with a DataFrame.
I have a database with sample as below:
Data frame is generated when I load data in Python as per below code
import os
import pandas as pd
data_dir="D:\\userdata\\adbharga\\Desktop\\AVA\\PythonCoding\\VF-Aus\\4G Cell Graphs"
os.chdir(data_dir)
df = pd.read_csv('CA Throughput(Kbit_s) .csv',index_col=None, header=0)
Output:
Is there any way by which we can avoid reading duplicate columns in Pandas, or remove the duplicate columns post reading.
Pl Note: Column Name is different once data is read in Pandas, so command like df=df.loc[:,~df.columns.duplicated()] won't work.
Actual database is very big and has many duplicate column with Dates only.
There are 2 ways you can do this.
Ignore columns when reading the data
pandas.read_csv has the argument usecols, which accepts an integer list.
So you can try:
# work out required columns
df = pd.read_csv('file.csv', header=0)
cols = [0] + list(range(1, len(df.columns), 2))
# use column integer list
df = pd.read_csv('file.csv', usecols=cols)
Remove columns from dataframe
You can use similar logic with pd.DataFrame.iloc to remove unwanted columns.
# cols as defined in previous example
df = df.iloc[:, cols]
One way of do it could be to read only the first row and create a mask using drop_duplicates(). This we pass to the usecols without the need to specify the index beforehand. It should be failsafe.
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1
Another way to do it would be to remove all columns with a dot inside .. This should work in most cases as the dot is rarely used in column names:
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
Full example:
import pandas as pd
data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''
df = pd.read_csv(pd.compat.StringIO(data))
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
print(df)
# Date Value1 Value2
#0 2018-01-01 0 1
#1 2018-01-02 0 1
I have a dataset of crimes reported by Gloucestershire Constabulary from 2011-16. It's a .csv file that I have imported to a Pandas dataframe. The data include a column stating the Lower Super Output Area (LSOA) in which the crime occurred, so for crimes in Tewkesbury, for instance, each record has the corresponding LSOA name, e.g. 'Tewkesbury 009D'; 'Tewkesbury 009E'.
I want to group these data by the town/city they relate to, e.g. 'Gloucester', 'Tewkesbury', ignoring the specific LSOAs within each conurbation. Ideally, I would append a new column to the dataframe, with just the place name copied across, and group on that. I am comfortable with how to do the grouping, just not the new column in the first place. Any advice on how to do this is gratefully received.
I am no Pandas expert but I think you can do string slicing to strip out the last five digits (it supports regex too if I recall correctly, so you can do a proper 'search' if required).
#x is the original dataframe
new_col = x.lsoa.str[:-5] #lsoa is the column containing city names
pd.concat([x, new_col], axis=1)
The str method can be used to extract a string out of the lsoa column of the dataframe.
Something along these lines should work:
df['town'] = [x.split()[0] for x in df['LSOA']]
You can use regex to extract the city name from the DataFrame and then join the result to the original DataFrame. If your inital DataFrame is df
df = pd.DataFrame([ 'Tewkesbury 009D', 'Tewkesbury 009E'], columns=['LSOA'])
In [2]: df
Out[2]:
LSOA
0 Tewkesbury 009D
1 Tewkesbury 009E
Then you can extract the city name and optionally the LSOA code in to a new DataFrame df_new
df_new = df['LSOA'].str.extract('(\w*)\s(\d+\w*)', expand=True)
In [10]: df_new
Out[10]:
0 1
0 Tewkesbury 009D
1 Tewkesbury 009E
If you want to discard the code and just keep the city name remove the second bracket from the regex as '(\w*)\s\d+\w*' . Now you can append this result to the original DataFrame
In [11]: df.join(df_new)
Out[11]:
LSOA 0 1
0 Tewkesbury 009D Tewkesbury 009D
1 Tewkesbury 009E Tewkesbury 009E
I want to delete columns that start the particular "TYPE" word and do not contain _1?
df =
TYPE_1 TYPE_2 TYPE_3 COL1
aaa asb bbb 123
The result should be:
df =
TYPE_1 COL1
aaa 123
Currently I am deleting these columns manually, however this approach is not very efficient if the number of columns is big:
df = df.drop(["TYPE_2","TYPE_3"], axis=1)
A list comprehension can be used. Note: axis=1 denotes that we are referring to the column and inplace=True can also be used as per pandas.DataFrame.drop docs.
droplist = [i for i in df.columns if i.startswith('TYPE') and '_1' not in i]
df1.drop(droplist,axis=1,inplace=True)
This is the fifth answer but I wanted to showcase the power of the filter dataframe method which filters by column names with regex. This searches for columns that don't start with TYPE or have _1 somewhere in them.
df.filter(regex='^(?!TYPE)|_1')
Easy:
unwanted = [column for column in df.columns
if column.startswith("TYPE") and "_1" not in column]
df = df.drop(unwanted)
t_cols = [c for c in df.columns.values if c.startswith('TYPE_') and not c == 'TYPE_1']
df.drop(t_cols)
Should do the job