How to slice within multiple columns?

How to slice within multiple columns? - python

I have a dataframe with some columns and I want to work with 3 of them (with some nan). In order to simplify, let's say that the columns are:
A B C
2135 87539 5255
213
9841 126
The first thing I want to do is only have the first two digits of each cell, but I don't know how to do it, since the dytpe is float and I have some missing values. I want to have this:
A B C
21 87 52
21
98 12
Then I want to replace the nan values by '103'. This part I did this way and it worked.
df.update(df[['A', 'B', 'C']].fillna(103))
So my final dataframe would be like this:
A B C
21 87 52
21 103 103
98 12 103
I just really don't know how to do the first part, where I slice the integers. Anyone can help me?

IIUC, use Series.astype and then DataFrame.stack
to access the first 2 characters.
df.stack().astype(str).str[:2].unstack()
Output
A B C
0 21 87 52
1 21
2 98 12
or using pd.to_numeric if you want get float values at the end
pd.to_numeric(df2.stack().astype(str).str[:2], errors='coerce').unstack()
A B C
0 21.0 87.0 52.0
1 21.0 NaN NaN
2 98.0 12.0 NaN

Here's what you can do. I'm pretty sure it is not the perfect solution, but it's a vague idea and it does work. Maybe you can try to optimize it a bit.
import pandas as pd
data = [[2135,87539,5255],[213,130,130],[9841,126,130]]
df = pd.DataFrame(data,columns=['A','B','C'], dtype=float)
print df
#Converting to object Dtype
df[["A", "B","C"]] = df[["A","B", "C"]].astype(str)
print df
#Storing columns in lists
listA = list(df.A)
listB = list(df.B)
listC = list(df.C)
#For generating required data
newListA = []
newListB = []
newListC = []
#Storing required data into new lists
for itemA,itemB,itemC in zip(listA,listB,listC):
newListA.append(str(itemA[0:2]))
newListB.append(str(itemB[0:2]))
newListC.append(str(itemC[0:2]))
#Converting the lists to new Dataframe with float Dtype
new_df = pd.DataFrame(zip(newListA,newListB,newListC), columns=['A','B','C'], dtype=float)
print(new_df)

Related

Python DataFrame String replace accidently Returing NaN

I encounter a weird problem in Python Pandas, while I read a excel and replace a character "k", the result gives me NaN for the rows without "K". see below image
It should return 173 on row #4，instead of NaN, but if I create a brand new excel, and type the same number. it will work.
or if i use this code,
df = pd.DataFrame({ 'sales':['75.8K','6.9K','7K','6.9K','173','148']})
df
then it will works well. Why? please advise!

This is because the 173 and 148 values from the excel import are numbers, not strings. Since str.replace returns a value that is non-numeric, these values become NaN. You can see that demonstrated by setting up the dataframe with numbers in those position:
df = pd.DataFrame({ 'sales':['75.8K','6.9K','7K','6.9K',173,148]})
df.dtypes
# sales object
# dtype: object
df['num'] = df['sales'].str.replace('K','')
Output:
sales num
0 75.8K 75.8
1 6.9K 6.9
2 7K 7
3 6.9K 6.9
4 173 NaN
5 148 NaN
If you don't mind all your values being strings, you can use
df = pd.read_excel('manual_import.xlsx', dtype=str)
or
df = pd.read_excel('manual_import.xlsx', converters={'sales':str})
should just convert all the sales values to strings.

Try this:
df['nums'] = df['sales'].astype(str)
df['nums'] = pd.to_numeric(df['nums'].str.replace('K', ''))

Index match in Pandas?

I am trying to match x value based on their row and column keys. In excel I have used INDEX & MATCH to fetch the correct values, but I am struggling to do the same in Pandas.
Example:
I want to add the highlighted value (saved in df2) to my df['Cost'] column.
I have got df['Weight'] & df['Country'] as keys but I don't know how to use them to look up the highlighted value in df2.
How can I fetch the yellow value into df3['Postage'], which I can then use to add that to my df['Cost'] column?
I hope this makes sense. Let me know i should provide more info.
Edit - more info (sorry, I could not figure out how to copy the output from Jupyter):
When I run [93] I get the following error:
ValueError: Row labels must have same size as column labels
Thanks!

To get the highlighted value 1.75 simply
df2.loc[df2['Country']=='B', 3]
So generalizing the above and using country-weight key pairs from df1:
cost = []
for i in range(df1.shape[0]):
country = df1.loc[i, 'Country']
weight = df1.loc[i, 'Weight']
cost.append(df2.loc[df2['Country']==country, weight]
df1['Cost'] = cost
Or much better:
df1['Cost'] = df1.apply(lambda x: df2.loc[df2['Country']==x['Country'], x['Weight'], axis=1)

for your case use (note [0] is needed to index into array)
row = df1.iloc[1]
df2[df2.Country == row.Country][row.Weight][0]
Hope this helps with .iloc and .loc
d = {chr(ord('A')+r):[c+r*10 for c in range(5)] for r in range(5)}
df = pd.DataFrame(d).transpose()
df.columns=['a','b','c','d','e']
print(df)
print("--------")
print(df.loc['B']['c'])
print(df.iloc[1][2])
output
a b c d e
A 0 1 2 3 4
B 10 11 12 13 14
C 20 21 22 23 24
D 30 31 32 33 34
E 40 41 42 43 44
--------
12
12

Finding duplicates in one column with non-dups in another

I am struggling with how to take a dataset and output a result that finds duplicate information in one column with non-duplicate items in another. If say column 0 and 2 are exact duplicates I don't care about the set of data, only if there are rows where column 0 has entries with more than one value in column 2. And, if that is the case, I want all of the rows that match column 0.
I am first using concat to narrow down the dataset to rows that have duplicates. My problem is now trying to get only the rows where column 2 is different.
My example dataset is:
Pattern or URI,Route Filter Clause,Partition,Pattern Usage,Owning Object,Owning Object Partition,Cluster ID,Catalog Name,Route String,Device Name,Device Description
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFF0723AFE8,device1
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFF862FAF74,device2
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFFF2A8AA38,device3
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFFD2C0A2C6,device4
"22334",,Prod_P,Device,"22334",Prod_P,,,,SEPFFFFCF87AB31,device5
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCF87AAEA,device6
"33333",,Dummy_P,Device,"33333",Dummy_P,,,,SEPFFFF18FF65A0,device7
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCFCCAABB,device8
In this set, I want to have a result of the last three rows that are "33333" as they have more than one type of value in column 2. "11111" only matches Prod_P so I don't care about it.
import pandas as pd
ignorelist = []
inputfile = "pandas-problem-data.txt"
data = pd.read_csv(inputfile)
data.columns = data.columns.str.replace(' ','_')
data = pd.concat(g for _, g in data.groupby("Pattern_or_URI") if len(g) > 1)
data = data.loc[(data["Pattern_Usage"]=="Device"), ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"]]
new_rows = []
tempdup = pd.DataFrame()
for i, row in data.iterrows():
if row["Pattern_or_URI"] in ignorelist:
continue
ignorelist.append(row["Pattern_or_URI"])
# testdup = pd.concat(h for _, h in (data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])]).groupby("Partition") if len(h) > 1)
# print(data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])])
newrow = data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])]
If I uncomment the line where I try to use the same concat to find entries with "Partition" > 1 I get an error ValueError: No objects to concatenate. I know it gets through the first iter with the print statement uncommented.
Is there an easier or better way of doing this? I'm new to pandas and keep thinking there is probably a way to find this that I haven't figured out.
Thank you.
Desired output:
Pattern or URI,Route Filter Clause,Partition,Pattern Usage,Owning Object,Owning Object Partition,Cluster ID,Catalog Name,Route String,Device Name,Device Description
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCF87AAEA,device6
"33333",,Dummy_P,Device,"33333",Dummy_P,,,,SEPFFFF18FF65A0,device7
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCFCCAABB,device8

I think it's a bit misleading to say you're looking for duplicates. This is really a grouping problem.
You want to find groups of identical values in Pattern or URI that correspond with more than one unique value in your Partition Series.
transform + nunique
s = df.groupby('Pattern or URI')['Partition'].transform('nunique').gt(1)
df.loc[s]
Pattern or URI Route Filter Clause Partition Pattern Usage Owning Object Owning Object Partition Cluster ID Catalog Name Route String Device Name Device Description
5 33333 NaN Prod_P Device 33333 Prod_P NaN NaN NaN SEPFFFFCF87AAEA device6
6 33333 NaN Dummy_P Device 33333 Dummy_P NaN NaN NaN SEPFFFF18FF65A0 device7
7 33333 NaN Prod_P Device 33333 Prod_P NaN NaN NaN SEPFFFFCFCCAABB device8

Using df.drop_duplicates() as follows:
df=pd.DataFrame({'a':[111,111,111,222,222,333,333,333],
'b':['a','a','a','b','b','a','b','c'],
'c':[12,13,14,15,61,71,81,19]})
df
a b c
0 111 a 12
1 111 a 13
2 111 a 14
3 222 b 15
4 222 b 61
5 333 a 71
6 333 b 81
7 333 c 19
df1=df.drop_duplicates(['a','b'],keep=False)
df1
a b c
5 333 a 71
6 333 b 81
7 333 c 19
Note, instead of assigning it to a new DF, you can add inplace=True to apply it to the original

Pandas way to keep int portion of a column of numbers in a dataframe

I have a dataframe (df) column of floats:
0 59.9179
1 50.3874
2 50.3874
3 55.0089
4 58.423
5 58.8227
6 55.2471
7 57.2266
8 46.4312
9 59.9097
10 57.1417
Is there a way in pandas to keep the integer portion of the number and discard the decimal, so the resulting column would look like:
0 59
1 50
2 50
3 55
4 58
5 58
6 55
7 57
8 46
9 59
10 57
I can see a way to do this for 1 number
>>> s = 59.9179
>>> i, d = divmod(s, 1)
>>> i
59
but not for a whole column in one go
Many thanks

You've got two options:
Casting the column type (or even the whole dataframe):
df[column] = df[column].astype(int)
Or using numpy's floor method (for positive floats as in your example)
df[column] = np.floor(df[column])

You can use apply method by passing a lambda function.
df = df.apply(lambda row: int(row['second_column']), axis=1)
Another method is to use astype method.
df[second_column] = df[second_column].astype(int)

If your column name is col then do the following:
df[col]=df[col].astype('int')

You can use this:
df[your_column] = df[your_column].astype(int)
Or for the whole df:
df = df.astype(int)

To keep integer part use trunc() function as shown below.
Easiest method is to use trunc() function on the column.
df[col_name] = trunc(df[col_name])

Renaming columns in pandas dataframe error [duplicate]

I have the following data frames:
print(df_a)
mukey DI PI
0 100000 35 14
1 1000005 44 14
2 1000006 44 14
3 1000007 43 13
4 1000008 43 13
print(df_b)
mukey niccdcd
0 190236 4
1 190237 6
2 190238 7
3 190239 4
4 190240 7
When I try to join these data frames:
join_df = df_a.join(df_b, on='mukey', how='left')
I get the error:
*** ValueError: columns overlap but no suffix specified: Index([u'mukey'], dtype='object')
Why is this so? The data frames do have common 'mukey' values.

Your error on the snippet of data you posted is a little cryptic, in that because there are no common values, the join operation fails because the values don't overlap it requires you to supply a suffix for the left and right hand side:
In [173]:
df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
Out[173]:
mukey_left DI PI mukey_right niccdcd
index
0 100000 35 14 NaN NaN
1 1000005 44 14 NaN NaN
2 1000006 44 14 NaN NaN
3 1000007 43 13 NaN NaN
4 1000008 43 13 NaN NaN
merge works because it doesn't have this restriction:
In [176]:
df_a.merge(df_b, on='mukey', how='left')
Out[176]:
mukey DI PI niccdcd
0 100000 35 14 NaN
1 1000005 44 14 NaN
2 1000006 44 14 NaN
3 1000007 43 13 NaN
4 1000008 43 13 NaN

The .join() function is using the index of the passed as argument dataset, so you should use set_index or use .merge function instead.
Please find the two examples that should work in your case:
join_df = LS_sgo.join(MSU_pi.set_index('mukey'), on='mukey', how='left')
or
join_df = df_a.merge(df_b, on='mukey', how='left')

This error indicates that the two tables have one or more column names that have the same column name.
The error message translates to: "I can see the same column in both tables but you haven't told me to rename either one before bringing them into the same table"
You either want to delete one of the columns before bringing it in from the other on using del df['column name'], or use lsuffix to re-write the original column, or rsuffix to rename the one that is being brought in.
df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')

The error indicates that the two tables have the 1 or more column names that have the same column name.
Anyone with the same error who doesn't want to provide a suffix can rename the columns instead. Also make sure the index of both DataFrames match in type and value if you don't want to provide the on='mukey' setting.
# rename example
df_a = df_a.rename(columns={'a_old': 'a_new', 'a2_old': 'a2_new'})
# set the index
df_a = df_a.set_index(['mukus'])
df_b = df_b.set_index(['mukus'])
df_a.join(df_b)

Mainly join is used exclusively to join based on the index,not on the attribute names,so change the attributes names in two different dataframes,then try to join,they will be joined,else this error is raised

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to slice within multiple columns? - python

Related

Python DataFrame String replace accidently Returing NaN

Index match in Pandas?

Finding duplicates in one column with non-dups in another

Pandas way to keep int portion of a column of numbers in a dataframe

Renaming columns in pandas dataframe error [duplicate]

Categories

Resources