Slicing Pandas DataFrame based on csv - python

Let's say I have a Pandas DataFrame like following.
df = pd.DataFrame({'Name' : ['A','B','C'],
'Country' : ['US','UK','SL']})
Country Name
0 US A
1 UK B
2 SL C
And I'm having a csv like following.
Name,Extended
A,Jorge
B,Alex
E,Mark
F,Bindu
I need to check whether df['Name'] is in csv and if so get the "Extended". If not I need to just get the "Name". So my Expected output is like following.
Country Name Extended
0 US A Jorge
1 UK B Alex
2 SL C C
Following shows what I tried so far.
f = open('mycsv.csv','r')
lines = f.readlines()
def parse(x):
for line in lines:
if x in line.split(',')[0]:
return line.strip().split(',')[1]
df['Extended'] = df['Name'].apply(parse)
Name Country Extended
0 A US Jorge
1 B UK Alex
2 C SL None
I can not figure out how to get the "Name" for C at "Extended"(else part in the code)? Any help.

You can use the "fillna" function from pandas like this:
import pandas as pd
df1 = pd.DataFrame({'Name' : ['A','B','C'],
'Country' : ['US','UK','SL']})
df2 = pd.DataFrame.from_csv('mycsv.csv', index_col=None)
df_merge = pd.merge(df, f, how="left", on="Name")
df_merge["Extended"].fillna('Name', inplace=True)

You could just load the csv as a df and then assign using where:
df['Name'] = df2['Extended'].where(df2['Name'] != df2['Extended'], df2['Name'])
So here we use the boolean condition to test if 'Name' is not equal to 'Extended' and use that value, otherwise just use 'Name'.
Also is 'Extended' always either different or same as 'Name'? If so why not just assign the value of extended to the dataframe:
df['Name'] = df2['Extended']
This would be a lot simpler.

Related

convert a list in rows of dataframe in one column to simple string

I have a dataframe which has list in one column that I want to convert into a simple string
id data_words_nostops
26561364 [andrographolide, major, labdane, diterpenoid]
26561979 [dgat, plays, critical, role, hepatic, triglyc]
26562217 [despite, success, imatinib, inhibiting, bcr]
DESIRED OUTPUT
id data_words_nostops
26561364 andrographolide, major, labdane, diterpenoid
26561979 dgat, plays, critical, role, hepatic, triglyc
26562217 despite, success, imatinib, inhibiting, bcr
Try this :
df['data_words_nostops'] = df['data_words_nostops'].apply(lambda row : ','.join(row))
Complete code :
import pandas as pd
l1 = ['26561364', '26561979', '26562217']
l2 = [['andrographolide', 'major', 'labdane', 'diterpenoid'],['dgat', 'plays', 'critical', 'role', 'hepatic', 'triglyc'],['despite', 'success', 'imatinib', 'inhibiting', 'bcr']]
df = pd.DataFrame(list(zip(l1, l2)),
columns =['id', 'data_words_nostops'])
df['data_words_nostops'] = df['data_words_nostops'].apply(lambda row : ','.join(row))
Output :
id data_words_nostops
0 26561364 andrographolide,major,labdane,diterpenoid
1 26561979 dgat,plays,critical,role,hepatic,triglyc
2 26562217 despite,success,imatinib,inhibiting,bcr
df["data_words_nostops"] = df.apply(lambda row: row["data_words_nostops"][0], axis=1)
You can use pandas str join for this:
df["data_words_nostops"] = df["data_words_nostops"].str.join(",")
df
id data_words_nostops
0 26561364 andrographolide,major,labdane,diterpenoid
1 26561979 dgat,plays,critical,role,hepatic,triglyc
2 26562217 despite,success,imatinib,inhibiting,bcr
I tried the following as well
df_ready['data_words_nostops_Joined'] = df_ready.data_words_nostops.apply(', '.join)

How do you sort only the first column in a csv file and not affect any of the other columns while doing so using python

i am working on a table(csv file) where it has the following data:
roll_no,student_name,grade,email_id
1,Aarav Gosalia,Grade 11,aarav.gosalia#flag.org.in
2,Aarav Rawal,Grade 11,aarav.rawal#flag.org.in
3,Abizar Chitalwala,Grade 11,abizar.chitalwala#flag.org.in
4,Ahad Motorwala,Grade 11,ahad.motorwala#flag.org.in
5,Armaan Adenwala,Grade 11,armaan.adenwala#flag.org.in
6,Aryan Shah,Grade 11,aryan.shah#flag.org.in
7,Baasit Motorwala,Grade 11,baasit.motorwala#flag.org.in
16,Caroline Walker,Grade 11,caroline.walker#flag.org.in
8,Darsshan Kavedia,Grade 11,darsshan.kavedia#flag.org.in
9,Devanshi Rajgharia,Grade 11,devanshi.rajgharia#flag.org.in
10,Dhruv Jain,Grade 11,dhruv.jain#flag.org.in
11,Eisa Patel,Grade 11,eisa.patel#flag.org.in
12,Esha Khimawat,Grade 11,esha.khimawat#flag.org.in
13,Fatima Unwala,Grade 11,fatima.unwala#flag.org.in
14,Hamza Erfan,Grade 11,hamza.erfan#flag.org.in
15,Harsh Gosar,Grade 11,harsh.gosar#flag.org.in
so as you can see all of the names are sorted but the roll number of caroline walker is 16. so i want a way to sort only the roll numbers and not affect any of the other columns while doing so.
I want the final table to look like this:
roll_no,student_name,grade,email_id
1,Aarav Gosalia,Grade 11,aarav.gosalia#flag.org.in
2,Aarav Rawal,Grade 11,aarav.rawal#flag.org.in
3,Abizar Chitalwala,Grade 11,abizar.chitalwala#flag.org.in
4,Ahad Motorwala,Grade 11,ahad.motorwala#flag.org.in
5,Armaan Adenwala,Grade 11,armaan.adenwala#flag.org.in
6,Aryan Shah,Grade 11,aryan.shah#flag.org.in
7,Baasit Motorwala,Grade 11,baasit.motorwala#flag.org.in
8,Caroline Walker,Grade 11,caroline.walker#flag.org.in
9,Darsshan Kavedia,Grade 11,darsshan.kavedia#flag.org.in
10,Devanshi Rajgharia,Grade 11,devanshi.rajgharia#flag.org.in
11,Dhruv Jain,Grade 11,dhruv.jain#flag.org.in
12,Eisa Patel,Grade 11,eisa.patel#flag.org.in
13,Esha Khimawat,Grade 11,esha.khimawat#flag.org.in
14,Fatima Unwala,Grade 11,fatima.unwala#flag.org.in
15,Hamza Erfan,Grade 11,hamza.erfan#flag.org.in
16,Harsh Gosar,Grade 11,harsh.gosar#flag.org.in
Please help me and keep in mind that i am yet a beginner in python.
just use pandas. df['roll_no'] = range(1:len(df))
pandas is one way.
import pandas as pd
df = pd.read_csv('file.csv')
df.sort_values(by='roll_no')
df.to_csv('file.csv',index=False)
This will work:
df = pd.DataFrame({'id':[1,3,2,7],
'name': ['M', 'r', 'd', 'd']})
df['id'] = list(df['id'].sort_values())
df
Result:
id name
0 1 M
1 2 r
2 3 d
3 7 d

Passing a defaultdict into a df

I am trying to import a txt file with states and universities listed in it. I have utilized defaultdict to import the txt and parse it to where I have a list whereby universities are attached to the state. How do I then put the data into a pandas dataframe with two columns (State, RegionName)? Nothing thus far has worked.
I built an empty dataframe with:
ut = pd.DataFrame(columns = {'State', 'RegionName'})
and have tried a couple of different methods but none have worked.
with open('ut.txt') as ut:
for line in ut:
if '[edit]' in line:
a = line.rstrip().split('[')
d[a[0]].append(a[1])
else:
b = line.rstrip().split(' ')
d[a[0]].append(b[0])
continue
This gets me a nice list:
defaultdict(<class 'list'>, {'State': ['edit]', 'School', 'School2', 'School3', 'School4', 'School5', 'School6', 'School7', 'School8'],
The edit] is part of the original txt file signifying a state. Everything after are the towns the schools are in.
I'd like to build a nice 2 column dataframe where state is the left column and all schools on the right...
Considering the following dictionary
data_dict = {"a": 1, "b": 2, "c": 3}
Considering that from that dictionary you want to create a dataframe and name the columns State and RegionName, respectively, the following will do the work
data_items = data_dict.items()
data_list = list(data_items)
df = pd.DataFrame(data_list, columns = ["State", "RegionName"])
Which will get
[In]: print(df)
[Out]:
State RegionName
0 a 1
1 b 2
2 c 3
If one doesn't pass the name of the columns when creating the dataframe, considering that the columns have the name a and b one can rename the columns with pandas.DataFrame.rename
df = df.rename(columns = {"a": "State", "b": "RegionName"})
If the goal is solely reading a txt file with a structure like this
column1 column2
1 2
3 4
5 6
Then the following will do the work
colnames=['State', 'RegionName']
df = pd.read_csv("file.txt", colnames, header=None)
Note that if the name of the columns is already the one one wants use just the following
df = pd.read_csv("file.txt")

Remove grave accent from IDs

I have an ID column with grave accent like this `1234ABC40 and I want to remove just that character from this column but keep the dataframe form.
I tried this on the column only. I have a file name x here and has multiple columns. id is the col i want to fix.
pd.read_csv(r'C:\filename.csv', index_col = False)
id = str(x['id'])
id2 = unidecode.unidecode(id)
id3 = id2.replace('`','')
This changes to str but I want that column back in the dataframe form
DataFrames have their own replace() function. Note, for partial replacements you must enable regex=True in the parameters:
import pandas as pd
d = {'id': ["12`3", "32`1"], 'id2': ["004`", "9`99"]}
df = pd.DataFrame(data=d)
df["id"] = df["id"].replace('`','', regex=True)
print df
id id2
0 123 004`
1 321 9`99

how to change upper cases for all values in dataframe

i want to change upper cases for all values in dataframe, and use the following codes,
import pandas as pd
import numpy as np
path1= "C:\\Users\\IBM_ADMIN\\Desktop\\ml-1m\\SELECT_FROM_HRAP2P3_SAAS_ZTXDMPARAM_201611291745.csv"
frame1 = pd.read_csv(path1,encoding='utf8',dtype = {'COUNTRY_CODE': str})
for x in frame1:
frame1[x] = frame1[x].str.lower()
frame1
but i get the following error for this row:
frame1[x] = frame1[x].str.lower()
error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
don't know the reason,
You can use applymap function.
import pandas as pd
df1 = pd.DataFrame({'MovieName': ['LIGHTS OUT', 'Legend'], 'Actors':['MARIA Bello', 'Tom Hard']})
df2=df1.applymap(lambda x: x.lower())
print df1, "\n"
print df2
Output:
Actors MovieName
0 MARIA Bello LIGHTS OUT
1 Tom Hard Legend
Actors MovieName
0 maria bello lights out
1 tom hard legend
Try to use str.lower on Series object.
Support your DataFrame like below:
df = pd.DataFrame(dict(name=["HERE", "We", "are"]))
name
0 HERE
1 We
2 are
Then lower all values and output:
df['name'] = df['name'].str.lower()
name
0 here
1 we
2 are
You can try this:
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Series(["TEST","Train","test","train"]),
'F' : 'foo' })
mylist = list(df2.select_dtypes(include=['object']).columns) # in dataframe
#string is stored as object
for i in mylist:
df2[i]= df2[i].str.lower()

Categories

Resources