Subsetting into several data frames by identifier in Python? - python

I wish to subset a data frame in Python by identifier. For instance, suppose we have the below data:
ID Number
A 50
A 45
A 21
B 78
B 79
B 12
C 15
C 74
C 10
I want to split the data into three separate data frames, i.e. all data for A would be the first data frame, B would be the second, C the third.
I'm having trouble going about this. I've tried using set for unique values but am thinking this is not the way to go about it. Any help appreciated.

Is this what you want ? (PS: I consider auto assign name to Dataframe)
variables = locals()
for i in df['ID'].unique():
variables["df{0}".format(i)] = df.loc[df.ID == i,]
dfA
Out[147]:
ID Number
0 A 1
3 A 1
6 A 1

Related

How to do a Lookup of Data from a CSV file in Python?

How do I achieve this in Python. I know there is a vlookup function in excel but if there is a way in Python, I prefer to do it in Python. Basically my goal is to get data from CSV2 column Quantity and write the data to column Quantity of CSV1 based on Bin_Name. The script should not copy all the value at once, it must be by selecting a Bin_Name. Ex: For today, I would like to get the data from Bin_Name ABCDE of CSV2 to CSV1 then it will write the data in column Quantity of CSV1. If this is possible, I will be very grateful and will learn a lot from this. Thank you very much in advance.
CSV1 CSV2
Bin_Name Quantity Bin_Name Quantity
A A 43
B B 32
C C 28
D D 33
E E 37
F F 38
G G 39
H H 41
I would simply use pandas built-in functions in this case and there is no need for loops.
So, assuming that there is no duplicate bin names, try the code below to copy the whole column :
df1= pd.read_csv("file1.csv")
df2= pd.read_csv("file2.csv")
df1["Quantity"]= df2["Quantity"].where(df1["Bin_Name"].eq(df2["Bin_Name"]))
print(df1)
Bin_Name Quantity
0 A 43
1 B 32
2 C 28
3 D 33
4 E 37
5 F 38
6 G 39
7 H 41
If you need to copy only a subset of rows, use boolean indexing with pandas.DataFrame.loc :
​
vals= ["A", "B", "C", "D"]
df1.loc[df1["Bin_Name"].isin(vals), "Quantity"] = df2.loc[df1["Bin_Name"].isin(vals), "Quantity"]
print(df1)
Bin_Name Quantity
0 A 43.0
1 B 32.0
2 C 28.0
3 D 33.0
4 E NaN
5 F NaN
6 G NaN
7 H NaN
I am not really sure if I understood your question fully, but let me know if this answers your challenge.
The normally way of doing Excel-type operations in Python is by using the framework Pandas. Using this, you can read, manipulate and save your CSV-files (and many other formats) using Python code.
Setting up the example
EDIT: Ensure you have installed pandas by e.g. typing the following in your terminal: pip install pandas
Since I don't have your CSV-files, I will create them using Pandas, rather than using the built-in read_csv()-method.
import pandas as pd
csv1 = pd.DataFrame.from_dict({
"Bin_Name": ["A","B","C","D","E","F","G","H"],
"Quantity": []
}, orient="index").T
csv2 = pd.DataFrame.from_dict({
"Bin_Name": ["A","B","C","D","E","F","G","H"],
"Quantity": [43, 32, 28, 33, 37, 38, 39, 41]
}, orient="index").T
The way I understood your question, you want to specify which bins should be copied from your csv1-file to your csv2-file. In your example, you mention something like this:
# Specify bins you want to copy
bins_to_copy = ["A", "B", "C", "D", "E"]
Now, there are several ways of doing the copying-operation you mentioned. Some better than others. Since you explicitly say "the script should not copy all the value at once", I will give one suggestions that follows you instructions, and one that I believe is a better approach.
Solution 1 (bad - using for-loops)
# Loop through each bin and copy cell value from csv2 to csv1
for bin_to_copy in bins_to_copy:
csv1.loc[csv1["Bin_Name"]==bin_to_copy, "Quantity"] = csv2.loc[csv2["Bin_Name"]==bin_to_copy, "Quantity"]
# OUTPUT:
> csv1
Bin_Name Quantity
0 A 43
1 B 32
2 C 28
3 D 33
4 E 37
5 F None
6 G None
7 H None
This approach does exactly what I believe you are asking for. However, there are several weaknesses with it:
Looping through rows is a very slow approach compared to using more efficient, built-in methods provided in the Pandas-library
The approach is vulnerable to situations where you have duplicate bins in either of the CSV-files
The approach is vulnerable to situations where a bin only exists in one of the CSV-files
Since we have updated one cell at a time, Pandas doesn't understand that the datatype of the column has changed, and we are still left with None for the missing values (and an "object"-type for the column) rather than NaN (which would indicate a numeric (float) column datatype).
If I have understood your problem correctly, then a better approach would be as follows
Solution 2 (better - using merge)
# Select the columns with bins from csv1
csv1_bins = csv1["Bin_Name"]
# Select only the rows with the desired bins from csv2
csv2_desired_bins = csv2[csv2["Bin_Name"].isin(bins_to_copy)]
# Merge the columns (just "Quantity" in this case) from csv2 to csv1 using "Bin_Name" as "merging-key"
result = pd.merge(left=csv1_bins, right=csv2_desired_bins, on="Bin_Name", how="left")
# OUTPUT
> result
Bin_Name Quantity
0 A 43
1 B 32
2 C 28
3 D 33
4 E 37
5 F NaN
6 G NaN
7 H NaN
The merge()-method is much more powerful and answers all the challenges I listed solution 1. It is also a more generic version of the join()-method, which according to the documentation is "like an Excel VLOOKUP operation." (which is what you mention would be you Excel equivalent)
Hi you can simply iterate CSV2 first, then after gathering wanted value, you can search it in CSV1. I wrote a code below it might help you, but there can be much more efficient ways to do.
def func(wanted_rows: list,csv2df: pd.DataFrame):
# Iterate csv2df
for index,row in csv2df.iterrows():
# Check if index in the wanted list
if index in wanted_rows:
# Get index of CSV1 for same value
csv1_index = CSV1[CSV1.Bin_Name == row['Bin_Name']].index[0]
CSV1.at[csv1_index,'Quantity'] = row['Quantity']
return df
wanted_list = [1,2,3,4,5]
func(wanted_list,CSV2df)

How to parse a dataframe efficiently, while storing data (specific row, or multiple rows) in others dataframe using a specific pattern?

How to parse data on all rows, and use this row to populate other dataframes with data from multiple rows ?
I am trying to parse a csv file containing several data entry for training purpose as I am quite new to this technology.
My data consist in 10 columns, and hunderds of rows.
The first column is filled with a code that is either 10, 50, or 90.
Example :
Dataframe 1 :
0
1
10
Power-220
90
End
10
Power-290
90
End
10
Power-445
90
End
10
Power-390
50
Clotho
50
Kronus
90
End
10
Power-550
50
Ares
50
Athena
50
Artemis
50
Demeter
90
End
And the list goes on..
On one hand I want to be able to read the first cell, and to populate another dataframe directly if this is a code 10.
On the other hand, I'd like to populate another dataframe with all the codes 50s, but I want to be able to get the data from the previous code 10, as it hold the type of Power that is used, and populate a new column on this dataframe.
The new data frames are supposed to look like this:
Dataframe 2 :
0
1
10
Power-220
10
Power-290
10
Power-445
10
Power-390
10
Power-550
Dataframe 3 :
0
1
2
50
Clotho
Power-390
50
Kronus
Power-390
50
Ares
Power-550
50
Athena
Power-550
50
Artemis
Power-550
50
Demeter
Power-550
So far, I was using iterrows, and I've read everywhere that it was a bad idea.. but i'm struggling implementing another method..
In my code I just create two other dataframes, but I don't know yet a way to retrieve data from the previous cell. I would usually use a classic method, but I think it's rather archaic.
for index, row in df.iterrows():
if (df.iat[index,0] == '10'):
df2 = df2.append(df.loc[index], ignore_index = True)
if (df.iat[index,0] == '50'):
df3 = df3.append(df.loc[index], ignore_index = True)
Any ideas ?
(Update)
For df2, it's pretty simple:
df2 = df.rename(columns={'Power/Character': 'Power'}) \
.loc[df['Code'] == 10, :]
For df3, it's a bit more complex:
# Extract power and fill forward values
power = df.loc[df['Code'] == 10, 'Power/Character'].reindex(df.index).ffill()
df3 = df.rename(columns={'Power/Character': 'Character'}) \
.assign(Power=power).loc[lambda x: x['Code'] == 50]
Output:
>>> df2
Code Power
0 10 Power-220
2 10 Power-290
4 10 Power-445
6 10 Power-390
10 10 Power-550
>>> df3
Code Character Power
7 50 Clotho Power-390
8 50 Kronus Power-390
11 50 Ares Power-550
12 50 Athena Power-550
13 50 Artemis Power-550
14 50 Demeter Power-550
You could simply copy the required rows to another dataframe,
df2 = df[df.col_1 == '10'].copy()
This will make a new dataframe df2 that contains only the rows from column col_1 that fits some criteria. The copy() function guarantees that the two dataframes are not identical, and changes in one do not affect the other.
If df2 already exists, you can concatenate them
df2 = pd.concat([df2, df[df.col_1 == '10'].copy()])

Pandas DataFrame MultiIndex Pivot - Remove Empty Headers and Axis Rows

this is closely related to the question I asked earlier here Python Pandas Dataframe Pivot Table Column and Values Order. Thanks again for the help. Very much appreciated.
I'm trying to automate a report that will be distributed via email to a large audience so it needs to look "pretty" :)
I'm having trouble resetting/removing the Indexes and/or Axis post-Pivots to enable me to use the .style CSS functions (i.e. creating a Styler Object out of the df) to make the table look nice.
I have a DataFrame where two of the principal fields (in my example here they are "Name" and "Bucket") will be variable. The desired display order will also change (so it can't be hard-coded) but it can be derived earlier in the application (e.g. "Name_Rank" and "Bucket_Rank") into Integer "Sorting Values" which can be easily sorted (and theoretically dropped later).
I can drop the column Sorting Value but not the Row/Header/Axis(?). Additionally, no matter what I try I just can't seem to get rid of the blank row between the headers and the DataTable.
I (think) I need to set the Index = Bucket and Headers = "Name" and "TDY/Change" to use the .style style object functionality properly.
import pandas as pd
import numpy as np
data = [
['AAA',2,'X',3,5,1],
['AAA',2,'Y',1,10,2],
['AAA',2,'Z',2,15,3],
['BBB',3,'X',3,15,3],
['BBB',3,'Y',1,10,2],
['BBB',3,'Z',2,5,1],
['CCC',1,'X',3,10,2],
['CCC',1,'Y',1,15,3],
['CCC',1,'Z',2,5,1],
]
df = pd.DataFrame(data, columns =
['Name','Name_Rank','Bucket','Bucket_Rank','Price','Change'])
display(df)
Name
Name_Rank
Bucket
Bucket_Rank
Price
Change
0
AAA
2
X
3
5
1
1
AAA
2
Y
1
10
2
2
AAA
2
Z
2
15
3
3
BBB
3
X
3
15
3
4
BBB
3
Y
1
10
2
5
BBB
3
Z
2
5
1
6
CCC
1
X
3
10
2
7
CCC
1
Y
1
15
3
8
CCC
1
Z
2
5
1
Based on the prior question/answer I can pretty much get the table into the right format:
df2 = (pd.pivot_table(df, values=['Price','Change'],index=['Bucket_Rank','Bucket'],
columns=['Name_Rank','Name'], aggfunc=np.mean)
.swaplevel(1,0,axis=1)
.sort_index(level=0,axis=1)
.reindex(['Price','Change'],level=1,axis=1)
.swaplevel(2,1,axis=1)
.rename_axis(columns=[None,None,None])
).reset_index().drop('Bucket_Rank',axis=1).set_index('Bucket').rename_axis(columns=
[None,None,None])
which looks like this:
1
2
3
CCC
AAA
BBB
Price
Change
Price
Change
Price
Change
Bucket
Y
15
3
10
2
10
2
Z
5
1
15
3
5
1
X
10
2
5
1
15
3
Ok, so...
A) How do I get rid of the Row/Header/Axis(?) that used to be "Name_Rank" (e.g. the integer "Sorting Values" 1,2,3). I figured a hack where the df is exported to XLS/re-imported with Header=(1,2) but that can't be the best way to accomplish the objective.
B) How do I get rid of the blank row above the data in the table? From what I've read online it seems like you should "rename_axis=[None]" but this doesn't seem to work no matter which order I try.
C) Is there a way to set the Header(s) such that the both what used to be "Name" and "Price/Change" rows are Headers so that the .style functionality can be employed to format them separate from the data in the table below?
Thanks a lot for whatever suggestions anyone might have. I'm totally stuck!
Cheers,
Devon
In pandas 1.4.0 the options for A and B are directly available using the Styler.hide method:

select first value of specific column for each ID in sorted pandas data frame

for example, my data frame is:
ID
time
number
a
14:03:01
11
b
14:03:02
7
b
14:03:15
2
c
14:03:09
5
a
14:03:02
9
d
14:03:17
1
a
14:03:35
15
c
14:03:11
8
I sort this data frame by time and for each ID I want to get the value of the number column for the earliest time. I know the solution is SQL but now, I get confused to do it for pandas.
ID
number
a
11
b
7
c
5
d
1
How can I do these using pandas? (I don't want to use "for loop" .)
try via sort_values() method ,drop_duplicates() method and drop() method:
out=df.sort_values('time').drop_duplicates(subset=['ID']).drop('time',1)
OR
via groupby() and first():
out=df.groupby('ID',as_index=False)['number'].first()

Transpose all rows in one column of dataframe to multiple columns based on certain conditions

I would like to convert one column of data to multiple columns in dataframe based on certain values/conditions.
Please find the code to generate the input dataframe
df1 = pd.DataFrame({'VARIABLE':['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
The data looks like as shown below
Please note that I may not know the column names in advance. But it usually follows this format. What I have shown above is a sample data and real data might have around 600-700 columns and data arranged in this fashion
What I would like to do is convert values which start with non-digits(characters) as new columns in dataframe. It can be a new dataframe.
I attempted to write a for loop but failed to due to the below error. Can you please help me achieve this outcome.
for i in range(3,len(df1)):
#str(df1['VARIABLE'][i].contains('^\d'))
if (df1['VARIABLE'][i].astype(str).contains('^\d') == True):
Through the above loop, I was trying to check whether first char is a digit, if yes, then retain it as a value (ex: 1,2,3 etc) and if it's a character (ex:gender, ethnicity etc), then create a new column. But guess this is an incorrect and lengthy approach
For example, in the above example, the columns would be studyid,age_interview,Gender,Ethnicity.
The final output would look like this
Can you please let me know if there is an elegant approach to do this?
You can use groupby to do something like:
m=~df1['VARIABLE'].str[0].str.isdigit().fillna(True)
new_df=(pd.DataFrame(df1.groupby(m.cumsum()).VARIABLE.apply(list).
values.tolist()).set_index(0).T)
print(new_df.rename_axis(None,axis=1))
studyid age_interview Gender Ethnicity
1 1 65 1.Male 1.Chinese
2 None None 2.Female 2.Indian
3 None None None 3.Malay
Explanation: m is a helper series which helps seperating the groups:
print(m.cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 4
10 4
Then we group this helper series and apply list:
df1.groupby(m.cumsum()).VARIABLE.apply(list)
VARIABLE
1 [studyid, 1]
2 [age_interview, 65]
3 [Gender, 1.Male, 2.Female]
4 [Ethnicity, 1.Chinese, 2.Indian, 3.Malay]
Name: VARIABLE, dtype: object
At this point we have each group as a list with the column name as the first entry.
So we create a dataframe with this and set the first column as index and transpose to get our desired output.
Use itertools.groupby and then construct pd.DataFrame:
import pandas as pd
import itertools
l = ['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']
l = list(map(str, l))
grouped = [list(g) for k, g in itertools.groupby(l, key=lambda x:x[0].isnumeric())]
d = {k[0]: v for k,v in zip(grouped[::2],grouped[1::2])}
pd.DataFrame.from_dict(d, orient='index').T
Output:
Gender studyid age_interview Ethnicity
0 1.Male 1 65 1.Chinese
1 2.Female None None 2.Indian
2 None None None 3.Malay

Categories

Resources