I have a table that is already grouped according to first column. I would like to split table into sub-tables with only the corresponding second column. I would like to use pandas or something else in python. I am not keen to use "awk" because that will require me to "subprocess" or "os". In the end I actually only need entries in second column separated according to first. The size of the table can be about 10000 rows X 6 columns.
These are similar posts that I found but I could not figure how to modify them for my purpose.
Split pandas dataframe based on groupby
Splitting groupby() in pandas into smaller groups and combining them
The table/dataframe that I have looks like this:
P0A910 sp|A0A2C5WRC3| 84.136 0.0 100
P0A910 sp|A0A068Z9R6| 73.816 0.0 99
Q9HVD1 sp|A0A2G2MK84| 37.288 4.03e-34 99
Q9HVD1 sp|A0A1H2GM32| 40.571 6.86e-32 98
P09169 sp|A0A379DR81| 52.848 2.92e-117 99
P09169 sp|A0A127L436| 49.524 2.15e-108 98
And I would like it to be split like the following
group1:
P0A910 A0A2C5WRC3
P0A910 A0A068Z9R6
group2:
Q9HVD1 A0A2G2MK84
Q9HVD1 A0A1H2GM32
group3:
P09169 A0A379DR81
P09169 A0A127L436
OR into lists
P0A910:
A0A2C5WRC3
A0A068Z9R6
Q9HVD1:
A0A2G2MK84
A0A1H2GM32
P09169:
A0A379DR81
A0A127L436
So your problem is rather to separate the strings. Is it what you want:
new_col = df[1].str[3:-1]
list(new_col.groupby(df[0]))
So I managed to get a solution of some sort. In this solution I managed to remove prefixes in the second and use groupby in pandas to group the entries by first column. Then, looped through it and wrote each group separately to csv files. I took help from #Quang 's answer and this link. It could probably be done in better ways but here is my code:
import pandas as pd
#read .csv as dataframe
data=pd.read_csv("BlastOut.csv")
#truncates sp| | from second column (['B']).
new_col=data['B'].str[3:-1]
#replaces second column with new_col
data['B']=new_col.to_frame(name=None)
#groups dataframe by first column (['A'])
grouped=data.groupby('A')
#loops through grouped items and writes each group to .csv file with title
#of group ([group_name].csv)
for group_name, group in grouped:
group.to_csv('Out_{}.csv'.format(group_name))
Update- removed all columns except column of interest. This is a continuation to the previous code
import glob
#reads all csv files starting with "Out_" in filename
files=glob.glob("Out_*.csv")
#loop through all csv files
for f in files:
df=pd.read_csv(f, index_col=0)
# Drop columns by column title (["A"])
df.drop(["A"], axis=1, inplace=True)
df.to_csv(f,index=False)
Related
I'm stuck with this issue:
I want to replace each row in one column in csv with id.
I have vehicle names and id's in the database:
In csv file this column look like this:
I was thinking to use pandas, to make a replacement:
df = pd.read_csv(file).replace('ALFA ROMEO 147 (937), 10.04 - 05.10', '0')
But it is the wrong way to write replace 2000+ times.
So, how can I use names from db and replace them with the correct id?
A possible solution is to merge the second dataset with the first one:
After reading the two datasets (df1, the one from the csv file, and df2, the one with vehicle_id):
df1.merge(df2, how='left', on='vehicle')
So that the final output will be a dataset with columns:
id, vehicle, vehicle_id
Imagine df1 as:
and df2 as:
the result will be:
Here you can find the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
I have a dataframe of two columns Stock and DueDate, where I need to select first row from the repeated consecutive entries based on stock column.
df:
I am expecting output like below,
Expected output:
My Approach
The approach I tried to use is to first list out what all rows repeating based on stock column by creating a new column repeated_yes and then subset the first row only if any rows are repeating more than twice.
I have used the below line of code to create new column "repeated_yes",
ss = df.Stock.ne(df.Stock.shift())
df['repeated_yes'] = ss.groupby(ss.cumsum()).cumcount() + 1
so the new updated dataframe looks like this,
df_new
But I am stuck on subsetting only row number 3 and 8 inorder to attain the result. If there are any other effective approach it would be helpful.
Edited:
Forgot to include the actual full question,
If there are any other rows below the last row in the dataframe df it should not display any output.
Chain another mask created by Series.duplicated with keep=False by & for bitwise AND and filter in boolean indexing:
ss = df.Stock.ne(df.Stock.shift())
ss1 = ss.cumsum().duplicated(keep=False)
df = df[ss & ss1]
I have two csv's in which the rows can be matched by the value in one column (after some tweaking of this column). After the matching I want to take some values from both off them and make a new, combined row. I thought of a simple script using csv.DictReader for both of them and then a double
for row1 in csv1:
for row2 in csv2:
if row1['someID'] == row2['someID]:
newdict = ... etc
However, 1 file is 9 million rows and the other is 500k rows. So my code would take 4.5 * 10^12 iterations. Hence my question: what is a fast way to match them? Important:
This 'someID' on which they are matched is in neither csv unique.
I want additional rows for every match. So if a 'someID' appears
twice in csv1 and 3 times csv2, I expect 6 rows with this 'someID' in the final result.
Try this: instead of iterating, use pandas.read_csv() on both files, and merge them on someID. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
For example:
import pandas as pd
csv1 = pd.read_csv(path1)
csv2 = pd.read_csv(path2)
merged = csv1.merge(csv2, on='someID')
merged['new_column'] = ...
Pandas operations are over entire numpy arrays which is are much faster than iterating at the element level.
I am looking into searching a csv file with 242000 rows and want to sum the unique identifiers in one of the columns. The column name is 'logid' and has a number of different values i.e. 1002, 3004, 5003. I want to search the csv file using the panda dataframe and sum the amount of unique identifiers. If possible I would then like to create a new csv file that stores this information. For example if I find there are 50 logid's of 1004 I would then like to create a csv file that has column name 1004 and the count of 50 displayed below. I would do this for all unique identifiers and add them in the same csv file. I am completely new at this and have done some searching but no idea where to start.
Thanks!
As you haven't post your code I can give you an answer only about the general way it would work.
Load the CSV file into a pd.Dataframe using pandas.read_csv
Save all values which a occurence > 1 in a seperate df1 using pandas.DataFrame.drop_duplicates like:
df1=df.drop_duplicates(keep="first)
--> This Will return a DataFrame which only contains the rows with the first occurence of duplicate values. E.g. if the value 1000 is in 5 rows only the first row will be returned while the others are dropped.
--> Applying df1.shape[0] will return you the number of duplicate values in your df.
3.If you want to store all rows of df which contain a "duplicate value" in a seperate CSV file you have to do smth like this:
df=pd.DataFrame({"A":[0,1,2,3,0,1,2,5,5]}) # This should represent your original data set
print(df)
df1=df.drop_duplicates(subset="A",keep="first") #I assume the column with the duplicate values is columns "A" if you want to check the whole row just omit the subset keyword.
print(df1)
list=[]
for m in df1["A"]:
mask=(df==m)
list.append(df[mask].dropna())
for dfx in range(len(list)):
name="file{0}".format(dfx)
list[dfx].to_csv(r"YOUR PATH\{0}".format(name))
I have an empty pandas dataframe as displayed in the first picture.
What I like first dataframe
So, many, many Pfam IDs as columns and many different gene IDs as indices. Then I have a second dataframe like this.
second dataframe
Now what I would like to do is getting the data from the second into the first, doing this I simply like to write a 0 in each Pfam column that has no entry for a particular gene ID, and a 1 in each case a gene has a Pfam.
Any help would be highly appreciated.
assume the first dataframe is named d1 and the second is d2
d1.fillna(d2.groupby([d2.index, 'Pfam']).size().mul(0).unstack())