Searching CSV files with Pandas (unique id's) - Python - python

I am looking into searching a csv file with 242000 rows and want to sum the unique identifiers in one of the columns. The column name is 'logid' and has a number of different values i.e. 1002, 3004, 5003. I want to search the csv file using the panda dataframe and sum the amount of unique identifiers. If possible I would then like to create a new csv file that stores this information. For example if I find there are 50 logid's of 1004 I would then like to create a csv file that has column name 1004 and the count of 50 displayed below. I would do this for all unique identifiers and add them in the same csv file. I am completely new at this and have done some searching but no idea where to start.
Thanks!

As you haven't post your code I can give you an answer only about the general way it would work.
Load the CSV file into a pd.Dataframe using pandas.read_csv
Save all values which a occurence > 1 in a seperate df1 using pandas.DataFrame.drop_duplicates like:
df1=df.drop_duplicates(keep="first)
--> This Will return a DataFrame which only contains the rows with the first occurence of duplicate values. E.g. if the value 1000 is in 5 rows only the first row will be returned while the others are dropped.
--> Applying df1.shape[0] will return you the number of duplicate values in your df.
3.If you want to store all rows of df which contain a "duplicate value" in a seperate CSV file you have to do smth like this:
df=pd.DataFrame({"A":[0,1,2,3,0,1,2,5,5]}) # This should represent your original data set
print(df)
df1=df.drop_duplicates(subset="A",keep="first") #I assume the column with the duplicate values is columns "A" if you want to check the whole row just omit the subset keyword.
print(df1)
list=[]
for m in df1["A"]:
mask=(df==m)
list.append(df[mask].dropna())
for dfx in range(len(list)):
name="file{0}".format(dfx)
list[dfx].to_csv(r"YOUR PATH\{0}".format(name))

Related

Creating a new dataframe based on rows of an existing dataframe which contain only specific characters

In Python I am trying to create a new dataframe by appending all rows which do not contain certain
charachters in a certain column of another dataframe. Afterwards I want the generated list containing the results into a dataframe.
However, this result only contains a one column dataframe and does not include all the columns of the first dataframe (which do not contain those characters, which is what I need).
Does anybody have a suggestion on how to add all the columns to a new dataframe?
%%time
newlist = []
for row in old_dataframe['column']:
if row != (r'^[^\s]') :
newlist.append(row)

Cleaning dataframe- assign value in one cell to column

I am reading multiple CSV files from a folder into a dataframe. I loop for all the files in the folder and then concat the dataframes to obtain the final dataframe.
However the CSV file has one summary row from which I want to extract the date, and then add as a new column for all the rows in that csv/dataframe.
'''
df=pd.read_csv(f,header=None,names=['Inverter',"Day Yield",'month Yield','Year Yield','SpecificYieldDay','SYMth','SYYear','Power'],sep=';', **kwargs)
df['date']=df.loc[[0],['Day Yield']]
df
I expect ['date'] column to be filled with the date for that file for all the rows in that particular csv, but it gets filled correctly only for the first row.
Refer to image of dataframe. I want all the rows of the 'date' column to be showing 7/25/2019 instead of only the first row.
I have also added an example of one of the csv files I am reading from
csv file
If I understood correctly, the value that you want to add as a new column for all rows is in df.loc[[0],['Day Yield']].
If that is correct you can do the following:
df = df.assign(date=[df.loc[[0],['Day Yield']]]*len(df))

split groups in a table into tables of its sub-groups

I have a table that is already grouped according to first column. I would like to split table into sub-tables with only the corresponding second column. I would like to use pandas or something else in python. I am not keen to use "awk" because that will require me to "subprocess" or "os". In the end I actually only need entries in second column separated according to first. The size of the table can be about 10000 rows X 6 columns.
These are similar posts that I found but I could not figure how to modify them for my purpose.
Split pandas dataframe based on groupby
Splitting groupby() in pandas into smaller groups and combining them
The table/dataframe that I have looks like this:
P0A910 sp|A0A2C5WRC3| 84.136 0.0 100
P0A910 sp|A0A068Z9R6| 73.816 0.0 99
Q9HVD1 sp|A0A2G2MK84| 37.288 4.03e-34 99
Q9HVD1 sp|A0A1H2GM32| 40.571 6.86e-32 98
P09169 sp|A0A379DR81| 52.848 2.92e-117 99
P09169 sp|A0A127L436| 49.524 2.15e-108 98
And I would like it to be split like the following
group1:
P0A910 A0A2C5WRC3
P0A910 A0A068Z9R6
group2:
Q9HVD1 A0A2G2MK84
Q9HVD1 A0A1H2GM32
group3:
P09169 A0A379DR81
P09169 A0A127L436
OR into lists
P0A910:
A0A2C5WRC3
A0A068Z9R6
Q9HVD1:
A0A2G2MK84
A0A1H2GM32
P09169:
A0A379DR81
A0A127L436
So your problem is rather to separate the strings. Is it what you want:
new_col = df[1].str[3:-1]
list(new_col.groupby(df[0]))
So I managed to get a solution of some sort. In this solution I managed to remove prefixes in the second and use groupby in pandas to group the entries by first column. Then, looped through it and wrote each group separately to csv files. I took help from #Quang 's answer and this link. It could probably be done in better ways but here is my code:
import pandas as pd
#read .csv as dataframe
data=pd.read_csv("BlastOut.csv")
#truncates sp| | from second column (['B']).
new_col=data['B'].str[3:-1]
#replaces second column with new_col
data['B']=new_col.to_frame(name=None)
#groups dataframe by first column (['A'])
grouped=data.groupby('A')
#loops through grouped items and writes each group to .csv file with title
#of group ([group_name].csv)
for group_name, group in grouped:
group.to_csv('Out_{}.csv'.format(group_name))
Update- removed all columns except column of interest. This is a continuation to the previous code
import glob
#reads all csv files starting with "Out_" in filename
files=glob.glob("Out_*.csv")
#loop through all csv files
for f in files:
df=pd.read_csv(f, index_col=0)
# Drop columns by column title (["A"])
df.drop(["A"], axis=1, inplace=True)
df.to_csv(f,index=False)

How to count attributes (columns) from a txt file in Python using pandas?

I have a txt file that has rows and columns of data in it. I am trying to figure out how to count the number of columns (attributes) in the whole txt file. Here is my code to read the txt file and to count the columns but it is giving me the wrong answer.
import pandas as pd
data_file = pd.read_csv('3human_evolution.txt')
data_file.columns = data_file.columns.str.strip()
A=len(data_file.columns)
print(A)
len gives you all elements in the DataFrame (product of rows and columns). The number of rows and columns are accessible with DataFrame.shape. shape gives you a tuple where the first entry is the number of rows and the second the number of columns. So you can print the number of columns with:
print(data_file.shape[1])

Python pandas: fill a dataframe with data from another

I have an empty pandas dataframe as displayed in the first picture.
What I like first dataframe
So, many, many Pfam IDs as columns and many different gene IDs as indices. Then I have a second dataframe like this.
second dataframe
Now what I would like to do is getting the data from the second into the first, doing this I simply like to write a 0 in each Pfam column that has no entry for a particular gene ID, and a 1 in each case a gene has a Pfam.
Any help would be highly appreciated.
assume the first dataframe is named d1 and the second is d2
d1.fillna(d2.groupby([d2.index, 'Pfam']).size().mul(0).unstack())

Categories

Resources