Python pandas: fill a dataframe with data from another - python

I have an empty pandas dataframe as displayed in the first picture.
What I like first dataframe
So, many, many Pfam IDs as columns and many different gene IDs as indices. Then I have a second dataframe like this.
second dataframe
Now what I would like to do is getting the data from the second into the first, doing this I simply like to write a 0 in each Pfam column that has no entry for a particular gene ID, and a 1 in each case a gene has a Pfam.
Any help would be highly appreciated.

assume the first dataframe is named d1 and the second is d2
d1.fillna(d2.groupby([d2.index, 'Pfam']).size().mul(0).unstack())

Related

How to combine column in dataframe based on date&time and take the mean of the value

I would like to ask how can I join the dataframe as shown in (exiting dataframe) to group values based on date&time and take the means of the values. what I meant is that if col B have 2 values in the same minute , it will take average of that value and do same for rest of the columns. What I want to achieve is to have one value each minutes as shown in (preprocessed dataframe)
Thank you
If your dataframe is called df, you can do as following :
df.groupby(['DataTime']).mean()

How to make new dataframe from existing dataframe with unique rows values of one column and corresponding row values from other columns?

I have a dataframe 'raw' that looks like this -
It has many rows with duplicate values in each column.
I want to make a new dataframe 'new_df' which has unique customer_code corresponding and market_code.
The new_df should look like this -
It sounds like you simply want to create a DataFrame with unique customer_code which also shows market_code. Here's a way to do it:
df = df[['customer_code','market_code']].drop_duplicates('customer_code')
Output:
customer_code market_code
0 Cus001 Mark001
1 Cus003 Mark003
3 Cus004 Mark003
4 Cus005 Mark004
The part reading df[['customer_code','market_code']] gives us a DataFrame containing only the two columns of interest, and the drop_duplicates('customer_code') part eliminates all but the first occurrence of duplicate values in the customer_code column (though you could instead keep the last occurrence of each duplicate by calling it using the keep='last' argument).

Drop duplicates from a pandas dataframe based on all columns starting from the third one

I have a dataframe with 50 + more columns, and the first 2 are unique IDs. For some reason for different IDs the data from the third column can be the exact same.
What I want to achieve is to delete the duplicates from the dataframe based on all columns starting from the third one. If there are more than 1 rows with different IDs and the same data from the third column, it is all the same which row we will keep, it can be the last one or the first one, whichever is easier to do.
I am fairly new to pandas, what I tried is something like this:
df.drop_duplicates(subset=df.iloc[2:], keep="last")
df.drop_duplicates expects a list of column names as the subset argument, so try this:
df.drop_duplicates(subset=df.columns[2:], keep="last")

pandas max function results in inoperable DataFrame

I have a DataFrame with four columns and want to generate a new DataFrame with only one column containing the maximum value of each row.
Using df2 = df1.max(axis=1) gave me the correct results, but the column is titled 0 and is not operable. Meaning I can not check it's data type or change it's name, which is critical for further processing. Does anyone know what is going on here? Or better yet, has a better way to generate this new DataFrame?
It is Series, for one column DataFrame use Series.to_frame:
df2 = df1.max(axis=1).to_frame('maximum')

Searching CSV files with Pandas (unique id's) - Python

I am looking into searching a csv file with 242000 rows and want to sum the unique identifiers in one of the columns. The column name is 'logid' and has a number of different values i.e. 1002, 3004, 5003. I want to search the csv file using the panda dataframe and sum the amount of unique identifiers. If possible I would then like to create a new csv file that stores this information. For example if I find there are 50 logid's of 1004 I would then like to create a csv file that has column name 1004 and the count of 50 displayed below. I would do this for all unique identifiers and add them in the same csv file. I am completely new at this and have done some searching but no idea where to start.
Thanks!
As you haven't post your code I can give you an answer only about the general way it would work.
Load the CSV file into a pd.Dataframe using pandas.read_csv
Save all values which a occurence > 1 in a seperate df1 using pandas.DataFrame.drop_duplicates like:
df1=df.drop_duplicates(keep="first)
--> This Will return a DataFrame which only contains the rows with the first occurence of duplicate values. E.g. if the value 1000 is in 5 rows only the first row will be returned while the others are dropped.
--> Applying df1.shape[0] will return you the number of duplicate values in your df.
3.If you want to store all rows of df which contain a "duplicate value" in a seperate CSV file you have to do smth like this:
df=pd.DataFrame({"A":[0,1,2,3,0,1,2,5,5]}) # This should represent your original data set
print(df)
df1=df.drop_duplicates(subset="A",keep="first") #I assume the column with the duplicate values is columns "A" if you want to check the whole row just omit the subset keyword.
print(df1)
list=[]
for m in df1["A"]:
mask=(df==m)
list.append(df[mask].dropna())
for dfx in range(len(list)):
name="file{0}".format(dfx)
list[dfx].to_csv(r"YOUR PATH\{0}".format(name))

Categories

Resources