Python: Create dataframe with 'uneven' column entries - python

I am trying to create a dataframe where the column lengths are not equal. How can I do this?
I was trying to use groupby. But I think this will not be the right way.
import pandas as pd
data = {'filename':['file1','file1'], 'variables':['a','b']}
df = pd.DataFrame(data)
grouped = df.groupby('filename')
print(grouped.get_group('file1'))
Above is my sample code. The output of which is:
What can I do to just have one entry of 'file1' under 'filename'?
Eventually I need to write this to a csv file.
Thank you

If you only have one entry in a column the other will be NaN. So you could just filter the NaNs by doing something like df = df.at[df["filename"].notnull()]

Related

Adding rows using timestamp

I saw this code
combine rows and add up value in dataframe,
but I want to add the values in cells for the same day, i.e. add all data for a day. how do I modify the code to achieve this?
Check below code:
import pandas as pd
df = pd.DataFrame({'Price':[10000,10000,10000,10000,10000,10000],
'Time':['2012.05','2012.05','2012.05','2012.06','2012.06','2012.07'],
'Type':['Q','T','Q','T','T','Q'],
'Volume':[10,20,10,20,30,10]
})
df.assign(daily_volume = df.groupby('Time')['Volume'].transform('sum'))
Output:

I have to extract all the rows in a .csv corresponding to the rows with 'watermelon' through pandas

I am using this code. but instead of new with just the required rows, I'm getting an empty .csv with just the header.
import pandas as pd
df = pd.read_csv("E:/Mac&cheese.csv")
newdf = df[df["fruit"]=="watermelon"+"*"]
newdf.to_csv("E:/Mac&cheese(2).csv",index=False)
I believe the problem is in how you select the rows containing the word "watermelon". Instead of:
newdf = df[df["fruit"]=="watermelon"+"*"]
Try:
newdf = df[df["fruit"].str.contains("watermelon")]
In your example, pandas is literally looking for cells containing the word "watermelon*".
missing the underscore in pd.read_csv on first call, also it looks like the actual location is incorrect. missing the // in the file location.

Supplying the values from first column of dataframe

In one of the code snippet, the authors provide the input as:
variants = [ 'rs425277', 'rs1571149', 'rs1240707', 'rs1240708', 'rs873927', 'rs880051', 'rs1878745', 'rs2296716', 'rs2298217', 'rs2459994' ]
However I have similar values as one of the column in csv format. I would like to know how I can supply one of the column as input similar to above example?
Thanks in advance
First, import your csv as a Pandas df.
df = pd.read_csv('data.csv')
Then, you can get a list from pandas dataframe column:
col_one_list = df['column_one'].tolist()

Python- loop trhough df and output as many dfs as rows

My python code produces a pandas dataframe that looks as follows:
enter image description here
I need to transform it to another format to achieve following: loop through every row in the dataframe and output as many data frames as rows in the table. Each dataframe should have a additional column: timestamp and be named as the value in "Type" Column. So for instance I'd have
enter image description here
I am struggling with where to start- I hope someone here can advise me?
Here is a code for what you want to achieve. It takes a csv file like yours. Loops through the rows. Adds a column with current time and saves each row in a separate csv. Let me know if it works for you.
import pandas as pd
from datetime import datetime
#Give path to your csv
df = pd.read_csv('C:/Users/username/Downloads/test.csv')
#iterating on rows in dataframe
for index, row in df.iterrows():
#adding a new columns with value in the row
df.loc[index, 'Timestamp'] = datetime.now().strftime('%c')
print(df.loc[index])
#saving row in a new dataframe
df_new = df.loc[index].to_frame().T
#saving the dataframe in a separate csv
df_new.to_csv(f'C:/Users/username/Downloads/test_{index}.csv', index= False)
Pandas' .to_records? is your friend (doc here.)
from datetime import datetime
list_of_final_dataframes = []
for record in df.to_dict(orient='records'):
record_with_timestamp = {**record, **{'timestamp': datetime.now()}}
list_of_final_dataframes.append(pd.DataFrame([record_with_timestamp]))

Combing pandas dataframe values based on other column values

I have a pandas dataframe like so:
import pandas as pd
import numpy as np
df = pd.DataFrame([['WY','M',2014,'Seth',5],
['WY','M',2014,'Spencer',5],
['WY','M',2014,'Tyce',5],
['NY','M',2014,'Seth',25],
['MA','M',2014,'Spencer',23]],columns = ['state','sex','year','name','number'])
print df
How do I manipulate the data to get a dataframe like:
df1 = pd.DataFrame([['M',2014,'Seth',30],
['M',2014,'Spencer',28],
['M',2014,'Tyce',5]],
columns = ['sex','year','name','number'])
print df1
This is just part of a very large dataframe, how would I do this for every name for every year?
df[['sex','year','name','number']].groupby(['sex','year','name']).sum().reset_index()
For a brief description of what this does, from left to right:
Select only the columns we care about. We could replace this part with df.drop('state',axis=1)
Perform a groupby on the columns we care about.
Sum the remaining columns (in this case, just number).
Reset the index so that the columns ['sex','year','name'] are no longer a part of the index.
you can use pivot table
df.pivot_table(values = 'number',aggfunc = 'sum',columns = ['sex','year','name']).reset_index().rename(columns={0:'number'})
Group by the columns you want, sum number, and flatten the multi-index:
df.groupby(['sex','year','name'])['number'].sum().reset_index()
In your case the column state is not sum-able, so you can shorten to:
df.groupby(['sex','year','name']).sum().reset_index()

Categories

Resources