How to create a bipartite graph from a csv file - python

I am trying to create a bipartite graph from an excel file that looks similar to this:
xyz pqr tsu
abc -1 1 -2
def -2 -1 2
ghj 2 -1 1
For begining, I have tried the following:
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import xlrd
import numpy as np
from numpy import genfromtxt
df = pd.read_csv (r'C:\Users\Dragos\Desktop\networkx project\proiect.csv')
G=nx.read_edgelist('proiect.csv', create_using=nx.Graph(), nodetype=str)
nx.draw(G)
plt.show()
But I keep getting the error Failed to convert edge data (['wage,carbon', 'tax,imigration,healthcare,voting,drugs,dc', 'statehood,abortion,UBI,wealthtax']) to dictionary.
Right now I'm at a loss and not sure how to proceed.

First thing is loading the pandas dataframe
pd.read_csv(path, sep=',')
See more here.
Then you need to create a new dataframe such that it follows this format.
>>> df
weight cost 0 b
0 4 7 A D
1 7 1 B A
2 10 9 C E
G=nx.from_pandas_dataframe(df, 0, 'b', ['weight', 'cost'])
Check this as well.

Related

pie chart drawing for a specific column in pandas python

I have a dataframe df, which has many columns. In df["house_electricity"], there are values like 1,0 or blank/NA. I want to plot the column in terms of a pie chart, where percentage of only 1 and 0 will be shown. Similarly I want to plot another pie chart where percentage of 1,0 and blank/N.A all will be there.
customer_id
house_electricity
house_refrigerator
cid01
0
0
cid02
1
na
cid03
1
cid04
1
cid05
na
0
#I wrote the following but it didnt give my my expected result
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("my_file.csv")
df_col=df.columns
df["house_electricity"].plot(kind="pie")
#I wrote the following but it didnt give my my expected result
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("my_file.csv")
df_col=df.columns
df["house_electricity"].plot(kind="pie")
For a dataframe
df = pd.DataFrame({'a':[1,0,np.nan,1,1,1,'',0,0,np.nan]})
df
a
0 1
1 0
2 NaN
3 1
4 1
5 1
6
7 0
8 0
9 NaN
The code below will give
df["a"].value_counts(dropna=False).plot(kind="pie")
If you want combine na and empty value, try replacing empty values with np.nan, then try to plot
df["a"].replace("", np.nan).value_counts(dropna=False).plot(kind="pie")
For solution you need to try with this code to generate 3 blocks.
import pandas as pd
import matplotlib.pyplot as plt
data = {'customer_id': ['cid01', 'cid02', 'cid03', 'cid04', 'cid05'],
'house_electricity': [0, 1, None, 1, None],
'house_refrigerator': [0, None, 1, None, 0]}
df = pd.DataFrame(data)
counts = df['house_electricity'].value_counts(dropna=False)
counts.plot.pie(autopct='%1.1f%%', labels=['0', '1', 'NaN'], shadow=True)
plt.title('Percentage distribution of house_electricity column')
plt.axis('equal')
plt.show()
Result:

Python plot heatmap from csv pixel file with panda

I would like to plot a heatmap from a csv file which contains pixels position. This csv file has this shape:
0 0 8.400000e+01
1 0 8.500000e+01
2 0 8.700000e+01
3 0 8.500000e+01
4 0 9.400000e+01
5 0 7.700000e+01
6 0 8.000000e+01
7 0 8.300000e+01
8 0 8.900000e+01
9 0 8.500000e+01
10 0 8.300000e+01
I try to write some lines in Python, but it returns me an error. I guess it is the format of column 3 which contains string. Is there any way to plot this kind of file?
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
path_to_csv= "/run/media/test.txt"
df= pd.read_csv(path_to_csv ,sep='\t')
plt.imshow(df,cmap='hot',interpolation='nearest')
plt.show(df)
I tried also seaborn but with no success.
Here the error returned:
TypeError: Image data of dtype object cannot be converted to float
You can set dtype=float as a keyword argument of pandas.read_csv :
df = pd.read_csv(path_to_csv, sep='\t', dtype=float)
Or use pandas.DataFrame.astype :
plt.imshow(df.astype(float), cmap='hot', interpolation='nearest', aspect='auto')
plt.show()
# Output :

How to efficiently update pandas row if computation involving lookup another array value

The objective is to update the df rows, by considering element in the df and and reference value from external np array.
Currently, I had to use a for loop to update each row, as below.
However, I wonder whether this can be takcle using any pandas built-in module.
import pandas as pd
import numpy as np
arr=np.array([1,2,5,100,3,6,8,3,99,12,5,6,8,11,14,11,100,1,3])
arr=arr.reshape((1,-1))
df=pd.DataFrame(zip([1,7,13],[4,11,17],['a','g','t']),columns=['start','end','o'])
for n in range (len(df)):
a=df.loc[n]
drange=list(range(a['start'],a['end']+1))
darr=arr[0,drange]
r=np.where(darr==np.amax(darr))[0].item()
df.loc[n,'pos_peak']=drange[r]
Expected output
start end o pos_peak
0 1 4 a 3.0
1 7 11 g 8.0
2 13 17 t 16.0
My approach would be to use pandas apply() function with which you can apply a function to each row of your dataframe. In order to find the index of the maximum element, I used the numpy function argmax() onto the relevant part of arr. Here is the code:
import pandas as pd
import numpy as np
arr=np.array([1,2,5,100,3,6,8,3,99,12,5,6,8,11,14,11,100,1,3])
arr=arr.reshape((1,-1))
df=pd.DataFrame(zip([1,7,13],[4,11,17],['a','g','t']),columns=['start','end','o'])
df['pos_peak'] = df.apply(lambda x: x['start'] + np.argmax(arr[0][x['start']:x['end']+1]), axis=1)
df
Output:
start end o pos_peak
0 1 4 a 3
1 7 11 g 8
2 13 17 t 16

How to modify CSV format data using pandas in Jupyter notebook?

I am reading a CSV file into variable called 'data' as follows in Jupyter Notebook using pandas
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv("C:/Users/hp/Desktop/dv project/googleplaystorecleaned.csv")
I tried to modify the 'Size' column of the data set to remove the character 'M' and 'k' using the following code
for i in range(len(data['Size'])):
data['Size'][i]=str(data['Size'][i])
data['Size'][i]=data['Size'][i].replace('M','')
data['Size'][i]=data['Size'][i].replace('k','')
data['Size'][i]=data['Size'][i].replace('Varies with device','')
data['Size'][i]=float(data['Size'][i])
print(data['Size'])
The code seems to work only partially on the data set as i am getting the following output
0 19
1 14
2 8.7
3 25
4 2.8
...
10836 53M
10837 3.6M
10838 9.5M
10839 Varies with device
10840 19M
Name: Size, Length: 10829, dtype: object
Please tell a proper way to do so.
I created an example dataframe to show the result :
df = pd.DataFrame({'A': [1,2,1], 'B': [3,4,3], 'Size': ['Ma2','kb3','3l Varies with device po']})
for i, v in enumerate(df['Size'].values):
v = v.replace('M', '')
v = v.replace('k', '')
v = v.replace('Varies with device', '')
df['Size'].values[i] = v
print(df)
Before :
A B Size
0 1 3 Mfoobar1
1 2 4 kfoobar2
2 1 3 Varies with devicefoobar3
After :
A B Size
0 1 3 foobar1
1 2 4 foobar2
2 1 3 foobar3
Hi you can also try this:
import pandas as pd
list1= ['20M','9M','10K','10']
dataframe1=pd.DataFrame(data=list1,columns=['Size'])
for i, s in enumerate(dataframe1['Size']):
if s[len(s)-1]=='M':
dataframe1['Col1'][i]=dataframe1['Size'][i].replace('M',"")
if s[len(s)-1]=='K':
dataframe1['Col1'][i]=dataframe1['Size'][i].replace('K',"")
dataframe1
You will get your expected output.
Note: You can add if condition according to your requirements

Shuffle pandas dataframe n times and rename it each time

I want to shuffle a pandas dataframe 'n' times and save the shuffled dataframe with a new name and then export it to a 'csv' file. What I mean is-
import pandas as pd
import sklearn
import numpy as np
from sklearn.utils import shuffle
df = pd.read_csv('example.csv')
Then something like this-
for i in np.arange(n):
df_%i = shuffle(df)
df_%i.to_csv('example.csv')
I appreciate any help. Thanks!
You can use
for i in range(n):
df.sample(frac= 1).to_csv(f"example_{i}.csv")
If you need to create an arbitrary number of variables, you should store them in a dictionary and you can reference them later by their keys; in this case the integer you loop over.
d = {}
for i in range(n):
d[i] = df.sample(frac=1) #d[i] = shuffle(df) in your case
d[i].to_csv(f'example_{i}.csv')
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10, (3, 3)))
d = {}
for i in range(5):
d[i] = df.sample(frac=1)
d[1]
# 0 1 2
#0 6 3 2
#1 7 6 4
#2 2 6 9
d[2]
# 0 1 2
#2 2 6 9
#1 7 6 4
#0 6 3 2

Categories

Resources