How to modify CSV format data using pandas in Jupyter notebook? - python

I am reading a CSV file into variable called 'data' as follows in Jupyter Notebook using pandas
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv("C:/Users/hp/Desktop/dv project/googleplaystorecleaned.csv")
I tried to modify the 'Size' column of the data set to remove the character 'M' and 'k' using the following code
for i in range(len(data['Size'])):
data['Size'][i]=str(data['Size'][i])
data['Size'][i]=data['Size'][i].replace('M','')
data['Size'][i]=data['Size'][i].replace('k','')
data['Size'][i]=data['Size'][i].replace('Varies with device','')
data['Size'][i]=float(data['Size'][i])
print(data['Size'])
The code seems to work only partially on the data set as i am getting the following output
0 19
1 14
2 8.7
3 25
4 2.8
...
10836 53M
10837 3.6M
10838 9.5M
10839 Varies with device
10840 19M
Name: Size, Length: 10829, dtype: object
Please tell a proper way to do so.

I created an example dataframe to show the result :
df = pd.DataFrame({'A': [1,2,1], 'B': [3,4,3], 'Size': ['Ma2','kb3','3l Varies with device po']})
for i, v in enumerate(df['Size'].values):
v = v.replace('M', '')
v = v.replace('k', '')
v = v.replace('Varies with device', '')
df['Size'].values[i] = v
print(df)
Before :
A B Size
0 1 3 Mfoobar1
1 2 4 kfoobar2
2 1 3 Varies with devicefoobar3
After :
A B Size
0 1 3 foobar1
1 2 4 foobar2
2 1 3 foobar3

Hi you can also try this:
import pandas as pd
list1= ['20M','9M','10K','10']
dataframe1=pd.DataFrame(data=list1,columns=['Size'])
for i, s in enumerate(dataframe1['Size']):
if s[len(s)-1]=='M':
dataframe1['Col1'][i]=dataframe1['Size'][i].replace('M',"")
if s[len(s)-1]=='K':
dataframe1['Col1'][i]=dataframe1['Size'][i].replace('K',"")
dataframe1
You will get your expected output.
Note: You can add if condition according to your requirements

Related

Python plot heatmap from csv pixel file with panda

I would like to plot a heatmap from a csv file which contains pixels position. This csv file has this shape:
0 0 8.400000e+01
1 0 8.500000e+01
2 0 8.700000e+01
3 0 8.500000e+01
4 0 9.400000e+01
5 0 7.700000e+01
6 0 8.000000e+01
7 0 8.300000e+01
8 0 8.900000e+01
9 0 8.500000e+01
10 0 8.300000e+01
I try to write some lines in Python, but it returns me an error. I guess it is the format of column 3 which contains string. Is there any way to plot this kind of file?
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
path_to_csv= "/run/media/test.txt"
df= pd.read_csv(path_to_csv ,sep='\t')
plt.imshow(df,cmap='hot',interpolation='nearest')
plt.show(df)
I tried also seaborn but with no success.
Here the error returned:
TypeError: Image data of dtype object cannot be converted to float
You can set dtype=float as a keyword argument of pandas.read_csv :
df = pd.read_csv(path_to_csv, sep='\t', dtype=float)
Or use pandas.DataFrame.astype :
plt.imshow(df.astype(float), cmap='hot', interpolation='nearest', aspect='auto')
plt.show()
# Output :

How to efficiently update pandas row if computation involving lookup another array value

The objective is to update the df rows, by considering element in the df and and reference value from external np array.
Currently, I had to use a for loop to update each row, as below.
However, I wonder whether this can be takcle using any pandas built-in module.
import pandas as pd
import numpy as np
arr=np.array([1,2,5,100,3,6,8,3,99,12,5,6,8,11,14,11,100,1,3])
arr=arr.reshape((1,-1))
df=pd.DataFrame(zip([1,7,13],[4,11,17],['a','g','t']),columns=['start','end','o'])
for n in range (len(df)):
a=df.loc[n]
drange=list(range(a['start'],a['end']+1))
darr=arr[0,drange]
r=np.where(darr==np.amax(darr))[0].item()
df.loc[n,'pos_peak']=drange[r]
Expected output
start end o pos_peak
0 1 4 a 3.0
1 7 11 g 8.0
2 13 17 t 16.0
My approach would be to use pandas apply() function with which you can apply a function to each row of your dataframe. In order to find the index of the maximum element, I used the numpy function argmax() onto the relevant part of arr. Here is the code:
import pandas as pd
import numpy as np
arr=np.array([1,2,5,100,3,6,8,3,99,12,5,6,8,11,14,11,100,1,3])
arr=arr.reshape((1,-1))
df=pd.DataFrame(zip([1,7,13],[4,11,17],['a','g','t']),columns=['start','end','o'])
df['pos_peak'] = df.apply(lambda x: x['start'] + np.argmax(arr[0][x['start']:x['end']+1]), axis=1)
df
Output:
start end o pos_peak
0 1 4 a 3
1 7 11 g 8
2 13 17 t 16

How to create a bipartite graph from a csv file

I am trying to create a bipartite graph from an excel file that looks similar to this:
xyz pqr tsu
abc -1 1 -2
def -2 -1 2
ghj 2 -1 1
For begining, I have tried the following:
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import xlrd
import numpy as np
from numpy import genfromtxt
df = pd.read_csv (r'C:\Users\Dragos\Desktop\networkx project\proiect.csv')
G=nx.read_edgelist('proiect.csv', create_using=nx.Graph(), nodetype=str)
nx.draw(G)
plt.show()
But I keep getting the error Failed to convert edge data (['wage,carbon', 'tax,imigration,healthcare,voting,drugs,dc', 'statehood,abortion,UBI,wealthtax']) to dictionary.
Right now I'm at a loss and not sure how to proceed.
First thing is loading the pandas dataframe
pd.read_csv(path, sep=',')
See more here.
Then you need to create a new dataframe such that it follows this format.
>>> df
weight cost 0 b
0 4 7 A D
1 7 1 B A
2 10 9 C E
G=nx.from_pandas_dataframe(df, 0, 'b', ['weight', 'cost'])
Check this as well.

Using data from pythons pandas dataframes to sample from normal distributions

I'm trying to sample from a normal distribution using means and standard deviations that are stored in pandas DataFrames.
For example:
means= numpy.arange(10)
means=means.reshape(5,2)
produces:
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
and:
sts=numpy.arange(10,20)
sts=sts.reshape(5,2)
produces:
0 1
0 10 11
1 12 13
2 14 15
3 16 17
4 18 19
How would I produce another pandas dataframe with the same shape but with values sampled from the normal distribution using the corresponding means and standard deviations.
i.e. position 0,0 of this new dataframe would sample from a normal distribution with mean=0 and standard deviation=10, and so on.
My function so far:
def make_distributions(self):
num_data_points,num_species= self.means.shape
samples=[]
for i,j in zip(self.means,self.stds):
for k,l in zip(self.means[i],self.stds[j]):
samples.append( numpy.random.normal(k,l,self.n) )
will sample from the distributions for me but I'm having difficulty putting the data back into the same shaped dataframe as the mean and standard deviation dfs. Does anybody have any suggestions as to how to do this?
Thanks in advance.
You can use numpy.random.normal to sample from a random normal distribution.
IIUC, then this might be easiest, taking advantage of broadcasting:
import numpy as np
np.random.seed(1) # only for demonstration
np.random.normal(means,sts)
array([[ 16.24345364, -5.72932055],
[ -4.33806103, -10.94859209],
[ 16.11570681, -29.52308045],
[ 33.91698823, -5.94051732],
[ 13.74270373, 4.26196287]])
Check that it works:
np.random.seed(1)
print np.random.normal(0,10)
print np.random.normal(1,11)
16.2434536366
-5.72932055015
If you need a pandas DataFrame:
import pandas as pd
pd.DataFrame(np.random.normal(means,sts))
I will use dictionary to construct this dataframe. Suppose indices and columns are the same for means and stds:
means= numpy.arange(10)
means=pd.DataFrame(means.reshape(5,2))
stds=numpy.arange(10,20)
stds=pd.DataFrame(sts.reshape(5,2))
samples={}
for i in means.columns:
col={}
for j in means.index:
col[j]=numpy.random.normal(means.ix[j,i],stds.ix[j,i],2)
samples[i]=col
print(pd.DataFrame(samples))
# 0 1
#0 [0.0760974520154, 3.29439282825] [11.1292510583, 0.318246201796]
#1 [-25.4518020981, 19.2176263823] [17.0826945017, 9.36179435872]
#2 [14.5402484325, 8.33808246538] [6.96459947914, 26.5552235093]
#3 [0.775891790613, -2.09168601369] [2.38723023677, 15.8099942902]
#4 [-0.828518484847, 45.4592922652] [26.8088977308, 16.0818556353]
Or reset the dtype of a DataFrame and reassign values:
import itertools
samples = means * 0
samples = samples.astype(object)
for i,j in itertools.product(means.index, means.columns):
samples.set_value(i,j,numpy.random.normal(means.ix[i,j],stds.ix[i,j],2))

Messy data in excel: importing using pandas; multiple occurrence of variables in columns

I do have an excel file with up to 100 measurements produced by a stupid export function. Each measurement consists of 200 rows :
Name1 Name2
' some other stuff related to the measurements'
v Qv vm qlnv 'empty column' v Qv vm qlnv
1 2 3 4 5 6 7 8
I do import it:
df = pd.read_excel('data.xls',skiprows = 2, indexcol=None)
Afterwards
df_1500.dropna(axis=1, inplace = True)
df_1500.columns
gives me:
Index([ u'v', u'Qv', u'vm', u'qlnv', u'v.1', u'Qv.1', u'vm.1', u'qlnv.1'])
I would like to reshape the data frame like:
name v Qv vm qlnv
1 1 2 3 4
2 5 6 7 8
How could I do that ? Is there maybe a feature of the csv parser that can do the work ?
You can get this effect with help of numpy. df denotes your dataframe with results in one row. I assume that there are 4 features for each line.
import numpy as np
pd.DataFrame(np.array(df).reshape(df.shape[0]*df.shape[1]/4,4))

Categories

Resources