问题描述
我只是想知道是否有更好的方法来执行此算法。 我发现我需要经常执行这种类型的操作,而我目前的操作方式需要花费数小时,因为我认为它将被视为n ^ 2算法。 我将其附在下面。
import csv
with open("location1", 'r') as main:
csvMain = csv.reader(main)
mainList = list(csvMain)
with open("location2", 'r') as anno:
csvAnno = csv.reader(anno)
annoList = list(csvAnno)
tempList = []
output = []
for full in mainList:
geneName = full[2].lower()
for annot in annoList:
if geneName == annot[2].lower():
tempList.extend(full)
tempList.append(annot[3])
tempList.append(annot[4])
tempList.append(annot[5])
tempList.append(annot[6])
output.append(tempList)
for i in tempList:
del i
with open("location3", 'w') as final:
a = csv.writer(final, delimiter=',')
a.writerows(output)
我有两个包含15,000个字符串的csv文件,并且我希望比较每个列中的列,如果它们匹配,请将第二个csv的末尾连接到第一个csv的末尾。 任何帮助将不胜感激!
谢谢!
1楼
这样应该更有效:
import csv
from collections import defaultdict
with open("location1", 'r') as main:
csvMain = csv.reader(main)
mainList = list(csvMain)
with open("location2", 'r') as anno:
csvAnno = csv.reader(anno)
annoList = list(csvAnno)
output = []
annoMap = defaultdict(list)
for annot in annoList:
tempList = annot[3:] # adapt this to the needed columns
annoMap[annot[2].lower()].append(tempList) # put these columns into the map at position of the column of intereset
for full in mainList:
geneName = full[2].lower()
if geneName in annoMap: # check if matching column exists
output.extend(annoMap[geneName])
with open("location3", 'w') as final:
a = csv.writer(final, delimiter=',')
a.writerows(output)
由于您只需要遍历每个列表一次,因此效率更高。 字典中的查找平均为O(1),因此基本上可以得到线性算法。
2楼
一种简单的方法是使用像这样的库。 内置功能非常有效。
您可以使用pandas.read_csv()
将csv加载到数据pandas.read_csv()
,然后使用pandas函数对其进行操作。
例如,您可以使用Pandas.merge()
在特定列上合并两个数据Pandas.merge()
也就是您的两个csv文件),然后删除不需要的数据Pandas.merge()
。
如果您有一些数据库知识,那么这里的逻辑非常相似。
3楼
谢谢@limes的帮助。 这是我使用的最后一个脚本,以为我会发布它以帮助他人。 再次感谢!
import csv
from collections import defaultdict
with open("location1", 'r') as main:
csvMain = csv.reader(main)
mainList = list(csvMain)
with open("location2", 'r') as anno:
csvAnno = csv.reader(anno)
annoList = list(csvAnno)
output = []
annoMap = defaultdict(list)
for annot in annoList:
tempList = annot[3:] # adapt this to the needed columns
annoMap[annot[2].lower()].append(tempList) # put these columns into the map at position of the column of intereset
for full in mainList:
geneName = full[2].lower()
if geneName in annoMap: # check if matching column exists
list = annoMap[geneName]
full.extend(list[0])
output.append(full)
with open("location3", 'w') as final:
a = csv.writer(final, delimiter=',')
a.writerows(output)